使用class_weight=’auto’的SVC在scikit-learn中失败?

我有以下数据集。我用SVC对其进行分类(它有5个标签)。当我想执行以下操作时:class_weight='auto',像这样:

X = tfidf_vect.fit_transform(df['content'].values)y = df['label'].valuesfrom sklearn import cross_validationX_train, X_test, y_train, y_test = cross_validation.train_test_split(X,                                                y)svm_1 = SVC(kernel='linear', class_weight='auto')svm_1.fit(X, y)svm_1_prediction = svm_1.predict(X_test)

然后我得到了这个异常:

Traceback (most recent call last):  File "test.py", line 62, in <module>    svm_1.fit(X, y)  File "/usr/local/lib/python2.7/site-packages/sklearn/svm/base.py", line 140, in fit    y = self._validate_targets(y)  File "/usr/local/lib/python2.7/site-packages/sklearn/svm/base.py", line 474, in _validate_targets    self.class_weight_ = compute_class_weight(self.class_weight, cls, y_)  File "/usr/local/lib/python2.7/site-packages/sklearn/utils/class_weight.py", line 47, in compute_class_weight    raise ValueError("classes should have valid labels that are in y")ValueError: classes should have valid labels that are in y

然后对于之前的问题,我尝试了以下方法:

svm_1 = SVC(kernel='linear', class_weight='auto')svm_1.fit(X, y_encoded)svm_1_prediction = le.inverse_transform(svm_1.predict(X))

这个方法的问题是,我得到了这个异常:

  File "/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py", line 179, in accuracy_score    y_type, y_true, y_pred = _check_targets(y_true, y_pred)  File "/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py", line 74, in _check_targets    check_consistent_length(y_true, y_pred)  File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 174, in check_consistent_length    "%s" % str(uniques))ValueError: Found arrays with inconsistent numbers of samples: [ 858 2598]

有谁能帮我理解上述方法的问题是什么,以及我如何正确使用SVCclass_weight='auto'参数来自动平衡数据?

更新:

当我执行print(y)时,输出是:0 51 42 53 44 45 56 47 48 39 510 411 412 113 414 415 516 417 418 519 520 421 422 523 524 325 326 427 528 429 4 ..2568 42569 42570 42571 32572 42573 52574 52575 52576 52577 32578 42579 42580 22581 42582 32583 42584 52585 42586 52587 42588 42589 32590 52591 52592 42593 42594 42595 22596 22597 5

更新

然后我做了以下操作:

mask = np.array(test)print y[np.arange(len(y))[~mask]]

这是输出:

0       51       42       53       44       45       56       47       48       39       510      411      412      113      414      415      516      417      418      519      520      421      422      523      524      325      326      427      528      429      4       ..2568    42569    42570    42571    32572    42573    52574    52575    52576    52577    32578    42579    42580    22581    42582    32583    42584    52585    42586    52587    42588    42589    32590    52591    52592    42593    42594    42595    22596    22597    5Name: label, dtype: float64

回答:

问题在这里:

df.label.unique()Out[50]: array([  5.,   4.,   3.,   1.,   2.,  nan])

示例代码:

import pandas as pdfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.svm import SVC# 替换为您自己的数据文件路径df = pd.read_csv('data1.csv', header=0)df[df.label.isnull()]Out[52]:                                id content  label900   Daewoo_DWD_M1051__Opinio...       5    NaN1463  Indesit_IWC_5105_B_it__O...       1    NaN# 删除这两个df = df[df.label.notnull()]X = df.content.valuesy = df.label.valuestransformer = TfidfVectorizer()X = transformer.fit_transform(X)estimator = SVC(kernel='linear', class_weight='auto', probability=True)estimator.fit(X, y)estimator.predict(X)Out[54]: array([ 4.,  4.,  4., ...,  2.,  2.,  3.])estimator.predict_proba(X)Out[55]: array([[ 0.0252,  0.0228,  0.0744,  0.3427,  0.535 ],       [ 0.002 ,  0.0122,  0.0604,  0.4961,  0.4292],       [ 0.0036,  0.0204,  0.1238,  0.5681,  0.2841],       ...,        [ 0.1494,  0.3341,  0.1586,  0.1316,  0.2263],       [ 0.0175,  0.1984,  0.0915,  0.3406,  0.3519],       [ 0.049 ,  0.0264,  0.2087,  0.3267,  0.3891]])

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注