使用class_weight=’auto’的SVC在scikit-learn中失败?

我有以下数据集。我用SVC对其进行分类(它有5个标签)。当我想执行以下操作时:class_weight='auto',像这样:

X = tfidf_vect.fit_transform(df['content'].values)y = df['label'].valuesfrom sklearn import cross_validationX_train, X_test, y_train, y_test = cross_validation.train_test_split(X,                                                y)svm_1 = SVC(kernel='linear', class_weight='auto')svm_1.fit(X, y)svm_1_prediction = svm_1.predict(X_test)

然后我得到了这个异常:

Traceback (most recent call last):  File "test.py", line 62, in <module>    svm_1.fit(X, y)  File "/usr/local/lib/python2.7/site-packages/sklearn/svm/base.py", line 140, in fit    y = self._validate_targets(y)  File "/usr/local/lib/python2.7/site-packages/sklearn/svm/base.py", line 474, in _validate_targets    self.class_weight_ = compute_class_weight(self.class_weight, cls, y_)  File "/usr/local/lib/python2.7/site-packages/sklearn/utils/class_weight.py", line 47, in compute_class_weight    raise ValueError("classes should have valid labels that are in y")ValueError: classes should have valid labels that are in y

然后对于之前的问题,我尝试了以下方法:

svm_1 = SVC(kernel='linear', class_weight='auto')svm_1.fit(X, y_encoded)svm_1_prediction = le.inverse_transform(svm_1.predict(X))

这个方法的问题是,我得到了这个异常:

  File "/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py", line 179, in accuracy_score    y_type, y_true, y_pred = _check_targets(y_true, y_pred)  File "/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py", line 74, in _check_targets    check_consistent_length(y_true, y_pred)  File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 174, in check_consistent_length    "%s" % str(uniques))ValueError: Found arrays with inconsistent numbers of samples: [ 858 2598]

有谁能帮我理解上述方法的问题是什么,以及我如何正确使用SVCclass_weight='auto'参数来自动平衡数据?

更新:

当我执行print(y)时,输出是:0 51 42 53 44 45 56 47 48 39 510 411 412 113 414 415 516 417 418 519 520 421 422 523 524 325 326 427 528 429 4 ..2568 42569 42570 42571 32572 42573 52574 52575 52576 52577 32578 42579 42580 22581 42582 32583 42584 52585 42586 52587 42588 42589 32590 52591 52592 42593 42594 42595 22596 22597 5

更新

然后我做了以下操作:

mask = np.array(test)print y[np.arange(len(y))[~mask]]

这是输出:

0       51       42       53       44       45       56       47       48       39       510      411      412      113      414      415      516      417      418      519      520      421      422      523      524      325      326      427      528      429      4       ..2568    42569    42570    42571    32572    42573    52574    52575    52576    52577    32578    42579    42580    22581    42582    32583    42584    52585    42586    52587    42588    42589    32590    52591    52592    42593    42594    42595    22596    22597    5Name: label, dtype: float64

回答:

问题在这里:

df.label.unique()Out[50]: array([  5.,   4.,   3.,   1.,   2.,  nan])

示例代码:

import pandas as pdfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.svm import SVC# 替换为您自己的数据文件路径df = pd.read_csv('data1.csv', header=0)df[df.label.isnull()]Out[52]:                                id content  label900   Daewoo_DWD_M1051__Opinio...       5    NaN1463  Indesit_IWC_5105_B_it__O...       1    NaN# 删除这两个df = df[df.label.notnull()]X = df.content.valuesy = df.label.valuestransformer = TfidfVectorizer()X = transformer.fit_transform(X)estimator = SVC(kernel='linear', class_weight='auto', probability=True)estimator.fit(X, y)estimator.predict(X)Out[54]: array([ 4.,  4.,  4., ...,  2.,  2.,  3.])estimator.predict_proba(X)Out[55]: array([[ 0.0252,  0.0228,  0.0744,  0.3427,  0.535 ],       [ 0.002 ,  0.0122,  0.0604,  0.4961,  0.4292],       [ 0.0036,  0.0204,  0.1238,  0.5681,  0.2841],       ...,        [ 0.1494,  0.3341,  0.1586,  0.1316,  0.2263],       [ 0.0175,  0.1984,  0.0915,  0.3406,  0.3519],       [ 0.049 ,  0.0264,  0.2087,  0.3267,  0.3891]])

Related Posts

Keras Dense层输入未被展平

这是我的测试代码: from keras import…

无法将分类变量输入随机森林

我有10个分类变量和3个数值变量。我在分割后直接将它们…

如何在Keras中对每个输出应用Sigmoid函数?

这是我代码的一部分。 model = Sequenti…

如何选择类概率的最佳阈值?

我的神经网络输出是一个用于多标签分类的预测类概率表: …

在Keras中使用深度学习得到不同的结果

我按照一个教程使用Keras中的深度神经网络进行文本分…

‘MatMul’操作的输入’b’类型为float32,与参数’a’的类型float64不匹配

我写了一个简单的TensorFlow代码,但不断遇到T…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注