我有以下数据集。我用SVC对其进行分类(它有5个标签)。当我想执行以下操作时:class_weight='auto'
,像这样:
X = tfidf_vect.fit_transform(df['content'].values)y = df['label'].valuesfrom sklearn import cross_validationX_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y)svm_1 = SVC(kernel='linear', class_weight='auto')svm_1.fit(X, y)svm_1_prediction = svm_1.predict(X_test)
然后我得到了这个异常:
Traceback (most recent call last): File "test.py", line 62, in <module> svm_1.fit(X, y) File "/usr/local/lib/python2.7/site-packages/sklearn/svm/base.py", line 140, in fit y = self._validate_targets(y) File "/usr/local/lib/python2.7/site-packages/sklearn/svm/base.py", line 474, in _validate_targets self.class_weight_ = compute_class_weight(self.class_weight, cls, y_) File "/usr/local/lib/python2.7/site-packages/sklearn/utils/class_weight.py", line 47, in compute_class_weight raise ValueError("classes should have valid labels that are in y")ValueError: classes should have valid labels that are in y
然后对于之前的问题,我尝试了以下方法:
svm_1 = SVC(kernel='linear', class_weight='auto')svm_1.fit(X, y_encoded)svm_1_prediction = le.inverse_transform(svm_1.predict(X))
这个方法的问题是,我得到了这个异常:
File "/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py", line 179, in accuracy_score y_type, y_true, y_pred = _check_targets(y_true, y_pred) File "/usr/local/lib/python2.7/site-packages/sklearn/metrics/classification.py", line 74, in _check_targets check_consistent_length(y_true, y_pred) File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 174, in check_consistent_length "%s" % str(uniques))ValueError: Found arrays with inconsistent numbers of samples: [ 858 2598]
有谁能帮我理解上述方法的问题是什么,以及我如何正确使用SVC的class_weight='auto'
参数来自动平衡数据?
更新:
当我执行print(y)
时,输出是:0 51 42 53 44 45 56 47 48 39 510 411 412 113 414 415 516 417 418 519 520 421 422 523 524 325 326 427 528 429 4 ..2568 42569 42570 42571 32572 42573 52574 52575 52576 52577 32578 42579 42580 22581 42582 32583 42584 52585 42586 52587 42588 42589 32590 52591 52592 42593 42594 42595 22596 22597 5
更新
然后我做了以下操作:
mask = np.array(test)print y[np.arange(len(y))[~mask]]
这是输出:
0 51 42 53 44 45 56 47 48 39 510 411 412 113 414 415 516 417 418 519 520 421 422 523 524 325 326 427 528 429 4 ..2568 42569 42570 42571 32572 42573 52574 52575 52576 52577 32578 42579 42580 22581 42582 32583 42584 52585 42586 52587 42588 42589 32590 52591 52592 42593 42594 42595 22596 22597 5Name: label, dtype: float64
回答:
问题在这里:
df.label.unique()Out[50]: array([ 5., 4., 3., 1., 2., nan])
示例代码:
import pandas as pdfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.svm import SVC# 替换为您自己的数据文件路径df = pd.read_csv('data1.csv', header=0)df[df.label.isnull()]Out[52]: id content label900 Daewoo_DWD_M1051__Opinio... 5 NaN1463 Indesit_IWC_5105_B_it__O... 1 NaN# 删除这两个df = df[df.label.notnull()]X = df.content.valuesy = df.label.valuestransformer = TfidfVectorizer()X = transformer.fit_transform(X)estimator = SVC(kernel='linear', class_weight='auto', probability=True)estimator.fit(X, y)estimator.predict(X)Out[54]: array([ 4., 4., 4., ..., 2., 2., 3.])estimator.predict_proba(X)Out[55]: array([[ 0.0252, 0.0228, 0.0744, 0.3427, 0.535 ], [ 0.002 , 0.0122, 0.0604, 0.4961, 0.4292], [ 0.0036, 0.0204, 0.1238, 0.5681, 0.2841], ..., [ 0.1494, 0.3341, 0.1586, 0.1316, 0.2263], [ 0.0175, 0.1984, 0.0915, 0.3406, 0.3519], [ 0.049 , 0.0264, 0.2087, 0.3267, 0.3891]])