scikit-learn学习曲线函数在输入SVM分类器时抛出ValueError


这会返回以下错误
ValueError: The number of classes has to be greater than one; got 1
这对我来说毫无意义，因为“car_rating”列绝对有两个类别。进行值计数返回:
unacc    1210acc       518
因此有两个类别，一个比另一个小，但足够多，以至于分层k折应该能够在所有切割中保持两者。所以是什么导致了这个错误呢？
我使用的数据集可以在这里找到这里。我确实更改了列名，并将’good’和’vgood’类别合并到了’acc’中，但除此之外数据没有变化
编辑：这是plot_learning_curve的代码:
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 10)):    """    Generate a simple plot of the test and training learning curve.    Parameters    ----------    estimator : object type that implements the "fit" and "predict" methods        An object of that type which is cloned for each validation.    title : string        Title for the chart.    X : array-like, shape (n_samples, n_features)        Training vector, where n_samples is the number of samples and        n_features is the number of features.    y : array-like, shape (n_samples) or (n_samples, n_features), optional        Target relative to X for classification or regression;        None for unsupervised learning.    ylim : tuple, shape (ymin, ymax), optional        Defines minimum and maximum yvalues plotted.    cv : int, cross-validation generator or an iterable, optional        Determines the cross-validation splitting strategy.        Possible inputs for cv are:          - None, to use the default 3-fold cross-validation,          - integer, to specify the number of folds.          - An object to be used as a cross-validation generator.          - An iterable yielding train/test splits.        For integer/None inputs, if ``y`` is binary or multiclass,        :class:`StratifiedKFold` used. If the estimator is not a classifier        or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.        Refer :ref:`User Guide <cross_validation>` for the various        cross-validators that can be used here.    n_jobs : integer, optional        Number of jobs to run in parallel (default 1).    taken from: http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html    """    plt.figure()    plt.title(title)    if ylim is not None:        plt.ylim(*ylim)    plt.xlabel("Training examples")    plt.ylabel("Score")    train_sizes, train_scores, test_scores = learning_curve(        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)    train_scores_mean = np.mean(train_scores, axis=1)    train_scores_std = np.std(train_scores, axis=1)    test_scores_mean = np.mean(test_scores, axis=1)    test_scores_std = np.std(test_scores, axis=1)    plt.grid()    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,                     train_scores_mean + train_scores_std, alpha=0.1,                     color="r")    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,                     test_scores_mean + test_scores_std, alpha=0.1, color="g")    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",             label="Training score")    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",             label="Cross-validation score")    plt.legend(loc="best")    return plt
这是完整的堆栈跟踪:
---------------------------------------------------------------------------ValueError                                Traceback (most recent call last)<ipython-input-19-04113e3ff056> in <module>()      1 # the built in learning curve      2 clf = SVC(kernel='poly', degree=3, C=1000)----> 3 plot_learning_curve(estimator=clf, title="Test", X=X, y=y, cv=10)<ipython-input-9-022f43e40037> in plot_learning_curve(estimator, title, X, y, ylim, cv, n_jobs, train_sizes)     50     plt.ylabel("Score")     51     train_sizes, train_scores, test_scores = learning_curve(---> 52         estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)     53     train_scores_mean = np.mean(train_scores, axis=1)     54     train_scores_std = np.std(train_scores, axis=1)~/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in learning_curve(estimator, X, y, groups, train_sizes, cv, scoring, exploit_incremental_learning, n_jobs, pre_dispatch, verbose, shuffle, random_state)   1126             clone(estimator), X, y, scorer, train, test,   1127             verbose, parameters=None, fit_params=None, return_train_score=True)-> 1128             for train, test in train_test_proportions)   1129         out = np.array(out)   1130         n_cv_folds = out.shape[0] // n_unique_ticks~/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)    777             # was dispatched. In particular this covers the edge    778             # case of Parallel used with an exhausted iterator.--> 779             while self.dispatch_one_batch(iterator):    780                 self._iterating = True    781             else:~/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)    623                 return False    624             else:--> 625                 self._dispatch(tasks)    626                 return True    627 ~/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch)    586         dispatch_timestamp = time.time()    587         cb = BatchCompletionCallBack(dispatch_timestamp, len(batch), self)--> 588         job = self._backend.apply_async(batch, callback=cb)    589         self._jobs.append(job)    590 ~/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in apply_async(self, func, callback)    109     def apply_async(self, func, callback=None):    110         """Schedule a func to be run"""--> 111         result = ImmediateResult(func)    112         if callback:    113             callback(result)~/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py in __init__(self, batch)    330         # Don't delay the application, to avoid keeping the input    331         # arguments in memory--> 332         self.results = batch()    333     334     def get(self):~/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in __call__(self)    129     130     def __call__(self):--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]    132     133     def __len__(self):~/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0)    129     130     def __call__(self):--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]    132     133     def __len__(self):~/anaconda3/lib/python3.6/site-packages/sklearn/model_selection/_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, error_score)    456             estimator.fit(X_train, **fit_params)    457         else:--> 458             estimator.fit(X_train, y_train, **fit_params)    459     460     except Exception as e:~/anaconda3/lib/python3.6/site-packages/sklearn/svm/base.py in fit(self, X, y, sample_weight)    148     149         X, y = check_X_y(X, y, dtype=np.float64, order='C', accept_sparse='csr')--> 150         y = self._validate_targets(y)    151     152         sample_weight = np.asarray([]~/anaconda3/lib/python3.6/site-packages/sklearn/svm/base.py in _validate_targets(self, y)    504             raise ValueError(    505                 "The number of classes has to be greater than one; got %d"--> 506                 % len(cls))    507     508         self.classes_ = clsValueError: The number of classes has to be greater than one; got 1

回答：
是的，问题是由于train_sizes引起的。
初始值是:
train_sizes=np.linspace(.1, 1.0, 10)
这用于查找train_sizes_abs属性（它只是将训练集的浮点分数转换为实际数字:
...n_max_training_samples = len(cv_iter[0][0])train_sizes_abs = _translate_train_sizes(train_sizes, n_max_training_samples)......
然后用于实际选择每个折叠的增量训练数据:
...else:    train_test_proportions = []    for train, test in cv_iter:        for n_train_samples in train_sizes_abs:            train_test_proportions.append((train[:n_train_samples], test))......
这导致了一个问题，当第一次选择数据进行训练时（train_test_proportions中的第一个值），它恰好只包含一个类别。我们对此无能为力。
但如果我们能在之前对训练数据进行洗牌，那么这个问题就不会发生（尽管在洗牌后选出的数据仍然可能包含单一类别的可能性仍然存在，但这种情况很少见）
所以我们需要在learning_curve调用中添加shuffle参数:
train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv,                                                        n_jobs=n_jobs,                                                         train_sizes=train_sizes,                                                         shuffle=True) 
之后代码将成功运行。



相关文章：

根据y轴与0的距离进行分组的聚类算法
可训练的sklearn标准化缩放器在R中的应用
使用自定义评分器依赖于训练特征的Scikit-learn分类器
如何在使用sklearn训练模型时利用预训练的词嵌入？
在 scikit-learn 中，fit()、fit_transform() 和 transform() 有什么区别？ [duplicate]
重新拟合决策树以增加一层
使用sklearn创建训练和验证集分割
如何确定sklearn中LogisticRegression.coef_中的每一组系数对应哪个标签？
如何在SelectFromModel()中决定特征选择的阈值？
Python Scikit – LinearRegression和Ridge返回不同的结果
学技术

scikit-learn学习曲线函数在输入SVM分类器时抛出ValueError

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复