这是一个关于Python 2.7中scikit learn(版本0.17.0)以及Pandas 0.17.1的问题。我尝试使用这里详细描述的方法来分割原始数据(没有缺失条目)。我发现如果使用分割后的数据进行.fit()
操作,会出现一个错误。
以下是主要来自另一个stackoverflow问题的代码,变量名称已被重命名。我创建了一个网格并尝试使用分割后的数据进行拟合,目的是确定最佳分类器参数。错误发生在下面的代码最后一行之后:
import pandas as pdimport numpy as np# UCI的葡萄酒数据集wine = pd.read_csv("https://s3.amazonaws.com/demo-datasets/wine.csv")# 从数据中分离目标变量y = wine['quality']X = wine.drop(['quality','color'],axis = 1)# 训练和测试数据的分层分割from sklearn.cross_validation import StratifiedShuffleSplitsss = StratifiedShuffleSplit(y, n_iter=3, test_size=0.2)# 分割数据集以获取训练和测试集的索引for train_index, test_index in sss: xtrain, xtest = X.iloc[train_index], X.iloc[test_index] ytrain, ytest = y[train_index], y[test_index]# 选择某个分类器from sklearn.tree import DecisionTreeClassifierdecision_tree = DecisionTreeClassifier()from sklearn.grid_search import GridSearchCV# 实例化网格grid = GridSearchCV(decision_tree, param_grid={'max_depth':np.arange(1,3)}, cv=sss, scoring='accuracy')# 这一行会导致错误信息grid.fit(xtrain,ytrain)
以下是上述代码产生的错误信息:
Traceback (most recent call last): File "C:\Python27\test.py", line 23, in <module> grid.fit(xtrain,ytrain) File "C:\Python27\lib\site-packages\sklearn\grid_search.py", line 804, in fit return self._fit(X, y, ParameterGrid(self.param_grid)) File "C:\Python27\lib\site-packages\sklearn\grid_search.py", line 553, in _fit for parameters in parameter_iterable File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 800, in __call__ while self.dispatch_one_batch(iterator): File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 658, in dispatch_one_batch self._dispatch(tasks) File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 566, in _dispatch job = ImmediateComputeBatch(batch) File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 180, in __init__ self.results = batch() File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 72, in __call__ return [func(*args, **kwargs) for func, args, kwargs in self.items] File "C:\Python27\lib\site-packages\sklearn\cross_validation.py", line 1524, in _fit_and_score X_train, y_train = _safe_split(estimator, X, y, train) File "C:\Python27\lib\site-packages\sklearn\cross_validation.py", line 1591, in _safe_split X_subset = safe_indexing(X, indices) File "C:\Python27\lib\site-packages\sklearn\utils\__init__.py", line 152, in safe_indexing return X.iloc[indices] File "C:\Python27\lib\site-packages\pandas\core\indexing.py", line 1227, in __getitem__ return self._getitem_axis(key, axis=0) File "C:\Python27\lib\site-packages\pandas\core\indexing.py", line 1504, in _getitem_axis self._is_valid_list_like(key, axis) File "C:\Python27\lib\site-packages\pandas\core\indexing.py", line 1443, in _is_valid_list_like raise IndexError("positional indexers are out-of-bounds")IndexError: positional indexers are out-of-bounds
注意:对我来说,保持X
和y
作为Pandas数据结构非常重要,类似于上面另一个stackoverflow问题中提到的第二种方法。即,我不想使用X.values
和y.values
。
问题:使用Pandas数据结构的原始数据(DataFrame
用于X
,Series
用于y
),有没有办法运行grid.fit()
而不出现这个错误信息?
回答:
您应该直接将X
和y
传递给fit()
,如下所示:
grid.fit(X, y)
然后GridSearchCV
会处理
xtrain, xtest = X.iloc[train_index], X.iloc[test_index]ytrain, ytest = y[train_index], y[test_index]
StratifiedShuffleSplit
实例在迭代时会生成训练/测试分割的索引对:
>>> list(sss)[(array([2531, 4996, 4998, ..., 3205, 2717, 4983]), array([5942, 893, 1702, ..., 6340, 4806, 2537])), (array([1888, 2332, 6276, ..., 1674, 775, 3705]), array([3404, 3304, 4741, ..., 4397, 3646, 1410])), (array([1517, 3759, 4402, ..., 5098, 4619, 4521]), array([1110, 4076, 1280, ..., 6384, 1294, 1132]))]
GridSearchCV
将使用这些索引来分割训练样本。您无需手动进行分割。
错误发生的原因是您将xtrain
和ytrain
(训练/测试分割之一)输入到交叉验证器中。交叉验证器试图访问完整数据集中存在但在训练/测试分割中不存在的项目,从而引发IndexError
。