我在应用K折交叉验证时遇到了问题,使用Tfidf时出现了以下错误
ValueError: setting an array element with a sequence.
我看到其他人也遇到了同样的问题,但他们使用的是train_test_split(),这与K折有些不同
for train_fold, valid_fold in kf.split(reviews_p1): vec = TfidfVectorizer(ngram_range=(1,1)) reviews_p1 = vec.fit_transform(reviews_p1) train_x = [reviews_p1[i] for i in train_fold] # 使用训练索引提取训练数据 train_y = [labels_p1[i] for i in train_fold] # 使用训练索引提取训练数据 valid_x = [reviews_p1[i] for i in valid_fold] # 使用交叉验证索引提取验证数据 valid_y = [labels_p1[i] for i in valid_fold] # 使用交叉验证索引提取验证数据 svc = LinearSVC() model = svc.fit(X = train_x, y = train_y) # 我们使用折叠训练数据来拟合模型 y_pred = model.predict(valid_x)
实际上,我找到了问题的所在,但我无法找到解决方法,基本上,当我们使用交叉验证/训练索引提取训练数据时,我们得到的是一个稀疏矩阵列表
[<1x21185 sparse matrix of type '<class 'numpy.float64'>' with 54 stored elements in Compressed Sparse Row format>, <1x21185 sparse matrix of type '<class 'numpy.float64'>' with 47 stored elements in Compressed Sparse Row format>, <1x21185 sparse matrix of type '<class 'numpy.float64'>' with 18 stored elements in Compressed Sparse Row format>, ....]
我尝试在分割数据后应用Tfidf,但这不起作用,因为特征的数量不同。
那么,有没有办法在不创建稀疏矩阵列表的情况下分割数据进行K折交叉验证呢?
回答:
在回答类似问题时,Do I use the same Tfidf vocabulary in k-fold cross_validation 他们建议
for train_index, test_index in kf.split(data_x, data_y): x_train, x_test = data_x[train_index], data_x[test_index] y_train, y_test = data_y[train_index], data_y[test_index] tfidf = TfidfVectorizer() x_train = tfidf.fit_transform(x_train) x_test = tfidf.transform(x_test) clf = SVC() clf.fit(x_train, y_train) y_pred = clf.predict(x_test) score = accuracy_score(y_test, y_pred) print(score)