CountVectorizer 错误:ValueError: setting an array element with a sequence

我有一个包含144条学生反馈的数据集,其中有72条正面反馈和72条负面反馈。数据集有两个属性,分别是data和target,其中data包含句子,target包含情感(正面或负面)。请看以下代码:

import pandas as pd
feedback_data = pd.read_csv('output.csv')
print(feedback_data)
      data    target
0      facilitates good student teacher communication.  positive
1                           lectures are very lengthy.  negative
2             the teacher is very good at interaction.  positive
3                       good at clearing the concepts.  positive
4                       good at clearing the concepts.  positive
5                                    good at teaching.  positive
6                          does not shows test copies.  negative
7                           good subjective knowledge.  positive
8                           good communication skills.  positive
9                               good teaching methods.  positive
10   posseses very good and thorough knowledge of t...  positive
11   posseses superb ability to provide a lots of i...  positive
12   good conceptual skills and knowledge for subject.  positive
13                      no commuication outside class.  negative
14                                     rude behaviour.  negative
15            very negetive attitude towards students.  negative
16   good communication skills, lacks time punctual...  positive
17   explains in a better way by giving practical e...  positive
18                               hardly comes on time.  negative
19                          good communication skills.  positive
20   to make students comfortable with the subject,...  negative
21                       associated to original world.  positive
22                             lacks time punctuality.  negative
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(binary = True)
cv.fit(feedback_data['data'].values)
X = feedback_data['data'].apply(lambda X : cv.transform([X])).values
X_test = cv.transform(feedback_data_test)
from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
target = [1 if i<72 else 0 for i in range(144)]
print(target)
X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.50)
clf = svm.SVC(kernel = 'linear', gamma = 0.001, C = 0.05)
#以下这行代码会报错
clf.fit(X , target)

我不知道哪里出了问题。请帮助我


回答:

错误的原因在于X的处理方式。你不能直接在Fit方法中使用X。你需要先对它进行更多的转换(在之前的问题中我无法告诉你这一点,因为我没有相关信息)

现在你有以下内容:

array([<1x23 sparse matrix of type '<class 'numpy.int64'>'with 5 stored elements in Compressed Sparse Row format>,   ...   <1x23 sparse matrix of type '<class 'numpy.int64'>'with 3 stored elements in Compressed Sparse Row format>], dtype=object)

这已经足够进行分割了。我们只需要对它进行转换,这样你就能理解,并且Fit方法也能理解:

X = list([list(x.toarray()[0]) for x in X])

我们所做的是将稀疏矩阵转换为numpy数组,取第一个元素(它只有一个元素),然后将其转换为列表,以确保它具有正确的维度。

现在我们为什么要这样做:

X看起来是这样的

>>>X[0]   <1x23 sparse matrix of type '<class 'numpy.int64'>'   with 5 stored elements in Compressed Sparse Row format>

所以我们转换它以查看它实际上是什么:

>>>X[0].toarray()   array([[0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,         0]], dtype=int64)

然后你会发现维度上有一个小问题,所以我们取第一个元素。

将其转换回列表没有任何作用,只是为了让你更好地理解你看到的内容。(为了速度,你可以省略这一步)

你的代码现在是这样的:

cv = CountVectorizer(binary = True)
cv.fit(df['data'].values)
X = df['data'].apply(lambda X : cv.transform([X])).values
X = list([list(x.toarray()[0]) for x in X])
clf = svm.SVC(kernel = 'linear', gamma = 0.001, C = 0.05)
clf.fit(X, target)

Related Posts

Flatten and back keras

我正在尝试使用自编码器获取简单向量中的值 这是我的代码…

如何按索引访问PyTorch模型参数

如果我的网络有10层,包括偏置项,如何仅通过索引访问第…

Python中多元逻辑回归显示错误

我试图使用逻辑回归进行预测,并使用Python和skl…

在MACOS上安装NLTK

我在我的2015款Mac Pro上尝试安装NLTK,操…

如何在R中将通过RFE选择的变量插入到机器学习模型中?

我想使用递归特征消除方法来选择最重要的特征,然后将这些…

### 在特定轮次后开始回调值准确性[start callback val acc after specific epoch]

我在使用验证准确性实现提前停止时有一个疑问。 假设我想…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注