我有一个包含144条学生反馈的数据集,其中有72条正面反馈和72条负面反馈。数据集有两个属性,分别是data和target,其中data包含句子,target包含情感(正面或负面)。请看以下代码:
import pandas as pd
feedback_data = pd.read_csv('output.csv')
print(feedback_data)
data target
0 facilitates good student teacher communication. positive
1 lectures are very lengthy. negative
2 the teacher is very good at interaction. positive
3 good at clearing the concepts. positive
4 good at clearing the concepts. positive
5 good at teaching. positive
6 does not shows test copies. negative
7 good subjective knowledge. positive
8 good communication skills. positive
9 good teaching methods. positive
10 posseses very good and thorough knowledge of t... positive
11 posseses superb ability to provide a lots of i... positive
12 good conceptual skills and knowledge for subject. positive
13 no commuication outside class. negative
14 rude behaviour. negative
15 very negetive attitude towards students. negative
16 good communication skills, lacks time punctual... positive
17 explains in a better way by giving practical e... positive
18 hardly comes on time. negative
19 good communication skills. positive
20 to make students comfortable with the subject,... negative
21 associated to original world. positive
22 lacks time punctuality. negative
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(binary = True)
cv.fit(feedback_data['data'].values)
X = feedback_data['data'].apply(lambda X : cv.transform([X])).values
X_test = cv.transform(feedback_data_test)
from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
target = [1 if i<72 else 0 for i in range(144)]
print(target)
X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.50)
clf = svm.SVC(kernel = 'linear', gamma = 0.001, C = 0.05)
#以下这行代码会报错
clf.fit(X , target)
我不知道哪里出了问题。请帮助我
回答:
错误的原因在于X的处理方式。你不能直接在Fit方法中使用X。你需要先对它进行更多的转换(在之前的问题中我无法告诉你这一点,因为我没有相关信息)
现在你有以下内容:
array([<1x23 sparse matrix of type '<class 'numpy.int64'>'with 5 stored elements in Compressed Sparse Row format>, ... <1x23 sparse matrix of type '<class 'numpy.int64'>'with 3 stored elements in Compressed Sparse Row format>], dtype=object)
这已经足够进行分割了。我们只需要对它进行转换,这样你就能理解,并且Fit方法也能理解:
X = list([list(x.toarray()[0]) for x in X])
我们所做的是将稀疏矩阵转换为numpy数组,取第一个元素(它只有一个元素),然后将其转换为列表,以确保它具有正确的维度。
现在我们为什么要这样做:
X看起来是这样的
>>>X[0] <1x23 sparse matrix of type '<class 'numpy.int64'>' with 5 stored elements in Compressed Sparse Row format>
所以我们转换它以查看它实际上是什么:
>>>X[0].toarray() array([[0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0]], dtype=int64)
然后你会发现维度上有一个小问题,所以我们取第一个元素。
将其转换回列表没有任何作用,只是为了让你更好地理解你看到的内容。(为了速度,你可以省略这一步)
你的代码现在是这样的:
cv = CountVectorizer(binary = True)
cv.fit(df['data'].values)
X = df['data'].apply(lambda X : cv.transform([X])).values
X = list([list(x.toarray()[0]) for x in X])
clf = svm.SVC(kernel = 'linear', gamma = 0.001, C = 0.05)
clf.fit(X, target)