我正在开发一个评分系统(毕业项目)。我已经对数据进行了预处理,然后在数据上使用了TfidfVectorizer,并使用LinearSVC来拟合模型。
系统的运行如下,它包含265个定义,长度各不相同;但总共,它们的形状为(265, 8581),因此当我尝试输入一些新的随机句子进行预测时,我收到了如下信息
如果你想的话,可以查看所使用的代码(完整且较长);
使用的代码;
def normalize(df): lst = [] for x in range(len(df)): text = re.sub(r"[,.'!?]",'', df[x]) lst.append(text) filtered_sentence = ' '.join(lst) return filtered_sentencedef stopWordRemove(df): stop = stopwords.words("english") needed_words = [] for x in range(len(df)): words = word_tokenize(df) for word in words: if word not in stop: needed_words.append(word) return needed_wordsdef prepareDataSets(df): sentences = [] for index, d in df.iterrows(): Definitions = stopWordRemove(d['Definitions'].lower()) Definitions_normalized = normalize(Definitions) if d['Results'] == 'F': sentences.append([Definitions, 'false']) else: sentences.append([Definitions, 'true']) df_sentences = DataFrame(sentences, columns=['Definitions', 'Results']) for x in range(len(df_sentences)): df_sentences['Definitions'][x] = ' '.join(df_sentences['Definitions'][x]) return df_sentencesdef featureExtraction(data): vectorizer = TfidfVectorizer(min_df=10, max_df=0.75, ngram_range=(1,3)) tfidf_data = vectorizer.fit_transform(data) return tfidf_datadef learning(clf, X, Y): X_train, X_test, Y_train, Y_test = \ cross_validation.train_test_split(X,Y, test_size=.2,random_state=43) classifier = clf() classifier.fit(X_train, Y_train) predict = cross_validation.cross_val_predict(classifier, X_test, Y_test, cv=5) scores = cross_validation.cross_val_score(classifier, X_test, Y_test, cv=5) print(scores) print ("Accuracy of %s: %0.2f(+/- %0.2f)" % (classifier, scores.mean(), scores.std() *2)) print (classification_report(Y_test, predict))
然后我运行这些脚本:之后我得到了提到的错误
test = LinearSVC()data, target = preprocessed_df['Definitions'], preprocessed_df['Results']tfidf_data = featureExtraction(data)X_train, X_test, Y_train, Y_test = \cross_validation.train_test_split(tfidf_data,target, test_size=.2,random_state=43)test.fit(tfidf_data, target)predict = cross_validation.cross_val_predict(test, X_test, Y_test, cv=10)scores = cross_validation.cross_val_score(test, X_test, Y_test, cv=10)print(scores)print ("Accuracy of %s: %0.2f(+/- %0.2f)" % (test, scores.mean(), scores.std() *2))print (classification_report(Y_test, predict))Xnew = ["machine learning is playing games in home"]tvect = TfidfVectorizer(min_df=1, max_df=1.0, ngram_range=(1,3))X_test= tvect.fit_transform(Xnew)ynew = test.predict(X_test)
回答:
你从未在测试数据上调用fit_transform()
,只使用了transform()
,并且使用了与训练数据相同的向量化器。
请这样做:
def featureExtraction(data): vectorizer = TfidfVectorizer(min_df=10, max_df=0.75, ngram_range=(1,3)) tfidf_data = vectorizer.fit_transform(data) # 这里我同样返回了用于生成训练数据的向量化器 return vectorizer, tfidf_data......tfidf_vectorizer, tfidf_data = featureExtraction(data)......# 现在在测试数据上使用相同的向量化器X_test= tfidf_vectorizer.transform(Xnew)...
在你的代码中,你使用了一个新的TfidfVectorizer,这显然不会知道训练数据,并且不知道训练数据有8581个特征。
测试数据的准备方式应该始终与训练数据的准备方式相同。否则,即使你没有得到错误,结果也是错误的,模型在实际情况下的表现也不会如预期那样。
请查看我关于不同特征预处理技术的类似情况的其他回答:
- https://stackoverflow.com/a/47205199/3374996
- https://stackoverflow.com/a/50461140/3374996
- https://stackoverflow.com/a/44671967/3374996
我本可以将这个问题标记为上述问题的重复,但看到你完全使用了一个新的向量化器,并且有不同的方法来转换训练数据,我回答了这个问题。从下次开始,请先搜索问题并尝试理解在类似情况下发生了什么,然后再发布问题。