我遇到了这个错误。请给我一些解决建议。这里是我的代码。我从train.csv中获取训练数据,从另一个文件test.csv中获取测试数据。我是机器学习的新手,所以我不明白问题出在哪里。请给我一些建议。
import quandl,math import numpy as np import pandas as pd import matplotlib.pyplot as pltfrom matplotlib import styleimport datetimefrom sklearn.ensemble import RandomForestClassifierfrom sklearn.preprocessing import LabelEncoderfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn import metricstrain = pd.read_csv("train.csv", index_col=None)test = pd.read_csv("test.csv", index_col=None)vectorizer = CountVectorizer(min_df=1)X1 = vectorizer.fit_transform(train['question'])Y1 = vectorizer.fit_transform(test['testing'])X=X1.toarray()Y=Y1.toarray()#print(Y.shape)number=LabelEncoder()train['answer']=number.fit_transform(train['answer'].astype('str'))features = ['question','answer']y = train['answer']clf=RandomForestClassifier(n_estimators=100)clf.fit(X[:25],y)predicted_result=clf.predict(Y[17])p_result=number.inverse_transform(predicted_result)f = open('output.txt', 'w')t=str(p_result)f.write(t)print(p_result)
回答:
你的代码存在多个问题。但与这个问题相关的是,你在训练和测试数据上都对CountVectorizer (vectorizer
)进行了拟合,这就是为什么你得到了不同的特征。
你应该做的是:
X1 = vectorizer.fit_transform(train['question'])# 以下这行已更改Y1 = vectorizer.transform(test['testing'])