我有一个数据集,其中包含’studentDetails’和’studentId’两列。我在这个数据集上训练了模型并保存了它。当我训练模型并保存训练好的模型,然后加载训练好的模型进行预测时,它成功地输出了结果。但是,当我单独加载保存的模型并使用它进行预测时,它会报错“CountVectorizer – Vocabulary wasn’t fitted”。
这是我使用的代码:
from sklearn.model_selection import train_test_splitfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.feature_extraction.text import TfidfTransformerimport picklefrom sklearn.svm import LinearSVC X_train, X_test, y_train, y_test = train_test_split(df['studentDetails'], df['studentId'], random_state = 0)count_vect = CountVectorizer()X_train_counts = count_vect.fit_transform(X_train)tfidf_transformer = TfidfTransformer()X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts) classificationModel = LinearSVC().fit(X_train_tfidf, y_train) filename = 'finalized_model.sav'pickle.dump(classificationModel, open(filename, 'wb'))
现在加载模型并进行预测:
from sklearn.feature_extraction.text import CountVectorizerdata_to_be_predicted="Alicia Scott is from United States"filename = 'finalized_model.sav'loaded_model = pickle.load(open(filename, 'rb'))count_vect = CountVectorizer()result = loaded_model.predict(count_vect.transform([data_to_be_predicted]))print(result)
输出:
94120
当我单独运行第二个代码片段时,它会报错
错误:
CountVectorizer - Vocabulary wasn't fitted
我很好奇,为什么在第二种情况下会出现错误,因为在第一种情况下我没有重新定义count_vect = CountVectorizer(),并且得到了正确的结果。
回答:
第二个代码片段的问题在于,你没有使用已经拟合的CountVectorizer,而是一个新的未拟合的CountVectorizer。
我建议你使用fit替代fit_transform,这将返回一个已经拟合好的CountVectorizer,然后你可以像保存模型一样保存它。
from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer import pickle from sklearn.svm import LinearSVC X_train, X_test, y_train, y_test = train_test_split(df['studentDetails'], df['studentId'], random_state = 0) count_vect = CountVectorizer().fit(X_train) X_train_counts = count_vect.transform(X_train) tfidf_transformer = TfidfTransformer().fit(X_train_counts) X_train_tfidf = tfidf_transformer.transform(X_train_counts) classificationModel = LinearSVC().fit(X_train_tfidf, y_train) filename = 'finalized_model.sav' pickle.dump(classificationModel, open(filename, 'wb')) pickle.dump(count_vect, open('count_vect, 'wb')) pickle.dump(tfidf_transformer, open('tfidf_transformer, 'wb'))
现在你可以在进行预测时加载这三个文件:
from sklearn.feature_extraction.text import CountVectorizerdata_to_be_predicted="Alicia Scott is from United States"filename = 'finalized_model.sav'loaded_model = pickle.load(open(filename, 'rb'))count_vect = pickle.load(open('count_vect', 'rb'))result = loaded_model.predict(count_vect.transform([data_to_be_predicted]))print(result)