NotFittedError: TfidfVectorizer – 词汇表未被拟合

我正在尝试使用scikit-learn/pandas构建一个情感分析器。构建和评估模型是可行的,但尝试对新的样本文本进行分类时却不行。

我的代码如下:

import csvimport pandas as pdimport numpy as npfrom sklearn.model_selection import train_test_splitfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.naive_bayes import BernoulliNBfrom sklearn.metrics import classification_reportfrom sklearn.metrics import accuracy_scoreinfile = 'Sentiment_Analysis_Dataset.csv'data = "SentimentText"labels = "Sentiment"class Classifier():    def __init__(self):        self.train_set, self.test_set = self.load_data()        self.counts, self.test_counts = self.vectorize()        self.classifier = self.train_model()    def load_data(self):        df = pd.read_csv(infile, header=0, error_bad_lines=False)        train_set, test_set = train_test_split(df, test_size=.3)        return train_set, test_set    def train_model(self):        classifier = BernoulliNB()        targets = self.train_set[labels]        classifier.fit(self.counts, targets)        return classifier    def vectorize(self):        vectorizer = TfidfVectorizer(min_df=5,                                 max_df = 0.8,                                 sublinear_tf=True,                                 ngram_range = (1,2),                                 use_idf=True)        counts = vectorizer.fit_transform(self.train_set[data])        test_counts = vectorizer.transform(self.test_set[data])        return counts, test_counts    def evaluate(self):        test_counts,test_set = self.test_counts, self.test_set        predictions = self.classifier.predict(test_counts)        print (classification_report(test_set[labels], predictions))        print ("The accuracy score is {:.2%}".format(accuracy_score(test_set[labels], predictions)))    def classify(self, input):        input_text = input        input_vectorizer = TfidfVectorizer(min_df=5,                                 max_df = 0.8,                                 sublinear_tf=True,                                 ngram_range = (1,2),                                 use_idf=True)        input_counts = input_vectorizer.transform(input_text)        predictions = self.classifier.predict(input_counts)        print(predictions)myModel = Classifier()text = ['I like this I feel good about it', 'give me 5 dollars']myModel.classify(text)myModel.evaluate()

错误信息如下:

Traceback (most recent call last):  File "sentiment.py", line 74, in <module>    myModel.classify(text)  File "sentiment.py", line 66, in classify    input_counts = input_vectorizer.transform(input_text)  File "/home/rachel/Sentiment/ENV/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 1380, in transform    X = super(TfidfVectorizer, self).transform(raw_documents)  File "/home/rachel/Sentiment/ENV/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 890, in transform    self._check_vocabulary()  File "/home/rachel/Sentiment/ENV/lib/python3.5/site-packages/sklearn/feature_extraction/text.py", line 278, in _check_vocabulary    check_is_fitted(self, 'vocabulary_', msg=msg),  File "/home/rachel/Sentiment/ENV/lib/python3.5/site-packages/sklearn/utils/validation.py", line 690, in check_is_fitted    raise _NotFittedError(msg % {'name': type(estimator).__name__})sklearn.exceptions.NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted.

我不确定问题出在哪里。在我的classify方法中,我创建了一个全新的向量化器来处理我想分类的文本,与用于创建模型的训练和测试数据的向量化器是分开的。

谢谢


回答:

你已经拟合了一个向量化器,但你把它丢弃了,因为它在你的vectorize函数的生命周期之外不存在。相反,在它被转换后,在vectorize中保存你的模型:

self._vectorizer = vectorizer

然后在你的classify函数中,不要创建一个新的向量化器。相反,使用你已经拟合到训练数据上的那个:

input_counts = self._vectorizer.transform(input_text)

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注