ValueError: 尝试在测试集上进行预测时维度不匹配

我是机器学习的新手,正在努力让分类器使用测试数据集进行预测。

我原本以为维度不匹配的错误是由于用测试集拟合了向量化器引起的,但我已经修复了这个问题,错误依然存在。

我认为错误是由于向量化器在某个地方被覆盖了,但我查找后没找到具体位置…

非常感谢您的帮助,我已经为此困扰了很长时间 🙂

import sqlalchemyimport pandas as pdimport numpy as npfrom sklearn.model_selection import train_test_splitfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.feature_extraction.text import TfidfTransformerfrom sklearn.feature_selection import SelectKBestfrom sklearn.feature_selection import chi2from sklearn.naive_bayes import MultinomialNBfrom sklearn.metrics import accuracy_scorefrom sklearn import metricsfrom sklearn.metrics import accuracy_scorefrom sklearn import metricsimport pickle### Connect to MYSQL database###dbServerName = "localhost"dbUser = "root"dbPassword = "woodycool123"dbName = "azure_support_tweets"engine = sqlalchemy.create_engine('mysql+pymysql://root:woodycool123@localhost:3306/azure_support_tweets')pd.set_option('display.max_colwidth', -1)df = pd.read_sql_table("preprocessed_tweets", engine)data = pd.DataFrame(df)### Training and Test Data Split###features_train, features_test, labels_train, labels_test = train_test_split(data['text_tweet'], data['main_category'], random_state = 42, test_size=0.34)### CountVectorizer###cv = CountVectorizer(ngram_range=(1,2), stop_words='english', min_df=3, max_df=0.50)features_train_cv = cv.fit_transform(features_train)# Uncomment to print a matrix count of tokens# print(features_train_cv.toarray())print("Feature Count\nCountVectorizer() #", len(cv.get_feature_names()))### TF-IDF Transformer###tfidfv = TfidfTransformer(use_idf=True)features_train_tfidfv = tfidfv.fit_transform(features_train_cv)print("Feature Set\nTfidfVectorizer() #", features_train_tfidfv.shape)# Remove to print the top 10 features# features = tfidfv.get_feature_names()# feature_order = np.argsort(tfidfv.idf_)[::-1]# top_n = 10# top_n_features = [features[i] for i in feature_order[:top_n]]# print(top_n_features)### SelectKBest###selector = SelectKBest(chi2, k=1000).fit_transform(features_train_tfidfv, labels_train)print("Feature Set\nSelectKBest() and chi2 #", selector.shape)### Train Model###clf = MultinomialNB()clf.fit(selector, labels_train)### Test Model###features_test_cv = cv.transform(features_test)features_test_cv_two = tfidfv.transform(features_test_cv)pred = clf.predict(features_test_cv)

错误:

Traceback (most recent call last):  File "/Users/bethwalsh/Documents/classifier-twitter/building_the_classifer/feature_generation_selection.py", line 76, in <module>    pred = clf.predict(features_test_cv)  File "/Users/bethwalsh/anaconda3/lib/python3.6/site-packages/sklearn/naive_bayes.py", line 66, in predict    jll = self._joint_log_likelihood(X)  File "/Users/bethwalsh/anaconda3/lib/python3.6/site-packages/sklearn/naive_bayes.py", line 725, in _joint_log_likelihood    return (safe_sparse_dot(X, self.feature_log_prob_.T) +  File "/Users/bethwalsh/anaconda3/lib/python3.6/site-packages/sklearn/utils/extmath.py", line 135, in safe_sparse_dot    ret = a * b  File "/Users/bethwalsh/anaconda3/lib/python3.6/site-packages/scipy/sparse/base.py", line 515, in __mul__    raise ValueError('dimension mismatch')ValueError: dimension mismatch

回答:

您需要将测试集也通过选择器处理,但首先您必须进行拟合

selector = SelectKBest(chi2, k=1000)selector.fit(features_train_tfidfv, labels_train)clf = MultinomialNB()clf.fit(selector.transform(features_train_tfidfv), labels_train)features_test_cv = selector.transform(tfidfv.transform(cv.transform(features_test)))pred = clf.predict(features_test_cv)    

它会抛出这个错误,因为选择器减少了训练集的维度,但没有减少测试集的维度

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注