我是机器学习的新手,正在努力让分类器使用测试数据集进行预测。
我原本以为维度不匹配的错误是由于用测试集拟合了向量化器引起的,但我已经修复了这个问题,错误依然存在。
我认为错误是由于向量化器在某个地方被覆盖了,但我查找后没找到具体位置…
非常感谢您的帮助,我已经为此困扰了很长时间 🙂
import sqlalchemyimport pandas as pdimport numpy as npfrom sklearn.model_selection import train_test_splitfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.feature_extraction.text import TfidfTransformerfrom sklearn.feature_selection import SelectKBestfrom sklearn.feature_selection import chi2from sklearn.naive_bayes import MultinomialNBfrom sklearn.metrics import accuracy_scorefrom sklearn import metricsfrom sklearn.metrics import accuracy_scorefrom sklearn import metricsimport pickle### Connect to MYSQL database###dbServerName = "localhost"dbUser = "root"dbPassword = "woodycool123"dbName = "azure_support_tweets"engine = sqlalchemy.create_engine('mysql+pymysql://root:woodycool123@localhost:3306/azure_support_tweets')pd.set_option('display.max_colwidth', -1)df = pd.read_sql_table("preprocessed_tweets", engine)data = pd.DataFrame(df)### Training and Test Data Split###features_train, features_test, labels_train, labels_test = train_test_split(data['text_tweet'], data['main_category'], random_state = 42, test_size=0.34)### CountVectorizer###cv = CountVectorizer(ngram_range=(1,2), stop_words='english', min_df=3, max_df=0.50)features_train_cv = cv.fit_transform(features_train)# Uncomment to print a matrix count of tokens# print(features_train_cv.toarray())print("Feature Count\nCountVectorizer() #", len(cv.get_feature_names()))### TF-IDF Transformer###tfidfv = TfidfTransformer(use_idf=True)features_train_tfidfv = tfidfv.fit_transform(features_train_cv)print("Feature Set\nTfidfVectorizer() #", features_train_tfidfv.shape)# Remove to print the top 10 features# features = tfidfv.get_feature_names()# feature_order = np.argsort(tfidfv.idf_)[::-1]# top_n = 10# top_n_features = [features[i] for i in feature_order[:top_n]]# print(top_n_features)### SelectKBest###selector = SelectKBest(chi2, k=1000).fit_transform(features_train_tfidfv, labels_train)print("Feature Set\nSelectKBest() and chi2 #", selector.shape)### Train Model###clf = MultinomialNB()clf.fit(selector, labels_train)### Test Model###features_test_cv = cv.transform(features_test)features_test_cv_two = tfidfv.transform(features_test_cv)pred = clf.predict(features_test_cv)
错误:
Traceback (most recent call last): File "/Users/bethwalsh/Documents/classifier-twitter/building_the_classifer/feature_generation_selection.py", line 76, in <module> pred = clf.predict(features_test_cv) File "/Users/bethwalsh/anaconda3/lib/python3.6/site-packages/sklearn/naive_bayes.py", line 66, in predict jll = self._joint_log_likelihood(X) File "/Users/bethwalsh/anaconda3/lib/python3.6/site-packages/sklearn/naive_bayes.py", line 725, in _joint_log_likelihood return (safe_sparse_dot(X, self.feature_log_prob_.T) + File "/Users/bethwalsh/anaconda3/lib/python3.6/site-packages/sklearn/utils/extmath.py", line 135, in safe_sparse_dot ret = a * b File "/Users/bethwalsh/anaconda3/lib/python3.6/site-packages/scipy/sparse/base.py", line 515, in __mul__ raise ValueError('dimension mismatch')ValueError: dimension mismatch
回答:
您需要将测试集也通过选择器处理,但首先您必须进行拟合
selector = SelectKBest(chi2, k=1000)selector.fit(features_train_tfidfv, labels_train)clf = MultinomialNB()clf.fit(selector.transform(features_train_tfidfv), labels_train)features_test_cv = selector.transform(tfidfv.transform(cv.transform(features_test)))pred = clf.predict(features_test_cv)
它会抛出这个错误,因为选择器减少了训练集的维度,但没有减少测试集的维度