我在尝试对 Python 中的 pandas 数据框进行预测。不知为何,CountVectorizer 无法转换数据。有人知道是什么原因导致这个问题吗?
这是我的代码:
filename = 'final_model.sav'print(response.status_code)data = response.json()print(data)dictionary = pd.read_json('rating_company_small.json', lines=True)dictionary_df = pd.DataFrame()dictionary_df["comment text"] = dictionary["comment"]data = pd.DataFrame.from_dict(json_normalize(data), orient='columns')print(data)df = pd.DataFrame()df["comment text"] = data["Text"]df["status"] = data["Status"]print(df)Processing.dataframe_cleaning(df)comment_data = df['comment text']tfidf = CountVectorizer()tfidf.fit(dictionary_df["comment text"])Test_X_Tfidf = tfidf.transform(df["comment text"])print(comment_data)print(Test_X_Tfidf)loaded_model = pickle.load(open(filename, 'rb'))predictions_NB = loaded_model.predict(Test_X_Tfidf)
这是数据框:
comment text status0 [slecht, bedrijf] string1 [leuk, bedrijfje, goed, behandeld] Approved2 [leuk, bedrijfje, goed, behandeld] Approved3 [leuk, bedrijfje] Approved
完整的错误信息:
Traceback (most recent call last): File "Request.py", line 36, in <module> Test_X_Tfidf = tfidf.transform(df["comment text"]) File "C:\Users\junio\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 1112, in transform _, X = self._count_vocab(raw_documents, fixed_vocab=True) File "C:\Users\junio\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 970, in _count_vocab for feature in analyze(doc): File "C:\Users\junio\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 352, in <lambda> tokenize(preprocess(self.decode(doc))), stop_words) File "C:\Users\junio\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 256, in <lambda> return lambda x: strip_accents(x.lower())AttributeError: 'list' object has no attribute 'lower'
我希望它能返回数据框上的预测结果。
回答:
CountVectorizer
无法直接处理列表形式的 Series
,这就是你得到该错误的原因(lower
是字符串方法)。看起来你需要使用 MultiLabelBinarizer
,它可以处理这种输入结构:
from sklearn.preprocessing import MultiLabelBinarizercount_vec = MultiLabelBinarizer()mlb = count_vec.fit(df["comment text"])pd.DataFrame(mlb.transform(df["comment text"]), columns=[mlb.classes_]) bedrijf bedrijfje behandeld goed leuk slecht0 1 0 0 0 0 11 0 1 1 1 1 02 0 1 1 1 1 03 0 1 0 0 1 0
然而,上述方法不会考虑列表中重复元素,输出元素只能是 0
或 1
。如果你期望的是这种行为,你可以先将列表连接成字符串,然后再使用 CountVectorizer
,因为它期望的是字符串:
text = df["comment text"].map(' '.join)count_vec = CountVectorizer()cv = count_vec.fit(text)pd.DataFrame(cv.transform(text).toarray(), columns=[mlb.classes_]) bedrijf bedrijfje behandeld goed leuk slecht0 1 0 0 0 0 11 0 1 1 1 1 02 0 1 1 1 1 03 0 1 0 0 1 0
请注意,这与输入字符串的 tf-idf
不同。这里你只是得到了实际的计数。对于 tf-idf
,你可以使用 TfidfVectorizer
,在相同示例中会产生:
bedrijf bedrijfje behandeld goed leuk slecht0 0.707107 0.000000 0.000000 0.000000 0.000000 0.7071071 0.000000 0.444931 0.549578 0.549578 0.444931 0.0000002 0.000000 0.444931 0.549578 0.549578 0.444931 0.0000003 0.000000 0.707107 0.000000 0.000000 0.707107 0.000000