AttributeError: ‘list’ 对象没有 ‘lower’ 属性，使用 CountVectorizer

我在尝试对 Python 中的 pandas 数据框进行预测。不知为何，CountVectorizer 无法转换数据。有人知道是什么原因导致这个问题吗？

这是我的代码：

filename = 'final_model.sav'print(response.status_code)data = response.json()print(data)dictionary = pd.read_json('rating_company_small.json', lines=True)dictionary_df = pd.DataFrame()dictionary_df["comment text"] = dictionary["comment"]data = pd.DataFrame.from_dict(json_normalize(data), orient='columns')print(data)df = pd.DataFrame()df["comment text"] = data["Text"]df["status"] = data["Status"]print(df)Processing.dataframe_cleaning(df)comment_data = df['comment text']tfidf = CountVectorizer()tfidf.fit(dictionary_df["comment text"])Test_X_Tfidf = tfidf.transform(df["comment text"])print(comment_data)print(Test_X_Tfidf)loaded_model = pickle.load(open(filename, 'rb'))predictions_NB = loaded_model.predict(Test_X_Tfidf)

这是数据框：

                         comment text    status0                   [slecht, bedrijf]    string1  [leuk, bedrijfje, goed, behandeld]  Approved2  [leuk, bedrijfje, goed, behandeld]  Approved3                   [leuk, bedrijfje]  Approved

完整的错误信息：

Traceback (most recent call last):  File "Request.py", line 36, in <module>    Test_X_Tfidf = tfidf.transform(df["comment text"])  File "C:\Users\junio\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 1112, in transform    _, X = self._count_vocab(raw_documents, fixed_vocab=True)  File "C:\Users\junio\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 970, in _count_vocab    for feature in analyze(doc):  File "C:\Users\junio\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 352, in <lambda>    tokenize(preprocess(self.decode(doc))), stop_words)  File "C:\Users\junio\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 256, in <lambda>    return lambda x: strip_accents(x.lower())AttributeError: 'list' object has no attribute 'lower'

我希望它能返回数据框上的预测结果。

回答：

CountVectorizer 无法直接处理列表形式的 Series，这就是你得到该错误的原因（lower 是字符串方法）。看起来你需要使用 MultiLabelBinarizer，它可以处理这种输入结构：

from sklearn.preprocessing import MultiLabelBinarizercount_vec = MultiLabelBinarizer()mlb = count_vec.fit(df["comment text"])pd.DataFrame(mlb.transform(df["comment text"]), columns=[mlb.classes_])  bedrijf bedrijfje behandeld goed leuk slecht0       1         0         0    0    0      11       0         1         1    1    1      02       0         1         1    1    1      03       0         1         0    0    1      0

然而，上述方法不会考虑列表中重复元素，输出元素只能是 0 或 1。如果你期望的是这种行为，你可以先将列表连接成字符串，然后再使用 CountVectorizer，因为它期望的是字符串：

text = df["comment text"].map(' '.join)count_vec = CountVectorizer()cv = count_vec.fit(text)pd.DataFrame(cv.transform(text).toarray(), columns=[mlb.classes_])  bedrijf bedrijfje behandeld goed leuk slecht0       1         0         0    0    0      11       0         1         1    1    1      02       0         1         1    1    1      03       0         1         0    0    1      0

请注意，这与输入字符串的 tf-idf 不同。这里你只是得到了实际的计数。对于 tf-idf，你可以使用 TfidfVectorizer，在相同示例中会产生：

    bedrijf bedrijfje behandeld      goed      leuk    slecht0  0.707107  0.000000  0.000000  0.000000  0.000000  0.7071071  0.000000  0.444931  0.549578  0.549578  0.444931  0.0000002  0.000000  0.444931  0.549578  0.549578  0.444931  0.0000003  0.000000  0.707107  0.000000  0.000000  0.707107  0.000000

学技术

AttributeError: ‘list’ 对象没有 ‘lower’ 属性，使用 CountVectorizer

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复