我正在尝试使用NLTK中的朴素贝叶斯分类器来对推文数据集进行分类。然而,我需要的不是对单个句子进行分类,如下所示:
classifier.classify(toDict("this is good"))
而是需要像下面这样对整个数据集进行分类:
classifier.classify(toDict(tweets))
我的问题在于,我需要对整个数据集进行分类,而不是单个句子。以下是我已经尝试过的代码(已注释):
我的其余代码如下:
import nltkimport numpy as npimport pandas as pd import reimport randomfrom pandas import DataFramefrom nltk import *from nltk import classifyfrom nltk import NaiveBayesClassifierfrom nltk import FreqDistfrom nltk.corpus import stopwordsfrom nltk.corpus import twitter_samplesfrom nltk.tokenize import word_tokenizefrom nltk.tokenize import WhitespaceTokenizer as w_tokenizernltk.download('punkt')nltk.download('stopwords')
def toDict(word): return {word : True}
posDataset = [(tweet_dict, "Positive") for tweet_dict in posModel]negDataset = [(tweet_dict, "Negative") for tweet_dict in negModel]trainingDataset = posDataset + negDatasetrandom.shuffle(trainingDataset)trainData, testData = trainingDataset[8000:], trainingDataset[:6000]classifier = NaiveBayesClassifier.train(trainData)#print(classifier.classify(dict((item, True) for item in tweets)))#classifier.classify()classifier.classify(toDict("this is good"))
‘tweets’是一个Pandas序列,项目以列表形式存储。示例图片见这里
运行注释掉的代码时,我得到的错误消息如下:
TypeError Traceback (most recent call last)<ipython-input-28-957eed734b8a> in <module> 20 classifier.classify(toDict("this is good")) 21 ---> 22 print(classifier.classify(dict((item, True) for item in tweets))) 23 #classifier.classify()TypeError: unhashable type: 'list'
回答:
下面的示例将为您提供一个通用的操作指南。
创建测试数据框
>>> import pandas as pd>>> df = pd.DataFrame({"text": ["this is a text", "that was a text", "but, you were a text"]})>>> df text0 this is a text1 that was a text2 but, you were a text
定义分类函数(示例)
>>> def classify_by_size(x):... size = len(x)... if size < 15:... return "small"... elif size > 15:... return "big"... return "medium"...
注意: 上述函数将是您的分类方法
分类
>>> df["new_column"] = df["text"].apply(classify_by_size)>>> df text new_column0 this is a text small1 that was a text medium2 but, you were a text big
因此,在您的案例中,您将会有类似这样的代码:
def my_classification(x): return classifier.classify(toDict(x))
以及调用代码:
df["new_column"] = df["text"].apply(my_classification)
这对于小型数据框可能效果很好。