我正在尝试使用NLTK的朴素贝叶斯分类器来对电影类型进行分类。然而,我得到了一些奇怪的结果。目前,它只是基于输入的类型数量进行猜测。
如果我输入两部动作电影和一部喜剧电影,每次猜测都会是动作片。自然地,我希望它基于输入的文本进行分类:
def RemoveStopWords(wordText): keep_list = [] for word in wordText: if word not in wordStop: keep_list.append(word.lower()) return set(keep_list)def getFeatures(element): splitter=re.compile('\\W*') f = {} plot = [s for s in RemoveStopWords(splitter.split(element['imdb']['plot'])) if len(s)>5 and len(s) < 15] for w in plot: f[w]= w return fdef FindFeaturesForList(MovieList): featureSet = [] for w in MovieList: print w['imdb']['title'] try: for genre in w['imdb']['genres']: featureSet.append((getFeatures(w), genre)) except: print "Error when retriving genre, skipping element" return featureSetfeatureList = FindFeaturesForList(trainset)cl = nltk.NaiveBayesClassifier.train(featureList)
所以每当我执行cl.classify(movie)时,它总是返回最常输入的类型,我做错了什么?
回答:
在NLTK书籍中的电影评论分类示例中,请注意收集了所有电影中所有单词的频率,然后只选择了最常见的单词作为特征键。
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())word_features = all_words.keys()[:2000]
我想指出,这是一种选择。特征键的选择方式并非强制性的。其他巧妙的特征选择可能会导致更好的分类器。选择好的特征是这门科学背后的艺术。
无论如何,或许你可以尝试在你的分类器中使用同样的思路:
def getFeatures(text, word_features): text = text.lower() f = {word: word in text for word in word_features} return fdef FindFeaturesForList(MovieList): featureSet = [] splitter = re.compile('\\W*') all_words = nltk.FreqDist( s.lower() for w in MovieList for s in RemoveStopWords(splitter.split(w['imdb']['plot'])) if len(s) > 5 and len(s) < 15) word_features = all_words.keys()[:2000] for w in MovieList: print w['imdb']['title'] try: for genre in w['imdb']['genres']: featureSet.append( (getFeatures(w['imdb']['plot'], word_features), genre)) except: print "Error when retriving genre, skipping element" return featureSet