Home IT技术如何为sklearn的CountVectorizer设置自定义停用词？

如何为sklearn的CountVectorizer设置自定义停用词？

IT技术 xiaolong · 2025年4月12日 · 0 Comment

我正在尝试对非英语文本数据集运行LDA（潜在Dirichlet分配）。

根据sklearn的教程，其中有一部分是计算词频以供LDA使用：

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,                            max_features=n_features,                            stop_words='english')

我认为这个内置的停用词功能只适用于英语。我如何使用我自己的停用词列表呢？

回答：

你可以将你自己的词列表赋值给stop_words，例如：

stop_words = (["word1", "word2","word3"])

machine-learning nlp python scikit-learn

发表回复取消回复