如何强制 sklearn 的 CountVectorizer 不移除特殊字符（例如 #, @, , $ 或 %）

这是我的代码：

count = CountVectorizer(lowercase = False)vocabulary = count.fit_transform([words])print(count.get_feature_names())

例如，如果：

 words = "Hello @friend, this is a good day. #good."

我希望它被分隔成这样：

['Hello', '@friend', 'this', 'is', 'a', 'good', 'day', '#good']

目前，它被分隔成这样：

['Hello', 'friend', 'this', 'is', 'a', 'good', 'day']

回答：

你可以使用 CountVectorizer 中的 token_pattern 参数，如文档中所述：

传递一个正则表达式来告诉 CountVectorizer 什么应该被视为一个词。假设在这种情况下，我们告诉 CountVectorizer，即使包含 # 或 @ 的词也应该被视为一个词。然后这样做：

count = CountVectorizer(lowercase = False, token_pattern = '[a-zA-Z0-9$&+,:;=?@#|<>.^*()%!-]+')

输出：

['#good', '@friend', 'Hello', 'a', 'day', 'good', 'is', 'this']

学技术