使用Python中的决策树进行文本分类

我对Python和机器学习都还不熟。我的实现是基于IEEE研究论文http://ieeexplore.ieee.org/document/7320414/（错误报告、功能请求，还是简单的赞扬？关于自动分类应用评论）

我想将文本分类到不同的类别中。这些文本是来自Google Play商店或Apple App Store的用户评论。研究中使用的类别包括Bug、Feature、User Experience和Rating。鉴于这种情况，我尝试使用Python中的sklearn包来实现决策树。我找到了sklearn提供的一个名为’IRIS’的示例数据集，它使用特征及其值来构建映射到目标的树模型。在这个例子中，数据是数值型的。

我尝试对文本而不是数值数据进行分类。例如：

我非常喜欢升级到PDF的功能。然而，它们不再显示了，修复它就会完美无缺 [BUG]
我希望当我的余额低于某个美元金额时，它能通知我 [FEATURE]
这个应用在我的业务领域非常有帮助 [Rating]
在iTunes中很容易找到歌曲并购买 [UserExperience]

有了这些文本和更多此类别的用户评论，我希望创建一个可以使用这些数据进行训练，并预测任何给定用户评论目标的分类器。

到目前为止，我已经对文本进行了预处理，并以元组列表的形式创建了训练数据，这些元组包含预处理后的数据及其目标。

我的预处理步骤如下：

将多行评论分解成单个句子
将每个句子分解成单词
从分词后的句子中移除停用词
对分词后的句子中的单词进行词形还原

([‘i’, ‘liked’, ‘much’, ‘upgrade’, ‘pdfs’, ‘however’, ‘displaying’, ‘anymore’, ‘fix’, ‘perfect’], “BUG”)

这是我目前所做的：

import jsonfrom sklearn import treefrom nltk.corpus import stopwordsfrom nltk.stem import WordNetLemmatizerfrom nltk.tokenize import sent_tokenize, RegexpTokenizer# 定义一个分词器，用于分词并去除标点符号tokenizer = RegexpTokenizer(r'\w+')# 这个列表用于存储所有带标签的训练数据tagged_tokenized_comments_corpus = []# 方法：向训练集中添加数据# 参数：格式为（数据，标签）的元组def tag_tokenized_comments_corpus(*tuple_data):tagged_tokenized_comments_corpus.append(tuple_data)# 步骤1：从nltk包中加载所有停用词stop_words = stopwords.words("english")stop_words.remove('not')# 创建一个临时列表以复制现有的停用词temp_stop_words = stop_wordsfor word in temp_stop_words:if "n't" in word:    stop_words.remove(word)# 加载数据集files = ["Bug.txt", "Feature.txt", "Rating.txt", "UserExperience.txt"]d = {"Bug": 0, "Feature": 1, "Rating": 2, "UserExperience": 3}for file in files:input_file = open(file, "r")file_text = input_file.read()json_content = json.loads(file_text)# 步骤3：将多句评论分解成单个句子comments_corpus = []for i in range(len(json_content)):    comments = json_content[i]['comment']    if len(sent_tokenize(comments)) > 1:        for comment in sent_tokenize(comments):            comments_corpus.append(comment)    else:        comments_corpus.append(comments)# 步骤4：对每个句子进行分词，去除停用词并对评论语料库进行词形还原lemmatizer = WordNetLemmatizer()tokenized_comments_corpus = []for i in range(len(comments_corpus)):    words = tokenizer.tokenize(comments_corpus[i])    tokenized_sentence = []    for w in words:        if w not in stop_words:            tokenized_sentence.append(lemmatizer.lemmatize(w.lower()))    if tokenized_sentence:        tokenized_comments_corpus.append(tokenized_sentence)        tag_tokenized_comments_corpus(tokenized_sentence, d[input_file.name.split(".")[0]])# 步骤5：从分词后的评论语料库中创建单词字典unique_words = []for sentence in tagged_tokenized_comments_corpus:for word in sentence[0]:    unique_words.append(word)unique_words = set(unique_words)dictionary = {}i = 0for dict_word in unique_words:dictionary.update({i, dict_word})i = i + 1train_target = []train_data = []for sentence in tagged_tokenized_comments_corpus:train_target.append(sentence[0])train_data.append(sentence[1])clf = tree.DecisionTreeClassifier()clf.fit(train_data, train_target)test_data = "Beautiful Keep it up.. this far is the most usable app editor.. it makes my photos more beautiful and alive.."test_words = tokenizer.tokenize(test_data)test_tokenized_sentence = []for test_word in test_words:    if test_word not in stop_words:     test_tokenized_sentence.append(lemmatizer.lemmatize(test_word.lower()))#使用分类器进行预测print("predicting the labels: ")print(clf.predict(test_tokenized_sentence))

然而，这似乎不起作用，因为在运行时训练算法时会抛出错误。我在想，如果我可以将元组中的单词映射到字典，并将文本转换为数值形式来训练算法。但我不确定这是否可行。

有谁能建议我如何修复这段代码？或者是否有更好的方法来实现这个决策树？

Traceback (most recent call last):  File "C:/Users/venka/Documents/GitHub/RE-18/Test.py", line 87, in <module>clf.fit(train_data, train_target)  File "C:\Users\venka\Anaconda3\lib\site-packages\sklearn\tree\tree.py", line 790, in fitX_idx_sorted=X_idx_sorted) File "C:\Users\venka\Anaconda3\lib\site-packages\sklearn\tree\tree.py", line 116, in fitX = check_array(X, dtype=DTYPE, accept_sparse="csc")File "C:\Users\venka\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 441, in check_array"if it contains a single sample.".format(array))ValueError: Expected 2D array, got 1D array instead:array=[ 0.  0.  0. ...,  3.  3.  3.].Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

回答：

决策树只能在所有特征向量长度相同的情况下工作。就我个人而言，我不知道决策树在这种文本分析中的效果如何，但如果你想尝试的话，我建议使用“one-hot”或“词袋”风格的向量。

基本上，记录你的例子中单词出现的次数，并将它们放入一个代表整个语料库的向量中。假设你去除了所有停用词后，整个语料库的集合是：

{"Apple", "Banana", "Cherry", "Date", "Eggplant"}

你可以通过一个与语料库大小相同的向量来表示，每个值代表该单词是否出现。在我们的例子中，这是一个长度为5的向量，第一个元素与"Apple"相关联，第二个与"Banana"相关联，依此类推。你可能会得到类似这样的结果：

bag("Apple Banana Date")#: [1, 1, 0, 1, 0]bag("Cherry")#: [0, 0, 1, 0, 0]bag("Date Eggplant Banana Banana")#: [0, 1, 0, 1, 1]# 对于这种情况，我不知道Banana的值为2是否会改善结果。# 它可能会，也可能不会。这需要你去测试。

这样，无论输入如何，你都有了相同大小的向量，决策树知道在哪里寻找某些输出。假设"Banana"强烈对应于错误报告，那么决策树会知道第二个元素的1意味着错误报告的可能性更大。

当然，你的语料库可能包含数千个单词。在这种情况下，决策树可能不是最佳工具。除非你先花些时间来减少你的特征。

学技术

使用Python中的决策树进行文本分类

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复