R 情感分析中在词典中添加短语

我正在对一组推文进行情感分析，现在我想知道如何向正面和负面词典中添加短语。

我已经读取了我想要测试的短语文件，但在运行情感分析时没有得到结果。

在阅读情感算法时，我可以看到它正在将单词与词典匹配，但有没有办法同时扫描单词和短语呢？

这是代码：

    score.sentiment = function(sentences, pos.words, neg.words, .progress='none'){  require(plyr)    require(stringr)    # we got a vector of sentences. plyr will handle a list    # or a vector as an "l" for us    # we want a simple array ("a") of scores back, so we use    # "l" + "a" + "ply" = "laply":    scores = laply(sentences, function(sentence, pos.words, neg.words) {    # clean up sentences with R's regex-driven global substitute, gsub():    sentence = gsub('[[:punct:]]', '', sentence)    sentence = gsub('[[:cntrl:]]', '', sentence)    sentence = gsub('\\d+', '', sentence)        # and convert to lower case:        sentence = tolower(sentence)        # split into words. str_split is in the stringr package        word.list = str_split(sentence, '\\s+')        # sometimes a list() is one level of hierarchy too much        words = unlist(word.list)        # compare our words to the dictionaries of positive & negative terms    pos.matches = match(words, pos)    neg.matches = match(words, neg)       # match() returns the position of the matched term or NA        # we just want a TRUE/FALSE:        pos.matches = !is.na(pos.matches)       neg.matches = !is.na(neg.matches)       # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():    score = sum(pos.matches) - sum(neg.matches)        return(score)      }, pos.words, neg.words, .progress=.progress )    scores.df = data.frame(score=scores, text=sentences)    return(scores.df)  }analysis=score.sentiment(Tweets, pos, neg)table(analysis$score)

这是我得到的结果：

而我想要的是这个函数提供的标准表格，例如：

-2 -1 0 1 2  1  2 3 4 5

例如。

有没有人对如何在短语上运行这个有任何想法？注意：TWEETS 文件是一组句子。

回答：

函数 score.sentiment 似乎可以正常工作。如果我尝试一个非常简单的设置，

Tweets = c("this is good", "how bad it is")neg = c("bad")pos = c("good")analysis=score.sentiment(Tweets, pos, neg)table(analysis$score)

我得到了预期的结果，

> table(analysis$score)-1  1  1  1

你是如何将20条推文输入到方法中的？从你发布的结果来看，即 0 20，我认为你的问题是你的20条推文中没有任何正面或负面的词，当然如果你有的话你会注意到这一点。也许如果你能提供更多关于你的推文列表、你的正面和负面词汇的细节，会更容易帮助你。

无论如何，你的函数似乎运行得很好。

希望这对你有帮助。

在评论中澄清后的编辑：

实际上，要解决你的问题，你需要将句子标记化为 n-grams，其中 n 对应于你用于正面和负面 n-grams 列表的最大单词数。你可以参考这个 Stack Overflow 问题了解如何操作。为了完整起见，并且因为我自己已经测试过，这里是一个你可以做的例子。我简化为 bigrams（n=2）并使用以下输入：

Tweets = c("rewarding hard work with raising taxes and VAT. #LabourManifesto",               "Ed Miliband is offering 'wrong choice' of 'more cuts' in #LabourManifesto")pos = c("rewarding hard work")neg = c("wrong choice")

你可以这样创建一个二元组标记器，

library(tm)library(RWeka)BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2,max=2))

并测试它，

> BigramTokenizer("rewarding hard work with raising taxes and VAT. #LabourManifesto")[1] "rewarding hard"       "hard work"            "work with"           [4] "with raising"         "raising taxes"        "taxes and"           [7] "and VAT"              "VAT #LabourManifesto"

然后在你的方法中，你只需将这一行，

word.list = str_split(sentence, '\\s+')

替换为这一行

word.list = BigramTokenizer(sentence)

当然，最好是将 word.list 改为 ngram.list 或类似的名称。

结果如预期的那样，

> table(analysis$score)-1  0  1  1

只要决定你的 n-gram 大小并将其添加到 Weka_control 中，你应该没问题了。

希望这对你有帮助。

学技术

R 情感分析中在词典中添加短语

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复