从CSV文件中的字符串进行Tf-idf计算

我的test.csv文件内容如下（无表头）：

very good, very bad, you are greatvery bad, good restaurent, nice place to visit

我想将我的语料库用,分隔开，这样我的最终DocumentTermMatrix会变成这样：

      terms docs       very good      very bad        you are great   good restaurent   nice place to visit  doc1       tf-idf          tf-idf         tf-idf          0                    0  doc2       0                tf-idf         0                tf-idf             tf-idf

如果我不从csv文件加载documents，我可以正确生成上面的DTM，如下所示：

library(tm)docs <- c(D1 = "very good, very bad, you are great",     D2 = "very bad, good restaurent, nice place to visit")dd <- Corpus(VectorSource(docs))dd <- tm_map(dd, function(x) {    PlainTextDocument(       gsub("\\s+","~",strsplit(x,",\\s*")[[1]]),        id=ID(x)     )})inspect(dd)# A corpus with 2 text documents# # The metadata consists of 2 tag-value pairs and a data frame# Available tags are:#   create_date creator # Available variables in the data frame are:#   MetaID # $D1# very~good# very~bad# you~are~great# # $D2# very~bad# good~restaurent# nice~place~to~visitdtm <- DocumentTermMatrix(dd, control = list(weighting = weightTfIdf))as.matrix(dtm)

这将产生

# Docs good~restaurent nice~place~to~visit very~bad very~good you~are~great#   D1       0.0000000           0.0000000        0 0.3333333     0.3333333#   D2       0.3333333           0.3333333        0 0.0000000     0.0000000

如果我从csv文件加载document，那么每个文档的第一个词会被连接起来，如下所示：

> file_loc <- "testdata.csv"> require(tm)  Loading required package: tm> x <- read.csv(file_loc, header = FALSE)> x <- data.frame(lapply(x, as.character), stringsAsFactors=FALSE)> dd <- Corpus(DataframeSource(x))> dd <- tm_map(dd, stripWhitespace)> dd <- tm_map(dd, tolower)>  dd <- tm_map(dd, function(x) {            PlainTextDocument(            gsub("\\s+","~",strsplit(x,",\\s*")[[1]]),             id=ID(x)            )          })> inspect(dd)

只连接了第一个词，如下所示：

# $D1# very~good# # $D2# very~bad

如何连接所有词并创建如上所示的DocumentTermMatrix？

回答：

你读取数据的方式不正确。我使用scan来读取。以下方法有效：

docs <- scan("testdata.csv", "character", sep = "\n")dd <- Corpus(VectorSource(x))dd <- tm_map(dd, function(x) {  PlainTextDocument(    gsub("\\s+","~",strsplit(x,",\\s*")[[1]]),     id=ID(x)  )})inspect(dd)dtm <- DocumentTermMatrix(dd, control = list(weighting = weightTfIdf))as.matrix(dtm)

学技术

从CSV文件中的字符串进行Tf-idf计算

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复