我的test.csv
文件内容如下(无表头):
very good, very bad, you are greatvery bad, good restaurent, nice place to visit
我想将我的语料库用,
分隔开,这样我的最终DocumentTermMatrix
会变成这样:
terms docs very good very bad you are great good restaurent nice place to visit doc1 tf-idf tf-idf tf-idf 0 0 doc2 0 tf-idf 0 tf-idf tf-idf
如果我不从csv文件
加载documents
,我可以正确生成上面的DTM
,如下所示:
library(tm)docs <- c(D1 = "very good, very bad, you are great", D2 = "very bad, good restaurent, nice place to visit")dd <- Corpus(VectorSource(docs))dd <- tm_map(dd, function(x) { PlainTextDocument( gsub("\\s+","~",strsplit(x,",\\s*")[[1]]), id=ID(x) )})inspect(dd)# A corpus with 2 text documents# # The metadata consists of 2 tag-value pairs and a data frame# Available tags are:# create_date creator # Available variables in the data frame are:# MetaID # $D1# very~good# very~bad# you~are~great# # $D2# very~bad# good~restaurent# nice~place~to~visitdtm <- DocumentTermMatrix(dd, control = list(weighting = weightTfIdf))as.matrix(dtm)
这将产生
# Docs good~restaurent nice~place~to~visit very~bad very~good you~are~great# D1 0.0000000 0.0000000 0 0.3333333 0.3333333# D2 0.3333333 0.3333333 0 0.0000000 0.0000000
如果我从csv
文件加载document
,那么每个文档的第一个词会被连接起来,如下所示:
> file_loc <- "testdata.csv"> require(tm) Loading required package: tm> x <- read.csv(file_loc, header = FALSE)> x <- data.frame(lapply(x, as.character), stringsAsFactors=FALSE)> dd <- Corpus(DataframeSource(x))> dd <- tm_map(dd, stripWhitespace)> dd <- tm_map(dd, tolower)> dd <- tm_map(dd, function(x) { PlainTextDocument( gsub("\\s+","~",strsplit(x,",\\s*")[[1]]), id=ID(x) ) })> inspect(dd)
只连接了第一个词,如下所示:
# $D1# very~good# # $D2# very~bad
如何连接所有词并创建如上所示的DocumentTermMatrix
?
回答:
你读取数据的方式不正确。我使用scan
来读取。以下方法有效:
docs <- scan("testdata.csv", "character", sep = "\n")dd <- Corpus(VectorSource(x))dd <- tm_map(dd, function(x) { PlainTextDocument( gsub("\\s+","~",strsplit(x,",\\s*")[[1]]), id=ID(x) )})inspect(dd)dtm <- DocumentTermMatrix(dd, control = list(weighting = weightTfIdf))as.matrix(dtm)