我是pyspark的新手。我想对一个文本文件进行一些机器学习操作。
from pyspark import Rowfrom pyspark.context import SparkContextfrom pyspark.sql.session import SparkSessionfrom pyspark import SparkConfsc = SparkContextspark = SparkSession.builder.appName("ML").getOrCreate()train_data = spark.read.text("20ng-train-all-terms.txt")td= train_data.rdd #transformer df to rddtr_data= td.map(lambda line: line.split()).map(lambda words: Row(label=words[0],words=words[1:]))from pyspark.ml.feature import CountVectorizervectorizer = CountVectorizer(inputCol ="words", outputCol="bag_of_words")vectorizer_transformer = vectorizer.fit(td)
对于我的最后一条命令,我得到了错误”AttributeError: ‘RDD’对象没有属性 ‘_jdf’
谁能帮帮我吗?谢谢
回答:
你不应该对rdd
使用CountVectorizer
。相反,你应该尝试在dataframe
本身中形成单词数组,如下所示:
train_data = spark.read.text("20ng-train-all-terms.txt")from pyspark.sql import functions as Ftd= train_data.select(F.split("value", " ").alias("words")).select(F.col("words")[0].alias("label"), F.col("words"))from pyspark.ml.feature import CountVectorizervectorizer = CountVectorizer(inputCol="words", outputCol="bag_of_words")vectorizer_transformer = vectorizer.fit(td)
然后它应该可以正常工作,这样你就可以调用transform
函数,如下所示:
vectorizer_transformer.transform(td).show(truncate=False)
现在,如果你想坚持使用旧式的转换为rdd样式,那么你需要修改某些代码行。以下是你的修改后的完整代码(可运行):
from pyspark import Rowfrom pyspark.context import SparkContextfrom pyspark.sql.session import SparkSessionfrom pyspark import SparkConfsc = SparkContextspark = SparkSession.builder.appName("ML").getOrCreate()train_data = spark.read.text("20ng-train-all-terms.txt")td= train_data.rdd #transformer df to rddtr_data= td.map(lambda line: line[0].split(" ")).map(lambda words: Row(label=words[0], words=words[1:])).toDF()from pyspark.ml.feature import CountVectorizervectorizer = CountVectorizer(inputCol="words", outputCol="bag_of_words")vectorizer_transformer = vectorizer.fit(tr_data)
但我建议你坚持使用dataframe
方式。