String Indexer, CountVectorizer Pyspark 在单行上

你好，我遇到了一个问题，我有一些行，每行有两个包含单词数组的列。

column1, column2["a", "b" ,"b", "c"], ["a","b", "x", "y"]

基本上，我想统计每个单词在两个列之间的出现次数，最终得到两个数组：

[1, 2, 1, 0, 0], [1, 1, 0, 1, 1]

所以”a”在每个数组中出现一次，”b”在column1中出现两次，在column2中出现一次，”c”只在column1中出现，”x”和”y”只在column2中出现，依此类推。

我尝试查看ml库中的CountVectorizer函数，但不确定它是否能按行工作，每列中的数组可能非常大？而且0值（即一个单词出现在一列中但不在另一列中）似乎没有被处理。

任何帮助都将不胜感激。

回答：

对于Spark 2.4+，你可以使用DataFrame API和内置的数组函数来实现这一点。

首先，使用array_union函数获取每行的所有单词。然后，使用transform函数转换单词数组，对于每个元素，使用size和array_remove函数计算在每个列中的出现次数：

df = spark.createDataFrame([(["a", "b", "b", "c"], ["a", "b", "x", "y"])], ["column1", "column2"])df.withColumn("words", array_union("column1", "column2")) \  .withColumn("occ_column1",              expr("transform(words, x -> size(column1) - size(array_remove(column1, x)))")) \  .withColumn("occ_column2",              expr("transform(words, x -> size(column2) - size(array_remove(column2, x)))")) \  .drop("words") \  .show(truncate=False)

输出：

+------------+------------+---------------+---------------+|column1     |column2     |occ_column1    |occ_column2    |+------------+------------+---------------+---------------+|[a, b, b, c]|[a, b, x, y]|[1, 2, 1, 0, 0]|[1, 1, 0, 1, 1]|+------------+------------+---------------+---------------+

学技术

String Indexer, CountVectorizer Pyspark 在单行上

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复