我想获取数据框中每列的不同值及其各自的计数,并将它们作为(k,v)对存储在另一个数据框中。注意:我的列不是静态的,它们会不断变化。因此,我不能硬编码列名,而是应该遍历它们。
例如,以下是我的数据框
+----------------+-----------+------------+|name |country |DOB |+----------------+-----------+------------+| Blaze | IND| 19950312|| Scarlet | USA| 19950313|| Jonas | CAD| 19950312|| Blaze | USA| 19950312|| Jonas | CAD| 19950312|| mark | USA| 19950313|| mark | CAD| 19950313|| Smith | USA| 19950313|| mark | UK | 19950313|| scarlet | CAD| 19950313|
我的最终结果应在新数据框中以(k,v)的形式创建,其中k是不同的记录,v是它的计数。
+----------------+-----------+------------+|name |country |DOB |+----------------+-----------+------------+| (Blaze,2) | (IND,1) |(19950312,3)|| (Scarlet,2) | (USA,4) |(19950313,6)|| (Jonas,3) | (CAD,4) | || (mark,3) | (UK,1) | || (smith,1) | | |
谁能帮我解决这个问题?我使用的是Spark 2.4.0和Scala 2.11.12
注意:我的列是动态的,所以我不能硬编码列并对它们进行分组操作。
回答:
我没有你查询的精确解决方案,但我可以提供一些帮助,让你开始处理你的问题。
创建数据框
scala> val df = Seq(("Blaze ","IND","19950312"), | ("Scarlet","USA","19950313"), | ("Jonas ","CAD","19950312"), | ("Blaze ","USA","19950312"), | ("Jonas ","CAD","19950312"), | ("mark ","USA","19950313"), | ("mark ","CAD","19950313"), | ("Smith ","USA","19950313"), | ("mark ","UK ","19950313"), | ("scarlet","CAD","19950313")).toDF("name", "country","dob")
接下来计算每列不同元素的计数
scala> val distCount = df.columns.map(c => df.groupBy(c).count)
创建一个范围来遍历distCount
scala> val range = Range(0,distCount.size)range: scala.collection.immutable.Range = Range(0, 1, 2)
聚合你的数据
scala> val aggVal = range.toList.map(i => distCount(i).collect().mkString).toSeqaggVal: scala.collection.immutable.Seq[String] = List([Jonas ,2][Smith ,1][Scarlet,1][scarlet,1][mark ,3][Blaze ,2], [CAD,4][USA,4][IND,1][UK ,1], [19950313,6][19950312,4])
创建数据框:
scala> Seq((aggVal(0),aggVal(1),aggVal(2))).toDF("name", "country","dob").show()+--------------------+--------------------+--------------------+| name| country| dob|+--------------------+--------------------+--------------------+|[Jonas ,2][Smith...|[CAD,4][USA,4][IN...|[19950313,6][1995...|+--------------------+--------------------+--------------------+
希望这对你有所帮助。