Spark ML – KMeans – org.apache.spark.sql.AnalysisException: 无法解析’features’给定的输入列

我正在尝试使用Spark ML中的KMeans来分析和聚类芝加哥犯罪数据集。以下是代码片段

case class ChicCase(ID: Long, Case_Number: String, Date: String, Block: String, IUCR: String, Primary_Type: String, Description: String, Location_description: String, Arrest: Boolean, Domestic: Boolean, Beat: Int, District: Int, Ward: Int, Community_Area: Int, FBI_Code: String, X_Coordinate: Int, Y_Coordinate: Int, Year: Int, Updated_On: String, Latitude: Double, Longitude: Double, Location: String)val city = spark.read.option("header", true).option("inferSchema", true).csv("/chicago_city/Crimes_2001_to_present_2").as[ChicCase]val data = city.drop("ID", "Case_Number", "Date", "Block", "IUCR", "Primary_Type", "Description", "Location_description", "Arrest", "Domestic", "FBI_Code", "Year", "Location", "Updated_On")val kmeans = new KMeanskmeans.setK(10).setSeed(1L)val model = kmeans.fit(data)

但这会抛出以下异常

    org.apache.spark.sql.AnalysisException: cannot resolve '`features`' given input columns: [Ward, Longitude, X_Coordinate, Beat, Latitude, District, Y_Coordinate, Community_Area];   at  org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77) 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) 	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) 	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) 	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:190) 	at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:200) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:204) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) 	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)  	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)  at scala.collection.AbstractTraversable.map(Traversable.scala:104)   at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:204) 	at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$5.apply(QueryPlan.scala:209) 	at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) 	at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:209) 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74) 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67) 	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126) 	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)   	at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2589) at org.apache.spark.sql.Dataset.select(Dataset.scala:969)   at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:307)   ... 90 elided

数据类型要么是Int要么是Double。问题可能出在哪里?


回答:

在Spark ML的数据框API中,所有特征列都应使用VectorAssembler收集到一个名为features的单一列中。当你拟合模型时,它会尝试查找features列,在你的例子中,没有这样的列,这就是为什么会抛出异常:无法解析’`features`’给定的输入列:

import org.apache.spark.ml.feature.VectorAssemblerimport org.apache.spark.ml.clustering.KMeans// assembler用于将所有感兴趣的列收集到一个features列中val assembler = (new VectorAssembler().                     setInputCols(Array("Ward", "Longitude", "X_Coordinate", "Beat",                                         "Latitude", "District", "Y_Coordinate",                                         "Community_Area")).                     setOutputCol("features"))   val data = assembler.transform(city)    val kmeans = new KMeans()val model = kmeans.fit(data)model.getK// res28: Int = 2     示例在这里

Related Posts

L1-L2正则化的不同系数

我想对网络的权重同时应用L1和L2正则化。然而,我找不…

使用scikit-learn的无监督方法将列表分类成不同组别,有没有办法?

我有一系列实例,每个实例都有一份列表,代表它所遵循的不…

f1_score metric in lightgbm

我想使用自定义指标f1_score来训练一个lgb模型…

通过相关系数矩阵进行特征选择

我在测试不同的算法时,如逻辑回归、高斯朴素贝叶斯、随机…

可以将机器学习库用于流式输入和输出吗?

已关闭。此问题需要更加聚焦。目前不接受回答。 想要改进…

在TensorFlow中,queue.dequeue_up_to()方法的用途是什么?

我对这个方法感到非常困惑,特别是当我发现这个令人费解的…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注