pyspark ml模型预测后映射id列

我使用pyspark.ml.classification.RandomForestClassifier训练了一个分类模型，并将其应用于一个新的数据集进行预测。在将数据集输入模型之前，我删除了customer_id列，但不确定如何在预测后将customer_id映射回来。因此，由于Spark数据框架本质上是无序的，我无法识别哪一行属于哪个客户。

回答：

这里有一个使用pipeline技术的classification的Spark文档示例，其中保留了原始模式，并且仅选择的列作为学习算法的输入特征（例如，我用random forest替换了它）。

参考 => https://spark.apache.org/docs/latest/ml-pipeline.html

from pyspark.ml import Pipelinefrom pyspark.ml.classification import RandomForestClassifierfrom pyspark.ml.feature import HashingTF, Tokenizer# Prepare training documents from a list of (id, text, label) tuples.training = spark.createDataFrame([    (0, "a b c d e spark", 1.0),    (1, "b d", 0.0),    (2, "spark f g h", 1.0),    (3, "hadoop mapreduce", 0.0)], ["id", "text", "label"])# Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and rf.tokenizer = Tokenizer(inputCol="text", outputCol="words")hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")rf = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=10)pipeline = Pipeline(stages=[tokenizer, hashingTF, rf])# Fit the pipeline to training documents.model = pipeline.fit(training)# Prepare test documents, which are unlabeled (id, text) tuples.test = spark.createDataFrame([    (4, "spark i j k"),    (5, "l m n"),    (6, "spark hadoop spark"),    (7, "apache hadoop")], ["id", "text"])# Make predictions on test documents and print columns of interest.prediction = model.transform(test)# schema is preservedprediction.printSchema()root |-- id: long (nullable = true) |-- text: string (nullable = true) |-- words: array (nullable = true) |    |-- element: string (containsNull = true) |-- features: vector (nullable = true) |-- rawPrediction: vector (nullable = true) |-- probability: vector (nullable = true) |-- prediction: double (nullable = false)# sample rowfor i in prediction.take(1): print(i)Row(id=4, text='spark i j k', words=['spark', 'i', 'j', 'k'], features=SparseVector(262144, {20197: 1.0, 24417: 1.0, 227520: 1.0, 234657: 1.0}), rawPrediction=DenseVector([5.0857, 4.9143]), probability=DenseVector([0.5086, 0.4914]), prediction=0.0)

这里有一个VectorAssembler类的Spark文档示例，其中多个列被组合为输入特征，这些将作为学习算法的输入。

参考 => https://spark.apache.org/docs/latest/ml-features.html#vectorassembler

from pyspark.ml.linalg import Vectorsfrom pyspark.ml.feature import VectorAssemblerdataset = spark.createDataFrame(    [(0, 18, 1.0, Vectors.dense([0.0, 10.0, 0.5]), 1.0)],    ["id", "hour", "mobile", "userFeatures", "clicked"])assembler = VectorAssembler(    inputCols=["hour", "mobile", "userFeatures"],    outputCol="features")output = assembler.transform(dataset)print("Assembled columns 'hour', 'mobile', 'userFeatures' to vector column 'features'")output.select("features", "clicked").show(truncate=False)Assembled columns 'hour', 'mobile', 'userFeatures' to vector column 'features'+-----------------------+-------+|features               |clicked|+-----------------------+-------+|[18.0,1.0,0.0,10.0,0.5]|1.0    |+-----------------------+-------+

学技术

pyspark ml模型预测后映射id列

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复