Spark MlLib线性回归(线性最小二乘法)结果随机

我刚开始学习Spark和机器学习。我成功地完成了Mllib的一些教程,但这个教程却无法正常运行:

我在这里找到了示例代码:https://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression

(LinearRegressionWithSGD部分)

以下是代码:

import org.apache.spark.mllib.regression.LabeledPointimport org.apache.spark.mllib.regression.LinearRegressionModelimport org.apache.spark.mllib.regression.LinearRegressionWithSGDimport org.apache.spark.mllib.linalg.Vectors// Load and parse the dataval data = sc.textFile("data/mllib/ridge-data/lpsa.data")val parsedData = data.map { line =>  val parts = line.split(',')  LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))}.cache()// Building the modelval numIterations = 100val model = LinearRegressionWithSGD.train(parsedData, numIterations)// Evaluate model on training examples and compute training errorval valuesAndPreds = parsedData.map { point =>  val prediction = model.predict(point.features)  (point.label, prediction)}val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()println("training Mean Squared Error = " + MSE)// Save and load modelmodel.save(sc, "myModelPath")val sameModel = LinearRegressionModel.load(sc, "myModelPath")

(这正是网站上的内容)

结果是

training Mean Squared Error = 6.2087803138063045

以及

valuesAndPreds.collect

得到

    Array[(Double, Double)] = Array((-0.4307829,-1.8383286021929077), (-0.1625189,-1.4955700806407322), (-0.1625189,-1.118820892849544), (-0.1625189,-1.6134108278724875), (0.3715636,-0.45171266551058276), (0.7654678,-1.861316066986158), (0.8544153,-0.3588282725617985), (1.2669476,-0.5036812148225209), (1.2669476,-1.1534698170911792), (1.2669476,-0.3561392231695041), (1.3480731,-0.7347031705813306), (1.446919,-0.08564658011814863), (1.4701758,-0.656725375080344), (1.4929041,-0.14020483324910105), (1.5581446,-1.9438858658143454), (1.5993876,-0.02181165554398845), (1.6389967,-0.3778677315868635), (1.6956156,-1.1710092824030043), (1.7137979,0.27583044213064634), (1.8000583,0.7812664902440078), (1.8484548,0.94605507153074), (1.8946169,-0.7217282082851512), (1.9242487,-0.24422843221437684),...

我的问题是预测结果看起来完全是随机的(而且是错误的),而且因为这是网站示例的完美复制,使用了相同的数据集(训练集),我不知道该从哪里查找问题,我是不是遗漏了什么?

请给我一些建议或线索,告诉我该从哪里开始查找,我可以阅读和实验。

谢谢


回答:

基于SGD的线性回归需要调整步长,详情请见http://spark.apache.org/docs/latest/mllib-optimization.html

在你的例子中,如果你将步长设置为0.1,你会得到更好的结果(MSE = 0.5)。

import org.apache.spark.mllib.regression.LabeledPointimport org.apache.spark.mllib.regression.LinearRegressionModelimport org.apache.spark.mllib.regression.LinearRegressionWithSGDimport org.apache.spark.mllib.linalg.Vectors// Load and parse the dataval data = sc.textFile("data/mllib/ridge-data/lpsa.data")val parsedData = data.map { line =>  val parts = line.split(',')  LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))}.cache()// Build the modelvar regression = new LinearRegressionWithSGD().setIntercept(true)regression.optimizer.setStepSize(0.1)val model = regression.run(parsedData)// Evaluate model on training examples and compute training errorval valuesAndPreds = parsedData.map { point =>  val prediction = model.predict(point.features)  (point.label, prediction)}val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()println("training Mean Squared Error = " + MSE)

关于更现实数据集的另一个示例,请见

https://github.com/selvinsource/spark-pmml-exporter-validator/blob/master/src/main/resources/datasets/winequalityred_linearregression.md

https://github.com/selvinsource/spark-pmml-exporter-validator/blob/master/src/main/resources/spark_shell_exporter/linearregression_winequalityred.scala

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注