我想执行多项逻辑回归,但无法正确设置threshold
和thresholds
参数。请考虑以下DF:
from pyspark.ml.linalg import DenseVectortest_train_df = (sqlc.createDataFrame([(0, DenseVector([-1.0, 1.2, 0.7])), (0, DenseVector([3.1, -2.0, -2.9])), (1, DenseVector([1.0, 0.8, 0.3])), (1, DenseVector([4.2, 1.4, -1.7])), (0, DenseVector([-1.9, 2.5, -2.3])), (2, DenseVector([2.6, -0.2, 0.2])), (1, DenseVector([0.3, -3.4, 1.8])), (2, DenseVector([-1.0, -3.5, 4.7]))], ['label', 'features']))
我的标签有3个类别,所以我必须设置thresholds
(复数,默认值为None
),而不是threshold
(单数,默认值为0.5
)。然后我编写了以下代码:
from pyspark.ml import classification as cltest_logit_abst = ( cl.LogisticRegression() .setFamily('multinomial') .setThresholds([.5, .5, .5]))
然后我想在我的DF上拟合模型:
test_logit = test_logit_abst.fit(test_train_df)
但在执行最后一条命令时,我得到了一个错误:
---------------------------------------------------------------------------Py4JJavaError Traceback (most recent call last)~/anaconda3/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw) 62 try:---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError as e:~/anaconda3/lib/python3.6/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 318 "An error occurred while calling {0}{1}{2}.\n".--> 319 format(target_id, ".", name), value) 320 else:Py4JJavaError: An error occurred while calling o3769.fit.: java.lang.IllegalArgumentException: requirement failed: Logistic Regression found inconsistent values for threshold and thresholds. Param threshold is set (0.5), indicating binary classification, but Param thresholds is set with length 3. Clear one Param value to fix this problem.During handling of the above exception, another exception occurred:IllegalArgumentException Traceback (most recent call last)<ipython-input-211-8f3443f41b6b> in <module>()----> 1 test_logit = test_logit_abst.fit(test_train_df)~/anaconda3/lib/python3.6/site-packages/pyspark/ml/base.py in fit(self, dataset, params) 62 return self.copy(params)._fit(dataset) 63 else:---> 64 return self._fit(dataset) 65 else: 66 raise ValueError("Params must be either a param map or a list/tuple of param maps, "~/anaconda3/lib/python3.6/site-packages/pyspark/ml/wrapper.py in _fit(self, dataset)263 264 def _fit(self, dataset):--> 265 java_model = self._fit_java(dataset) 266 return self._create_model(java_model)267~/anaconda3/lib/python3.6/site-packages/pyspark/ml/wrapper.py in _fit_java(self, dataset) 260 """ 261 self._transfer_params_to_java()--> 262 return self._java_obj.fit(dataset._jdf)263 264 def _fit(self, dataset):~/anaconda3/lib/python3.6/site-packages/py4j/java_gateway.py in __call__(self, *args) 1131 answer = self.gateway_client.send_command(command) 1132 return_value = get_return_value(-> 1133 answer, self.gateway_client, self.target_id, self.name)1134 1135 for temp_arg in temp_args:~/anaconda3/lib/python3.6/site-packages/pyspark/sql/utils.py in deco(*a, **kw) 77 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace) 78 if s.startswith('java.lang.IllegalArgumentException: '):---> 79 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace) 80 raise 81 return decoIllegalArgumentException: 'requirement failed: Logistic Regression found inconsistent values for threshold and thresholds. Param threshold is set (0.5), indicating binary classification, but Param thresholds is set with length 3. Clear one Param value to fix this problem.'
错误提示threshold
已被设置。这看起来很奇怪,因为文档表示设置thresholds
(复数)会清除threshold
(单数),所以0.5
的值应该被删除。那么,如何清除threshold
呢?因为不存在clearThreshold()
方法。
为了实现这一点,我尝试了以下方法来清除threshold
:
logit_abst = ( cl.LogisticRegression() .setFamily('multinomial') .setThresholds([.5, .5, .5]) .setThreshold(None))
这次拟合命令可以正常工作,我甚至得到了模型的截距和系数:
test_logit.interceptVectorDenseVector([65.6445, 31.6369, -97.2814])test_logit.coefficientMatrixDenseMatrix(3, 3, [-76.4534, -19.4797, -79.4949, 12.3659, 4.642, 4.1057, 64.0876, 14.8377, 75.3892], 1)
但是如果我尝试从test_logit_abst
获取thresholds
(复数),我会得到一个错误:
test_logit_abst.getThresholds()---------------------------------------------------------------------------TypeError Traceback (most recent call last)<ipython-input-214-fc1c8617ce80> in <module>()----> 1 test_logit_abst.getThresholds()~/anaconda3/lib/python3.6/site-packages/pyspark/ml/classification.py in getThresholds(self) 363 if not self.isSet(self.thresholds) and self.isSet(self.threshold): 364 t = self.getOrDefault(self.threshold)--> 365 return [1.0-t, t] 366 else: 367 return self.getOrDefault(self.thresholds)TypeError: unsupported operand type(s) for -: 'float' and 'NoneType'
这意味着什么?
作为进一步的细节,令人奇怪的是(对我来说无法理解),颠倒参数设置的顺序会产生我上面发布的第一个错误:
logit_abst = ( cl.LogisticRegression() .setFamily('multinomial') .setThreshold(None) .setThresholds([.5, .5, .5]))
为什么改变“set”指令的顺序也会改变输出呢?
回答:
这确实是一个混乱的情况…
简短的回答是:
setThresholds
(复数)未能清除threshold(单数)似乎是一个错误- 对于多项分类(即类别数>2),
setThresholds
不会按你期望的方式工作(可以说你不需要它) - 如果你只需要在“默认”值0.5上设置一些“阈值”,你没有问题——简单地不使用任何相关参数或
setThresholds
语句 - 如果你真的需要在多项分类中对不同类别应用不同的决策阈值,你将不得不手动进行,通过后处理相应的概率,即在转换后的数据框中的
probability
列(虽然对于二元
分类,使用setThreshold(s)
可以正常工作)
现在是详细的回答…
让我们从二元
分类开始,调整一下文档中的玩具数据:
spark.version# u'2.2.0'from pyspark.ml.classification import LogisticRegressionfrom pyspark.sql import Rowfrom pyspark.ml.linalg import Vectorsbdf = sc.parallelize([ Row(label=1.0, features=Vectors.dense(0.0, 5.0)), Row(label=0.0, features=Vectors.dense(1.0, 2.0)),blor = LogisticRegression(threshold=0.7, thresholds=[0.3, 0.7]) Row(label=1.0, features=Vectors.dense(2.0, 1.0)), Row(label=0.0, features=Vectors.dense(3.0, 3.0))]).toDF()
我们不需要在这里设置thresholds
(复数)——threshold=0.7
就足够了,但这在下面的setThreshold
说明中会很有用。
blorModel = blor.fit(bdf) # works OKblor.getThreshold()# 0.7blor.getThresholds()# [0.3, 0.7]blorModel.transform(bdf).show(truncate=False) # transform the training data
这里是结果:
+---------+-----+------------------------------------------+----------------------------------------+----------+|features |label|rawPrediction |probability |prediction| +---------+-----+------------------------------------------+----------------------------------------+----------+|[0.0,5.0]|1.0 |[-1.138455151184087,1.138455151184087] |[0.242604109995602,0.757395890004398] |1.0 ||[1.0,2.0]|0.0 |[-0.6056346859838877,0.6056346859838877] |[0.35305562698104337,0.6469443730189567]|0.0 | |[2.0,1.0]|1.0 |[0.26586039040308496,-0.26586039040308496]|[0.5660763559614698,0.4339236440385302] |0.0 | |[3.0,3.0]|0.0 |[1.6453673835702176,-1.6453673835702176] |[0.8382639556951765,0.16173604430482344]|0.0 | +---------+-----+------------------------------------------+----------------------------------------+----------+
thresholds=[0.3, 0.7]
的含义是什么?答案在于第二行,尽管概率对于1.0
(0.65)更高,但预测结果为0.0
:0.65确实高于0.35,但它低于我们为该类别设置的阈值(0.7),因此它没有被分类为该类别。
现在让我们尝试一个看似相同
的操作,但使用setThreshold(s)
代替:
blor2 = (LogisticRegression() .setThreshold(0.7) .setThresholds([0.3, 0.7]) ) # works OKblorModel2 = blor2.fit(bdf)[...]IllegalArgumentException: u'requirement failed: Logistic Regression getThreshold found inconsistent values for threshold (0.5) and thresholds (equivalent to 0.7)'
不错吧?
setThresholds
(复数)确实似乎清除了我们在上一行设置的threshold值(0.7),正如文档中所声称的,但它似乎只恢复了其默认值0.5…
省略.setThreshold(0.7)
会产生你自己报告的第一个错误(未显示)。
颠倒参数设置的顺序可以解决问题(!!!),而且,使getThreshold
(单数)和getThresholds
(复数)都正常工作(与你的情况相反):
blor2 = (LogisticRegression() .setThresholds([0.3, 0.7]) .setThreshold(0.7) )blorModel2 = blor2.fit(bdf) # works OKblor2.getThreshold()# 0.7blor2.getThresholds()# [0.30000000000000004, 0.7]
现在让我们转向多项
情况;我们将再次使用文档中的示例,数据来自Spark Github仓库(它们也应该在你的本地$SPARK_HOME/data/mllib/sample_multiclass_classification_data.txt
中可用,但我正在Databricks笔记本上工作);这是一个3类情况,标签在{0.0, 1.0, 2.0}
中。
data_path ="/FileStore/tables/sample_multiclass_classification_data.txt"mdf = spark.read.format("libsvm").load(data_path)
与上面的二元情况类似,我们的thresholds
(复数)的元素总和为1,让我们为类别2请求一个0.8的阈值:
mlor = (LogisticRegression() .setFamily("multinomial") .setThresholds([0, 0.2, 0.8]) .setThreshold(0.8) )mlorModel= mlor.fit(mdf) # works OKmlor.getThreshold()# 0.8mlor.getThresholds()# [0.19999999999999996, 0.8]
看起来不错,但让我们在(训练)数据集中请求一个预测
:
mlorModel.transform(mdf).show(truncate=False)
我只单独列出了一个行——它应该是完整输出的倒数第二行:
+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+ |label|features |rawPrediction |probability |prediction| +-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+[...]|0.0 |(4,[0,1,2,3],[0.111111,-0.333333,0.38983,0.166667]) |[36.67790353804905,-74.71196613173531,38.034062593686244]|[0.20486526556822454,8.619113376801409E-50,0.7951347344317755] |2.0 | [...]+-----+----------------------------------------------------+---------------------------------------------------------+---------------------------------------------------------------+----------+
向右滚动,你会看到尽管2.0
类别的预测值低于我们设置的阈值(0.8),但该行确实被预测为2.0
——与上面演示的二元情况相反…
那么,该怎么办呢?简单地删除所有与阈值相关的语句;你不需要它们——甚至setFamily
也是不必要的,因为算法会自动检测到你有超过2个类别。这将给出与上述相同的结果:
mlor = LogisticRegression() # works OK - no family, no threshold(s)
总结如下:
- 在二元和多项情况下,算法实际返回的是一个长度等于类别数的概率向量,其元素总和为1。
- 仅在
二元
情况下,Spark允许你进一步不仅仅是简单地选择最高probability
的类别作为prediction
,而是应用用户定义的阈值;这种设置在处理不平衡数据的情况下可能有用。 - 这个
threshold(s)
设置在多项
情况下实际上没有效果,Spark总是会返回最高probability
的类别作为prediction
。
尽管文档中存在混乱
(关于这一点我在别处有过争论)以及可能存在一些错误,让我关于(3)说,虽然这种设计选择并非没有道理;正如别处(原文强调)很好地论述的:
当你为新样本的每个类别输出一个概率时,你的统计练习就结束了。选择一个阈值来将新观察分类为1或0不再是
统计
的一部分。这是决策
的一部分。
尽管上述论点是针对二元情况提出的,但它同样完全适用于多项情况…