pyspark.ml: 计算精确度和召回率时的类型错误

我正在尝试使用 pyspark.ml 计算分类器的精确度、召回率和 F1 值:

model = completePipeline.fit(training)predictions = model.transform(test)mm = MulticlassMetrics(predictions.select(["label", "prediction"]).rdd)labels = sorted(predictions.select("prediction").rdd.distinct().map(lambda r: r[0]).collect())for label in labels:    print labels    print "Precision = %s" % mm.precision(label=label)     print "Recall = %s" % mm.recall(label=label)     print "F1 Score = %s" % mm.fMeasure(label=label)metrics = pandas.DataFrame([(label, mm.precision(label=label), mm.recall(label=label), mm.fMeasure(label=label)) for label in labels],                            columns=["Precision", "Recall", "F1"])

生成的数据框架 predictions 的架构如下:

[('features', 'vector'), ('label', 'int'), ('rawPrediction', 'vector'), ('probability', 'vector'), ('prediction', 'double')]

调用 mm.precision 时触发的错误信息:

Traceback (most recent call last):  File "ml_pipeline_factory_test", line 1, in <module>  File "ml_pipeline_factory_test", line 92, in ml_pipeline_factory_test  File "/tmp/conda-9a013169-8b21-43cb-bcbe-06fc31523d3e/real/envs/conda-env/lib/python2.7/site-packages/pyspark/mllib/evaluation.py", line 240, in precision    return self.call("precision", float(label))  File "/tmp/conda-9a013169-8b21-43cb-bcbe-06fc31523d3e/real/envs/conda-env/lib/python2.7/site-packages/pyspark/mllib/common.py", line 146, in call    return callJavaFunc(self._sc, getattr(self._java_model, name), *a)  File "/tmp/conda-9a013169-8b21-43cb-bcbe-06fc31523d3e/real/envs/conda-env/lib/python2.7/site-packages/pyspark/mllib/common.py", line 123, in callJavaFunc    return _java2py(sc, func(*args))  File "/tmp/conda-9a013169-8b21-43cb-bcbe-06fc31523d3e/real/envs/conda-env/lib/python2.7/site-packages/py4j/java_gateway.py", line 1160, in __call__    answer, self.gateway_client, self.target_id, self.name)  File "/tmp/conda-9a013169-8b21-43cb-bcbe-06fc31523d3e/real/envs/conda-env/lib/python2.7/site-packages/pyspark/sql/utils.py", line 63, in deco    return f(*a, **kw)  File "/tmp/conda-9a013169-8b21-43cb-bcbe-06fc31523d3e/real/envs/conda-env/lib/python2.7/site-packages/py4j/protocol.py", line 320, in get_return_value    format(target_id, ".", name), value)Py4JJavaError: An error occurred while calling o371.precision.: org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 22.0 failed 4 times, most recent failure: Lost task 7.3 in stage 22.0 (TID 153, dhbpdn12.de.t-internal.com, executor 4): org.apache.spark.api.python.PythonException: Traceback (most recent call last):  File "/tmp/conda-e6ac7105-4788-4b4c-9163-ba8763f29ead/real/envs/conda-env/lib/python2.7/site-packages/pyspark/worker.py", line 245, in main    process()  File "/tmp/conda-e6ac7105-4788-4b4c-9163-ba8763f29ead/real/envs/conda-env/lib/python2.7/site-packages/pyspark/worker.py", line 240, in process    serializer.dump_stream(func(split_index, iterator), outfile)  File "/tmp/conda-e6ac7105-4788-4b4c-9163-ba8763f29ead/real/envs/conda-env/lib/python2.7/site-packages/pyspark/serializers.py", line 372, in dump_stream    vs = list(itertools.islice(iterator, batch))  File "/tmp/conda-9a013169-8b21-43cb-bcbe-06fc31523d3e/real/envs/conda-env/lib/python2.7/site-packages/pyspark/sql/session.py", line 677, in prepare  File "/tmp/conda-9a013169-8b21-43cb-bcbe-06fc31523d3e/real/envs/conda-env/lib/python2.7/site-packages/pyspark/sql/types.py", line 1421, in verify  File "/tmp/conda-9a013169-8b21-43cb-bcbe-06fc31523d3e/real/envs/conda-env/lib/python2.7/site-packages/pyspark/sql/types.py", line 1402, in verify_struct  File "/tmp/conda-9a013169-8b21-43cb-bcbe-06fc31523d3e/real/envs/conda-env/lib/python2.7/site-packages/pyspark/sql/types.py", line 1421, in verify  File "/tmp/conda-9a013169-8b21-43cb-bcbe-06fc31523d3e/real/envs/conda-env/lib/python2.7/site-packages/pyspark/sql/types.py", line 1415, in verify_default  File "/tmp/conda-9a013169-8b21-43cb-bcbe-06fc31523d3e/real/envs/conda-env/lib/python2.7/site-packages/pyspark/sql/types.py", line 1310, in verify_acceptable_typesTypeError: field prediction: DoubleType can not accept object 0 in type <type 'int'>

回答:

如错误信息所示:

TypeError: field prediction: DoubleType can not accept object 0 in type <type 'int'>

类型很重要。虽然在 Python 中 intfloat 通常可以互换,但在 Java 中却不行。

最简单的解决方案是在上游转换 label 字段:

predictions = (predictions    .withColumn("label", predictions["label"].cast("double")))

Related Posts

L1-L2正则化的不同系数

我想对网络的权重同时应用L1和L2正则化。然而,我找不…

使用scikit-learn的无监督方法将列表分类成不同组别,有没有办法?

我有一系列实例,每个实例都有一份列表,代表它所遵循的不…

f1_score metric in lightgbm

我想使用自定义指标f1_score来训练一个lgb模型…

通过相关系数矩阵进行特征选择

我在测试不同的算法时,如逻辑回归、高斯朴素贝叶斯、随机…

可以将机器学习库用于流式输入和输出吗?

已关闭。此问题需要更加聚焦。目前不接受回答。 想要改进…

在TensorFlow中,queue.dequeue_up_to()方法的用途是什么?

我对这个方法感到非常困惑,特别是当我发现这个令人费解的…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注