我在使用Spark的ML库进行决策树的交叉验证时,调用cv.fit(train_dataset)
时遇到了这个错误:
pyspark.sql.utils.IllegalArgumentException: u'需求失败:无效的初始容量'
除了数据框为空之外,我没有找到太多关于这可能是什么的信息,但我的数据框不是空的。这是我的代码:
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data')
df.columns = ['Sex', 'Length', 'Diameter', 'Height', 'Whole weight', 'Schuked weight', 'Viscera weight', 'Shell weight', 'Rings']
train_dataset = sqlContext.createDataFrame(df)
column_types = train_dataset.dtypes
categoricalCols = []
numericCols = []
for ct in column_types:
if ct[1] == 'string':
categoricalCols += [ct[0]]
else:
numericCols += [ct[0]]
stages = []
for categoricalCol in categoricalCols:
stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol+"Index")
stages += [stringIndexer]
assemblerInputs = map(lambda c: c + "Index", categoricalCols) + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]
labelIndexer = StringIndexer(inputCol='Rings', outputCol='indexedLabel')
stages += [labelIndexer]
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="features")
evaluator = MulticlassClassificationEvaluator(labelCol='indexedLabel', predictionCol='prediction', metricName='f1')
paramGrid = (ParamGridBuilder()
.addGrid(dt.maxDepth, [1,2,6])
.addGrid(dt.maxBins, [20,40])
.build())
stages += [dt]
pipeline = Pipeline(stages=stages)
cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=1)
cvModel = cv.fit(train_dataset)
train_dataset = cvModel.transform(train_dataset)
我在本地运行独立的Spark。可能出了什么问题?
谢谢!
回答:
所以,问题出在将CrossValidation
的numFolds
参数设置为1。如果我想使用ParamGrid
进行参数调整,并且只进行一次训练-测试分割,显然我需要使用TrainValidationSplit
来替代。