的模块
我正在尝试使用spark-sklearn
库在Spark集群上执行网格搜索。为此,我在bash
shell中运行nohup ./spark_python_shell.sh > output.log &
来启动Spark集群,并且还运行了我的Python脚本(见下文spark-submit \ --master yarn 'rforest_grid_search.py'
):
SPARK_HOME=/u/users/******/spark-2.3.0 \ Q_CORE_LOC=/u/users/******/****** \ ENV=local \ HIVE_HOME=/usr/hdp/current/hive-client \ SPARK2_HOME=/u/users/******/spark-2.3.0 \ HADOOP_CONF_DIR=/etc/hadoop/conf \ HIVE_CONF_DIR=/etc/hive/conf \ HDFS_PREFIX=hdfs:// \ PYTHONPATH=/u/users/******/******/python-lib:/u/users/******/******/python-lib:/u/users/******/pyenv/prod_python_libs/lib/python2.7/site-packages/:$PYTHON_PATH \ YARN_HOME=/usr/hdp/current/hadoop-yarn-client \ SPARK_DIST_CLASSPATH=$(hadoop classpath):$(yarn classpath):/etc/hive/conf/hive-site.xml \ PYSPARK_PYTHON=/usr/bin/python2.7 \ QQQ_LOC=/u/users/******/three-queues \ spark-submit \ --master yarn 'rforest_grid_search.py' \ --executor-memory 10g \ --num-executors 8 \ --executor-cores 10 \ --conf spark.port.maxRetries=80 \ --conf spark.dynamicAllocation.enabled=False \ --conf spark.default.parallelism=6000 \ --conf spark.sql.shuffle.partitions=6000 \ --principal ************************ \ --queue default \ --name lets_get_starting \ --keytab /u/users/******/.******.keytab \ --driver-memory 10g
在rforest_grid_search.py
Python脚本中有以下源代码,尝试将网格搜索连接到Spark集群:
# Spark配置from pyspark import SparkContext, SparkConfconf = SparkConf()sc = SparkContext(conf=conf)print('Spark上下文:', sc)# 超参数网格parameters = {'n_estimators': list(range(150, 200, 25)), 'criterion': ['gini', 'entropy'], 'max_depth': list(range(2, 11, 2)), 'max_features': [i/10. for i in range(10, 16)], 'class_weight': [{0: 1, 1: i/10.} for i in range(10, 17)], 'min_samples_split': list(range(2, 7))}# 执行网格搜索 - 使用spark_sklearn库from spark_sklearn import GridSearchCVfrom sklearn.ensemble import RandomForestClassifierclassifiers_grid = GridSearchCV(sc, estimator=RandomForestClassifier(), param_grid=parameters, scoring='precision', cv=5, n_jobs=-1)classifiers_grid.fit(X, y)
当我运行Python脚本时,在classifiers_grid.fit(X, y)
这一行会出现以下错误:
ImportError: 没有名为model_selection._validation的模块
或者更详细一些(但不包括所有内容,因为太长了)如下:
... ('Spark上下文:', <SparkContext master=yarn appName=rforest_grid_search.py>)... 18/10/24 12:43:50 INFO scheduler.TaskSetManager: 开始执行stage 0.0中的任务2.0(TID 2, oser404637.*****.com, executor 2, 分区2, PROCESS_LOCAL, 42500字节) 18/10/24 12:43:50 WARN scheduler.TaskSetManager: 丢失stage 0.0中的任务0.0(TID 0, oser404637.*****.com, executor 2):org.apache.spark.api.python.PythonException: Traceback (最近的调用最后): File "/u/applic/data/hdfs2/hadoop/yarn/local/usercache/*****/appcache/application_1539785180345_36939/container_e126_1539785180345_36939_01_000003/pyspark.zip/pyspark/worker.py", line 216, in main func, profiler, deserializer, serializer = read_command(pickleSer, infile) File "/u/applic/data/hdfs2/hadoop/yarn/local/usercache/*****/appcache/application_1539785180345_36939/container_e126_1539785180345_36939_01_000003/pyspark.zip/pyspark/worker.py", line 58, in read_command command = serializer._read_with_length(file) File "/u/applic/data/hdfs2/hadoop/yarn/local/usercache/*****/appcache/application_1539785180345_36939/container_e126_1539785180345_36939_01_000003/pyspark.zip/pyspark/serializers.py", line 170, in _read_with_length return self.loads(obj) File "/u/applic/data/hdfs2/hadoop/yarn/local/usercache/*****/appcache/application_1539785180345_36939/container_e126_1539785180345_36939_01_000003/pyspark.zip/pyspark/serializers.py", line 562, in loads return pickle.loads(obj) ImportError: 没有名为model_selection._validation的模块 at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298) at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438) at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)...
回答: