我尝试使用Scikit Learn的GridSearch类来调整我的逻辑回归算法的超参数。
然而,即使使用多任务并行,GridSearch处理起来也需要几天的时间,除非你只调整一个参数。我考虑使用Apache Spark来加速这个过程,但我有两个问题。
-
为了使用Apache Spark,是否必须有多台机器来分担工作负载?例如,如果你只有一台笔记本电脑,使用Apache Spark是否毫无意义?
-
是否有简单的方法在Apache Spark中使用Scikit Learn的GridSearch?
我已经阅读了文档,但它讨论的是在整个机器学习管道上运行并行工作者,但我只想用于参数调整。
导入
import datetime%matplotlib inlineimport pylabimport pandas as pdimport mathimport seaborn as snsimport matplotlib.pyplot as pltimport matplotlib.dates as mdatesimport matplotlib.pylab as pylabimport numpy as npimport statsmodels.api as smfrom statsmodels.formula.api import olsfrom sklearn import datasets, tree, metrics, model_selectionfrom sklearn.preprocessing import LabelEncoderfrom sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import train_test_split, GridSearchCVfrom sklearn.metrics import classification_report, confusion_matrix, roc_curve, aucfrom sklearn.linear_model import LogisticRegression, LinearRegression, Perceptronfrom sklearn.feature_selection import SelectKBest, chi2, VarianceThreshold, RFEfrom sklearn.svm import SVCfrom sklearn.cross_validation import train_test_splitfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifierfrom sklearn.naive_bayes import GaussianNBimport findsparkfindspark.init()import pysparksc = pyspark.SparkContext()from datetime import datetime as dtimport scipyimport itertoolsucb_w_reindex = pd.read_csv('clean_airbnb.csv')ucb = pd.read_csv('clean_airbnb.csv')pylab.rcParams[ 'figure.figsize' ] = 15 , 10plt.style.use("fivethirtyeight")new_style = {'grid': False}plt.rc('axes', **new_style)
算法超参数调整
X = ucb.drop('country_destination', axis=1)y = ucb['country_destination'].valuesX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, random_state=42, stratify=y)knn = KNeighborsClassifier()parameters = {'leaf_size': range(1, 100), 'n_neighbors': range(1, 10), 'weights': ['uniform', 'distance'], 'algorithm': ['kd_tree', 'ball_tree', 'brute', 'auto']}# ======== 我想在Apache Spark中做的 ========= #%%timeparameters = {'n_neighbors': range(1, 100)}clf1 = GridSearchCV(estimator=knn, param_grid=parameters, n_jobs=5).fit(X_train, y_train)best = clf1.best_estimator_# ==================================================== #
回答:
你可以使用一个名为spark-sklearn的库来运行分布式参数扫描。你说的对,你需要一个机器集群,或者一台多CPU的机器来获得并行加速。
希望这对你有帮助,
Roope – Microsoft MMLSpark 团队