如何使用Apache Spark进行简单的网格搜索

我尝试使用Scikit Learn的GridSearch类来调整我的逻辑回归算法的超参数。

然而，即使使用多任务并行，GridSearch处理起来也需要几天的时间，除非你只调整一个参数。我考虑使用Apache Spark来加速这个过程，但我有两个问题。

为了使用Apache Spark，是否必须有多台机器来分担工作负载？例如，如果你只有一台笔记本电脑，使用Apache Spark是否毫无意义？
是否有简单的方法在Apache Spark中使用Scikit Learn的GridSearch？

我已经阅读了文档，但它讨论的是在整个机器学习管道上运行并行工作者，但我只想用于参数调整。

导入

import datetime%matplotlib inlineimport pylabimport pandas as pdimport mathimport seaborn as snsimport matplotlib.pyplot as pltimport matplotlib.dates as mdatesimport matplotlib.pylab as pylabimport numpy as npimport statsmodels.api as smfrom statsmodels.formula.api import olsfrom sklearn import datasets, tree, metrics, model_selectionfrom sklearn.preprocessing import LabelEncoderfrom sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import train_test_split, GridSearchCVfrom sklearn.metrics import classification_report, confusion_matrix, roc_curve, aucfrom sklearn.linear_model import LogisticRegression, LinearRegression, Perceptronfrom sklearn.feature_selection import SelectKBest, chi2, VarianceThreshold, RFEfrom sklearn.svm import SVCfrom sklearn.cross_validation import train_test_splitfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifierfrom sklearn.naive_bayes import GaussianNBimport findsparkfindspark.init()import pysparksc = pyspark.SparkContext()from datetime import datetime as dtimport scipyimport itertoolsucb_w_reindex = pd.read_csv('clean_airbnb.csv')ucb = pd.read_csv('clean_airbnb.csv')pylab.rcParams[ 'figure.figsize' ] = 15 , 10plt.style.use("fivethirtyeight")new_style = {'grid': False}plt.rc('axes', **new_style)

算法超参数调整

X = ucb.drop('country_destination', axis=1)y = ucb['country_destination'].valuesX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, random_state=42, stratify=y)knn = KNeighborsClassifier()parameters = {'leaf_size': range(1, 100), 'n_neighbors': range(1, 10), 'weights': ['uniform', 'distance'],               'algorithm': ['kd_tree', 'ball_tree', 'brute', 'auto']}# ======== 我想在Apache Spark中做的 ========= #%%timeparameters = {'n_neighbors': range(1, 100)}clf1 = GridSearchCV(estimator=knn, param_grid=parameters, n_jobs=5).fit(X_train, y_train)best = clf1.best_estimator_# ==================================================== #

回答：

你可以使用一个名为spark-sklearn的库来运行分布式参数扫描。你说的对，你需要一个机器集群，或者一台多CPU的机器来获得并行加速。

希望这对你有帮助，

Roope – Microsoft MMLSpark 团队

学技术

如何使用Apache Spark进行简单的网格搜索

发表回复取消回复

相关文章：

Related Posts

Keras Dense层输入未被展平

无法将分类变量输入随机森林

如何在Keras中对每个输出应用Sigmoid函数？

如何选择类概率的最佳阈值？

在Keras中使用深度学习得到不同的结果

‘MatMul’操作的输入’b’类型为float32，与参数’a’的类型float64不匹配

发表回复 取消回复

发表回复取消回复