如何防止梯度提升机过拟合?

我正在对一个多分类问题比较几种模型(梯度提升机、随机森林、逻辑回归、支持向量机、多层感知器和Keras神经网络)。我已经在我的模型上使用了嵌套交叉验证和网格搜索,这些方法在我的实际数据和随机化数据上运行,以检查是否存在过拟合。然而,对于梯度提升机,无论我如何更改数据或模型参数,它总是能在随机数据上达到100%的准确率。是否我的代码中有什么地方导致了这种情况?

这是我的代码:

dataset= pd.read_csv('data.csv')data = dataset.drop(["gene"],1)df = data.iloc[:,0:26]df = df.fillna(0)X = MinMaxScaler().fit_transform(df)le = preprocessing.LabelEncoder()encoded_value = le.fit_transform(["certain", "likely", "possible", "unlikely"])Y = le.fit_transform(data["category"])sm = SMOTE(random_state=100)X_res, y_res = sm.fit_resample(X, Y)seed = 7logreg = LogisticRegression(penalty='l1', solver='liblinear',multi_class='auto')LR_par= {'penalty':['l1'], 'C': [0.5, 1, 5, 10], 'max_iter':[100, 200, 500, 1000]}rfc =RandomForestClassifier(n_estimators=500)param_grid = {"max_depth": [3],             "max_features": ["auto"],              "min_samples_split": [2],              "min_samples_leaf": [1],              "bootstrap": [False],              "criterion": ["entropy", "gini"]}mlp = MLPClassifier(random_state=seed)parameter_space = {'hidden_layer_sizes': [(50,50,50)],     'activation': ['relu'],     'solver': ['adam'],     'max_iter': [10000],     'alpha': [0.0001],     'learning_rate': ['constant']}gbm = GradientBoostingClassifier()param = {"loss":["deviance"],    "learning_rate": [0.001],    "min_samples_split": [2],    "min_samples_leaf": [1],    "max_depth":[3],    "max_features":["auto"],    "criterion": ["friedman_mse"],    "n_estimators":[50]    }svm = SVC(gamma="scale")tuned_parameters = {'kernel':('linear', 'rbf'), 'C':(1,0.25,0.5,0.75)}inner_cv = KFold(n_splits=10, shuffle=True, random_state=seed)outer_cv = KFold(n_splits=10, shuffle=True, random_state=seed)def baseline_model():    model = Sequential()    model.add(Dense(100, input_dim=X_res.shape[1], activation='relu')) #dense layers perform: output = activation(dot(input, kernel) + bias).    model.add(Dropout(0.5))    model.add(Dense(50, activation='relu')) #8 is the dim/ the number of hidden units (units are the kernel)    model.add(Dense(4, activation='softmax'))    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])    return modelmodels = []models.append(('GBM', GridSearchCV(gbm, param, cv=inner_cv,iid=False, n_jobs=1)))models.append(('RFC', GridSearchCV(rfc, param_grid, cv=inner_cv,iid=False, n_jobs=1)))models.append(('LR', GridSearchCV(logreg, LR_par, cv=inner_cv, iid=False, n_jobs=1)))models.append(('SVM', GridSearchCV(svm, tuned_parameters, cv=inner_cv, iid=False, n_jobs=1)))models.append(('MLP', GridSearchCV(mlp, parameter_space, cv=inner_cv,iid=False, n_jobs=1)))models.append(('Keras', KerasClassifier(build_fn=baseline_model, epochs=100, batch_size=50, verbose=0)))results = []names = []scoring = 'accuracy'X_train, X_test, Y_train, Y_test = train_test_split(X_res, y_res, test_size=0.2, random_state=0)for name, model in models:    nested_cv_results = model_selection.cross_val_score(model, X_res, y_res, cv=outer_cv, scoring=scoring)    results.append(nested_cv_results)    names.append(name)    msg = "Nested CV Accuracy %s: %f (+/- %f )" % (name, nested_cv_results.mean()*100, nested_cv_results.std()*100)    print(msg)    model.fit(X_train, Y_train)    print('Test set accuracy: {:.2f}'.format(model.score(X_test, Y_test)*100),  '%')

输出:

Nested CV Accuracy GBM: 90.952381 (+/- 2.776644 )Test set accuracy: 90.48 %Nested CV Accuracy RFC: 79.285714 (+/- 5.112122 )Test set accuracy: 75.00 %Nested CV Accuracy LR: 91.904762 (+/- 4.416009 )Test set accuracy: 92.86 %Nested CV Accuracy SVM: 94.285714 (+/- 3.563483 )Test set accuracy: 96.43 %Nested CV Accuracy MLP: 91.428571 (+/- 4.012452 )Test set accuracy: 92.86 %

随机数据代码:

ran = np.random.randint(4, size=161)random = np.random.normal(500, 100, size=(161,161))rand = np.column_stack((random, ran))print(rand.shape)X1 = rand[:161]Y1 = rand[:,-1]print("Random data counts of label '1': {}".format(sum(ran==1)))print("Random data counts of label '0': {}".format(sum(ran==0)))print("Random data counts of label '2': {}".format(sum(ran==2)))print("Random data counts of label '3': {}".format(sum(ran==3)))for name, model in models:    cv_results = model_selection.cross_val_score(model, X1, Y1,  cv=outer_cv, scoring=scoring)    names.append(name)    msg = "Random data CV %s: %f (+/- %f)" % (name, cv_results.mean()*100, cv_results.std()*100)    print(msg)

随机数据输出:

Random data CV GBM: 100.000000 (+/- 0.000000)Random data CV RFC: 62.941176 (+/- 15.306485)Random data CV LR: 23.566176 (+/- 6.546699)Random data CV SVM: 22.352941 (+/- 6.331220)Random data CV MLP: 23.639706 (+/- 7.371392)Random data CV Keras: 22.352941 (+/- 8.896451)

这个梯度提升分类器(GBM)无论我减少特征数量,还是更改网格搜索中的参数(尽管我确实输入了多个参数,但这可能会运行数小时而无结果,所以我暂时搁置了这个问题),甚至尝试二分类数据,准确率始终为100%。

随机森林(RFC)的准确率也高达62%,我是否做错了什么?

我使用的数据主要是二元特征,举个例子,看起来像这样(并预测类别列):

gene   Tissue    Druggable Eigenvalue CADDvalue Catalogpresence   CategoryACE      1           1         1          0           1            CertainABO      1           0         0          0           0            LikelyTP53     1           1         0          0           0            Possible

任何指导都会很感激。


回答:

一般来说,有几个参数可以调整以减少过拟合。最容易理解的是增加min_samples_split和min_samples_leaf的数值。设置这些参数为较高值将不允许模型记住如何正确识别单个数据或非常小的数据组。对于一个大型数据集(约100万行),我会将这些值设置在50左右,如果不是更高的话。你可以进行网格搜索来找到适合你特定数据的数值。

你也可以使用subsample来减少过拟合,以及max_features。这些参数基本上不让你的模型查看一些数据,从而防止它记住这些数据。

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注