当我在带有正则化的SciKit
线性模型中改变列的顺序(特征顺序)时,我得到了不同的得分。我已经用ElasticNet
和Lasso
测试了这一点。我使用的是scikit-learn==0.23.1
import pandas as pdimport numpy as npfrom sklearn import linear_modelfrom sklearn import metricsdf = pd.DataFrame({ 'col1': [1, 2, 3, 4, 5, 6], 'col2': [16, 32, 64, 12, 5, 256], 'col3': [7, 8, 9, 10, 12, 11], 'out': [40, 5, 60, 7, 9, 100]})print(df)X_df = df[['col1', 'col2', 'col3']]y_df = df['out']regr = linear_model.ElasticNet(alpha=0.1, random_state=0)regr.fit(X_df, y_df)y_pred = regr.predict(X_df)print("R2:", regr.score(X_df, y_df))print("MSE:", metrics.mean_squared_error(y_df, y_pred))# change the order to: [col2, col1, col3]first_cols = ['col2']cols = first_cols.copy()for c in X_df.columns: if c not in cols: cols.append(c)X_df = X_df[cols]regr.fit(X_df, y_df)y_pred = regr.predict(X_df)print("\nReorder:")print("R2:", regr.score(X_df, y_df))print("MSE:", metrics.mean_squared_error(y_df, y_pred))
以上代码的输出结果是:
col1 col2 col3 out0 1 16 7 401 2 32 8 52 3 64 9 603 4 12 10 74 5 5 12 95 6 256 11 100R2: 0.8277462579081043MSE: 207.13034003933535Reorder:R2: 0.8277586094134455MSE: 207.11548769725997
为什么会这样?
回答:
这是因为tol
参数的差异。
根据文档说明:
tol : float, default=1e-4
优化过程的容忍度:如果更新值小于
tol
,优化代码会检查双重间隙是否达到最优,并持续进行直到小于tol
。
只需在两种情况下都添加tol=1e-12
,就可以获得你想要的精度水平。
from sklearn.tree import DecisionTreeClassifierfrom sklearn.model_selection import train_test_splitfrom sklearn.pipeline import make_pipelinefrom sklearn.compose import make_column_transformerfrom sklearn.feature_extraction.text import CountVectorizerimport pandas as pdimport numpy as npfrom sklearn import linear_modelfrom sklearn import metricsdf = pd.DataFrame({ 'col1': [1, 2, 3, 4, 5, 6], 'col2': [16, 32, 64, 12, 5, 256], 'col3': [7, 8, 9, 10, 12, 11], 'out': [40, 5, 60, 7, 9, 100]})# print(df)X_df = df[['col1', 'col2', 'col3']]y_df = df['out']regr = linear_model.ElasticNet(alpha=0.1, random_state=0, tol=1e-12)regr.fit(X_df, y_df)y_pred = regr.predict(X_df)print(regr.coef_)print("R2:", regr.score(X_df, y_df))print("MSE:", metrics.mean_squared_error(y_df, y_pred))# change the order to: [col2, col1, col3]first_cols = ['col2']cols = first_cols.copy()for c in X_df.columns: if c not in cols: cols.append(c)X_df = X_df[cols]regr = linear_model.ElasticNet(alpha=0.1, random_state=0, tol=1e-12)regr.fit(X_df, y_df)y_pred = regr.predict(X_df)print("\nReorder:")print(regr.coef_)print("R2:", regr.score(X_df, y_df))print("MSE:", metrics.mean_squared_error(y_df, y_pred))
[-8.92519779 0.42980208 3.59812779]R2: 0.8277593357239204MSE: 207.11461432908925Reorder:[ 0.42980208 -8.92519779 3.59812779]R2: 0.8277593357240851MSE: 207.11461432889112