我正在开发一个机器学习模型,我有一个包含数据的数据框
我使用标准分布对数据进行标准化
scaler = StandardScaler()df = scaler.fit_transform(df)
我将数据集分为目标和特征
X_df = df[X_characteristics_list]y_df = df[target]
我将数据分为训练集和测试集,然后训练模型
X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size = 0.25)forest = RandomForestRegressor()forest.fit(X_train, y_train)
我预测测试集以验证模型的有效性
y_test_pred = forest.predict(X2_test)mse = mean_squared_error(y_test, y_test_pred)
但当需要在现实生活中测试时,我需要让模型准备好进行预测
如果我想预测一个单独的记录,例如 [100,20,34],我无法做到,因为我需要标准化该记录,而使用 StandardScaler 进行转换不起作用,因为它依赖于标准差,所以我需要原始数据集
解决这个问题的更好方法是什么?
回答:
请看下面的示例:
>>> from sklearn.datasets import make_classification>>> from sklearn.model_selection import train_test_split>>> from sklearn.linear_model import LogisticRegression>>> from sklearn.preprocessing import StandardScaler# 创建我们的输入和输出矩阵>>> X, y = make_classification()# 分割训练集和测试集... "测试集" 将是生产/未观察到/"现实生活" 数据>>> X_train, X_test, y_train, y_test = train_test_split(X, y)# X_train 看起来是什么样的?>>> X_trainarray([[-0.08930702, -2.71113991, -0.93849926, ..., 0.21650905, 0.68952722, 0.61365789], [-0.31143977, -1.87817904, 0.08287492, ..., -0.41332943, -0.58967179, 1.7239411 ], [-1.62287589, 1.10691318, -0.630556 , ..., -0.35060008, 1.11270562, 0.08106694], ..., [-0.59797041, 0.90218081, 0.89983074, ..., -0.54374315, 1.18534841, -0.03397969], [-1.2006559 , 1.01890955, -1.21617181, ..., 1.76263322, 1.38280423, -1.0192972 ], [ 0.11883425, 1.42952643, -1.23647358, ..., 1.02509208, -1.14308885, 0.72096531]])# 让我们对其进行缩放>>> scaler = StandardScaler()>>> X_train = scaler.fit_transform(X_train)>>> X_trainarray([[ 0.08867642, -1.97950269, -1.1214106 , ..., 0.22075623, 0.57844552, 0.46487917], [-0.10736984, -1.34896243, 0.00808597, ..., -0.37670234, -0.6045418 , 1.57819736], [-1.26479555, 0.91071257, -0.78086855, ..., -0.3171979 , 0.96979563, -0.06916763], ..., [-0.36025134, 0.7557329 , 0.91152449, ..., -0.50041152, 1.03697478, -0.18452874], [-0.89215959, 0.84409499, -1.42847749, ..., 1.68739437, 1.21957946, -1.17253964], [ 0.27237431, 1.15492649, -1.4509284 , ..., 0.98777012, -1.116335 , 0.57247992]])# 训练模型>>> model = LogisticRegression()>>> model.fit(X_train, y_train)LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='auto', n_jobs=None, penalty='l2', random_state=None, solver='lbfgs', tol=0.0001, verbose=0, warm_start=False)# 现在让我们使用已经拟合的 StandardScaler 对象来简单地转换# *不是 fit_transform* 测试数据>>> X_test = scaler.transform(X_test)>>> model.predict(X_test)array([1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0])
请注意,使用 joblib
或 pickle
你可以保存 scaler
对象,并在以后的”实时”缩放中重新加载它。