为什么我不能直接使用特征矩阵进行预测?

[已解决]以下是我处理新数据并尝试预测但失败的过程,使用了数据和我的训练模型。

首先,我导入,

import pandas as pdfrom sklearn import preprocessingimport sklearn.model_selection as msfrom sklearn import linear_modelimport sklearn.metrics as sklmimport numpy as npimport numpy.random as nrimport matplotlib.pyplot as pltimport seaborn as snsimport scipy.stats as ssimport math%matplotlib inline

导入数据和数据处理

##test##prepare test_datax_test_data = pd.read_csv('AW_test.csv')x_test_data.loc[:,x_test_data.dtypes==object].isnull().sum()##dropnancols_of_interest = ['Title','MiddleName','Suffix','AddressLine2']x_test_data.drop(cols_of_interest,axis=1,inplace=True)##dropduplicatex_test_data.drop_duplicates(subset = 'CustomerID', keep = 'first', inplace=True)print(x_test_data.shape)

然后,我将我的分类变量特征转换为独热编码矩阵

##change categorical variables to numeric variablesdef encode_string(cat_features):    enc = preprocessing.LabelEncoder()    enc.fit(cat_features)    enc_cat_features = enc.transform(cat_features)    ohe = preprocessing.OneHotEncoder()    encoded = ohe.fit(enc_cat_features.reshape(-1,1))    return encoded.transform(enc_cat_features.reshape(-1,1)).toarray()categorical_columns = ['CountryRegionName','Education','Occupation','Gender','MaritalStatus']Features = encode_string(x_test_data['CountryRegionName'])for col in categorical_columns:    temp = encode_string(x_test_data[col])    Features = np.concatenate([Features, temp],axis=1)print(Features)

接着,我将剩余的数值特征添加到矩阵中

##add numeric variablesFeatures = np.concatenate([Features, np.array(x_test_data[['HomeOwnerFlag','NumberCarsOwned','TotalChildren','YearlyIncome']])], axis=1)

接下来,我对特征矩阵进行缩放

##scale numeric variableswith open('./lin_reg_scaler.pickle', 'rb') as file:scaler =pickle.load(file)Features[:,-5:] = scaler.transform(Features[:,-5:])

我加载了在另一个文件中训练的线性回归模型(如果需要,我可以发布它)

# Loading the saved linear regression model pickleimport pickleloaded_model = pickle.load(open('./lin_reg_mod.pickle', 'rb'))

我直接将我的特征矩阵输入

#predictloaded_model.predict(Features)

然而,这是我得到的结果

array([-5.71697209e+12, -4.64634881e+12, -4.64634881e+12, -4.64634881e+12,   -4.64634881e+12, -4.64634881e+12, -5.71697209e+12, -4.64634881e+12,   -5.71697209e+12, -4.64634881e+12, -5.71697209e+12, -4.64634881e+12,   -4.64634881e+12, -4.64634881e+12, -5.71697209e+12, -4.64634881e+12,   -4.64634881e+12, -5.71697209e+12, -5.71697209e+12, -5.71697209e+12,   -4.64634881e+12, -4.64634881e+12, -4.64634881e+12, -4.64634881e+12,   -4.64634881e+12, -5.71697209e+12, -4.64634881e+12, -5.71697209e+12,   -5.71697209e+12, -4.64634881e+12, -5.71697209e+12, -5.71697209e+12,   -4.64634881e+12, -5.71697209e+12, -4.64634881e+12, -5.71697209e+12,   -4.64634881e+12, -4.64634881e+12, -4.64634881e+12, -4.64634881e+12,   -5.71697209e+12, -5.71697209e+12, -4.64634881e+12, -4.64634881e+12,   -4.64634881e+12, -4.64634881e+12, -4.64634881e+12, -5.71697209e+12,   -4.64634881e+12, -4.64634881e+12, -4.64634881e+12, -4.64634881e+12,   -4.64634881e+12, -4.64634881e+12, -4.64634881e+12, -4.64634881e+12,   -4.64634881e+12, -5.71697209e+12, -4.64634881e+12, -5.71697209e+12,   -4.64634881e+12, -4.64634881e+12, -4.64634881e+12, -5.71697209e+12,   -5.71697209e+12, -5.71697209e+12, -5.71697209e+12, -4.64634881e+12,............

在我的另一个文件中,我已经成功地训练了我的模型并用测试数据进行了测试。

这是我在那个文件中将x_test输入我的模型时得到的结果(我想要的结果):

[83.75482221 66.31820493 47.22211384 ... 69.65032224 88.45908874  58.45193545]

我不知道发生了什么,有人能帮我吗

[更新]下面是我训练模型的代码

custs = pd.read_csv('combined_custs.csv')custs.dtypes##avemonthspend dataams = pd.read_csv('AW_AveMonthSpend.csv')ams.drop_duplicates(subset='CustomerID', keep='first', inplace=True)##mergecombined_custs=custs.merge(ams)combined_custs.to_csv('./ams_combined_custs.csv')combined_custs.head(20)##change categorical variables to numeric variablesdef encode_string(cat_features):enc = preprocessing.LabelEncoder()enc.fit(cat_features)enc_cat_features = enc.transform(cat_features)ohe = preprocessing.OneHotEncoder()encoded = ohe.fit(enc_cat_features.reshape(-1,1))return encoded.transform(enc_cat_features.reshape(-1,1)).toarray()categorical_columns = ['CountryRegionName','Education','Occupation','Gender','MaritalStatus']Features = encode_string(combined_custs['CountryRegionName'])for col in categorical_columns:    temp = encode_string(combined_custs[col])    Features = np.concatenate([Features, temp],axis=1)print(Features.shape)print(Features[:2,:])##add numeric variablesFeatures = np.concatenate([Features, np.array(combined_custs[['HomeOwnerFlag','NumberCarsOwned','TotalChildren','YearlyIncome']])], axis=1)print(Features.shape)print(Features)##train_test_splitnr.seed(9988)labels = np.array(combined_custs['AveMonthSpend'])indx = range(Features.shape[0])indx = ms.train_test_split(indx, test_size = 300)x_train = Features[indx[0],:]y_train = np.ravel(labels[indx[0]])x_test = Features[indx[1],:]y_test = np.ravel(labels[indx[1]])print(x_test.shape)##scale numeric variablesscaler = preprocessing.StandardScaler().fit(x_train[:,-5:])x_train[:,-5:] = scaler.transform(x_train[:,-5:])x_test[:,-5:] = scaler.transform(x_test[:,-5:])x_train[:2,]import picklefile = open('./lin_reg_scaler.pickle', 'wb')pickle.dump(scaler, file)file.close()##define and fit the linear regression modellin_mod = linear_model.LinearRegression(fit_intercept=False)lin_mod.fit(x_train,y_train)print(lin_mod.intercept_)print(lin_mod.coef_)import picklefile = open('./lin_reg_mod.pickle', 'wb')pickle.dump(lin_mod, file)file.close()lin_mod.predict(x_test)

我的训练模型的预测结果是:

array([ 78.20673535,  91.11860042,  75.27284767,  63.69507673,   102.10758616,  74.64252358,  92.84218321,  77.9675721 ,   102.18989779,  96.98098962,  87.61415378,  39.37006326,    85.81839618,  78.41392293,  45.49439829,  48.0944897 ,    36.06024114,  70.03880373, 128.90267485,  54.63235443,    52.20289729,  82.61123334,  41.58779815,  57.6456416 ,    46.64014991,  78.38639454,  77.61072157,  94.5899366 ,.....

回答:

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注