我正在构建一个逻辑回归模型,用于预测一笔交易是否有效(1)或无效(0),数据集只有150个观测值。我的数据在两个类别之间的分布如下:
- 106个观测值为0(无效)
- 44个观测值为1(有效)
我使用了两个预测变量(都是数值型)。尽管数据大部分是0,我的分类器却总是预测测试集中每笔交易都是1,尽管其中大部分应该是0。分类器从未输出任何观测值为0的结果。
这是我的完整代码:
# Logistic Regressionimport numpy as npimport pandas as pdfrom pandas import Series, DataFrameimport scipyfrom scipy.stats import spearmanrfrom pylab import rcParamsimport seaborn as sbimport matplotlib.pyplot as pltimport sklearnfrom sklearn.preprocessing import scalefrom sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import train_test_splitfrom sklearn import metricsfrom sklearn import preprocessingaddress = "dummy_csv-150.csv"trades = pd.read_csv(address)trades.columns=['location','app','el','rp','rule1','rule2','rule3','validity','transactions']trades.head()trade_data = trades.ix[:,(1,8)].valuestrade_data_names = ['app','transactions']# set dependent/response variabley = trades.ix[:,7].values# center around the data meanX= scale(trade_data)LogReg = LogisticRegression()LogReg.fit(X,y)print(LogReg.score(X,y))y_pred = LogReg.predict(X)from sklearn.metrics import classification_reportprint(classification_report(y,y_pred)) log_prediction = LogReg.predict_log_proba( [ [2, 14],[3,1], [1, 503],[1, 122],[1, 101],[1, 610],[1, 2120],[3, 85],[3, 91],[2, 167],[2, 553],[2, 144] ])prediction = LogReg.predict([[2, 14],[3,1], [1, 503],[1, 122],[1, 101],[1, 610],[1, 2120],[3, 85],[3, 91],[2, 167],[2, 553],[2, 144]])
我的模型定义如下:
LogReg = LogisticRegression() LogReg.fit(X,y)
其中X看起来像这样:
X = array([[1, 345], [1, 222], [1, 500], [2, 120]]....)
而Y只是每个观测值的0或1。
标准化的X传递给模型是这样的:
[[-1.67177659 0.14396503] [-1.67177659 -0.14538932] [-1.67177659 0.50859856] [-1.67177659 -0.3853417 ] [-1.67177659 -0.43239119] [-1.67177659 0.743846 ] [-1.67177659 4.32195953] [ 0.95657805 -0.46062089] [ 0.95657805 -0.45591594] [ 0.95657805 -0.37828428] [ 0.95657805 -0.52884264] [ 0.95657805 -0.20420118] [ 0.95657805 -0.63705646] [ 0.95657805 -0.65587626] [ 0.95657805 -0.66763863] [-0.35759927 -0.25125067] [-0.35759927 0.60975496] [-0.35759927 -0.33358727] [-0.35759927 -0.20420118] [-0.35759927 1.37195666] [-0.35759927 0.27805607] [-0.35759927 0.09456307] [-0.35759927 0.03810368] [-0.35759927 -0.41121892] [-0.35759927 -0.64411389] [-0.35759927 -0.69586832] [ 0.95657805 -0.57353966] [ 0.95657805 -0.57353966] [ 0.95657805 -0.53825254] [ 0.95657805 -0.53354759] [ 0.95657805 -0.52413769] [ 0.95657805 -0.57589213] [ 0.95657805 0.03810368] [ 0.95657805 -0.66293368] [ 0.95657805 2.86107294] [-1.67177659 0.14396503] [-1.67177659 -0.14538932] [-1.67177659 0.50859856] [-1.67177659 -0.3853417 ] [-1.67177659 -0.43239119] [-1.67177659 0.743846 ] [-1.67177659 4.32195953] [ 0.95657805 -0.46062089] [ 0.95657805 -0.45591594] [ 0.95657805 -0.37828428] [ 0.95657805 -0.52884264] [ 0.95657805 -0.20420118] [ 0.95657805 -0.63705646] [ 0.95657805 -0.65587626] [ 0.95657805 -0.66763863] [-0.35759927 -0.25125067] [-0.35759927 0.60975496] [-0.35759927 -0.33358727] [-0.35759927 -0.20420118] [-0.35759927 1.37195666] [-0.35759927 0.27805607] [-0.35759927 0.09456307] [-0.35759927 0.03810368] [-0.35759927 -0.41121892] [-0.35759927 -0.64411389] [-0.35759927 -0.69586832] [ 0.95657805 -0.57353966] [ 0.95657805 -0.57353966] [ 0.95657805 -0.53825254] [ 0.95657805 -0.53354759] [ 0.95657805 -0.52413769] [ 0.95657805 -0.57589213] [ 0.95657805 0.03810368] [ 0.95657805 -0.66293368] [ 0.95657805 2.86107294] [-1.67177659 0.14396503] [-1.67177659 -0.14538932] [-1.67177659 0.50859856] [-1.67177659 -0.3853417 ] [-1.67177659 -0.43239119] [-1.67177659 0.743846 ] [-1.67177659 4.32195953] [ 0.95657805 -0.46062089] [ 0.95657805 -0.45591594] [ 0.95657805 -0.37828428] [ 0.95657805 -0.52884264] [ 0.95657805 -0.20420118] [ 0.95657805 -0.63705646] [ 0.95657805 -0.65587626] [ 0.95657805 -0.66763863] [-0.35759927 -0.25125067] [-0.35759927 0.60975496] [-0.35759927 -0.33358727] [-0.35759927 -0.20420118] [-0.35759927 1.37195666] [-0.35759927 0.27805607] [-0.35759927 0.09456307] [-0.35759927 0.03810368] [-0.35759927 -0.41121892] [-0.35759927 -0.64411389] [-0.35759927 -0.69586832] [ 0.95657805 -0.57353966] [ 0.95657805 -0.57353966] [ 0.95657805 -0.53825254] [ 0.95657805 -0.53354759] [ 0.95657805 -0.52413769] [ 0.95657805 -0.57589213] [ 0.95657805 0.03810368] [ 0.95657805 -0.66293368] [ 0.95657805 2.86107294] [-1.67177659 0.14396503] [-1.67177659 -0.14538932] [-1.67177659 0.50859856] [-1.67177659 -0.3853417 ] [-1.67177659 -0.43239119] [-1.67177659 0.743846 ] [-1.67177659 4.32195953] [ 0.95657805 -0.46062089] [ 0.95657805 -0.45591594] [ 0.95657805 -0.37828428] [ 0.95657805 -0.52884264] [ 0.95657805 -0.20420118] [ 0.95657805 -0.63705646] [ 0.95657805 -0.65587626] [ 0.95657805 -0.66763863] [-0.35759927 -0.25125067] [-0.35759927 0.60975496] [-0.35759927 -0.33358727] [-0.35759927 -0.20420118] [-0.35759927 1.37195666] [-0.35759927 0.27805607] [-0.35759927 0.09456307] [-0.35759927 0.03810368] [-0.35759927 -0.41121892] [-0.35759927 -0.64411389] [-0.35759927 -0.69586832] [ 0.95657805 -0.57353966] [ 0.95657805 -0.57353966] [ 0.95657805 -0.53825254] [ 0.95657805 -0.53354759] [ 0.95657805 -0.52413769] [ 0.95657805 -0.57589213] [ 0.95657805 0.03810368] [ 0.95657805 -0.66293368] [ 0.95657805 2.86107294] [-0.35759927 0.60975496] [-0.35759927 -0.33358727] [-0.35759927 -0.20420118] [-0.35759927 1.37195666] [-0.35759927 0.27805607] [-0.35759927 0.09456307] [-0.35759927 0.03810368]]
而Y是这样的:
[0 0 0 0 0 0 1 1 0 0 0 1 1 1 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 1 1 1 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 1 1 1 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 1 1 1 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0]
模型的指标如下:
precision recall f1-score support 0 0.78 1.00 0.88 98 1 1.00 0.43 0.60 49avg / total 0.85 0.81 0.78 147
得分为0.80
当我运行model.predict_log_proba(test_data)时,得到的概率区间看起来像这样:
array([[ -1.10164032e+01, -1.64301095e-05], [ -2.06326947e+00, -1.35863187e-01], [ -inf, 0.00000000e+00], [ -inf, 0.00000000e+00], [ -inf, 0.00000000e+00], [ -inf, 0.00000000e+00], [ -inf, 0.00000000e+00], [ -inf, 0.00000000e+00], [ -inf, 0.00000000e+00], [ -inf, 0.00000000e+00], [ -inf, 0.00000000e+00], [ -inf, 0.00000000e+00]])
我的测试集是这样的,除了2个应该是0,其余的都被分类为1。这在每个测试集上都会发生,即使是模型训练过的值也是如此。
[2, 14],[3,1], [1, 503],[1, 122],[1, 101],[1, 610],[1, 2120],[3, 85],[3, 91],[2, 167],[2, 553],[2, 144]
我在这里找到了一个类似的问题:https://stats.stackexchange.com/questions/168929/logistic-regression-is-predicting-all-1-and-no-0,但在这个问题中,问题似乎是数据大部分是1,所以模型输出1是合理的。我的情况恰恰相反,因为训练数据大部分是0,但不知为何我的模型总是对所有情况输出1,尽管1的数量相对较少。我还尝试了随机森林分类器来看看是否是模型的问题,但结果是一样的。可能是我的数据有问题,但我不知道具体是什么问题,因为它满足了所有假设。
可能是什么问题呢?数据满足了逻辑模型的所有假设(两个预测变量是独立的,输出是二元的,没有缺失的数据点)。
回答:
你没有对你的test
数据进行缩放。你在训练数据上进行缩放是正确的,像这样做:
X= scale(trade_data)
在你训练模型之后,你没有对测试数据做同样的处理:
log_prediction = LogReg.predict_log_proba([ [2, 14],[3,1], [1, 503],[1, 122],[1, 101],[1, 610],[1, 2120],[3, 85],[3, 91],[2, 167],[2, 553],[2, 144]])
你的模型的系数是基于标准化输入构建的。你的测试数据没有被标准化。任何正系数都会被模型放大,因为你的数据没有被缩放,很可能导致所有预测值都为1。
一个通用规则是,你在训练集上做的任何转换,也应该在测试集上同样进行。你也应该在训练集上应用相同的转换到测试集上。不要这样做:
X = scale(trade_data)
你应该从你的训练数据中创建一个缩放器,像这样:
scaler = StandardScaler().fit(trade_date)X = scaler.transform(trade_data)
然后稍后将该缩放器应用到你的test
数据上:
scaled_test = scaler.transform(test_x)