我们在训练一个深度学习模型来预测贷款评分(分类为0、1或3)时遇到了以下问题。
这些是步骤:
步骤1:创建新的列“scoring”(输出)
conditions = [(df2['Credit Score'] >= 0) & (df2['Credit Score'] < 1000),(df2['Credit Score'] >= 1000) & (df2['Credit Score'] < 6000),(df2['Credit Score'] >= 6000) & (df2['Credit Score'] <= 7000)]choices = [0,1,2]df2['Scoring'] = np.select(conditions, choices)
步骤2:准备训练
array = df2.valuesX = np.vstack((array[:,2:3].T, array[:,5:15].T)).TY = array[:,15:]N = Y.shape[0]T = np.zeros((N, np.max(Y)+1))for i in range(N): T[i,Y[i]] = 1x_train, x_test, y_train, y_test = train_test_split(X, T, test_size=0.2, random_state=42)
步骤3:拓扑结构
model = Sequential()model.add(Dense(80, input_shape=(11,), activation='tanh'))model.add(Dropout(0.2))model.add(Dense(80, activation='tanh'))model.add(Dropout(0.1))model.add(Dense(40, activation='relu'))model.add(Dense(3, activation='softmax'))epochs =200learning_rate = 0.00001decay_rate = learning_rate / epochsmomentum = 0.002sgd = SGD(lr=learning_rate, momentum=momentum, decay=decay_rate, nesterov=False)ad = Adamax(lr=learning_rate)
步骤4:训练
epochs = 200 batch_size = 16 history = model.fit(x_train, y_train, validation_data=(x_test, y_test), nb_epoch=epochs, batch_size=batch_size,validation_split=0.1) print ('fit done!')
指标
365/365 [==============================] – 0s 60us/sample – loss: 0.0963 – acc: 0.9808测试集损失:0.096 准确率:0.981
步骤5:预测
text1 = [1358,1555,1,3,1741,8,0,1596,1518,0,0] #scoring 0 text2 = [1454,1601,3,11,1763,10,0,685,1044,0,0] #scoring 1 text3 = [1209,1437,3,11,199,18,1,761,1333,1,0] #scoring 2tmp = np.vstack(text1).TtextA = tmp.reshape(1,-1)tmp = np.vstack(text2).TtextB = tmp.reshape(1,-1)tmp = np.vstack(text3).Tprint(tmp)textC = tmp.reshape(1,-1)p = model.predict(textA)t = p[0]print(textA,np.argmax(t))p = model.predict(textB)t = p[0]print(textB,np.argmax(t))p = model.predict(textC)t = p[0]print(textC,np.argmax(t))
问题:预测结果总是相同的!!!
[9.9205679e-01 3.8634153e-04 7.5568780e-03] [[1358 1555 1 3 1741 8 0 1596 1518 0 0]] 0 — scoring 0
[0.9862417 0.00205712 0.01170125] [[1454 1601 3 11 1763 10 0 685 1044 0 0]] 0 — scoring 0
[9.9251783e-01 2.5733517e-04 7.2247880e-03] [[1209 1437 3 11 199 18 1 761 1333 1 0]] 0 —- scoring 0
这种行为的原因可能是什么?
提前感谢!
回答:
你的数据集极度不平衡。一个很好的理解方式是:如果总是预测0就能达到98%的准确率,那么要说某样东西属于其他类别是非常冒险的(或者必须非常明显)。神经网络找到的任何模式,使得少数类别与多数类别(0)不同,都必须非常独特,因为即使有小的重叠,不预测0的成本也太高了。
考虑这个例子:你有一个数据集,有两个类别,A和B,两者都遵循正态分布。类别A的均值为1,标准差为1,类别B的均值为3,标准差为0.1。你有1,000,000个类别0的样本和20,000个类别1的样本,所以总是预测A可以给你98%的准确率。所有类别B的样本在99%的置信区间内将位于2.743到3.257之间。在这些值之间,类别A预计有29,300个样本,因此将任何观察值预测为类别B的成本是导致29,300个A样本的错误,而将所有东西预测为A的成本只是在20,000个B样本中产生错误。
以下是该示例的图形展示:
import numpy as npimport matplotlib.pyplot as plt# Get A and BA = np.random.normal(1, 1, 1000000)B = np.random.normal(3, 0.1, 20000)# Count the number of observations in A for each BB.sort()a = A[np.logical_and(A >= B.min(), A <= B.max())]a = [(a<i).sum() for i in B]# Plot resultsplt.plot(B, np.arange(B.shape[0]), label='Class B')plt.plot(B, a, label='Class A')plt.ylabel('样本数量')plt.xlabel('值')plt.legend()plt.show()
关于平衡数据集,请查看这篇文章:https://www.kdnuggets.com/2017/06/7-techniques-handle-imbalanced-data.html