我试图使用三个与银行历史相关的二元解释变量:违约、房产和贷款,通过逻辑回归分类器来预测二元响应变量。
我有以下数据集:
用于将文本“no/yes”转换为整数“0/1”的映射函数
convert_to_binary = {'no' : 0, 'yes' : 1}default = bank['default'].map(convert_to_binary)housing = bank['housing'].map(convert_to_binary)loan = bank['loan'].map(convert_to_binary)response = bank['response'].map(convert_to_binary)
我将三个解释变量和响应变量添加到数组中
data = np.array([np.array(default), np.array(housing), np.array(loan),np.array(response)]).Tkfold = KFold(n_splits=3)scores = []for train_index, test_index in kfold.split(data): X_train, X_test = data[train_index], data[test_index] y_train, y_test = response[train_index], response[test_index] model = LogisticRegression().fit(X_train, y_train) pred = model.predict(data[test_index]) results = model.score(X_test, y_test) scores.append(results)print(np.mean(scores))
我的准确率总是100%,我知道这不正确。准确率应该在50-65%左右?
我做错了什么吗?
回答:
分割方式不正确
这是正确的分割方式
X_train, X_labels = data[train_index], response[train_index]y_test, y_labels = data[test_index], response[test_index]model = LogisticRegression().fit(X_train, X_labels)pred = model.predict(y_test)acc = sklearn.metrics.accuracy_score(y_labels,pred,normalize=True)