只有一个特征维度,但结果却不合理。代码和数据如下。代码的目的是判断两个句子是否相同。
实际上,模型的最终输入是:特征为[1]的标签为1,特征为[0]的标签为0。
数据非常简单:
sent1 sent2 label
我想听 我想听 1
我想听 我想说 0
我想说 我想说 1
我想说 我想听 0
我想听 我想听 1
我想听 我想说 0
我想说 我想说 1
我想说 我想听 0
我想听 我想听 1
我想听 我想说 0
我想说 我想说 1
我想说 我想听 0
我想听 我想听 1
我想听 我想说 0
我想说 我想说 1
我想说 我想听 0
我想听 我想听 1
我想听 我想说 0
我想说 我想说 1
我想说 我想听 0
import pandas as pdimport xgboost as xgbd = pd.read_csv("data_small.tsv",sep=" ")def my_test(sent1,sent2): result = [0] if "我想说" in sent1 and "我想说" in sent2: result[0] = 1 if "我想听" in sent1 and "我想听" in sent2: result[0] = 1 return resultfea_ = d.apply(lambda row: my_test(row['sent1'], row['sent2']), axis=1).tolist()labels = d["label"].tolist()fea = pd.DataFrame(fea_)for i in range(len(fea_)): print(fea_[i],labels[i])labels = pd.DataFrame(labels)from sklearn.model_selection import train_test_split# train_x_pd_split, valid_x_pd, train_y_pd_split, valid_y_pd = train_test_split(fea, labels, test_size=0.2,# random_state=1234)train_x_pd_split = fea[0:16]valid_x_pd = fea[16:20]train_y_pd_split = labels[0:16]valid_y_pd = labels[16:20]train_xgb_split = xgb.DMatrix(train_x_pd_split, label=train_y_pd_split)valid_xgb = xgb.DMatrix(valid_x_pd, label=valid_y_pd)watch_list = [(train_xgb_split, 'train'), (valid_xgb, 'valid')]params3 = { 'seed': 1337, 'colsample_bytree': 0.48, 'silent': 1, 'subsample': 1, 'eta': 0.05, 'objective': 'binary:logistic', 'eval_metric': 'logloss', 'max_depth': 8, 'min_child_weight': 20, 'nthread': 8, 'tree_method': 'hist',}xgb_trained_model = xgb.train(params3, train_xgb_split, 1000, watch_list, early_stopping_rounds=50, verbose_eval=10)# xgb_trained_model.save_model("predict/model/xgb_model_all")print("feature importance 0:")importance = xgb_trained_model.get_fscore()temp1 = []temp2 = []for k in importance: temp1.append(k) temp2.append(importance[k])print("-----")feature_importance_df = pd.DataFrame({ 'column': temp1, 'importance': temp2,}).sort_values(by='importance')# print(feature_importance_df)feature_sort_list = feature_importance_df["column"].tolist()feature_importance_list = feature_importance_df["importance"].tolist()print()for i,item in enumerate(feature_sort_list): print(item,feature_importance_list[i])train_x_xgb = xgb.DMatrix(train_x_pd_split)train_predict = xgb_trained_model.predict(train_x_xgb)print(train_predict)train_predict_binary = (train_predict >= 0.5) * 1print("TRAIN DATA SELF")from sklearn import metricsprint('LogLoss: %.4f' % metrics.log_loss(train_y_pd_split, train_predict))print('AUC: %.4f' % metrics.roc_auc_score(train_y_pd_split, train_predict))print('ACC: %.4f' % metrics.accuracy_score(train_y_pd_split, train_predict_binary))print('Recall: %.4f' % metrics.recall_score(train_y_pd_split, train_predict_binary))print('F1-score: %.4f' % metrics.f1_score(train_y_pd_split, train_predict_binary))print('Precesion: %.4f' % metrics.precision_score(train_y_pd_split, train_predict_binary))print()valid_xgb = xgb.DMatrix(valid_x_pd)valid_predict = xgb_trained_model.predict(valid_xgb)print(valid_predict)valid_predict_binary = (valid_predict >= 0.5) * 1print("TEST DATA PERFORMANCE")from sklearn import metricsprint('LogLoss: %.4f' % metrics.log_loss(valid_y_pd, valid_predict))print('AUC: %.4f' % metrics.roc_auc_score(valid_y_pd, valid_predict))print('ACC: %.4f' % metrics.accuracy_score(valid_y_pd, valid_predict_binary))print('Recall: %.4f' % metrics.recall_score(valid_y_pd, valid_predict_binary))print('F1-score: %.4f' % metrics.f1_score(valid_y_pd, valid_predict_binary))print('Precesion: %.4f' % metrics.precision_score(valid_y_pd, valid_predict_binary))
但结果显示xgboost并未拟合数据:
TRAIN DATA SELFLogLoss: 0.6931AUC: 0.5000ACC: 0.5000Recall: 1.0000F1-score: 0.6667Precesion: 0.5000TEST DATA PERFORMANCELogLoss: 0.6931AUC: 0.5000ACC: 0.5000Recall: 1.0000F1-score: 0.6667Precesion: 0.5000
回答:
我获得了100%的收敛。以下是配置之间的差异:
-
我将
min_child_weight
设置为0。将其设置为20并期望XGBoost找到分割点是不合理的。 -
我移除了
colsample_bytree
,你只有一个特征,我认为抽样不是一个好的选择。