假设我有以下数据框:
Student_Id Math Physical Arts Class Sub_Class0 id_1 6 7 9 A x1 id_2 9 7 1 A y2 id_3 3 5 5 C x3 id_4 6 8 9 A x4 id_5 6 7 10 B z5 id_6 9 5 10 B z6 id_7 3 5 6 C x7 id_8 3 4 6 C x8 id_9 6 8 9 A x9 id_10 6 7 10 B z10 id_11 9 5 10 B z11 id_12 3 5 6 C x12 id_13 3 4 6 C x
我想使用RandomForestClassifier分类器,首先以类为目标变量进行训练,并在测试数据集中预测类。
Student_Id Class Sub_Class predicted_class11 id_12 C x C8 id_9 A x A3 id_4 A x A
然后,它会获取测试数据集中每个预测的类,并仅针对该特定类别组的训练数据集进行训练,通过逐一添加每个组来预测子类。
- 首先,它会选择一个类’C’,仅在类’C’上进行训练并预测子类
Student_Id Class Sub_Class predicted_class preicted_Sub_Class11 id_12 C x C x
2)接下来,它会选择类’A’,仅在类’A’上进行训练并预测子类
Student_Id Class Sub_Class predicted_class preicted_Sub_Class8 id_9 A x A x3 id_4 A x A y
3)最后,它会将它们全部组合起来
Student_Id Class Sub_Class predicted_class preicted_Sub_Class11 id_12 C x C x8 id_9 A x A x3 id_4 A x A y
总结,我不想分别训练和预测类/子类。我想先预测类,然后使用该预测按类别集群地训练模型,因为我认为这样可以改善结果。
我无法理解如何进行第二部分的循环和针对每个类训练模型以获得子类的方法。
目前没有第二部分的示例代码
import pandas as pdfrom sklearn.metrics import classification_reportfrom sklearn import metrics from sklearn.metrics import confusion_matrixfrom sklearn.metrics import accuracy_scorefrom sklearn import metrics from sklearn.ensemble import RandomForestClassifierfrom sklearn.model_selection import train_test_split#Ceate dataframedata = [ ["id_1",6,7,9, "A", "x"], ["id_2",9,7,1, "A","y" ], ["id_3",3,5,5, "C", "x"], ["id_4",6,8,9, "A","x" ], ["id_5",6,7,10, "B", "z"], ["id_6",9,5,10,"B", "z"], ["id_7",3,5,6, "C", "x"], ["id_8",3,4,6, "C", "x"], ["id_9",6,8,9, "A","x" ], ["id_10",6,7,10, "B", "z"], ["id_11",9,5,10,"B", "z"], ["id_12",3,5,6, "C", "x"], ["id_13",3,4,6, "C", "x"]]df = pd.DataFrame(data, columns = ['Student_Id', 'Math', 'Physical','Arts', 'Class', 'Sub_Class'])#Split into test and traintraining_data, testing_data = train_test_split(df, test_size=0.2, random_state=25)# First predict(classify) the Class--------------------------------------------#Create train dataX_train = training_data[['Math', 'Physical','Arts']]y_train = training_data[['Class']]#Create testX_test = testing_data[['Math', 'Physical','Arts']]y_test = testing_data[['Class']]#Ranom Forest classifier for predicting class rfc = RandomForestClassifier(n_estimators=50).fit(X_train, y_train) predictions = rfc.predict(X_test)rfc_table = testing_data[['Student_Id', 'Class', 'Sub_Class']]rfc_table = rfc_table.assign(predicted_class=predictions)#Next train for Sub_Class------------------------------------------------------
回答:
你可以这样做
# 我们创建一个训练函数,它接收一个df并返回在其上的预测子类def train_sub(df): # 一个模型字典,用于返回训练后的模型 models = {} # 现在我们将选择df中所有唯一的类并遍历它们 for i in df['Class'].unique(): # 从df中选择类等于i的索引 temp_idx = df[df['Class'] == i].index train_idx, test_idx = train_test_split(temp_idx, test_size=0.2, random_state=25) X_train = df.loc[train_idx, ['Math', 'Physical','Arts']] y_train = df.loc[train_idx, ['Sub_Class']] X_test = df.loc[test_idx, ['Math', 'Physical','Arts']] y_test = df.loc[test_idx, ['Sub_Class']] # 训练模型以分类该类下的子类 temp_model = RandomForestClassifier(n_estimators=50).fit(X_train, y_train) # 将预测值添加到整个df中,属于相应的类 df.loc[temp_idx, 'Predicted_subClass'] = temp_model.predict(df.loc[temp_idx, ['Math', 'Physical','Arts']]) # 将模型添加到字典中 models[i] = temp_model return models# 调用函数models = train_sub(df)# 查看结果df