为什么 OneVsRestClassifier()
在相同数据集上的得分远低于仅使用 multi_class="ovr"
参数?
使用简单的方法拟合并获取逻辑回归的得分:
#Load Data, assign variablestraining_data = pd.read_csv("iris.data")training_data.columns = [ "sepal_length", "sepal_width", "petal_length", "petal_width", "class",]feature_cols = ["sepal_length", "sepal_width", "petal_length", "petal_width"]label_cols = ["class"]X = training_data.loc[:, feature_cols]y = training_data.loc[:, label_cols].values.ravel()X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)# Instantiate and fit the model:logreg =LogisticRegression(solver="liblinear", multi_class="ovr", random_state=24)clf = logreg.fit(X_train, y_train)# See if the model is reasonable.print("Score: ", clf.score(X_test, y_test))
我得到的得分为 0.92
,而使用 OneVsAllRegression
时得分为 0.62
training_data = pd.read_csv("iris.data")training_data.columns = [ "sepal_length", "sepal_width", "petal_length", "petal_width", "class",]feature_cols = ["sepal_length", "sepal_width", "petal_length", "petal_width"]label_cols = ["class"]X = training_data.loc[:, feature_cols]y = training_data.loc[:, label_cols].values.ravel()#transform lables to 0-1-2le = preprocessing.LabelEncoder()le.fit(training_data.loc[:, label_cols].values.ravel())y=le.transform(training_data.loc[:, label_cols].values.ravel())# Binarize the outputy = label_binarize(y, classes=[0, 1, 2])n_classes = 3X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)# Instantiate and fit the model:logreg = OneVsRestClassifier(LogisticRegression(solver="liblinear", random_state=24))clf = logreg.fit(X_train, y_train)# See if the model is reasonable.print("Score: ", clf.score(X_test, y_test))
为什么一种方法的效果远好于另一种方法?
这是数据输入的样貌(这是 Iris 数据集):
training_data
sepal_length sepal_width petal_length petal_width class 4.9 3.0 1.4 0.2 Iris-setosa 4.7 3.2 1.3 0.2 Iris-setosa (...)
回答:
是的,得分应该是一样的。问题在于第二种方法中,你对输出进行了二值化。这会改变 y
的形式,从而影响预测。尝试使用 clf.predict(X_test)
来查看预测的格式是否正确。
要解决你的问题,请删除以下这行代码:
y = label_binarize(y, classes=[0, 1, 2])
然后,为了在多种方法之间保持相同的得分,在使用 train_test_split
时添加随机状态:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=24)
回答你的评论:
如果你想使用 label_binarize()
二值化的标签,可以保留这行代码
y = label_binarize(y, classes=[0, 1, 2])
然后在拟合模型后,像这样计算 y_score:
# Instantiate and fit the model:logreg = OneVsRestClassifier(LogisticRegression(solver="liblinear", random_state=24))y_score = logreg.fit(X_train, y_train).decision_function(X_test)
与你的代码的区别在于调用了 .decision_function(X_test)
,它为每个类别计算一个得分。要理解结果,请显示 y_score
。
然后,你的数据格式就适合继续进行教程了。