我有一个数据集,其中每个观测值可能属于不同的标签(多标签分类)。
我已经对其进行了SVM分类,并且效果不错。(在这里,我对每个类别的准确性感兴趣,所以我对每个类别应用了OneVsRestClassifier
,正如您在代码中看到的那样。)
我想查看测试数据中每个项目的预测值。换句话说,我想查看模型对测试样本中每个观测值预测的标签是什么。
例如:这是传递给模型进行预测的数据
,sentences,ADR,WD,EF,INF,SSI,DI,others0,"extreme weight gain, short-term memory loss, hair loss.",1,0,0,0,0,0,01,I am detoxing from Lexapro now.,0,0,0,0,0,0,12,I slowly cut my dosage over several months and took vitamin supplements to help.,0,0,0,0,0,0,13,I am now 10 days completely off and OMG is it rough.,0,0,0,0,0,0,14,"I have flu-like symptoms, dizziness, major mood swings, lots of anxiety, tiredness.",0,1,0,0,0,0,15,I have no idea when this will end.,1,0,0,0,0,0,1
然后我的模型已经预测了这些行的标签,我想看到每行的预测映射。
我知道我们可以使用scikit-learn库中的Label Binarization
来做到这一点。
问题是fit_transform
的输入参数如这里所解释的,与我准备并传递给SVM分类的目标数据不同。所以我不知道如何解决这个问题。
这是我的代码:
df = pd.read_csv("finalupdatedothers.csv")categories = ['ADR','WD','EF','INF','SSI','DI','others']train,test = train_test_split(df,random_state=42,test_size=0.3,shuffle=True)X_train = train.sentencesX_test = test.sentencesSVC_pipeline = Pipeline([ ('tfidf', TfidfVectorizer(stop_words=stop_words)), ('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)), ])for category in categories: print('... Processing {} '.format(category)) SVC_pipeline.fit(X_train,train[category] prediction = SVC_pipeline.predict(X_test) print('SVM Linear Test accuracy is {} '.format(accuracy_score(test[category], prediction))) print 'SVM Linear f1 measurement is {} '.format(f1_score(test[category], prediction, average='weighted')) print "\n"
感谢您的宝贵时间。
回答:
这就是您想要的,我所做的就是将prediction
映射,它是一个表示categories
列表中类别标签索引的numpy数组。以下是完整的代码。
import pandas as pdimport numpy as npfrom sklearn import svmfrom sklearn.datasets import samples_generatorfrom sklearn.feature_selection import SelectKBestfrom sklearn.feature_selection import f_regressionfrom sklearn.pipeline import Pipelinefrom sklearn.model_selection import train_test_splitfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.multiclass import OneVsRestClassifierfrom sklearn.svm import LinearSVCfrom sklearn.metrics import accuracy_scorefrom sklearn.metrics import f1_scoredf = pd.read_csv("finalupdatedothers.csv")categories = ['ADR','WD','EF','INF','SSI','DI','others']train,test = train_test_split(df,random_state=42,test_size=0.3,shuffle=True)X_train = train.sentencesX_test = test.sentencesSVC_pipeline = Pipeline([ ('tfidf', TfidfVectorizer(stop_words=[])), ('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)), ])for category in categories: print('... Processing {} '.format(category)) SVC_pipeline.fit(X_train,train[category]) prediction = SVC_pipeline.predict(X_test) print([{X_test.iloc[i]:categories[prediction[i]]} for i in range(len(list(prediction))) ]) print('SVM Linear Test accuracy is {} '.format(accuracy_score(test[category], prediction))) print ('SVM Linear f1 measurement is {} '.format(f1_score(test[category], prediction, average='weighted'))) print ("\n")
以下是样本输出:
... Processing ADR [{'extreme weight gain, short-term memory loss, hair loss.': 'ADR'}, {'I am detoxing from Lexapro now.': 'ADR'}]SVM Linear Test accuracy is 0.5 SVM Linear f1 measurement is 0.3333333333333333 ... Processing WD [{'extreme weight gain, short-term memory loss, hair loss.': 'ADR'}, {'I am detoxing from Lexapro now.': 'ADR'}]SVM Linear Test accuracy is 1.0 SVM Linear f1 measurement is 1.0
希望这对您有帮助。