我试图从一个文本语料库中获取最具信息量的特征。从这个问题的回答中,我知道可以按以下方式完成这项任务:
def most_informative_feature_for_class(vectorizer, classifier, classlabel, n=10): labelid = list(classifier.classes_).index(classlabel) feature_names = vectorizer.get_feature_names() topn = sorted(zip(classifier.coef_[labelid], feature_names))[-n:] for coef, feat in topn: print classlabel, feat, coef
然后:
most_informative_feature_for_class(tfidf_vect, clf, 5)
对于这个分类器:
X = tfidf_vect.fit_transform(df['content'].values)y = df['label'].valuesfrom sklearn import cross_validationX_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.33)clf = SVC(kernel='linear', C=1)clf.fit(X, y)prediction = clf.predict(X_test)
问题在于most_informative_feature_for_class
的输出:
5 a_base_de_bien bastante (0, 2451) -0.210683496368 (0, 3533) -0.173621065386 (0, 8034) -0.135543062425 (0, 10346) -0.173621065386 (0, 15231) -0.154148294738 (0, 18261) -0.158890483047 (0, 21083) -0.297476572586 (0, 434) -0.0596263855375 (0, 446) -0.0753492277856 (0, 769) -0.0753492277856 (0, 1118) -0.0753492277856 (0, 1439) -0.0753492277856 (0, 1605) -0.0753492277856 (0, 1755) -0.0637950312345 (0, 3504) -0.0753492277856 (0, 3511) -0.115802483001 (0, 4382) -0.0668983049212 (0, 5247) -0.315713152154 (0, 5396) -0.0753492277856 (0, 5753) -0.0716096348446 (0, 6507) -0.130661516772 (0, 7978) -0.0753492277856 (0, 8296) -0.144739048504 (0, 8740) -0.0753492277856 (0, 8906) -0.0753492277856 : : (0, 23282) 0.418623443832 (0, 4100) 0.385906085143 (0, 15735) 0.207958503155 (0, 16620) 0.385906085143 (0, 19974) 0.0936828782325 (0, 20304) 0.385906085143 (0, 21721) 0.385906085143 (0, 22308) 0.301270427482 (0, 14903) 0.314164150621 (0, 16904) 0.0653764031957 (0, 20805) 0.0597723455204 (0, 21878) 0.403750815828 (0, 22582) 0.0226150073272 (0, 6532) 0.525138162099 (0, 6670) 0.525138162099 (0, 10341) 0.525138162099 (0, 13627) 0.278332617058 (0, 1600) 0.326774799211 (0, 2074) 0.310556919237 (0, 5262) 0.176400451433 (0, 6373) 0.290124806858 (0, 8593) 0.290124806858 (0, 12002) 0.282832270298 (0, 15008) 0.290124806858 (0, 19207) 0.326774799211
它没有返回标签和单词。为什么会这样?如何打印出单词和标签?你们认为这是因为我使用pandas读取数据造成的吗?我还尝试了以下方法,来自这个问题:
def print_top10(vectorizer, clf, class_labels): """Prints features with the highest coefficient values, per class""" feature_names = vectorizer.get_feature_names() for i, class_label in enumerate(class_labels): top10 = np.argsort(clf.coef_[i])[-10:] print("%s: %s" % (class_label, " ".join(feature_names[j] for j in top10)))print_top10(tfidf_vect,clf,y)
但我得到了以下错误信息:
Traceback (most recent call last):
File "/Users/user/PycharmProjects/TESIS_FINAL/Classification/Supervised_learning/Final/experimentos/RBF/SVM_con_rbf.py", line 237, in <module> print_top10(tfidf_vect,clf,5) File "/Users/user/PycharmProjects/TESIS_FINAL/Classification/Supervised_learning/Final/experimentos/RBF/SVM_con_rbf.py", line 231, in print_top10 for i, class_label in enumerate(class_labels):TypeError: 'int' object is not iterable
你们有解决这个问题的办法吗,以便获取具有最高系数值的特征?
回答:
为了专门解决线性SVM的问题,我们首先需要理解scikit-learn中SVM的公式及其与MultinomialNB的区别。
most_informative_feature_for_class
对MultinomialNB有效的原因是coef_
的输出本质上是给定类别的特征的对数概率(因此会是[nclass, n_features]
的大小,这是由于朴素贝叶斯问题的公式。但如果我们查看SVM的文档,coef_
并非那么简单。相反,(线性)SVM的coef_
是[n_classes * (n_classes -1)/2, n_features]
,因为每个二元模型都适合于每个可能的类别。
如果我们知道我们感兴趣的特定系数,我们可以修改函数如下:
def most_informative_feature_for_class_svm(vectorizer, classifier, classlabel, n=10): labelid = ?? # 这是我们感兴趣的系数。 feature_names = vectorizer.get_feature_names() svm_coef = classifier.coef_.toarray() topn = sorted(zip(svm_coef[labelid], feature_names))[-n:] for coef, feat in topn: print feat, coef
这样就可以按预期工作,并根据您所追求的系数向量打印出标签和前n个特征。
至于获取特定类的正确输出,这将取决于您的假设和您希望输出的内容。我建议阅读SVM文档中的多类文档,以了解您所追求的目标。
因此,使用在回答中描述的train.txt
文件,我们可以得到某种输出,尽管在这种情况下它并不是特别具有描述性或有助于解释。希望这对您有帮助。
import codecs, re, timefrom itertools import chainimport numpy as npfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.naive_bayes import MultinomialNBtrainfile = 'train.txt'# Vectorizing data.train = []word_vectorizer = CountVectorizer(analyzer='word')trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))tags = ['bs','pt','es','sr']# Training NBmnb = MultinomialNB()mnb.fit(trainset, tags)from sklearn.svm import SVCsvcc = SVC(kernel='linear', C=1)svcc.fit(trainset, tags)def most_informative_feature_for_class(vectorizer, classifier, classlabel, n=10): labelid = list(classifier.classes_).index(classlabel) feature_names = vectorizer.get_feature_names() topn = sorted(zip(classifier.coef_[labelid], feature_names))[-n:] for coef, feat in topn: print classlabel, feat, coefdef most_informative_feature_for_class_svm(vectorizer, classifier, n=10): labelid = 3 # 这是我们感兴趣的系数。 feature_names = vectorizer.get_feature_names() svm_coef = classifier.coef_.toarray() topn = sorted(zip(svm_coef[labelid], feature_names))[-n:] for coef, feat in topn: print feat, coefmost_informative_feature_for_class(word_vectorizer, mnb, 'pt')print most_informative_feature_for_class_svm(word_vectorizer, svcc)
输出如下:
pt teve -4.63472898823pt tive -4.63472898823pt todas -4.63472898823pt vida -4.63472898823pt de -4.22926388012pt foi -4.22926388012pt mais -4.22926388012pt me -4.22926388012pt as -3.94158180767pt que -3.94158180767no 0.0204081632653parecer 0.0204081632653pone 0.0204081632653por 0.0204081632653relación 0.0204081632653una 0.0204081632653visto 0.0204081632653ya 0.0204081632653es 0.0408163265306lo 0.0408163265306