如何在scikit-learn的MultinomialNB Pipeline训练模型中提取词特征？

# Note: The runnable code example is at the end of this question ##### Assume X_train contains cleaned sentence text as input data. Y_train are class labels. # parameters stores the parameter to be tried by GridSearchCVtext_clf_Pipline_MultinomialNB = Pipeline([('vect', CountVectorizer()),                                           ('tfidf', TfidfTransformer()),                                           ('clf', MultinomialNB()),                                                               ])gs_clf = GridSearchCV(text_clf_Pipline_MultinomialNB, parameters, n_jobs=-1)   gs_classifier = gs_clf.fit(X_train, y_train)

现在我可以根据sklearn.naive_bayes.MultinomialNB文档从gs_classifier中获取feature_log_prob_。这是一个例子。

我的问题是如何获取与每个对数概率对应的词？CountVectorizer()和TfidfTransformer()都进行了特征选择。GridSearchCV对象在哪里存储了选定的词/短语特征？如何将它们与概率匹配起来？

我已经检查了gs_classifier的成员，但没有找到选定的特征。谢谢。

以下是一个可运行的示例：

from sklearn.feature_extraction.text import CountVectorizerfrom sklearn.feature_extraction.text import TfidfTransformerfrom sklearn.grid_search import GridSearchCVfrom sklearn.pipeline import Pipelinefrom sklearn.naive_bayes import MultinomialNBfrom inspect import getmembersX_train = ['qwe rtyuiop', 'asd fghj kl', 'zx cv bnm', 'qw erty ui op', 'as df ghj kl', 'zxc vb nm', 'qwe rt yu iop', 'asdfg hj kl', 'zx cvb nm',          'qwe rt yui op', 'asd fghj kl', 'zx cvb nm', 'qwer tyui op', 'asd fg hjk l', 'zx cv b nm', 'qw ert yu iop', 'as df gh jkl', 'zx cvb nm',           'qwe rty uiop', 'asd fghj kl', 'zx cvbnm', 'qw erty ui op', 'as df ghj kl', 'zxc vb nm', 'qwe rtyu iop', 'as dfg hj kl', 'zx cvb nm',          'qwe rt yui op', 'asd fg hj kl', 'zx cvb nm', 'qwer tyuiop', 'asd fghjk l', 'zx cv b nm', 'qw ert yu iop', 'as df gh jkl', 'zx cvb nm']    y_train = ['1', '2', '3', '1', '1', '3', '1', '2', '3',          '1', '2', '3', '1', '4', '1', '2', '2', '4',           '1', '2', '3', '1', '1', '3', '1', '2', '3',          '1', '2', '3', '1', '4', '1', '2', '2', '4']    parameters = {                  'clf__alpha': (1e-1, 1e-2),                 'vect__ngram_range': [(1,2),(1,3)],                 'vect__max_df': (0.9, 0.98)            }text_clf_Pipline_MultinomialNB = Pipeline([('vect', CountVectorizer()),                                           ('tfidf', TfidfTransformer()),                                           ('clf', MultinomialNB()),                                                               ])gs_clf = GridSearchCV(text_clf_Pipline_MultinomialNB, parameters, n_jobs=-1)   gs_classifier = gs_clf.fit(X_train, y_train)nbclf = getmembers(gs_classifier.best_estimator_)[2][1]['named_steps']['clf']nbclf.feature_log_prob_

那么问题是：如何获取训练模型中与对数概率对应的词特征列表？另外，例如，哪个_log_prob_输出中的概率对应于类别’1’的词’qwe’？

获取答案后的编辑：Andreas的回答有效：

gs_classifier.best_estimator_.named_steps['vect'].get_feature_names()

类似地，有一种更好的方法来索引GridSearchCV以获取训练的分类器

nbclf = gs_classifier.best_estimator_.named_steps['clf']

回答：

为什么需要getmembers？要获取与feature_log_prob_对应的特征名称：

gs_classifier.best_estimator_.named_steps['vect'].get_feature_names()

学技术

如何在scikit-learn的MultinomialNB Pipeline训练模型中提取词特征？

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复