我对一组推文运行了多项式和伯努利朴素贝叶斯以及线性支持向量机分类器。在1000条训练推文的60/40分割上,它们的表现分别为80%、80%和90%,效果不错。
每个算法都有可以调整的参数,我想知道通过改变这些参数是否能获得更好的结果。我对机器学习的了解仅限于训练、测试和预测,所以我想请教一下,哪些参数是我可以调整的?
这是我使用的代码:
import codecsfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.naive_bayes import MultinomialNB,BernoulliNBfrom sklearn import svmtrainfile = 'training_words.txt'testfile = 'testing_words.txt'word_vectorizer = CountVectorizer(analyzer='word') trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))tags = training_labelsmnb = svm.LinearSVC() #Or any other classifiermnb.fit(trainset, tags)codecs.open(testfile,'r','utf8')testset = word_vectorizer.transform(codecs.open(testfile,'r','utf8'))results = mnb.predict(testset)print results
回答:
你可以使用网格搜索交叉验证
来调整模型参数,并使用分层K折交叉验证分割。这里是一个示例代码。
import codecsfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.naive_bayes import MultinomialNB,BernoulliNBfrom sklearn import svmfrom sklearn.grid_search import GridSearchCVtestfile = 'testing_words.txt'word_vectorizer = CountVectorizer(analyzer='word') trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))tags = training_labelsmnb = svm.LinearSVC() # or any other classifier# check out the sklearn online docs to see what params choice we have for your# particular choice of estimator, for SVM, C, class_weight are important ones to tune params_space = {'C': np.logspace(-5, 0, 10), 'class_weight':[None, 'auto']}# build a grid search cv, n_jobs=-1 to use all your processor coresgscv = GridSearchCV(mnb, params_space, cv=10, n_jobs=-1)# fit the modelgscv.fit(trainset, tags)# give a look at your best params combination and best score you havegscv.best_estimator_gscv.best_params_gscv.best_score_codecs.open(testfile,'r','utf8')testset = word_vectorizer.transform(codecs.open(testfile,'r','utf8'))results = gscv.predict(testset)print results