使用多项式朴素贝叶斯进行Python机器学习时出现多进程异常

我正在尝试为一些文档预测标签。每个文档可以有多个标签。以下是我编写的一个示例程序

import pandas as pdimport pickleimport refrom sklearn.cross_validation import train_test_splitfrom sklearn.metrics.metrics import classification_report, accuracy_score, confusion_matrixfrom nltk.stem import WordNetLemmatizerfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.naive_bayes import MultinomialNB as MNBfrom sklearn.pipeline import Pipelinefrom sklearn.grid_search import GridSearchCVdef Mytrain():    pipeline = Pipeline([    ('vect', TfidfVectorizer(stop_words='english',sublinear_tf=True)),    ('clf', MNB())    ])    parameters = {        'vect__max_df': (0.25, 0.5, 0.6, 0.7, 1.0),        'vect__ngram_range': ((1, 1), (1, 2), (2,3), (1,3), (1,4), (1,5)),        'vect__use_idf': (True, False),        'clf__fit_prior': (True, False)    }    traindf = pickle.load(open("train.pkl","rb"))    X, y = traindf['Data'], traindf['Tags'].as_matrix()    Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, train_size=0.7)    gridSearch = GridSearchCV(pipeline, parameters, n_jobs=3, verbose=1, scoring='accuracy')    gridSearch.fit(Xtrain, ytrain)    print ('best score: %0.3f' % gridSearch.best_score_)    print ('best parameters set:')    res = open("res.txt", 'w')    res.write ('best parameters set:\n')    bestParameters = gridSearch.best_estimator_.get_params()    for paramName in sorted(parameters.keys()):        print ('\t %s: %r' % (paramName, bestParameters[paramName]))        res.write('\t %s: %r\n' % (paramName, bestParameters[paramName]))    pickle.dump(bestParameters,open("bestParams.pkl","wb"))    predictions = gridSearch.predict(Xtest)    print ('Accuracy:', accuracy_score(ytest, predictions))    print ('Confusion Matrix:', confusion_matrix(ytest, predictions))    print ('Classification Report:', classification_report(ytest, predictions))

请注意,标签可以有多个值。现在我得到

An unexpected error occurred while tokenizing inputThe following traceback may be corrupted or invalidThe error message is: ('EOF in multi-line statement', (40, 0))Traceback (most recent call last):  File "X:\abc\predMNB.py", line 128, in <module>    MNBdrill(fname,topn)  File "X:\abc\predMNB.py", line 82, in MNBdrill    gridSearch.fit(Xtrain, ytrain)  File "X:\pqr\Anaconda2\lib\site-packages\sklearn\grid_search.py", line 732, in fit    return self._fit(X, y, ParameterGrid(self.param_grid))  File "X:\pqr\Anaconda2\lib\site-packages\sklearn\grid_search.py", line 505, in _fit    for parameters in parameter_iterable  File "X:\pqr\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 666, in __call__    self.retrieve()  File "X:\pqr\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py", line 549, in retrieve    raise exception_type(report)sklearn.externals.joblib.my_exceptions.JoblibMemoryError: JoblibMemoryError

然后

Multiprocessing exception:...........................................................................X:\pqr\Anaconda2\lib\site-packages\sklearn\grid_search.py in fit(self=GridSearchCV(cv=None, error_score='raise',     ..._func=None,       scoring='accuracy', verbose=1), X=14151    text for document having t1,t2,t3,t4Name: Content, dtype: object, y=array([u't1',u't2',u't3',u't4'], dtype=object))    727         y : array-like, shape = [n_samples] or [n_samples, n_output], optional    728             Target relative to X for classification or regression;    729             None for unsupervised learning.    730     731         """--> 732         return self._fit(X, y, ParameterGrid(self.param_grid))        self._fit = <bound method GridSearchCV._fit of GridSearchCV(...func=None,       scoring='accuracy', verbose=1)>        X = 14151    text for document having t1,t2,t3,t4Name: Content, dtype: object        y = array([u't1',u't2',u't3',u't4'], dtype=object)        self.param_grid = {'clf__fit_prior': (True, False), 'vect__max_df': (0.25, 0.5, 0.6, 0.7, 1.0), 'vect__ngram_range': ((1, 1), (1, 2), (2, 3), (1, 3), (1, 4), (1, 5)), 'vect__use_idf': (True, False)}    733     734     735 class RandomizedSearchCV(BaseSearchCV):    736     """Randomized search on hyper parameters............................................................................X:\pqr\Anaconda2\lib\site-packages\sklearn\grid_search.py in _fit(self=GridSearchCV(cv=None, error_score='raise',     ..._func=None,       scoring='accuracy', verbose=1), X=14151    text for document having t1,t2,t3,t4Name: Content, dtype: object, y=array([u't1',u't2',u't3',u't4'], dtype=object), parameter_iterable=<sklearn.grid_search.ParameterGrid object>)    500         )(    501             delayed(_fit_and_score)(clone(base_estimator), X, y, self.scorer_,    502                                     train, test, self.verbose, parameters,    503                                     self.fit_params, return_parameters=True,    504                                     error_score=self.error_score)--> 505                 for parameters in parameter_iterable        parameters = undefined        parameter_iterable = <sklearn.grid_search.ParameterGrid object>    506                 for train, test in cv)    507     508         # Out is a list of triplet: score, estimator, n_test_samples    509         n_fits = len(out)...........................................................................X:\pqr\Anaconda2\lib\site-packages\sklearn\externals\joblib\parallel.py in __call__(self=Parallel(n_jobs=3), iterable=<itertools.islice object>)    661             if pre_dispatch == "all" or n_jobs == 1:    662                 # The iterable was consumed all at once by the above for loop.    663                 # No need to wait for async callbacks to trigger to    664                 # consumption.    665                 self._iterating = False--> 666             self.retrieve()        self.retrieve = <bound method Parallel.retrieve of Parallel(n_jobs=3)>    667             # Make sure that we get a last message telling us we are done    668             elapsed_time = time.time() - self._start_time    669             self._print('Done %3i out of %3i | elapsed: %s finished',    670                         (len(self._output),    ---------------------------------------------------------------------------    Sub-process traceback:    ---------------------------------------------------------------------------    MemoryError                                     

堆栈跟踪在之后继续,指向其他具有相同问题的函数。如果需要,我可以发布整个内容,但这是我认为发生的事情

请注意

scoring='accuracy', verbose=1), X=14151    text for document having t1,t2,t3,t4Name: Content, dtype: object, y=array([u't1',u't2',u't3',u't4'], dtype=object))

由于有多个标签,这是否会引起问题?

另外,什么是

多进程异常?

内存错误?

请帮助我解决这个问题。


回答:

你有多少训练数据?

我最好的猜测是,唯一“真正的”错误是MemoryError,即在尝试训练分类器时,你使用了所有可用的RAM,而所有其他奇怪的错误/回溯都是内存分配失败的后果。

你在训练分类器时检查过可用内存吗?

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注