我想将包含我的训练语料库的txt文件向量化,用于OneClassSVM分类器。为此,我使用了scikit-learn库中的CountVectorizer。以下是我的代码:
def file_to_corpse(file_name, stop_words): array_file = [] with open(file_name) as fd: corp = fd.readlines() array_file = np.array(corp) stwf = stopwords.words('french') for w in stop_words: stwf.append(w) vectorizer = CountVectorizer(decode_error = 'replace', stop_words=stwf, min_df=1) X = vectorizer.fit_transform(array_file) return X
当我运行这个函数处理我的文件(约206346行)时,我得到了以下错误,并且我无法理解它:
Traceback (most recent call last): File "svm.py", line 93, in <module> clf_svm.fit(training_data) File "/home/imane/anaconda/lib/python2.7/site-packages/sklearn/svm/classes.py", line 1028, in fit super(OneClassSVM, self).fit(X, np.ones(_num_samples(X)), sample_weight=sample_weight, File "/home/imane/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 122, in _num_samples " a valid collection." % x)TypeError: Singleton array array(<536172x13800 sparse matrix of type '<type 'numpy.int64'>' with 1952637 stored elements in Compressed Sparse Row format>, dtype=object) cannot be considered a valid collection.
能有人帮我解决这个问题吗?我已经卡了一段时间了 :).
回答:
如果你查看源代码,可以在这里找到这里,例如,你会发现它检查这个条件是否为真(x 是你的数组)
if len(x.shape) == 0:
如果是,它将引发此异常
TypeError("Singleton array %r cannot be considered a valid collection." % x)
我的建议是,你可以尝试找出array_file
或此函数的返回值是否具有大于0的形状长度