我在学习scikit learn中的随机森林,并希望用随机森林分类器对文本进行分类,使用我自己的数据集。首先,我使用tfidf对文本进行了向量化,然后进行分类:
from sklearn.ensemble import RandomForestClassifierclassifier=RandomForestClassifier(n_estimators=10) classifier.fit(X_train, y_train) prediction = classifier.predict(X_test)
当我运行分类时,我得到了以下错误:
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
然后我对X_train
使用了.toarray()
方法,得到以下错误:
TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]
根据之前的一个问题,我理解需要降低numpy数组的维度,所以我做了同样的事情:
from sklearn.decomposition.truncated_svd import TruncatedSVD pca = TruncatedSVD(n_components=300) X_reduced_train = pca.fit_transform(X_train) from sklearn.ensemble import RandomForestClassifier classifier=RandomForestClassifier(n_estimators=10) classifier.fit(X_reduced_train, y_train) prediction = classifier.predict(X_testing)
然后我得到了这个异常:
File "/usr/local/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 419, in predict n_samples = len(X) File "/usr/local/lib/python2.7/site-packages/scipy/sparse/base.py", line 192, in __len__ raise TypeError("sparse matrix length is ambiguous; use getnnz()"TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]
然后我尝试了以下方法:
prediction = classifier.predict(X_train.getnnz())
并得到了以下错误:
File "/usr/local/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 419, in predict n_samples = len(X)TypeError: object of type 'int' has no len()
由此引发了两个问题:如何正确使用随机森林进行分类?以及X_train
发生了什么?
然后我尝试了以下方法:
df = pd.read_csv('/path/file.csv',header=0, sep=',', names=['id', 'text', 'label'])X = tfidf_vect.fit_transform(df['text'].values)y = df['label'].valuesfrom sklearn.decomposition.truncated_svd import TruncatedSVDpca = TruncatedSVD(n_components=2)X = pca.fit_transform(X)a_train, a_test, b_train, b_test = train_test_split(X, y, test_size=0.33, random_state=42)from sklearn.ensemble import RandomForestClassifierclassifier=RandomForestClassifier(n_estimators=10)classifier.fit(a_train, b_train)prediction = classifier.predict(a_test)from sklearn.metrics.metrics import precision_score, recall_score, confusion_matrix, classification_reportprint '\nscore:', classifier.score(a_train, b_test)print '\nprecision:', precision_score(b_test, prediction)print '\nrecall:', recall_score(b_test, prediction)print '\n confussion matrix:\n',confusion_matrix(b_test, prediction)print '\n clasification report:\n', classification_report(b_test, prediction)
回答:
不太清楚您是否向分类器的fit
方法和predict
方法传递了相同的数据结构(类型和形状)。随机森林在处理大量特征时运行时间会很长,因此您链接的帖子中建议降低维度。
您应该对训练数据和测试数据都应用SVD,以便分类器在与您希望预测的数据相同形状的输入上进行训练。检查传递给fit
方法和predict
方法的输入是否具有相同数量的特征,并且都是数组而不是稀疏矩阵。
更新示例:更新为使用数据框
from sklearn.feature_extraction.text import TfidfVectorizertfidf_vect= TfidfVectorizer( use_idf=True, smooth_idf=True, sublinear_tf=False)from sklearn.cross_validation import train_test_splitdf= pd.DataFrame({'text':['cat on the','angel eyes has','blue red angel','one two blue','blue whales eat','hot tin roof','angel eyes has','have a cat']\ ,'class': [0,0,0,1,1,1,0,3]})X = tfidf_vect.fit_transform(df['text'].values)y = df['class'].valuesfrom sklearn.decomposition.truncated_svd import TruncatedSVD pca = TruncatedSVD(n_components=2) X_reduced_train = pca.fit_transform(X) a_train, a_test, b_train, b_test = train_test_split(X, y, test_size=0.33, random_state=42)from sklearn.ensemble import RandomForestClassifier classifier=RandomForestClassifier(n_estimators=10) classifier.fit(a_train.toarray(), b_train) prediction = classifier.predict(a_test.toarray())
请注意,SVD发生在将数据集拆分为训练集和测试集之前,这样传递给预测器的数组与调用fit
方法的数组具有相同的n
值。