使用随机森林分类器时出现TypeError：稀疏矩阵长度不明确；应使用getnnz()或shape[0]？

我在学习scikit learn中的随机森林，并希望用随机森林分类器对文本进行分类，使用我自己的数据集。首先，我使用tfidf对文本进行了向量化，然后进行分类：

from sklearn.ensemble import RandomForestClassifierclassifier=RandomForestClassifier(n_estimators=10) classifier.fit(X_train, y_train)           prediction = classifier.predict(X_test)

当我运行分类时，我得到了以下错误：

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

然后我对X_train使用了.toarray()方法，得到以下错误：

TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]

根据之前的一个问题，我理解需要降低numpy数组的维度，所以我做了同样的事情：

from sklearn.decomposition.truncated_svd import TruncatedSVD        pca = TruncatedSVD(n_components=300)                                X_reduced_train = pca.fit_transform(X_train)               from sklearn.ensemble import RandomForestClassifier                 classifier=RandomForestClassifier(n_estimators=10)                  classifier.fit(X_reduced_train, y_train)                            prediction = classifier.predict(X_testing)

然后我得到了这个异常：

  File "/usr/local/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 419, in predict    n_samples = len(X)  File "/usr/local/lib/python2.7/site-packages/scipy/sparse/base.py", line 192, in __len__    raise TypeError("sparse matrix length is ambiguous; use getnnz()"TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]

然后我尝试了以下方法：

prediction = classifier.predict(X_train.getnnz())

并得到了以下错误：

  File "/usr/local/lib/python2.7/site-packages/sklearn/ensemble/forest.py", line 419, in predict    n_samples = len(X)TypeError: object of type 'int' has no len()

由此引发了两个问题：如何正确使用随机森林进行分类？以及X_train发生了什么？

然后我尝试了以下方法：

df = pd.read_csv('/path/file.csv',header=0, sep=',', names=['id', 'text', 'label'])X = tfidf_vect.fit_transform(df['text'].values)y = df['label'].valuesfrom sklearn.decomposition.truncated_svd import TruncatedSVDpca = TruncatedSVD(n_components=2)X = pca.fit_transform(X)a_train, a_test, b_train, b_test = train_test_split(X, y, test_size=0.33, random_state=42)from sklearn.ensemble import RandomForestClassifierclassifier=RandomForestClassifier(n_estimators=10)classifier.fit(a_train, b_train)prediction = classifier.predict(a_test)from sklearn.metrics.metrics import precision_score, recall_score, confusion_matrix, classification_reportprint '\nscore:', classifier.score(a_train, b_test)print '\nprecision:', precision_score(b_test, prediction)print '\nrecall:', recall_score(b_test, prediction)print '\n confussion matrix:\n',confusion_matrix(b_test, prediction)print '\n clasification report:\n', classification_report(b_test, prediction)

回答：

不太清楚您是否向分类器的fit方法和predict方法传递了相同的数据结构（类型和形状）。随机森林在处理大量特征时运行时间会很长，因此您链接的帖子中建议降低维度。

您应该对训练数据和测试数据都应用SVD，以便分类器在与您希望预测的数据相同形状的输入上进行训练。检查传递给fit方法和predict方法的输入是否具有相同数量的特征，并且都是数组而不是稀疏矩阵。

更新示例：更新为使用数据框

from sklearn.feature_extraction.text import TfidfVectorizertfidf_vect= TfidfVectorizer(  use_idf=True, smooth_idf=True, sublinear_tf=False)from sklearn.cross_validation import train_test_splitdf= pd.DataFrame({'text':['cat on the','angel eyes has','blue red angel','one two blue','blue whales eat','hot tin roof','angel eyes has','have a cat']\              ,'class': [0,0,0,1,1,1,0,3]})X = tfidf_vect.fit_transform(df['text'].values)y = df['class'].valuesfrom sklearn.decomposition.truncated_svd import TruncatedSVD        pca = TruncatedSVD(n_components=2)                                X_reduced_train = pca.fit_transform(X)  a_train, a_test, b_train, b_test = train_test_split(X, y, test_size=0.33, random_state=42)from sklearn.ensemble import RandomForestClassifier classifier=RandomForestClassifier(n_estimators=10)                  classifier.fit(a_train.toarray(), b_train)                            prediction = classifier.predict(a_test.toarray())

请注意，SVD发生在将数据集拆分为训练集和测试集之前，这样传递给预测器的数组与调用fit方法的数组具有相同的n值。

学技术

使用随机森林分类器时出现TypeError：稀疏矩阵长度不明确；应使用getnnz()或shape[0]？

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复