使用随机森林算法处理20个新闻组数据集的问题

我在尝试使用20个新闻组数据集运行随机森林算法,但不知道如何解决这个问题。我之前使用SVM和NB在相同的数据集上运行过,效果很好。

 from sklearn.datasets import fetch_20newsgroupsfrom sklearn.feature_extraction.text import CountVectorizerdataset_train=fetch_20newsgroups(subset='train',shuffle=True)dataset_test=fetch_20newsgroups(subset='test',shuffle=True)vectorizer=CountVectorizer()x_train_counts=vectorizer.fit_transform(dataset_train.data)from sklearn.feature_extraction.text import TfidfVectorizervectorizer=TfidfVectorizer(stop_words='english',lowercase=True,ngram_range=(1,5))x_train_tfidf=vectorizer.fit_transform(dataset_train.data)from sklearn.ensemble import RandomForestClassifiermodel=RandomForestClassifier(n_estimators=10)model=model.fit(dataset_train.data,dataset_train.target)

这是错误信息:

    Traceback (most recent call last):  File "C:/Users/new_randomforest.py", line 18, in <module>    model=model.fit(dataset_train.data,dataset_train.target)  File "C:\Users\forest.py", line 247, in fit    X = check_array(X, accept_sparse="csc", dtype=DTYPE)  File "C:\Users\validation.py", line 433, in check_array    array = np.array(array, dtype=dtype, order=order, copy=copy)ValueError: could not convert string to float: "From: [email protected] (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

回答:

在训练模型时,你需要使用你创建的训练向量(这里是x_train_tfidf

model.fit(x_train_tfidf,dataset_train.target)

dataset_train.data 这里是一个字符串列表。这就是错误的原因。

附注: 错误信息的基本意思是,你试图在模型中拟合string,这是不允许的。引用自文档

fit(X, y, sample_weight=None)

X : array-like or sparse matrix of shape = [n_samples, n_features]

训练输入样本。内部会将其数据类型转换为 dtype=np.float32。如果提供的是稀疏矩阵,它将被转换为稀疏的csc_matrix。

Related Posts

L1-L2正则化的不同系数

我想对网络的权重同时应用L1和L2正则化。然而,我找不…

使用scikit-learn的无监督方法将列表分类成不同组别,有没有办法?

我有一系列实例,每个实例都有一份列表,代表它所遵循的不…

f1_score metric in lightgbm

我想使用自定义指标f1_score来训练一个lgb模型…

通过相关系数矩阵进行特征选择

我在测试不同的算法时,如逻辑回归、高斯朴素贝叶斯、随机…

可以将机器学习库用于流式输入和输出吗?

已关闭。此问题需要更加聚焦。目前不接受回答。 想要改进…

在TensorFlow中,queue.dequeue_up_to()方法的用途是什么?

我对这个方法感到非常困惑,特别是当我发现这个令人费解的…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注