sklearn DecisionTreeClassifier与CountVectorizer及额外预测器的使用

我使用sklearn的DecisionTreeClassifier构建了一个文本分类模型，并希望添加另一个预测器。我的数据在一个pandas数据框中，列标签分别为'Impression'（文本）、'Volume'（浮点数）和'Cancer'（标签）。我之前只使用Impression来预测Cancer，但现在我想同时使用Impression和Volume来预测Cancer。

之前运行无误的代码如下：

X_train, X_test, y_train, y_test = train_test_split(data['Impression'], data['Cancer'], test_size=0.2)vectorizer = CountVectorizer()X_train = vectorizer.fit_transform(X_train)X_test = vectorizer.transform(X_test)dt = DecisionTreeClassifier(class_weight='balanced', max_depth=6, min_samples_leaf=3, max_leaf_nodes=20)dt.fit(X_train, y_train)y_pred = dt.predict(X_test)

我尝试了几种不同的方法来添加Volume预测器（更改部分以粗体显示）：

1) 只对Impressions进行fit_transform

X_train, X_test, y_train, y_test = train_test_split(data[['Impression', 'Volume']], data['Cancer'], test_size=0.2)vectorizer = CountVectorizer()X_train['Impression'] = vectorizer.fit_transform(X_train['Impression'])X_test = vectorizer.transform(X_test)dt = DecisionTreeClassifier(class_weight='balanced', max_depth=6, min_samples_leaf=3, max_leaf_nodes=20)dt.fit(X_train, y_train)y_pred = dt.predict(X_test)

这会抛出以下错误：

TypeError: float() argument must be a string or a number, not 'csr_matrix'...ValueError: setting an array element with a sequence.

2) 对Impressions和Volumes都调用fit_transform。除了fit_transform行外，代码与上述相同：

X_train = vectorizer.fit_transform(X_train)

这当然会抛出以下错误：

ValueError: Number of labels=1800 does not match number of samples=2...X_train.shape(2, 2)y_train.shape(1800,)

我相当确定方法#1是正确的途径，但我还没有找到任何教程或解决方案来告诉我如何将浮点数预测器添加到这个文本分类模型中。

任何帮助都将不胜感激！

回答：

ColumnTransformer()将完美解决这个问题。与其手动将CountVectorizer的输出与其他列拼接，我们可以在ColumnTransformer中将remainder参数设置为passthrough。

from sklearn.tree import DecisionTreeClassifierfrom sklearn.model_selection import train_test_splitfrom sklearn.pipeline import make_pipelinefrom sklearn.compose import make_column_transformerfrom sklearn.feature_extraction.text import CountVectorizerimport pandas as pdfrom sklearn import set_configset_config(print_changed_only='True', display='diagram')data = pd.DataFrame({'Impression': ['this is the first text',                                    'second one goes like this',                                    'third one is very short',                                    'This is the final statement'],                     'Volume': [123, 1, 2, 123],                     'Cancer': [1, 0, 0, 1]})X_train, X_test, y_train, y_test = train_test_split(    data[['Impression', 'Volume']], data['Cancer'], test_size=0.5)ct = make_column_transformer(    (CountVectorizer(), 'Impression'), remainder='passthrough')pipeline = make_pipeline(ct, DecisionTreeClassifier())pipeline.fit(X_train, y_train)pipeline.score(X_test, y_test)

使用0.23.0版本，可以看到管道对象的可视化（set_config中的display参数）

学技术

sklearn DecisionTreeClassifier与CountVectorizer及额外预测器的使用

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复