我有一个大约400行的数据集,其中包含几个分类数据列,还有一列文本形式的描述作为我的分类模型的输入。我计划使用SVM作为我的分类模型。由于模型无法接受非数值数据作为输入,因此我已经将输入特征转换为数值数据。
我对描述列进行了TF-IDF转换,将术语转换成了矩阵形式。
我是否需要使用标签编码转换分类特征,然后将其与TF-IDF合并后输入到机器学习模型中?
回答:
使用ColumnTransformer
对不同数据类型的列应用不同的管道转换。这里是一个例子:
from sklearn.compose import ColumnTransformerfrom sklearn.pipeline import Pipelinefrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.preprocessing import OneHotEncoderfrom sklearn.svm import SVC# pipeline for text datatext_features = 'text_column'text_transformer = Pipeline(steps=[ ('vectorizer', TfidfVectorizer(stop_words="english"))])# pipeline for categorical datacategorical_features = ['cat_col1', 'cat_col2',]categorical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='constant', fill_value='missing')), ('onehot', OneHotEncoder(handle_unknown='ignore'))])# you can add other transformations for other data types# combine preprocessing with ColumnTransformerpreprocessor = ColumnTransformer( transformers=[ ('text', text_transformer, text_features), ('cat', categorical_transformer, categorical_features)])# add model to be part of pipelineclf_pipe = Pipeline(steps=[('preprocessor', preprocessor), ("model", SVC())])# ...## you can just use preprocessor by itself# X_train = preprocessor.fit_transform(X_train)# X_test = preprocessor.transform(X_test)# clf_s= SVC().fit(X_train, y_train)# clf_s.score(X_test, y_test)## or better, you can use the whole.# clf_pipe.fit(X_train, y_train) # clf_pipe.score(X_test, y_test)
查看Scikit-learn示例了解更多详情