我有一个数据集,包含不同类型的变量:二元、分类、数值、文本。
Text Age Type Link Start Passed Default0 care packag saint luke cathol church wa ... 21.0 organisation saintlukemclean <2001.0 0 01 opportun busi group center food support compan... 23.0 organisation cfanj <2003.0 0 02 holiday ice rink persh squar depart cultur sit... 98.0 home culturela >1975.0 0 0
我使用了不同的转换器,一个用于分类变量(OneHotEncoder),一个用于数值变量(SimpleImputer),一个用于文本变量(CountVectorizer/TF-IDF):
categorical_preprocessing = OneHotEncoder(handle_unknown='ignore')# categorical_encoder = ('CV',CountVectorizer())numeric_preprocessing = Pipeline([ ('imputer', SimpleImputer(strategy='mean'))])# CountVectorizertext_preprocessing_cv = Pipeline(steps=[ ('CV',CountVectorizer())]) # TF-IDFtext_preprocessing_tfidf = Pipeline(steps=[ ('TF-IDF',TfidfVectorizer()) ])
为了转换我的特征,并将它们传递到管道中(使用分类器逻辑回归、多项式朴素贝叶斯、随机森林和SVM),如下所示:
preprocessing = ColumnTransformer( transformers=[ ('text',text_preprocessing_cv, text_columns) ('category', categorical_preprocessing, categorical_columns), ('numeric', numeric_preprocessing, numerical_columns)])
然而,在这一步我遇到了错误:
from sklearn.linear_model import LogisticRegressionclf = Pipeline(steps=[('preprocessor', preprocessing), ('classifier', LogisticRegression())])clf.fit(X_train, y_train) # <-- error
ValueError: 选择的列,[‘Age’,’Default’] 在数据框中不是唯一的。
这个错误可能是由于我的过采样或者是我预处理特征的方式引起的…重采样的正确顺序应该是只应用于训练集以避免过拟合,但我不清楚是否需要在重采样前后考虑不同类型的变量和转换器。
如果您能帮助我修复这个错误,让管道能够使用这些预处理,我将不胜感激。谢谢
请参考代码:
text_columns = ['Text'] categorical_columns = ['Type', 'Link','Start'] numerical_columns = ['Age','Default'] # 我可以将布尔值视为数值吗? X = df[categorical_columns + numerical_columns+text_columns] y= df['Passed'] X_train, X_test, y_train, y_test = train_test_split( X, y, stratify=y, random_state=42) # 返回到一个数据框 training_set = pd.concat([X_train, y_train], axis=1) # 需要用于重采样技术 passed=training_set[training_set['Passed']==1] not_passed=training_set[training_set['Passed']==0] # 过采样少数类 oversample = resample(passed, replace=True, n_samples=len(not_passed),# 返回到新的训练集oversample_train = pd.concat([not_passed, oversample]) train_df = oversample_train.copy() # 这是应用重采样后的训练集 test_df = pd.concat([X_test, y_test], axis=1)X_train=train_df.loc[:,train_df.columns !='Passed']y_train=train_df[['Passed']categorical_encoder = OneHotEncoder(handle_unknown='ignore')numerical_pipe = Pipeline([ ('imputer', SimpleImputer(strategy='mean'))])text_transformer_cv = Pipeline(steps=[ ('cntvec',CountVectorizer())]) # TF-IDFtext_preprocessing_tfidf = Pipeline(steps=[ ('TF-IDF',TfidfVectorizer()) ]) # TF-IDF preprocessing = ColumnTransformer( transformers= [('category', categorical_encoder, categorical_columns), ('numeric', numerical_pipe, numerical_columns), # 我认为这是导致错误的原因。但我不知道为什么分类列也没有问题 ('text',text_transformer_cv, text_columns)])clf = Pipeline(steps=[('preprocessor', preprocessing), ('classifier', LogisticRegression())])clf.fit(X_train, y_train) ```
回答:
问题在于传递单个文本列的方式。我希望scikit-learn的未来版本能允许['Text',]
,但在那个时候,请直接传递它:
...text_columns = 'Text' # 而不是 ['Text']preprocessing = ColumnTransformer( transformers=[ ('text', text_preprocessing_cv, text_columns), ('category', categorical_preprocessing, categorical_columns), ('numeric', numeric_preprocessing, numerical_columns) ], remainder='passthrough')