多文本列的特征提取用于分类问题

如何从多个文本列中正确提取特征并应用任何分类算法?请指导我,如果我做错了什么

示例数据集

enter image description here

自变量: Description1,Description2, State, NumericCol1,NumericCol2

因变量: TargetCategory

代码:

########### Feature Extraction for Text Data ############################## Description1 (it can be any wordembedding technique like countvectorizer, tfidf, word2vec,bert..etc)tfidf = TfidfVectorizer(max_features = 500,                               ngram_range = (1,3),                              stop_words = "english")X_Description1 = tfidf.fit_transform(df["Description1"].tolist())######### Description2 (it can be any wordembedding technique like countvectorizer, tfidf, word2vec,bert..etc)tfidf = TfidfVectorizer(max_features = 500,                               ngram_range = (1,3),                              stop_words = "english")X_Description2 = tfidf.fit_transform(df["Description2"].tolist())######### State (have 100 unique entries thats why used BinaryEncoder)import category_encoders as cebinary_encoder= ce.BinaryEncoder(cols=['state'],return_df=True)X_state = binary_encoder.fit_transform(df["state"])import scipyX = scipy.sparse.hstack((X_Description1,                          X_Description2,                         X_state,                         df[["NumericCol1", "NumericCol2"]].to_numpy())).tocsr()y = df['TargetCategory']##### train Test Split ########from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=111)##### Create Model Model ######from sklearn.ensemble import RandomForestClassifierfrom sklearn.metrics import accuracy_score, recall_score, classification_report, cohen_kappa_scorefrom sklearn import metrics # Baseline Random forest based Modelrfc = RandomForestClassifier(criterion = 'gini', n_estimators=1000, verbose=1, n_jobs = -1,                              class_weight = 'balanced', max_features = 'auto')rfcg = rfc.fit(X_train,y_train) # fit on training data####### Prediction ##########predictions = rfcg.predict(X_test)print('Baseline: Accuracy: ', round(accuracy_score(y_test, predictions)*100, 2))print('\n Classification Report:\n', classification_report(y_test,predictions))

回答:

在scikit-learn中使用多列作为输入的方法是使用ColumnTransformer

这里有一个如何在异构数据上使用它的示例。

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注