将聚类输出拟合到机器学习模型中

这只是一个机器学习/数据科学问题。

a) 假设我有一个包含20个特征的数据集，我决定使用其中的3个特征进行无监督学习的聚类 – 理想情况下这会产生3个聚类（A、B和C）。

b) 然后我将这个输出结果（聚类A、B或C）作为一个新特征重新加入到我的数据集中（即现在总共有21个特征）。

c) 我使用这21个特征运行一个回归模型来预测一个标签值。

我想知道步骤 b) 是否是多余的（因为这些特征已经存在于之前的数据集中），如果我使用更强大的模型（随机森林、XGBoost）是否会如此，以及如何从数学上解释这一点。

任何意见和建议都将非常受欢迎！

回答：

好主意：不妨试一试，看看结果如何。正如你猜测的，这高度依赖于你的数据集和模型选择。很难预测添加这种类型的特征会如何表现，就像任何其他特征工程一样。但是要注意，在某些情况下，这甚至可能不会提高你的性能。请看下面使用Iris数据集的测试，其中性能实际上是下降的：

import numpy as npfrom sklearn.cluster import KMeansfrom sklearn.linear_model import LogisticRegressionfrom sklearn.datasets import load_irisfrom sklearn.svm import SVCfrom sklearn import metrics# load datairis = load_iris()X = iris.data[:, :3]  # only keep three out of the four available features to make it more challengingy = iris.target# split train / testindices = np.random.permutation(len(X))N_test = 30X_train, y_train = X[indices[:-N_test]], y[indices[:-N_test]]X_test, y_test = X[indices[N_test:]], y[indices[N_test:]]# compute a clustering method (here KMeans) based on available features in X_trainkmeans = KMeans(n_clusters=3, random_state=0).fit(X_train)new_clustering_feature_train = kmeans.predict(X_train)new_clustering_feature_test = kmeans.predict(X_test)# create a new input train/test X with this feature addedX_train_with_clustering_feature = np.column_stack([X_train, new_clustering_feature_train])X_test_with_clustering_feature = np.column_stack([X_test, new_clustering_feature_test])

现在让我们比较两个模型，一个仅在 X_train 上学习，另一个在 X_train_with_clustering_feature 上学习：

model1 = SVC(kernel='rbf', gamma=0.7, C=1.0).fit(X_train, y_train)print(metrics.classification_report(model1.predict(X_test), y_test))              precision    recall  f1-score   support           0       1.00      1.00      1.00        45           1       0.95      0.97      0.96        38           2       0.97      0.95      0.96        37    accuracy                           0.97       120   macro avg       0.97      0.97      0.97       120weighted avg       0.98      0.97      0.97       120

另一个模型：

model2 = SVC(kernel='rbf', gamma=0.7, C=1.0).fit(X_train_with_clustering_feature, y_train)print(metrics.classification_report(model2.predict(X_test_with_clustering_feature), y_test))           0       1.00      1.00      1.00        45           1       0.87      0.97      0.92        35           2       0.97      0.88      0.92        40    accuracy                           0.95       120   macro avg       0.95      0.95      0.95       120weighted avg       0.95      0.95      0.95       120

学技术

将聚类输出拟合到机器学习模型中

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复