我在一个包含一些分类特征的数据集上使用K-means聚类。我有一些旧代码,这些代码处理的是非分类数据,按照顺序执行fit然后predict操作都能正常工作。
所以现在我正在修改这些工作代码,以适用于包含一些分类特征的数据集,因此需要进行独热编码。这就是一切开始变得有点混乱的地方。
似乎predict方法调用时期望的是独热编码前的旧列数。在删除目标列后的数据集有17列。然后在独热编码后,它有29列。
这是我的代码:
import pandas as pdimport numpy as npfrom sklearn.cluster import KMeansfrom sklearn.metrics import silhouette_scorefrom google.colab import drivedrive.mount('/gdrive')#Change current working directory to gdrive%cd /gdrive#Read filesinputFileA = r'/gdrive/My Drive/FilenameA.csv'trainDataA = pd.read_csv(inputFileA) #creates a dataframeprint(trainDataA.shape)#Extract training and test dataprint("------------------\nShapes before dropping target column")print(trainDataA.shape)print(trainDataB.shape)y_trainA = trainDataA["Revenue"]X_trainA = trainDataA.drop(["Revenue"], axis=1) #extracting training data without target columnprint("------------------\nShapes after dropping target column")print(X_trainA.shape)#categorical features of dataset AcategoricalFeaturesA = ["Month", "VisitorType","Weekend"]data_processed_A = pd.get_dummies(X_trainA,prefix_sep="__",columns=categoricalFeaturesA)print("---------------\nDataset A\n",data_processed_A.head())data_processed_A.to_csv(r'/gdrive/My Drive/data_processed_A.csv')#K-Means Clustering ========================================================================#Default Mode - K=8kmeans = KMeans()data_processed_A_fit = data_processed_Aprint("===================")print("Shape of processed data: \n", data_processed_A_fit.shape)data_processed_A_fit.to_csv(r'/gdrive/My Drive/data_processed_A_after_fit.csv')kmeans.fit(data_processed_A_fit)print("Online shoppers dataset");print("\n============\nDataset A labels")print(kmeans.labels_)print("==============\n\nDataset A Clusters")print(kmeans.cluster_centers_)#Print Silhouette measureprint("\nDataset A silhouette_score:",silhouette_score(data_processed_A, kmeans.labels_))df_kmeansA = data_processed_Aprint(df_kmeansA.head())print(df_kmeansA.dtypes)kmeans_predict_trainA = kmeans.predict(df_kmeansA)
在最后一行它抛出了一个错误:
ValueError: Incorrect number of features. Got 30 features, expected 29
所以似乎它期望的是独热编码前的那个数据集,但我搞不清楚为什么会这样。
编辑:应要求,以下是输出结果。
(12330, 18)------------------Shapes before dropping target column(12330, 18)------------------Shapes after dropping target column(12330, 17)---------------Dataset A Administrative Administrative_Duration ... Weekend__False Weekend__True0 0 0.0 ... 1 01 0 0.0 ... 1 02 0 0.0 ... 1 03 0 0.0 ... 1 04 0 0.0 ... 0 1[5 rows x 29 columns]===================Shape of processed data: (12330, 29)Online shoppers dataset============Dataset A labels[1 1 1 ... 1 1 1]==============Dataset A Clusters[[ 3.81805930e+00 1.38862225e+02 9.64959569e-01 6.74071040e+01 5.82958221e+01 2.41869720e+03 7.61833487e-03 2.22516393e-02 8.26725184e+00 5.21563342e-02 2.12398922e+00 2.27021563e+00 3.19204852e+00 3.92318059e+00 3.70619946e-02 1.35444744e-01 4.71698113e-03 3.77358491e-02 1.95417790e-02 1.04447439e-01 2.58086253e-01 3.29514825e-01 4.38005391e-02 2.96495957e-02 6.13207547e-02 1.34770889e-03 9.37331536e-01 7.85040431e-01 2.14959569e-01]
回答:
它似乎期望的是独热编码前的那个数据集
它不是;如果是这样的话,它会要求17个特征,而不是它要求的29个:
ValueError: Incorrect number of features. Got 30 features, expected 29
所以,它抱怨的是比预期多一个特征;仔细查看你的输出,很明显
print(df_kmeansA.head())
的输出是[5 rows x 30 columns]
,包含了一个Cluster Number
列。然而,你的KMeans是用data_processed_A_fit
拟合的,它有
===================Shape of processed data: (12330, 29)
并且没有Cluster Number
列。
这确实表明,尽管你设置了data_processed_A_fit = data_processed_A
和df_kmeansA = data_processed_A
,但这里没有显示的一段代码,在data_processed_A
数据框中添加了Cluster Number
列,因此导致了错误。