我正在尝试使用K均值聚类方法对演员进行聚类,基于以下列的信息进行分类
Actors Movies TvGuest Awards Shorts Special LiveShows
Robert De Niro 111 2 6 0 0 0
Jack Nicholson 70 2 4 0 5 0
Marlon Brando 64 2 5 0 0 28
Denzel Washington 25 2 3 24 0 0
Katharine Hepburn 90 1 2 0 0 0
Humphrey Bogart 105 2 1 0 0 52
Meryl Streep 27 2 2 5 0 0
Daniel Day-Lewis 90 2 1 0 71 22
Sidney Poitier 63 2 3 0 0 0
Clark Gable 34 2 4 0 3 0
Ingrid Bergman 22 2 2 3 0 4
Tom Hanks 82 11 6 21 11 22
# 首先对数据进行标准化处理
X = StandardScaler().fit_transform(data)
# 使用肘部图来寻找最佳的k值
sum_of_squared_distances = []
K = range(1,15)
for k in K:
k_means = KMeans(n_clusters=k)
model = k_means.fit(X)
sum_of_squared_distances.append(k_means.inertia_)
plt.plot(K, sum_of_squared_distances, 'bx-')
plt.show()
# 为计算出的k值找到yhat
kmeans = KMeans(n_clusters=3)
model = kmeans.fit(X)
yhat = kmeans.predict(X)
无法弄清楚如何按演员创建散点图。
编辑:如果中心点也使用以下方法绘制,是否有办法找到哪些演员最接近中心点
centers = kmeans.cluster_centers_ (这里的kmeans指的是Eric的下方解决方案)
plt.scatter(centers[:,0],centers[:,1],color=’purple’,marker=’*’,label=’centroid’)
回答:
在Pandas中使用K均值聚类 – 散点图
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
df = pd.DataFrame(columns=['Actors', 'Movies', 'TvGuest', "Awards", "Shorts"])
df.loc[0] = ["Robert De Niro", 111, 2, 6, 0]
df.loc[1] = ["Jack Nicholson", 70, 2, 4, 0]
df.loc[2] = ["Marlon Brando", 64, 4, 5, 0]
df.loc[3] = ["Denzel Washington", 25, 2, 3, 24]
df.loc[4] = ["Katharine Hepburn", 90, 1, 2, 0]
df.loc[5] = ["Humphrey Bogart", 105, 2, 1, 0]
df.loc[6] = ["Meryl Streep", 27, 3, 2, 5]
df.loc[7] = ["Daniel Day-Lewis", 90, 2, 1, 0]
df.loc[8] = ["Sidney Poitier", 63, 2, 3, 0]
df.loc[9] = ["Clark Gable", 34, 2, 4, 0]
df.loc[10] = ["Ingrid Bergman", 22, 5, 2, 3]
kmeans = KMeans(n_clusters=4)
y = kmeans.fit_predict(df[['Movies', 'TvGuest', 'Awards']])
df['Cluster'] = y
plt.scatter(df.Movies, df.TvGuest, c=df.Cluster, alpha = 0.6)
plt.title('K-means Clustering 2 dimensions and 4 clusters')
plt.show()
显示:
请注意,2维散点图上展示的数据点是Movies
和TvGuest
,然而Kmeans拟合时使用了3个变量:Movies
,TvGuest
,Awards
。可以想象有一个额外的维度进入屏幕,用于计算聚类成员身份。
来源链接:
https://datasciencelab.wordpress.com/2013/12/12/clustering-with-k-means-in-python/
https://towardsdatascience.com/visualizing-clusters-with-pythons-matplolib-35ae03d87489