查找K均值距离

我有一个包含13个特征和1000万行的数据库。我想应用K均值来移除任何异常。我的想法是应用K均值,创建一个新列来存储数据点与聚类中心之间的距离,再创建一个新列来存储平均距离,如果某个数据点的距离大于平均距离,我就删除整行。但我写的代码似乎不起作用。

数据集样本:https://drive.google.com/open?id=1iB1qjnWQyvoKuN_Pa8Xk4BySzXVTwtUk

df = pd.read_csv('Final After Simple Filtering.csv',index_col=None,low_memory=True)# Dropping columns with low feature importancedel df['AmbTemp_DegC']del df['NacelleOrientation_Deg']del df['MeasuredYawError']#applying kmeans#applying kmeanskmeans = KMeans( n_clusters=8)clusters= kmeans.fit_predict(df)centroids = kmeans.cluster_centers_distance1 = kmeans.fit_transform(df)distance2 = distance1.mean()df['distances']=distance1-distance2df = df[df['distances'] >=0]del df['distances']df.to_csv('/content//drive/My Drive/K TEST.csv', index=False)

错误:

KeyError                                  Traceback (most recent call last)/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)   2896             try:-> 2897                 return self._engine.get_loc(key)   2898             except KeyError:pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()KeyError: 'distances'During handling of the above exception, another exception occurred:KeyError                                  Traceback (most recent call last)9 framespandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()KeyError: 'distances'During handling of the above exception, another exception occurred:ValueError                                Traceback (most recent call last)/usr/local/lib/python3.6/dist-packages/pandas/core/internals/blocks.py in __init__(self, values, placement, ndim)    126             raise ValueError(    127                 "Wrong number of items passed {val}, placement implies "--> 128                 "{mgr}".format(val=len(self.values), mgr=len(self.mgr_locs))    129             )    130 ValueError: Wrong number of items passed 8, placement implies 1

谢谢


回答:

这是对您上一个问题的后续回答。

import seaborn as snsimport pandas as pdtitanic = sns.load_dataset('titanic')titanic = titanic.copy()titanic = titanic.dropna()titanic['age'].plot.hist(  bins = 50,  title = "Histogram of the age variable")from scipy.stats import zscoretitanic["age_zscore"] = zscore(titanic["age"])titanic["is_outlier"] = titanic["age_zscore"].apply(  lambda x: x <= -2.5 or x >= 2.5)titanic[titanic["is_outlier"]]ageAndFare = titanic[["age", "fare"]]ageAndFare.plot.scatter(x = "age", y = "fare")from sklearn.preprocessing import MinMaxScalerscaler = MinMaxScaler()ageAndFare = scaler.fit_transform(ageAndFare)ageAndFare = pd.DataFrame(ageAndFare, columns = ["age", "fare"])ageAndFare.plot.scatter(x = "age", y = "fare")from sklearn.cluster import DBSCANoutlier_detection = DBSCAN(  eps = 0.5,  metric="euclidean",  min_samples = 3,  n_jobs = -1)clusters = outlier_detection.fit_predict(ageAndFare)clustersfrom matplotlib import cmcmap = cm.get_cmap('Accent')ageAndFare.plot.scatter(  x = "age",  y = "fare",  c = clusters,  cmap = cmap,  colorbar = False)

enter image description here

请查看此链接以获取所有详细信息。

https://www.mikulskibartosz.name/outlier-detection-with-scikit-learn/

我今天之前从未听说过“局部异常因子”。当我在谷歌上搜索时,我得到了一些信息,似乎表明它是DBSCAN的衍生物。最后,我认为我的第一个答案实际上是检测异常的最佳方法。DBSCAN是一种聚类算法,恰好能够发现异常,这些异常实际上被视为“噪声”。我认为DBSCAN的主要目的不是异常检测,而是聚类。总之,正确选择超参数需要一些技巧。此外,DBSCAN在大型数据集上可能会很慢,因为它隐式地需要计算每个样本点的经验密度,导致最坏情况下的时间复杂度为二次方,这在大型数据集上相当慢。

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注