理解机器学习中的主成分分析

我正在使用部分鸢尾花数据集来更好地理解PCA。

这是我的代码:

from sklearn.datasets import load_irisimport numpy as npimport matplotlib.pyplot as pltfrom sklearn import decompositiondataset = load_iris()X = dataset.data[:20,]pca = decomposition.PCA(n_components=4)pca.fit(X)X = pca.transform(X)print(X)print()print(pca.explained_variance_ratio_)print(pca.explained_variance_)print(pca.noise_variance_)print()print(pca.components_)print()pca = decomposition.PCA(n_components=3)pca.fit(X)X = pca.transform(X)print(X)print()print(pca.explained_variance_ratio_)print(pca.explained_variance_)print(pca.noise_variance_)print()print(pca.components_)print()pca = decomposition.PCA(n_components=2)pca.fit(X)X = pca.transform(X)print(X)print()print(pca.explained_variance_ratio_)print(pca.explained_variance_)print(pca.noise_variance_)print()print(pca.components_)print()pca = decomposition.PCA(n_components=1)pca.fit(X)X = pca.transform(X)print(X)print()print(pca.explained_variance_ratio_)print(pca.explained_variance_)print(pca.noise_variance_)print()print(pca.components_)print()

输出:

| F1 | F2 | F3 | F4 | Label ||5.1 |3.5 |1.4 |0.2 |   0   ||4.9 |3.0 |1.4 |0.2 |   0   ||4.7 |3.2 |1.3 |0.2 |   0   ||4.6 |3.1 |1.5 |0.2 |   0   ||5.0 |3.6 |1.4 |0.2 |   0   ||5.4 |3.9 |1.7 |0.4 |   0   ||4.6 |3.4 |1.4 |0.3 |   0   ||5.0 |3.4 |1.5 |0.2 |   0   ||4.4 |2.9 |1.4 |0.2 |   0   ||4.9 |3.1 |1.5 |0.1 |   0   ||5.4 |3.7 |1.5 |0.2 |   0   ||4.8 |3.4 |1.6 |0.2 |   0   ||4.8 |3.0 |1.4 |0.1 |   0   ||4.3 |3.0 |1.1 |0.1 |   0   ||5.8 |4.0 |1.2 |0.2 |   0   ||5.7 |4.4 |1.5 |0.4 |   0   ||5.4 |3.9 |1.3 |0.4 |   0   ||5.1 |3.5 |1.4 |0.3 |   0   ||5.7 |3.8 |1.7 |0.3 |   0   ||5.1 |3.8 |1.5 |0.3 |   0   |[[ -5.35882132e-02   2.13091549e-02   5.63776995e-02   2.38909674e-02] [  4.31102885e-01   2.27802156e-01   7.74776903e-02  -8.56077547e-02] [  4.46437821e-01  -6.48981661e-02   7.80252213e-02  -2.16463511e-02] [  5.70213598e-01   1.37832371e-02  -1.17201913e-01  -2.27730577e-03] [ -4.99837824e-02  -1.06433448e-01   1.11801355e-02   6.42148516e-02] [ -5.88493547e-01   1.19234918e-02  -2.42112963e-01  -4.46036896e-02] [  3.62588639e-01  -2.42562846e-01  -9.89230051e-02  -3.13366123e-02] [  7.83136388e-02   6.27754417e-02  -4.79067754e-02   2.65736478e-02] [  8.58395527e-01  -1.49295381e-02  -5.29428852e-02  -4.69710396e-02] [  3.65880852e-01   2.20160693e-01  -4.51271386e-03   5.21066893e-02] [ -4.13586321e-01   1.11767646e-01   2.13883619e-02   5.54246013e-02] [  2.13819922e-01  -2.35008745e-02  -1.97388814e-01   6.95802124e-02] [  5.14034854e-01   1.87196747e-01   7.30881295e-02   2.14166399e-02] [  8.97493973e-01  -2.33177183e-01   1.99567657e-01   3.71580447e-02] [ -8.81108056e-01   4.91145021e-02   3.63511477e-01   3.42164603e-02] [ -1.12874867e+00  -2.07254026e-01  -5.20579454e-02   1.83622028e-02] [ -5.55989247e-01  -1.36936973e-01   1.21657674e-01  -1.11349149e-01] [ -6.47040031e-02   1.68848098e-04   3.14975704e-02  -6.99733273e-02] [ -7.24614545e-01   2.84297834e-01  -1.13495890e-01  -1.73834789e-02] [ -2.77465322e-01  -1.60606696e-01  -1.07228711e-01   2.82043907e-02]][ 0.87954353  0.06300167  0.05039505  0.00705974][ 0.31612993  0.02264438  0.01811324  0.00253745]0.0[[-0.71816179 -0.68211748 -0.08126075 -0.1111579 ] [ 0.61745716 -0.65996887  0.37215116 -0.21140307] [ 0.2926969  -0.15927874 -0.90942659 -0.24880129] [-0.131601    0.27163784  0.16686365 -0.93864295]][[ -5.35882132e-02   2.13091549e-02  -5.63776995e-02] [  4.31102885e-01   2.27802156e-01  -7.74776903e-02] [  4.46437821e-01  -6.48981661e-02  -7.80252213e-02] [  5.70213598e-01   1.37832371e-02   1.17201913e-01] [ -4.99837824e-02  -1.06433448e-01  -1.11801355e-02] [ -5.88493547e-01   1.19234918e-02   2.42112963e-01] [  3.62588639e-01  -2.42562846e-01   9.89230051e-02] [  7.83136388e-02   6.27754417e-02   4.79067754e-02] [  8.58395527e-01  -1.49295381e-02   5.29428852e-02] [  3.65880852e-01   2.20160693e-01   4.51271386e-03] [ -4.13586321e-01   1.11767646e-01  -2.13883619e-02] [  2.13819922e-01  -2.35008745e-02   1.97388814e-01] [  5.14034854e-01   1.87196747e-01  -7.30881295e-02] [  8.97493973e-01  -2.33177183e-01  -1.99567657e-01] [ -8.81108056e-01   4.91145021e-02  -3.63511477e-01] [ -1.12874867e+00  -2.07254026e-01   5.20579454e-02] [ -5.55989247e-01  -1.36936973e-01  -1.21657674e-01] [ -6.47040031e-02   1.68848098e-04  -3.14975704e-02] [ -7.24614545e-01   2.84297834e-01   1.13495890e-01] [ -2.77465322e-01  -1.60606696e-01   1.07228711e-01]][ 0.87954353  0.06300167  0.05039505][ 0.31612993  0.02264438  0.01811324]0.00253744874373[[  1.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00] [ -0.00000000e+00   1.00000000e+00  -3.33066907e-15   0.00000000e+00] [  0.00000000e+00  -3.10862447e-15  -1.00000000e+00  -3.60822483e-16]][[ -5.35882132e-02   2.13091549e-02] [  4.31102885e-01   2.27802156e-01] [  4.46437821e-01  -6.48981661e-02] [  5.70213598e-01   1.37832371e-02] [ -4.99837824e-02  -1.06433448e-01] [ -5.88493547e-01   1.19234918e-02] [  3.62588639e-01  -2.42562846e-01] [  7.83136388e-02   6.27754417e-02] [  8.58395527e-01  -1.49295381e-02] [  3.65880852e-01   2.20160693e-01] [ -4.13586321e-01   1.11767646e-01] [  2.13819922e-01  -2.35008745e-02] [  5.14034854e-01   1.87196747e-01] [  8.97493973e-01  -2.33177183e-01] [ -8.81108056e-01   4.91145021e-02] [ -1.12874867e+00  -2.07254026e-01] [ -5.55989247e-01  -1.36936973e-01] [ -6.47040031e-02   1.68848098e-04] [ -7.24614545e-01   2.84297834e-01] [ -2.77465322e-01  -1.60606696e-01]][ 0.88579703  0.06344961][ 0.31612993  0.02264438]0.0181132415475[[  1.00000000e+00   0.00000000e+00   0.00000000e+00] [ -0.00000000e+00   1.00000000e+00  -5.55111512e-16]][[-0.05358821] [ 0.43110288] [ 0.44643782] [ 0.5702136 ] [-0.04998378] [-0.58849355] [ 0.36258864] [ 0.07831364] [ 0.85839553] [ 0.36588085] [-0.41358632] [ 0.21381992] [ 0.51403485] [ 0.89749397] [-0.88110806] [-1.12874867] [-0.55598925] [-0.064704  ] [-0.72461455] [-0.27746532]][ 0.93315793][ 0.31612993]0.0226443764968[[ 1.  0.]]

在我的数据集中,F1具有最高的方差。这在PCA的输出中是如何体现的?

这里的“解释方差”具体是什么意思?这是否意味着原始特征对新计算值的方差有多大影响?

为什么在使用4个成分的第一个例子中噪声方差为0?

components_具体是什么?它们是n维特征向量吗?


回答:

F1具有最高的方差。这在PCA的输出中是如何体现的?

PCA是一种特征转换技术,它旋转您的原始数据维度并转换到新的正交特征空间。在新的特征空间中,主成分(您数据的标准化协方差矩阵的正交特征向量)构成了空间的维度。这些成分是您原始特征维度的线性组合。考虑以下代码,主导主成分PC1(捕捉数据中最高方差)可以表示为特征的线性组合,如PC1=-0.718162*F1+0.292697*F3-0.131601*F4

import pandas as pdpd.DataFrame(pca.components_, columns=['PC1', 'PC2', 'PC3', 'PC4'], index=['F1', 'F2', 'F3', 'F4'])#         PC1       PC2       PC3       PC4#F1 -0.718162 -0.682117 -0.081261 -0.111158#F2  0.617457 -0.659969  0.372151 -0.211403#F3  0.292697 -0.159279 -0.909427 -0.248801#F4 -0.131601  0.271638  0.166864 -0.938643

这里的“解释方差”具体是什么意思?这是否意味着原始特征对新计算值的方差有多大影响?

每个选定成分解释的方差量,通过简单地获取PCA加载列的方差(即pca.transform返回的列的方差,即转换后的特征的方差,而不是原始的),见以下代码:

X = pca.transform(X)print(np.var(X, axis=0))#[ 0.31612993  0.02264438  0.01811324  0.00253745]print(pca.explained_variance_)#[ 0.31612993  0.02264438  0.01811324  0.00253745]

为什么在使用4个成分的第一个例子中噪声方差为0?

因为在第一种情况下我们没有进行任何降维,我们只是将特征空间转换到另一个空间并使用了所有4个成分,没有排除任何成分(因此没有信息丢失)。

components_具体是什么?它们是n维特征向量吗?

成分可以被认为是缩放数据的协方差矩阵的正交特征向量,尽管如文档所说,它是使用奇异值分解以更数值稳定的方式计算的,在这种情况下,它们是从右奇异向量计算得出的。

Related Posts

Keras Dense层输入未被展平

这是我的测试代码: from keras import…

无法将分类变量输入随机森林

我有10个分类变量和3个数值变量。我在分割后直接将它们…

如何在Keras中对每个输出应用Sigmoid函数?

这是我代码的一部分。 model = Sequenti…

如何选择类概率的最佳阈值?

我的神经网络输出是一个用于多标签分类的预测类概率表: …

在Keras中使用深度学习得到不同的结果

我按照一个教程使用Keras中的深度神经网络进行文本分…

‘MatMul’操作的输入’b’类型为float32,与参数’a’的类型float64不匹配

我写了一个简单的TensorFlow代码,但不断遇到T…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注