如何使用sklearn或其他python包执行逆向局部线性嵌入(LLE)?
我想对一些表格数据X进行分类机器学习算法(SVM、神经网络等)的训练,其中y是目标类别变量。
通常的程序如下:
将X和y分割为X_train, y_train, X_test, y_test。由于我有大量的参数(列),我可以通过对X_train进行LLE来减少参数数量,以获得X_train_lle。y是目标变量,不进行任何变换。之后,我可以简单地在X_train_lle上训练模型。问题出现在我想在y_test上使用训练好的模型时。如果对X_test和X_train一起进行LLE,会引入数据泄露。另外,如果仅对X_test进行LLE,新的X_test_lle可能会完全不同,因为算法使用的是k最近邻。我认为正确的程序应该是使用在X_train上获得的参数对X_test进行逆向LLE,然后在X_test_lle上使用分类模型。
我查了一些参考文献,第2.4.1节处理了逆向LLE。https://arxiv.org/pdf/2011.10925.pdf
如何使用python( preferably sklearn)执行逆向LLE?
这是一个代码示例:
import numpy as npfrom sklearn import preprocessingfrom sklearn import svm, datasetsfrom sklearn.manifold import LocallyLinearEmbedding### Generating dummy datan_row = 10000 # these numbers are much bigger for the real problemn_col = 50 #X = np.random.random(n_row, n_col)y = np.random.randint(5, size=n_row) # five different classes labeled from 0 to 4### Preprocessing ###X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size = 0.5, random_state = 1)#standardization using StandardScaler applied to X_train and then scaling X_train and X_testscaler = preprocessing.StandardScaler()scaler.fit(X_train)X_train = scaler.transform(X_train)X_test = scaler.transform(X_test)### Here is the part with LLE #### We reduce the parameter space to 10 with 15 nearest neighboursX_train_lle = LocallyLinearEmbedding(n_neighbors=15, n_components=10, method='modified', eigen_solver='dense')### Here is the training part #### we want to apply SVM to transformed data X_train_lle#Create a svm Classifierclf = svm.SVC(kernel='linear') # Linear Kernel#Train the model using the training setsclf.fit(X_train_lle, y_train)# Here should go the code to do inverse LLE on X_test #i.e. where do values of X_test_lle fit in the manufold X_train_lle### After the previous part of the code was successfully solved by stackoverflow community :)#Predict the response for test datasety_pred = clf.predict(X_test_lle)
回答:
可以使用您提到的论文中的方法(Ghojogh et al. (2020) – 2.4.1节)以及其他论文(例如,Franz et al. (2014) – 4.1节)进行逆变换。基本思路是找到嵌入空间中的k最近邻,然后将每个点表示为其在嵌入空间中的邻居的线性组合。然后保留获得的权重,并使用相同的权重将每个点表示为其在原始空间中的k最近邻的组合。显然,应使用与原始正向LLE相同的邻居数量。
使用barycenter_kneighbor_graph
函数的代码看起来像这样:
from sklearn.manifold._locally_linear import barycenter_kneighbors_graph# calculate the weights for expressing each point in the embedded space as a linear combination of its neighborsW = barycenter_kneighbors_graph(Y, n_neighbors = k, reg = 1e-3)# reconstruct the data points in the high dimensional space from its neighbors using the weights calculated based on the embedded spaceX_reconstructed = W @ X
其中Y是原始LLE嵌入的结果(在您的代码片段中是X_train_lle),X是原始数据矩阵,k是最近邻的数量。