更高效地对pandas数据框中一组列进行均值中心化并保留列名

我有一个包含大约370列的数据框。我正在测试一系列假设，这些假设需要我使用模型的子集来拟合一个三次回归模型。我计划使用statsmodels来建模这些数据。

多项式回归过程的一部分涉及到变量的均值中心化（从每个特定特征的案例中减去均值）。

我可以用三行代码完成这个操作，但考虑到我需要为六个假设重复这个过程，这似乎效率不高。请注意，我需要从statsmodel输出中获取系数级别的数据，因此我需要保留列名。

这是数据的一个片段。这是为其中一个假设测试所需的列子集。

      i  we  you  shehe  they  ipron0  0.51   0    0   0.26  0.00   1.021  1.24   0    0   0.00  0.00   1.662  0.00   0    0   0.00  0.72   1.453  0.00   0    0   0.00  0.00   0.53

这是进行均值中心化并保留列名的代码。

from sklearn import preprocessing#create df of features for hypothesis, from full dataframeh2 = df[['i', 'we', 'you', 'shehe', 'they', 'ipron']]#center the variablesx_centered = preprocessing.scale(h2, with_mean='True', with_std='False')#convert back into a Pandas dataframe and add column namesx_centered_df = pd.DataFrame(x_centered, columns=h2.columns)

关于如何使这个过程更高效/更快的任何建议都将非常棒！

回答：

df.apply(lambda x: x-x.mean())%timeit df.apply(lambda x: x-x.mean())1000 loops, best of 3: 2.09 ms per loopdf.subtract(df.mean())%timeit df.subtract(df.mean())1000 loops, best of 3: 902 µs per loop

两者都产生以下结果：

        i  we  you  shehe  they  ipron0  0.0725   0    0  0.195 -0.18 -0.1451  0.8025   0    0 -0.065 -0.18  0.4952 -0.4375   0    0 -0.065  0.54  0.2853 -0.4375   0    0 -0.065 -0.18 -0.635

学技术

更高效地对pandas数据框中一组列进行均值中心化并保留列名

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复