使用Pandas在Python中进行多行计算的特征工程

我有以下格式的CSV数据:

+-----------------+--------+-------------+| reservation_num |  rate  | guest_name  |+-----------------+--------+-------------+| B874576         | 169.95 | Bob Smith   || H786234         | 258.95 | Jane Doe    || H786234         | 258.95 | John Doe    || F987354         | 385.95 | David Jones || N097897         | 449.95 | Mark Davis  || H567349         | 482.95 | Larry Stein || N097897         | 449.95 | Sue Miller  |+-----------------+--------+-------------+

我想在DataFrame中添加一个名为’rate_per_person’的特征(列)。这个特征的计算方法是将特定预订号的费率除以与该预订号相关联的总客人数。

这是我的代码:

#Importing Librariesimport pandas as pd# Importing the Datasetds = pd.read_csv('hotels.csv')for index, row in ds.iterrows():    row['rate_per_person'] = row['rate'] / ds[row['reservation_num']].count

以及错误信息:

Traceback (most recent call last):  File "<ipython-input-3-0668a3165e76>", line 2, in <module>    row['rate_per_person'] = row['rate'] / ds[row['reservation_num']].count  File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/frame.py", line 2062, in __getitem__    return self._getitem_column(key)  File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/frame.py", line 2069, in _getitem_column    return self._get_item_cache(key)  File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/generic.py", line 1534, in _get_item_cache    values = self._data.get(item)  File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/internals.py", line 3590, in get    loc = self.items.get_loc(item)  File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2395, in get_loc    return self._engine.get_loc(self._maybe_cast_indexer(key))  File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5239)  File "pandas/_libs/index.pyx", line 154, in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5085)  File "pandas/_libs/hashtable_class_helper.pxi", line 1207, in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20405)  File "pandas/_libs/hashtable_class_helper.pxi", line 1215, in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20359)KeyError: 'B874576'

根据错误信息,很明显ds[row['reservation_num']].count这部分代码存在问题。然而,我不确定如何以编程方式获取每个预订的客人数量,以便创建新的特征。


回答:

选项1
pd.Series.value_countsmap

df.rate / df.reservation_num.map(df.reservation_num.value_counts())0    169.9501    129.4752    129.4753    385.9504    224.9755    482.9506    224.975dtype: float64

选项2
groupbytransformsize

df.rate / df.groupby('reservation_num').rate.transform('size')0    169.9501    129.4752    129.4753    385.9504    224.9755    482.9506    224.975dtype: float64

选项3
np.uniquenp.bincount

u, f = np.unique(df.reservation_num.values, return_inverse=True)df.rate / np.bincount(f)[f]0    169.9501    129.4752    129.4753    385.9504    224.9755    482.9506    224.975dtype: float64

选项3.5
np.unique会排序,因此扩展性不如pd.factorize。在我的使用环境中,它们的功能相同。因此,我使用了一个函数,根据数组长度设置了一个经验阈值,以确定哪种方法性能更好。它被编号为3.5,因为它基本上与3的答案相同

def factor(a):    if len(a) > 10000:        return pd.factorize(a)[0]    else:        return np.unique(a, return_inverse=True)[1]def count(a):    f = factor(a)    return np.bincount(f)[f]df.rate / count(df.reservation_num.values)  0    169.9501    129.4752    129.4753    385.9504    224.9755    482.9506    224.975dtype: float64

时间测量

%timeit df.rate / df.reservation_num.map(df.reservation_num.value_counts())%timeit df.rate / df.groupby('reservation_num').rate.transform('size')1000 loops, best of 3: 650 µs per loop1000 loops, best of 3: 768 µs per loop%%timeitu, f = np.unique(df.reservation_num.values, return_inverse=True)df.rate / np.bincount(f)[f]10000 loops, best of 3: 131 µs per loop

Related Posts

L1-L2正则化的不同系数

我想对网络的权重同时应用L1和L2正则化。然而,我找不…

使用scikit-learn的无监督方法将列表分类成不同组别,有没有办法?

我有一系列实例,每个实例都有一份列表,代表它所遵循的不…

f1_score metric in lightgbm

我想使用自定义指标f1_score来训练一个lgb模型…

通过相关系数矩阵进行特征选择

我在测试不同的算法时,如逻辑回归、高斯朴素贝叶斯、随机…

可以将机器学习库用于流式输入和输出吗?

已关闭。此问题需要更加聚焦。目前不接受回答。 想要改进…

在TensorFlow中,queue.dequeue_up_to()方法的用途是什么?

我对这个方法感到非常困惑,特别是当我发现这个令人费解的…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注