使用Pandas在Python中进行多行计算的特征工程

我有以下格式的CSV数据:

+-----------------+--------+-------------+| reservation_num |  rate  | guest_name  |+-----------------+--------+-------------+| B874576         | 169.95 | Bob Smith   || H786234         | 258.95 | Jane Doe    || H786234         | 258.95 | John Doe    || F987354         | 385.95 | David Jones || N097897         | 449.95 | Mark Davis  || H567349         | 482.95 | Larry Stein || N097897         | 449.95 | Sue Miller  |+-----------------+--------+-------------+

我想在DataFrame中添加一个名为’rate_per_person’的特征（列）。这个特征的计算方法是将特定预订号的费率除以与该预订号相关联的总客人数。

这是我的代码:

#Importing Librariesimport pandas as pd# Importing the Datasetds = pd.read_csv('hotels.csv')for index, row in ds.iterrows():    row['rate_per_person'] = row['rate'] / ds[row['reservation_num']].count

以及错误信息:

Traceback (most recent call last):  File "<ipython-input-3-0668a3165e76>", line 2, in <module>    row['rate_per_person'] = row['rate'] / ds[row['reservation_num']].count  File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/frame.py", line 2062, in __getitem__    return self._getitem_column(key)  File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/frame.py", line 2069, in _getitem_column    return self._get_item_cache(key)  File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/generic.py", line 1534, in _get_item_cache    values = self._data.get(item)  File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/internals.py", line 3590, in get    loc = self.items.get_loc(item)  File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2395, in get_loc    return self._engine.get_loc(self._maybe_cast_indexer(key))  File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5239)  File "pandas/_libs/index.pyx", line 154, in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5085)  File "pandas/_libs/hashtable_class_helper.pxi", line 1207, in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20405)  File "pandas/_libs/hashtable_class_helper.pxi", line 1215, in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20359)KeyError: 'B874576'

根据错误信息，很明显ds[row['reservation_num']].count这部分代码存在问题。然而，我不确定如何以编程方式获取每个预订的客人数量，以便创建新的特征。

回答：

选项1
pd.Series.value_counts和map

df.rate / df.reservation_num.map(df.reservation_num.value_counts())0    169.9501    129.4752    129.4753    385.9504    224.9755    482.9506    224.975dtype: float64

选项2
groupby，transform和size

df.rate / df.groupby('reservation_num').rate.transform('size')0    169.9501    129.4752    129.4753    385.9504    224.9755    482.9506    224.975dtype: float64

选项3
np.unique和np.bincount

u, f = np.unique(df.reservation_num.values, return_inverse=True)df.rate / np.bincount(f)[f]0    169.9501    129.4752    129.4753    385.9504    224.9755    482.9506    224.975dtype: float64

选项3.5
np.unique会排序，因此扩展性不如pd.factorize。在我的使用环境中，它们的功能相同。因此，我使用了一个函数，根据数组长度设置了一个经验阈值，以确定哪种方法性能更好。它被编号为3.5，因为它基本上与3的答案相同

def factor(a):    if len(a) > 10000:        return pd.factorize(a)[0]    else:        return np.unique(a, return_inverse=True)[1]def count(a):    f = factor(a)    return np.bincount(f)[f]df.rate / count(df.reservation_num.values)  0    169.9501    129.4752    129.4753    385.9504    224.9755    482.9506    224.975dtype: float64

时间测量

%timeit df.rate / df.reservation_num.map(df.reservation_num.value_counts())%timeit df.rate / df.groupby('reservation_num').rate.transform('size')1000 loops, best of 3: 650 µs per loop1000 loops, best of 3: 768 µs per loop%%timeitu, f = np.unique(df.reservation_num.values, return_inverse=True)df.rate / np.bincount(f)[f]10000 loops, best of 3: 131 µs per loop

学技术

使用Pandas在Python中进行多行计算的特征工程

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复