我有以下格式的CSV数据:
+-----------------+--------+-------------+| reservation_num | rate | guest_name |+-----------------+--------+-------------+| B874576 | 169.95 | Bob Smith || H786234 | 258.95 | Jane Doe || H786234 | 258.95 | John Doe || F987354 | 385.95 | David Jones || N097897 | 449.95 | Mark Davis || H567349 | 482.95 | Larry Stein || N097897 | 449.95 | Sue Miller |+-----------------+--------+-------------+
我想在DataFrame中添加一个名为’rate_per_person’的特征(列)。这个特征的计算方法是将特定预订号的费率除以与该预订号相关联的总客人数。
这是我的代码:
#Importing Librariesimport pandas as pd# Importing the Datasetds = pd.read_csv('hotels.csv')for index, row in ds.iterrows(): row['rate_per_person'] = row['rate'] / ds[row['reservation_num']].count
以及错误信息:
Traceback (most recent call last): File "<ipython-input-3-0668a3165e76>", line 2, in <module> row['rate_per_person'] = row['rate'] / ds[row['reservation_num']].count File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/frame.py", line 2062, in __getitem__ return self._getitem_column(key) File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/frame.py", line 2069, in _getitem_column return self._get_item_cache(key) File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/generic.py", line 1534, in _get_item_cache values = self._data.get(item) File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/internals.py", line 3590, in get loc = self.items.get_loc(item) File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2395, in get_loc return self._engine.get_loc(self._maybe_cast_indexer(key)) File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5239) File "pandas/_libs/index.pyx", line 154, in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5085) File "pandas/_libs/hashtable_class_helper.pxi", line 1207, in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20405) File "pandas/_libs/hashtable_class_helper.pxi", line 1215, in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20359)KeyError: 'B874576'
根据错误信息,很明显ds[row['reservation_num']].count
这部分代码存在问题。然而,我不确定如何以编程方式获取每个预订的客人数量,以便创建新的特征。
回答:
选项1pd.Series.value_counts
和map
df.rate / df.reservation_num.map(df.reservation_num.value_counts())0 169.9501 129.4752 129.4753 385.9504 224.9755 482.9506 224.975dtype: float64
选项2groupby
,transform
和size
df.rate / df.groupby('reservation_num').rate.transform('size')0 169.9501 129.4752 129.4753 385.9504 224.9755 482.9506 224.975dtype: float64
选项3np.unique
和np.bincount
u, f = np.unique(df.reservation_num.values, return_inverse=True)df.rate / np.bincount(f)[f]0 169.9501 129.4752 129.4753 385.9504 224.9755 482.9506 224.975dtype: float64
选项3.5np.unique
会排序,因此扩展性不如pd.factorize
。在我的使用环境中,它们的功能相同。因此,我使用了一个函数,根据数组长度设置了一个经验阈值,以确定哪种方法性能更好。它被编号为3.5
,因为它基本上与3
的答案相同
def factor(a): if len(a) > 10000: return pd.factorize(a)[0] else: return np.unique(a, return_inverse=True)[1]def count(a): f = factor(a) return np.bincount(f)[f]df.rate / count(df.reservation_num.values) 0 169.9501 129.4752 129.4753 385.9504 224.9755 482.9506 224.975dtype: float64
时间测量
%timeit df.rate / df.reservation_num.map(df.reservation_num.value_counts())%timeit df.rate / df.groupby('reservation_num').rate.transform('size')1000 loops, best of 3: 650 µs per loop1000 loops, best of 3: 768 µs per loop%%timeitu, f = np.unique(df.reservation_num.values, return_inverse=True)df.rate / np.bincount(f)[f]10000 loops, best of 3: 131 µs per loop