使用Pandas在Python中进行多行计算的特征工程

我有以下格式的CSV数据:

+-----------------+--------+-------------+| reservation_num |  rate  | guest_name  |+-----------------+--------+-------------+| B874576         | 169.95 | Bob Smith   || H786234         | 258.95 | Jane Doe    || H786234         | 258.95 | John Doe    || F987354         | 385.95 | David Jones || N097897         | 449.95 | Mark Davis  || H567349         | 482.95 | Larry Stein || N097897         | 449.95 | Sue Miller  |+-----------------+--------+-------------+

我想在DataFrame中添加一个名为’rate_per_person’的特征(列)。这个特征的计算方法是将特定预订号的费率除以与该预订号相关联的总客人数。

这是我的代码:

#Importing Librariesimport pandas as pd# Importing the Datasetds = pd.read_csv('hotels.csv')for index, row in ds.iterrows():    row['rate_per_person'] = row['rate'] / ds[row['reservation_num']].count

以及错误信息:

Traceback (most recent call last):  File "<ipython-input-3-0668a3165e76>", line 2, in <module>    row['rate_per_person'] = row['rate'] / ds[row['reservation_num']].count  File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/frame.py", line 2062, in __getitem__    return self._getitem_column(key)  File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/frame.py", line 2069, in _getitem_column    return self._get_item_cache(key)  File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/generic.py", line 1534, in _get_item_cache    values = self._data.get(item)  File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/internals.py", line 3590, in get    loc = self.items.get_loc(item)  File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2395, in get_loc    return self._engine.get_loc(self._maybe_cast_indexer(key))  File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5239)  File "pandas/_libs/index.pyx", line 154, in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5085)  File "pandas/_libs/hashtable_class_helper.pxi", line 1207, in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20405)  File "pandas/_libs/hashtable_class_helper.pxi", line 1215, in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20359)KeyError: 'B874576'

根据错误信息,很明显ds[row['reservation_num']].count这部分代码存在问题。然而,我不确定如何以编程方式获取每个预订的客人数量,以便创建新的特征。


回答:

选项1
pd.Series.value_countsmap

df.rate / df.reservation_num.map(df.reservation_num.value_counts())0    169.9501    129.4752    129.4753    385.9504    224.9755    482.9506    224.975dtype: float64

选项2
groupbytransformsize

df.rate / df.groupby('reservation_num').rate.transform('size')0    169.9501    129.4752    129.4753    385.9504    224.9755    482.9506    224.975dtype: float64

选项3
np.uniquenp.bincount

u, f = np.unique(df.reservation_num.values, return_inverse=True)df.rate / np.bincount(f)[f]0    169.9501    129.4752    129.4753    385.9504    224.9755    482.9506    224.975dtype: float64

选项3.5
np.unique会排序,因此扩展性不如pd.factorize。在我的使用环境中,它们的功能相同。因此,我使用了一个函数,根据数组长度设置了一个经验阈值,以确定哪种方法性能更好。它被编号为3.5,因为它基本上与3的答案相同

def factor(a):    if len(a) > 10000:        return pd.factorize(a)[0]    else:        return np.unique(a, return_inverse=True)[1]def count(a):    f = factor(a)    return np.bincount(f)[f]df.rate / count(df.reservation_num.values)  0    169.9501    129.4752    129.4753    385.9504    224.9755    482.9506    224.975dtype: float64

时间测量

%timeit df.rate / df.reservation_num.map(df.reservation_num.value_counts())%timeit df.rate / df.groupby('reservation_num').rate.transform('size')1000 loops, best of 3: 650 µs per loop1000 loops, best of 3: 768 µs per loop%%timeitu, f = np.unique(df.reservation_num.values, return_inverse=True)df.rate / np.bincount(f)[f]10000 loops, best of 3: 131 µs per loop

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注