如何使用Python根据另外两个列的值填充数据集中空值?

我有一个泰坦尼克号的数据集。它包含多个属性,我主要处理的是1.年龄2.登船港口(乘客从哪个港口登船,总共有三个港口:S、Q和C)3.是否生还(0表示未生还,1表示生还)

我正在过滤无用的数据。然后我需要填充年龄列中的空值。为此,我统计了在每个登船港口(S、Q和C)生还和未生还的乘客数量。

我计算了从每个S、Q和C港口登船后生还和未生还的乘客的平均年龄。但现在我不知道如何将这六个值(每个S、Q和C港口生还的3个值和未生还的3个值,总共6个)填充到原始的泰坦尼克号年龄列中。如果我简单地使用titanic.Age.fillna(‘使用六个值之一’),它会用那个值填充所有年龄的空值,这不是我想要的。

经过一段时间的思考,我尝试了以下方法。

titanic[titanic.Survived==1][titanic.Embarked=='S'].Age.fillna(SurvivedS.Age.mean(),inplace=True)titanic[titanic.Survived==1][titanic.Embarked=='Q'].Age.fillna(SurvivedQ.Age.mean(),inplace=True)titanic[titanic.Survived==1][titanic.Embarked=='C'].Age.fillna(SurvivedC.Age.mean(),inplace=True)titanic[titanic.Survived==0][titanic.Embarked=='S'].Age.fillna(DidntSurvivedS.Age.mean(),inplace=True)titanic[titanic.Survived==0][titanic.Embarked=='Q'].Age.fillna(DidntSurvivedQ.Age.mean(),inplace=True)titanic[titanic.Survived==0][titanic.Embarked=='C'].Age.fillna(DidntSurvivedC.Age.mean(),inplace=True)

这没有报错,但仍然不起作用。有什么建议吗?


回答:

我认为你需要使用groupbyapply,结合fillnamean

titanic['age'] = titanic.groupby(['survived','embarked'])['age']                        .apply(lambda x: x.fillna(x.mean()))

import seaborn as snstitanic = sns.load_dataset('titanic')#检查年龄列中的NaN值print (titanic[titanic['age'].isnull()].head(10))    survived  pclass     sex  age  sibsp  parch      fare embarked   class  \5          0       3    male  NaN      0      0    8.4583        Q   Third   17         1       2    male  NaN      0      0   13.0000        S  Second   19         1       3  female  NaN      0      0    7.2250        C   Third   26         0       3    male  NaN      0      0    7.2250        C   Third   28         1       3  female  NaN      0      0    7.8792        Q   Third   29         0       3    male  NaN      0      0    7.8958        S   Third   31         1       1  female  NaN      1      0  146.5208        C   First   32         1       3  female  NaN      0      0    7.7500        Q   Third   36         1       3    male  NaN      0      0    7.2292        C   Third   42         0       3    male  NaN      0      0    7.8958        C   Third         who  adult_male deck  embark_town alive  alone  5     man        True  NaN   Queenstown    no   True  17    man        True  NaN  Southampton   yes   True  19  woman       False  NaN    Cherbourg   yes   True  26    man        True  NaN    Cherbourg    no   True  28  woman       False  NaN   Queenstown   yes   True  29    man        True  NaN  Southampton    no   True  31  woman       False    B    Cherbourg   yes  False  32  woman       False  NaN   Queenstown   yes   True  36    man        True  NaN    Cherbourg   yes   True  42    man        True  NaN    Cherbourg    no   True 

idx = titanic[titanic['age'].isnull()].indextitanic['age'] = titanic.groupby(['survived','embarked'])['age']                        .apply(lambda x: x.fillna(x.mean()))#检查值是否已被替换print (titanic.loc[idx].head(10))    survived  pclass     sex        age  sibsp  parch      fare embarked  \5          0       3    male  30.325000      0      0    8.4583        Q   17         1       2    male  28.113184      0      0   13.0000        S   19         1       3  female  28.973671      0      0    7.2250        C   26         0       3    male  33.666667      0      0    7.2250        C   28         1       3  female  22.500000      0      0    7.8792        Q   29         0       3    male  30.203966      0      0    7.8958        S   31         1       1  female  28.973671      1      0  146.5208        C   32         1       3  female  22.500000      0      0    7.7500        Q   36         1       3    male  28.973671      0      0    7.2292        C   42         0       3    male  33.666667      0      0    7.8958        C        class    who  adult_male deck  embark_town alive  alone  5    Third    man        True  NaN   Queenstown    no   True  17  Second    man        True  NaN  Southampton   yes   True  19   Third  woman       False  NaN    Cherbourg   yes   True  26   Third    man        True  NaN    Cherbourg    no   True  28   Third  woman       False  NaN   Queenstown   yes   True  29   Third    man        True  NaN  Southampton    no   True  31   First  woman       False    B    Cherbourg   yes  False  32   Third  woman       False  NaN   Queenstown   yes   True  36   Third    man        True  NaN    Cherbourg   yes   True  42   Third    man        True  NaN    Cherbourg    no   True  

#检查平均值print (titanic.groupby(['survived','embarked'])['age'].mean())survived  embarked0         C           33.666667          Q           30.325000          S           30.2039661         C           28.973671          Q           22.500000          S           28.113184Name: age, dtype: float64

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注