如何使用Python根据另外两个列的值填充数据集中空值?

我有一个泰坦尼克号的数据集。它包含多个属性,我主要处理的是1.年龄2.登船港口(乘客从哪个港口登船,总共有三个港口:S、Q和C)3.是否生还(0表示未生还,1表示生还)

我正在过滤无用的数据。然后我需要填充年龄列中的空值。为此,我统计了在每个登船港口(S、Q和C)生还和未生还的乘客数量。

我计算了从每个S、Q和C港口登船后生还和未生还的乘客的平均年龄。但现在我不知道如何将这六个值(每个S、Q和C港口生还的3个值和未生还的3个值,总共6个)填充到原始的泰坦尼克号年龄列中。如果我简单地使用titanic.Age.fillna(‘使用六个值之一’),它会用那个值填充所有年龄的空值,这不是我想要的。

经过一段时间的思考,我尝试了以下方法。

titanic[titanic.Survived==1][titanic.Embarked=='S'].Age.fillna(SurvivedS.Age.mean(),inplace=True)titanic[titanic.Survived==1][titanic.Embarked=='Q'].Age.fillna(SurvivedQ.Age.mean(),inplace=True)titanic[titanic.Survived==1][titanic.Embarked=='C'].Age.fillna(SurvivedC.Age.mean(),inplace=True)titanic[titanic.Survived==0][titanic.Embarked=='S'].Age.fillna(DidntSurvivedS.Age.mean(),inplace=True)titanic[titanic.Survived==0][titanic.Embarked=='Q'].Age.fillna(DidntSurvivedQ.Age.mean(),inplace=True)titanic[titanic.Survived==0][titanic.Embarked=='C'].Age.fillna(DidntSurvivedC.Age.mean(),inplace=True)

这没有报错,但仍然不起作用。有什么建议吗?


回答:

我认为你需要使用groupbyapply,结合fillnamean

titanic['age'] = titanic.groupby(['survived','embarked'])['age']                        .apply(lambda x: x.fillna(x.mean()))

import seaborn as snstitanic = sns.load_dataset('titanic')#检查年龄列中的NaN值print (titanic[titanic['age'].isnull()].head(10))    survived  pclass     sex  age  sibsp  parch      fare embarked   class  \5          0       3    male  NaN      0      0    8.4583        Q   Third   17         1       2    male  NaN      0      0   13.0000        S  Second   19         1       3  female  NaN      0      0    7.2250        C   Third   26         0       3    male  NaN      0      0    7.2250        C   Third   28         1       3  female  NaN      0      0    7.8792        Q   Third   29         0       3    male  NaN      0      0    7.8958        S   Third   31         1       1  female  NaN      1      0  146.5208        C   First   32         1       3  female  NaN      0      0    7.7500        Q   Third   36         1       3    male  NaN      0      0    7.2292        C   Third   42         0       3    male  NaN      0      0    7.8958        C   Third         who  adult_male deck  embark_town alive  alone  5     man        True  NaN   Queenstown    no   True  17    man        True  NaN  Southampton   yes   True  19  woman       False  NaN    Cherbourg   yes   True  26    man        True  NaN    Cherbourg    no   True  28  woman       False  NaN   Queenstown   yes   True  29    man        True  NaN  Southampton    no   True  31  woman       False    B    Cherbourg   yes  False  32  woman       False  NaN   Queenstown   yes   True  36    man        True  NaN    Cherbourg   yes   True  42    man        True  NaN    Cherbourg    no   True 

idx = titanic[titanic['age'].isnull()].indextitanic['age'] = titanic.groupby(['survived','embarked'])['age']                        .apply(lambda x: x.fillna(x.mean()))#检查值是否已被替换print (titanic.loc[idx].head(10))    survived  pclass     sex        age  sibsp  parch      fare embarked  \5          0       3    male  30.325000      0      0    8.4583        Q   17         1       2    male  28.113184      0      0   13.0000        S   19         1       3  female  28.973671      0      0    7.2250        C   26         0       3    male  33.666667      0      0    7.2250        C   28         1       3  female  22.500000      0      0    7.8792        Q   29         0       3    male  30.203966      0      0    7.8958        S   31         1       1  female  28.973671      1      0  146.5208        C   32         1       3  female  22.500000      0      0    7.7500        Q   36         1       3    male  28.973671      0      0    7.2292        C   42         0       3    male  33.666667      0      0    7.8958        C        class    who  adult_male deck  embark_town alive  alone  5    Third    man        True  NaN   Queenstown    no   True  17  Second    man        True  NaN  Southampton   yes   True  19   Third  woman       False  NaN    Cherbourg   yes   True  26   Third    man        True  NaN    Cherbourg    no   True  28   Third  woman       False  NaN   Queenstown   yes   True  29   Third    man        True  NaN  Southampton    no   True  31   First  woman       False    B    Cherbourg   yes  False  32   Third  woman       False  NaN   Queenstown   yes   True  36   Third    man        True  NaN    Cherbourg   yes   True  42   Third    man        True  NaN    Cherbourg    no   True  

#检查平均值print (titanic.groupby(['survived','embarked'])['age'].mean())survived  embarked0         C           33.666667          Q           30.325000          S           30.2039661         C           28.973671          Q           22.500000          S           28.113184Name: age, dtype: float64

Related Posts

Keras Dense层输入未被展平

这是我的测试代码: from keras import…

无法将分类变量输入随机森林

我有10个分类变量和3个数值变量。我在分割后直接将它们…

如何在Keras中对每个输出应用Sigmoid函数?

这是我代码的一部分。 model = Sequenti…

如何选择类概率的最佳阈值?

我的神经网络输出是一个用于多标签分类的预测类概率表: …

在Keras中使用深度学习得到不同的结果

我按照一个教程使用Keras中的深度神经网络进行文本分…

‘MatMul’操作的输入’b’类型为float32,与参数’a’的类型float64不匹配

我写了一个简单的TensorFlow代码,但不断遇到T…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注