我有一个泰坦尼克号的数据集。它包含多个属性,我主要处理的是1.年龄2.登船港口(乘客从哪个港口登船,总共有三个港口:S、Q和C)3.是否生还(0表示未生还,1表示生还)
我正在过滤无用的数据。然后我需要填充年龄列中的空值。为此,我统计了在每个登船港口(S、Q和C)生还和未生还的乘客数量。
我计算了从每个S、Q和C港口登船后生还和未生还的乘客的平均年龄。但现在我不知道如何将这六个值(每个S、Q和C港口生还的3个值和未生还的3个值,总共6个)填充到原始的泰坦尼克号年龄列中。如果我简单地使用titanic.Age.fillna(‘使用六个值之一’),它会用那个值填充所有年龄的空值,这不是我想要的。
经过一段时间的思考,我尝试了以下方法。
titanic[titanic.Survived==1][titanic.Embarked=='S'].Age.fillna(SurvivedS.Age.mean(),inplace=True)titanic[titanic.Survived==1][titanic.Embarked=='Q'].Age.fillna(SurvivedQ.Age.mean(),inplace=True)titanic[titanic.Survived==1][titanic.Embarked=='C'].Age.fillna(SurvivedC.Age.mean(),inplace=True)titanic[titanic.Survived==0][titanic.Embarked=='S'].Age.fillna(DidntSurvivedS.Age.mean(),inplace=True)titanic[titanic.Survived==0][titanic.Embarked=='Q'].Age.fillna(DidntSurvivedQ.Age.mean(),inplace=True)titanic[titanic.Survived==0][titanic.Embarked=='C'].Age.fillna(DidntSurvivedC.Age.mean(),inplace=True)
这没有报错,但仍然不起作用。有什么建议吗?
回答:
我认为你需要使用groupby
和apply
,结合fillna
和mean
:
titanic['age'] = titanic.groupby(['survived','embarked'])['age'] .apply(lambda x: x.fillna(x.mean()))
import seaborn as snstitanic = sns.load_dataset('titanic')#检查年龄列中的NaN值print (titanic[titanic['age'].isnull()].head(10)) survived pclass sex age sibsp parch fare embarked class \5 0 3 male NaN 0 0 8.4583 Q Third 17 1 2 male NaN 0 0 13.0000 S Second 19 1 3 female NaN 0 0 7.2250 C Third 26 0 3 male NaN 0 0 7.2250 C Third 28 1 3 female NaN 0 0 7.8792 Q Third 29 0 3 male NaN 0 0 7.8958 S Third 31 1 1 female NaN 1 0 146.5208 C First 32 1 3 female NaN 0 0 7.7500 Q Third 36 1 3 male NaN 0 0 7.2292 C Third 42 0 3 male NaN 0 0 7.8958 C Third who adult_male deck embark_town alive alone 5 man True NaN Queenstown no True 17 man True NaN Southampton yes True 19 woman False NaN Cherbourg yes True 26 man True NaN Cherbourg no True 28 woman False NaN Queenstown yes True 29 man True NaN Southampton no True 31 woman False B Cherbourg yes False 32 woman False NaN Queenstown yes True 36 man True NaN Cherbourg yes True 42 man True NaN Cherbourg no True
idx = titanic[titanic['age'].isnull()].indextitanic['age'] = titanic.groupby(['survived','embarked'])['age'] .apply(lambda x: x.fillna(x.mean()))#检查值是否已被替换print (titanic.loc[idx].head(10)) survived pclass sex age sibsp parch fare embarked \5 0 3 male 30.325000 0 0 8.4583 Q 17 1 2 male 28.113184 0 0 13.0000 S 19 1 3 female 28.973671 0 0 7.2250 C 26 0 3 male 33.666667 0 0 7.2250 C 28 1 3 female 22.500000 0 0 7.8792 Q 29 0 3 male 30.203966 0 0 7.8958 S 31 1 1 female 28.973671 1 0 146.5208 C 32 1 3 female 22.500000 0 0 7.7500 Q 36 1 3 male 28.973671 0 0 7.2292 C 42 0 3 male 33.666667 0 0 7.8958 C class who adult_male deck embark_town alive alone 5 Third man True NaN Queenstown no True 17 Second man True NaN Southampton yes True 19 Third woman False NaN Cherbourg yes True 26 Third man True NaN Cherbourg no True 28 Third woman False NaN Queenstown yes True 29 Third man True NaN Southampton no True 31 First woman False B Cherbourg yes False 32 Third woman False NaN Queenstown yes True 36 Third man True NaN Cherbourg yes True 42 Third man True NaN Cherbourg no True
#检查平均值print (titanic.groupby(['survived','embarked'])['age'].mean())survived embarked0 C 33.666667 Q 30.325000 S 30.2039661 C 28.973671 Q 22.500000 S 28.113184Name: age, dtype: float64