假设我有以下数据框:
Survived Pclass Sex Age Fare0 0 3 male 22.0 7.25001 1 1 female 38.0 71.28332 1 3 female 26.0 7.92503 1 1 female 35.0 53.10004 0 3 male 35.0 8.0500
我使用get_dummies()函数创建虚拟变量。代码和输出如下:
one_hot = pd.get_dummies(dataset, columns = ['Category'])
这将返回:
Survived Pclass Age Fare Sex_female Sex_male0 0 3 22 7.2500 0 11 1 1 38 71.2833 1 02 1 3 26 7.9250 1 03 1 1 35 53.1000 1 04 0 3 35 8.0500 0 1
我想要的是一个单独的Sex列,值为0或1,而不是两列。
有趣的是,当我在另一个不同的数据框上使用get_dummies()时,它的工作方式正如我所期望的那样。
对于以下数据框:
Category Message0 ham Go until jurong point, crazy.. Available only ...1 ham Ok lar... Joking wif u oni...2 spam Free entry in 2 a wkly comp to win FA Cup final...3 ham U dun say so early hor... U c already then say...4 ham Nah I don't think he goes to usf, he lives aro...
使用代码:
one_hot = pd.get_dummies(dataset, columns = ['Category'])
它返回:
Message ... Category_spam0 Go until jurong point, crazy.. Available only ... ... 01 Ok lar... Joking wif u oni... ... 02 Free entry in 2 a wkly comp to win FA Cup fina... ... 13 U dun say so early hor... U c already then say... ... 04 Nah I don't think he goes to usf, he lives aro... ... 0
- 为什么get_dummies()在这两个数据框上的表现不同?
- 如何确保每次都能得到第二个输出?
回答:
以下是多种实现方法:
from sklearn.preprocessing import LabelEncoderlbl=LabelEncoder()df['Sex_encoded'] = lbl.fit_transform(df['Sex'])# 使用仅pandasdf['Sex_encoded'] = df['Sex'].map({'male': 0, 'female': 1}) Survived Pclass Sex Age Fare Sex_encoded0 0 3 male 22.0 7.2500 01 1 1 female 38.0 71.2833 12 1 3 female 26.0 7.9250 13 1 1 female 35.0 53.1000 14 0 3 male 35.0 8.0500 0