你好,我正在尝试一个机器学习项目,其中数据集包含数值和字母值。我成功地使用sklearn的LabelEncoder()
将字母值转换为数值,但无法将所有必需的值添加到“X”和“y”变量中。以下是我的代码:
import pandas as pdfrom sklearn.model_selection import train_test_splitfrom sklearn.svm import SVCfrom sklearn import preprocessingfrom sklearn.metrics import accuracy_scoredata = pd.read_csv('data-set.csv')num_val = preprocessing.LabelEncoder()gender = num_val.fit_transform(list(data['gender']))ever_married = num_val.fit_transform(list(data['ever_married']))work_type = num_val.fit_transform(list(data['work_type']))Residence_type = num_val.fit_transform(list(data['Residence_type']))smoking_status = num_val.fit_transform(list(data['smoking_status']))predict = "stroke"X = list(zip(gender,ever_married,work_type,Residence_type,smoking_status))y = data['stroke']X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.1)model = SVC()model.fit(X_train, y_train)pred = model.predict(X_test)acc = accuracy_score(y_test, pred)print(acc)
我使用的数据集在这里
如何将数据集中所有值(包括已转换的值和未改变的数值)全部添加到“X”变量和其他变量中?请帮助我。
回答:
使用Pandas的apply
函数(下面的示例中使用transform
),结合你已经有的代码,但要对原始数据框(data
)中你想要转换的columns
列表进行操作。接下来,从数据框中删除目标列(在这个特定数据集中是stroke
),以创建X
变量。你还需要用与分析相关的某个值填充bmi
的NaN值,否则fit
函数会引发ValueError
。
...data = pd.read_csv('healthcare-dataset-stroke-data.csv')print(data.head())def transform(series): num_val = preprocessing.LabelEncoder() np_array = num_val.fit_transform(list(series)) return pd.Series(np_array)t_list = ["gender","ever_married","work_type","Residence_type","smoking_status"]data[t_list] = data[t_list].apply(transform)print(data.head())predict = "stroke"X = data.drop(columns=['stroke'])# 用与分析相关的某个值填充"bmi"的NaN值X = X.fillna(X.median())y = data['stroke']X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.1)...
原始数据框
id gender age ... work_type Residence_type avg_glucose_level bmi smoking_status stroke0 9046 Male 67.0 ... Private Urban 228.69 36.6 formerly smoked 11 51676 Female 61.0 ... Self-employed Rural 202.21 NaN never smoked 12 31112 Male 80.0 ... Private Rural 105.92 32.5 never smoked 13 60182 Female 49.0 ... Private Urban 171.23 34.4 smokes 14 1665 Female 79.0 ... Self-employed Rural 174.12 24.0 never smoked 1
转换后的数据框
id gender age ... work_type Residence_type avg_glucose_level bmi smoking_status stroke0 9046 1 67.0 ... 2 1 228.69 36.6 1 11 51676 0 61.0 ... 3 0 202.21 NaN 2 12 31112 1 80.0 ... 2 0 105.92 32.5 2 13 60182 0 49.0 ... 2 1 171.23 34.4 3 14 1665 0 79.0 ... 3 0 174.12 24.0 2 1