我正在尝试构建一个模型,用于预测运动员获得奖牌的概率。我有一个如下所示的数据框:
以下是我已经完成的工作
#清理数据框
#用均值或平均值替换NaN
df['Height'].fillna(value=df['Height'].mean(), inplace=True)
df['Weight'].fillna(value=df['Weight'].mean(), inplace=True)
#将类型更改为整数
df.Height = df.Height.astype(int)
df.Weight = df.Weight.astype(int)
#目标变量
y= df["Medal"]
#如果是男性=0,如果是女性=1
df['Sex'] = df['Sex'].apply(lambda x: 1 if str(x) != 'M' else 0)
#预测特征
feature_names = ["Age", "Sex", "Height", "Weight"]
X= df[feature_names]
#回归器
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
regressor = DecisionTreeRegressor(random_state=0)
cross_val_score(regressor, X, y, cv=10)
但是当我运行代码时,它返回了一个错误
warnings.warn("Estimator fit failed. The score on this train-test"
C:\Users\miss_\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py:610: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: Traceback (most recent call last): File "C:\Users\miss_\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 593, in _fit_and_score estimator.fit(X_train, y_train, **fit_params) File "C:\Users\miss_\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 1247, in fit super().fit( File "C:\Users\miss_\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 156, in fit X, y = self._validate_data(X, y, File "C:\Users\miss_\anaconda3\lib\site-packages\sklearn\base.py", line 430, in _validate_data X = check_array(X, **check_X_params) File "C:\Users\miss_\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 63, in inner_f return f(*args, **kwargs) File "C:\Users\miss_\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 663, in check_array _assert_all_finite(array, File "C:\Users\miss_\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 103, in _assert_all_finite raise ValueError(ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
并返回一个类似这样的数组: array[NaN, NaN, NaN...]
我的 X 看起来像这样
Age Sex Height Weight
0 24.0 1 180 80
1 23.0 1 170 60
2 24.0 1 175 70
3 34.0 1 175 70
4 21.0 1 185 82
... ... ... ... ...
271111 29.0 1 179 89
271112 27.0 1 176 59
271113 27.0 1 176 59
271114 30.0 1 185 96
271115 34.0 1 185 96
而我的 y :
0 0
1 0
2 0
3 1
4 0
..
271111 0
271112 0
271113 0
271114 0
271115 0
Name: Medal, Length: 271116, dtype: int64
回答:
你已经为“Height”和“Weight”填充了缺失值。你也应该对“Age”特征进行同样的操作。
首先定位该列中的缺失值:
>>> df.loc[df['Age'].isna(), ['ID', 'Name', 'Age'])
如果你只有少量缺失值,你可以用均值填充:
>>> df['Age'].fillna(value=df['Age'].mean(), inplace=True)
但如果你有大量缺失值,用全局均值填充可能不是一个好主意。“Age”可能会受到“Country”、“Sport”、“Year”甚至“Season”(冬季或夏季)的影响。实际上,对于Height/Weight也是如此:排球的平均身高可能与射箭的不同…