我在我的泰坦尼克号模型中使用逻辑回归时,PyCharm要求我传递仅包含布尔值的DataFrame:
Traceback (most recent call last): File "C:/Users/security/Downloads/AP/Titanic-Kaggle/TItanic-Kaggle.py", line 29, in <module> predictions = logReg.predict(test[test_data]) File "C:\Users\security\Anaconda3\envs\TItanic-Kaggle.py\lib\site-packages\pandas\core\frame.py", line 2914, in __getitem__ return self._getitem_frame(key) File "C:\Users\security\Anaconda3\envs\TItanic-Kaggle.py\lib\site-packages\pandas\core\frame.py", line 3009, in _getitem_frame raise ValueError('Must pass DataFrame with boolean values only')ValueError: Must pass DataFrame with boolean values only
我不明白为什么会这样,因为在训练模型时使用了完全相同的特征,并且当时它运行得很好。这里是我的代码(请忽略代码重复问题,我会在之后解决):
具体来说,Python对这段代码片段有问题:
test_data = test[['Pclass', 'Sex', 'Relatives', 'Fare', 'Age', 'Embarked', 'HasCabin']]...predictions = logReg.predict(test[test_data])
更新
我已经将predictions
变量更改为以下内容:
predictions = logReg.predict(test_data)
现在我的堆栈跟踪是这样的:
Traceback (most recent call last): File "C:/Users/security/Downloads/AP/Titanic-Kaggle/TItanic-Kaggle.py", line 29, in <module> predictions = logReg.predict(test_data) File "C:\Users\security\Anaconda3\envs\TItanic-Kaggle.py\lib\site-packages\sklearn\linear_model\base.py", line 281, in predict scores = self.decision_function(X) File "C:\Users\security\Anaconda3\envs\TItanic-Kaggle.py\lib\site-packages\sklearn\linear_model\base.py", line 257, in decision_function X = check_array(X, accept_sparse='csr') File "C:\Users\security\Anaconda3\envs\TItanic-Kaggle.py\lib\site-packages\sklearn\utils\validation.py", line 573, in check_array allow_nan=force_all_finite == 'allow-nan') File "C:\Users\security\Anaconda3\envs\TItanic-Kaggle.py\lib\site-packages\sklearn\utils\validation.py", line 56, in _assert_all_finite raise ValueError(msg_err.format(type_err, X.dtype))ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
这意味着我的测试数据的特征选择/工程没有通过
回答:
使用x_validate
进行预测没有问题。试试看:
>>> predictions = logReg.predict(x_validate)
所以test_data
肯定有问题。获取一些关于数据框的信息并进行比较:
>>> x_validate.info(verbose=True) <class 'pandas.core.frame.DataFrame'>Int64Index: 197 entries, 495 to 45Data columns (total 7 columns):Pclass 197 non-null int64Sex 197 non-null int64Relatives 197 non-null int64Fare 197 non-null float64Age 197 non-null float64Embarked 197 non-null int64HasCabin 197 non-null int64dtypes: float64(2), int64(5)memory usage: 12.3 KB>>> test_data.info(verbose=True) <class 'pandas.core.frame.DataFrame'>RangeIndex: 418 entries, 0 to 417Data columns (total 7 columns):Pclass 418 non-null int64Sex 418 non-null int64Relatives 418 non-null int64Fare 417 non-null float64Age 418 non-null float64Embarked 418 non-null int64HasCabin 418 non-null int64dtypes: float64(2), int64(5)memory usage: 22.9 KB
看起来这里有一个NaN值:
Fare 417 non-null float64