我有两个CSV文件(训练集 和 测试集)。由于在一些列中可以看到NaN
值(status
, hedge_value
, indicator_code
, portfolio_id
, desk_id
, office_id
)。
我开始处理时,先用对应列的某个巨大值替换NaN
值。然后我进行LabelEncoding
以去除文本数据并将其转换为数值数据。现在,当我尝试对分类数据进行OneHotEncoding
时,我遇到了错误。我尝试逐一将输入传递给OneHotEncoding
构造函数,但对于每一列我都得到了相同的错误。
基本上,我的最终目标是预测回报值,但我因为数据预处理部分的这个问题而卡住了。我该如何解决这个问题呢?
我使用Python3.6
,结合Pandas
和Sklearn
进行数据处理。
代码
import pandas as pdimport matplotlib.pyplot as pltimport numpy as nptest_data = pd.read_csv('test.csv')train_data = pd.read_csv('train.csv')# 替换NaN值train_data['status']=train_data['status'].fillna(2.0)train_data['hedge_value']=train_data['hedge_value'].fillna(2.0)train_data['indicator_code']=train_data['indicator_code'].fillna(2.0)train_data['portfolio_id']=train_data['portfolio_id'].fillna('PF99999999')train_data['desk_id']=train_data['desk_id'].fillna('DSK99999999')train_data['office_id']=train_data['office_id'].fillna('OFF99999999')x_train = train_data.iloc[:, :-1].valuesy_train = train_data.iloc[:, 17].values# =============================================================================# from sklearn.preprocessing import Imputer# imputer = Imputer(missing_values="NaN", strategy="mean", axis=0)# imputer.fit(x_train[:, 15:17])# x_train[:, 15:17] = imputer.fit_transform(x_train[:, 15:17])# # imputer.fit(x_train[:, 12:13])# x_train[:, 12:13] = imputer.fit_transform(x_train[:, 12:13])# =============================================================================# 编码分类数据,即文本数据,因为计算只能在数字上进行,所以像国家名称、购买状态这样的文本会带来麻烦from sklearn.preprocessing import LabelEncoder, OneHotEncoderlabelencoder_X = LabelEncoder()x_train[:, 0] = labelencoder_X.fit_transform(x_train[:, 0])x_train[:, 1] = labelencoder_X.fit_transform(x_train[:, 1])x_train[:, 2] = labelencoder_X.fit_transform(x_train[:, 2])x_train[:, 3] = labelencoder_X.fit_transform(x_train[:, 3])x_train[:, 6] = labelencoder_X.fit_transform(x_train[:, 6])x_train[:, 8] = labelencoder_X.fit_transform(x_train[:, 8])x_train[:, 14] = labelencoder_X.fit_transform(x_train[:, 14])# =============================================================================# import numpy as np# x_train[:, 3] = x_train[:, 3].reshape(x_train[:, 3].size,1)# x_train[:, 3] = x_train[:, 3].astype(np.float64, copy=False)# np.isnan(x_train[:, 3]).any()# =============================================================================# =============================================================================# from sklearn.preprocessing import StandardScaler# sc_X = StandardScaler# x_train = sc_X.fit_transform(x_train)# =============================================================================onehotencoder = OneHotEncoder(categorical_features=[0,1,2,3,6,8,14])x_train = onehotencoder.fit_transform(x_train).toarray() # 使用独热编码替换国家名称
错误
Traceback (most recent call last): File "<ipython-input-4-4992bf3d00b8>", line 58, in <module> x_train = onehotencoder.fit_transform(x_train).toarray() # 使用独热编码替换国家名称 File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 2019, in fit_transform self.categorical_features, copy=True) File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/data.py", line 1809, in _transform_selected X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES) File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 453, in check_array _assert_all_finite(array) File "/Users/parthapratimneog/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 44, in _assert_all_finite " or a value too large for %r." % X.dtype)ValueError: 输入包含NaN、无穷大或对于dtype('float64')来说过大的值
回答:
在发布问题后,我再次检查了数据集,发现了另一列包含NaN
。我简直不敢相信我浪费了这么多时间在上面,而我本可以使用Pandas函数来获取包含NaN
的列列表。使用以下代码,我发现我漏掉了三列。我之前是在视觉上搜索NaN
,而我本可以直接使用这个函数。在处理了这些新的NaN
后,代码正常工作了。
pd.isnull(train_data).sum() > 0
结果
portfolio_id Falsedesk_id Falseoffice_id Falsepf_category Falsestart_date Falsesold Truecountry_code Falseeuribor_rate Falsecurrency Falselibor_rate Truebought Truecreation_date Falseindicator_code Falsesell_date Falsetype Falsehedge_value Falsestatus Falsereturn Falsedtype: bool