我正在练习一个贷款预测练习问题,并尝试填补数据中的缺失值。我从这里获取了数据。为了完成这个问题,我正在按照这个教程进行操作。
你可以在GitHub上找到我正在使用的完整代码(文件名为model.py)和数据,在这里。
DataFrame看起来像这样:
df[['Loan_ID', 'Self_Employed', 'Education', 'LoanAmount']].head(10)Out: Loan_ID Self_Employed Education LoanAmount0 LP001002 No Graduate NaN1 LP001003 No Graduate 128.02 LP001005 Yes Graduate 66.03 LP001006 No Not Graduate 120.04 LP001008 No Graduate 141.05 LP001011 Yes Graduate 267.06 LP001013 No Not Graduate 95.07 LP001014 No Graduate 158.08 LP001018 No Graduate 168.09 LP001020 No Graduate 349.0
在最后一行执行后(对应model.py文件中的第60行)
url = 'https://raw.githubusercontent.com/Aniruddh-SK/Loan-Prediction-Problem/master/train.csv'df = pd.read_csv(url) df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True)df['Self_Employed'].fillna('No',inplace=True)table = df.pivot_table(values='LoanAmount', index='Self_Employed' ,columns='Education', aggfunc=np.median)# Define function to return value of this pivot_tabledef fage(x): return table.loc[x['Self_Employed'],x['Education']]# Replace missing valuesdf['LoanAmount'].fillna(df[df['LoanAmount'].isnull()].apply(fage, axis=1), inplace=True)
我得到了这个错误 :
ValueError Traceback (most recent call last)<ipython-input-40-5146e49c2460> in <module>()----> 1 df['LoanAmount'].fillna(df[df['LoanAmount'].isnull()].apply(fage, axis=1), inplace=True)/usr/local/lib/python2.7/dist-packages/pandas/core/series.pyc in fillna(self, value, method, axis, inplace, limit, downcast, **kwargs) 2368 axis=axis, inplace=inplace, 2369 limit=limit, downcast=downcast,-> 2370 **kwargs) 2371 2372 @Appender(generic._shared_docs['shift'] % _shared_doc_kwargs)/usr/local/lib/python2.7/dist-packages/pandas/core/generic.pyc in fillna(self, value, method, axis, inplace, limit, downcast) 3264 else: 3265 raise ValueError("invalid fill value with a %s" %-> 3266 type(value)) 3267 3268 new_data = self._data.fillna(value=value, limit=limit,ValueError: invalid fill value with a <class 'pandas.core.frame.DataFrame'>
如何在不出现这个错误的情况下填补缺失值?
回答:
似乎教程的作者想要用table
的值替换NaN
。
但首先需要通过unstack
和set_index
创建Series
来对齐数据。
首先删除用mean
替换NaN
的操作:
url='https://raw.githubusercontent.com/Aniruddh-SK/Loan-Prediction-Problem/master/train.csv'df = pd.read_csv(url) #使用Pandas将数据集读取到DataFrame中#df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True)df['Self_Employed'].fillna('No',inplace=True)