我在开始使用Kaggle时,正在进行一个指导任务,预测泰坦尼克号事故中谁生还了,谁没有生还。
我按照要求完成了所有步骤。
所以我的最后一个代码单元格看起来是这样的
from sklearn.ensemble import RandomForestClassifiery = train_data['Survived']features = ["Pclass","Sex","SibSp","Parch"]X = pd.get_dummies(train_data[features])X_test = pd.get_dummies(train_data[features])model = RandomForestClassifier(n_estimators=1,max_depth=5,random_state=1)model.fit(X,y)predictions = model.predict(X_test)output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})output.to_csv('my_submission.csv', index=False)print("Your submission was successfully saved!")
编译后显示以下错误:
ValueError Traceback (most recent call last)<ipython-input-24-7d2fc2ea2973> in <module> 11 12 ---> 13 output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions}) 14 output.to_csv('my_submission.csv', index=False) 15 print("Your submission was successfully saved!")/opt/conda/lib/python3.7/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy) 433 ) 434 elif isinstance(data, dict):--> 435 mgr = init_dict(data, index, columns, dtype=dtype) 436 elif isinstance(data, ma.MaskedArray): 437 import numpy.ma.mrecords as mrecords/opt/conda/lib/python3.7/site-packages/pandas/core/internals/construction.py in init_dict(data, index, columns, dtype) 252 arr if not is_datetime64tz_dtype(arr) else arr.copy() for arr in arrays 253 ]--> 254 return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype) 255 256 /opt/conda/lib/python3.7/site-packages/pandas/core/internals/construction.py in arrays_to_mgr(arrays, arr_names, index, columns, dtype) 62 # figure out the index, if necessary 63 if index is None:---> 64 index = extract_index(arrays) 65 else: 66 index = ensure_index(index)/opt/conda/lib/python3.7/site-packages/pandas/core/internals/construction.py in extract_index(data) 376 f"length {len(index)}" 377 )--> 378 raise ValueError(msg) 379 else: 380 index = ibase.default_index(lengths[0])ValueError: array length 891 does not match index length 418
然而,我无法调试出我的错误到底是什么,有人能帮我吗?谢谢你。
回答:
你构建X_test数据框的方式不正确,因为你使用的是train_data而不是test_data。这导致在创建输出文件时,test_data.PassengerId和predictions的大小不匹配。
修正以下这行代码,它就会工作:
X_test = pd.get_dummies(test_data[features])