我想了解如何进行一个简单的预测任务,我正在使用这个数据集进行尝试,另外这个数据集也以不同的格式在这里。这个数据集是关于学生在某些课程中的表现,我希望能够向量化数据集中的某些列,以便不使用所有数据(只是为了学习其工作原理)。所以我尝试了以下方法,使用DictVectorizer:
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
training_data = pd.read_csv('/Users/user/Downloads/student/student-mat.csv')
dict_vect = DictVectorizer(sparse=False)
training_matrix = dict_vect.fit_transform(training_data['G1','G2','sex','school','age'])
training_matrix.toarray()
然后我想传递另一行特征数据,如下所示:
testing_data = pd.read_csv('/Users/user/Downloads/student/student-mat_test.csv')
test_matrix = dict_vect.transform(testing_data['G1','G2','sex','school','age'])
这样做的问题是,我得到了以下错误跟踪:
/usr/local/Cellar/python/2.7.8_1/Frameworks/Python.framework/Versions/2.7/bin/python2.7 school_2.py
Traceback (most recent call last):
File "/Users/user/PycharmProjects/PAN-pruebas/escuela_2.py", line 14, in <module>
X = dict_vect.fit_transform(df['sex','age','address','G1','G2'].values)
File "school_2.py", line 1787, in __getitem__
return self._getitem_column(key)
File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 1794, in _getitem_column
return self._get_item_cache(key)
File "/usr/local/lib/python2.7/site-packages/pandas/core/generic.py", line 1079, in _get_item_cache
values = self._data.get(item)
File "/usr/local/lib/python2.7/site-packages/pandas/core/internals.py", line 2843, in get
loc = self.items.get_loc(item)
File "/usr/local/lib/python2.7/site-packages/pandas/core/index.py", line 1437, in get_loc
return self._engine.get_loc(_values_from_object(key))
File "pandas/index.pyx", line 134, in pandas.index.IndexEngine.get_loc (pandas/index.c:3824)
File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:3704)
File "pandas/hashtable.pyx", line 697, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12349)
File "pandas/hashtable.pyx", line 705, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12300)
KeyError: ('sex', 'age', 'address', 'G1', 'G2')
Process finished with exit code 1
有谁知道如何正确地向量化训练和测试数据?并且显示两个矩阵使用.toarray()
更新
>>>print training_data.info()
/usr/local/Cellar/python/2.7.8_1/Frameworks/Python.framework/Versions/2.7/bin/python2.7 /Users/user/PycharmProjects/PAN-pruebas/escuela_3.py
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 396 entries, (school, sex, age, address, famsize, Pstatus, Medu, Fedu, Mjob, Fjob, reason, guardian, traveltime, studytime, failures, schoolsup, famsup, paid, activities, nursery, higher, internet, romantic, famrel, freetime, goout, Dalc, Walc, health, absences) to (MS, M, 19, U, LE3, T, 1, 1, other, at_home, course, father, 1, 1, 0, no, no, no, no, yes, yes, yes, no, 3, 2, 3, 3, 3, 5, 5)
Data columns (total 3 columns):
id 396 non-null object
content 396 non-null object
label 396 non-null object
dtypes: object(3)
memory usage: 22.7+ KB
None
Process finished with exit code 0
回答:
你需要传递一个列表:
test_matrix = dict_vect.transform(testing_data[['G1','G2','sex','school','age']])
你之前尝试使用这些键来索引你的DataFrame:
['G1','G2','sex','school','age']
这就是为什么你会得到KeyError
,因为没有一个单独的列名是上述形式的。要选择多个列,你需要传递一个列名列表,并使用双重下标[[col_list]]
示例:
In [43]:df = pd.DataFrame(columns=['a','b'])
df
Out[43]:
Empty DataFrame
Columns: [a, b]
Index: []
In [44]:df['a','b']
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-44-33332c7e7227> in <module>()
----> 1 df['a','b']
......
pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12349)()
pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12300)()
KeyError: ('a', 'b')
但这样是可以的:
In [45]:df[['a','b']]
Out[45]:
Empty DataFrame
Columns: [a, b]
Index: []