使用scikit learn的DictVectorizer向量化特定列时遇到问题?

我想了解如何进行一个简单的预测任务,我正在使用这个数据集进行尝试,另外这个数据集也以不同的格式在这里。这个数据集是关于学生在某些课程中的表现,我希望能够向量化数据集中的某些列,以便不使用所有数据(只是为了学习其工作原理)。所以我尝试了以下方法,使用DictVectorizer

import pandas as pd
from sklearn.feature_extraction import DictVectorizer
training_data = pd.read_csv('/Users/user/Downloads/student/student-mat.csv')
dict_vect = DictVectorizer(sparse=False)
training_matrix = dict_vect.fit_transform(training_data['G1','G2','sex','school','age'])
training_matrix.toarray()

然后我想传递另一行特征数据,如下所示:

testing_data = pd.read_csv('/Users/user/Downloads/student/student-mat_test.csv')
test_matrix = dict_vect.transform(testing_data['G1','G2','sex','school','age'])

这样做的问题是,我得到了以下错误跟踪:

/usr/local/Cellar/python/2.7.8_1/Frameworks/Python.framework/Versions/2.7/bin/python2.7 school_2.py
Traceback (most recent call last):
  File "/Users/user/PycharmProjects/PAN-pruebas/escuela_2.py", line 14, in <module>
    X = dict_vect.fit_transform(df['sex','age','address','G1','G2'].values)
  File "school_2.py", line 1787, in __getitem__
    return self._getitem_column(key)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 1794, in _getitem_column
    return self._get_item_cache(key)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/generic.py", line 1079, in _get_item_cache
    values = self._data.get(item)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/internals.py", line 2843, in get
    loc = self.items.get_loc(item)
  File "/usr/local/lib/python2.7/site-packages/pandas/core/index.py", line 1437, in get_loc
    return self._engine.get_loc(_values_from_object(key))
  File "pandas/index.pyx", line 134, in pandas.index.IndexEngine.get_loc (pandas/index.c:3824)
  File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:3704)
  File "pandas/hashtable.pyx", line 697, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12349)
  File "pandas/hashtable.pyx", line 705, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12300)
KeyError: ('sex', 'age', 'address', 'G1', 'G2')
Process finished with exit code 1

有谁知道如何正确地向量化训练和测试数据?并且显示两个矩阵使用.toarray()

更新

>>>print training_data.info()
/usr/local/Cellar/python/2.7.8_1/Frameworks/Python.framework/Versions/2.7/bin/python2.7 /Users/user/PycharmProjects/PAN-pruebas/escuela_3.py
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 396 entries, (school, sex, age, address, famsize, Pstatus, Medu, Fedu, Mjob, Fjob, reason, guardian, traveltime, studytime, failures, schoolsup, famsup, paid, activities, nursery, higher, internet, romantic, famrel, freetime, goout, Dalc, Walc, health, absences) to (MS, M, 19, U, LE3, T, 1, 1, other, at_home, course, father, 1, 1, 0, no, no, no, no, yes, yes, yes, no, 3, 2, 3, 3, 3, 5, 5)
Data columns (total 3 columns):
id         396 non-null object
content    396 non-null object
label      396 non-null object
dtypes: object(3)
memory usage: 22.7+ KB
None
Process finished with exit code 0

回答:

你需要传递一个列表:

test_matrix = dict_vect.transform(testing_data[['G1','G2','sex','school','age']])

你之前尝试使用这些键来索引你的DataFrame:

['G1','G2','sex','school','age']

这就是为什么你会得到KeyError,因为没有一个单独的列名是上述形式的。要选择多个列,你需要传递一个列名列表,并使用双重下标[[col_list]]

示例:

In [43]:df = pd.DataFrame(columns=['a','b'])
df
Out[43]:
Empty DataFrame
Columns: [a, b]
Index: []
In [44]:df['a','b']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-44-33332c7e7227> in <module>()
----> 1 df['a','b']
......
    pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12349)()
pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12300)()
KeyError: ('a', 'b')

但这样是可以的:

In [45]:df[['a','b']]
Out[45]:
Empty DataFrame
Columns: [a, b]
Index: []

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注