如何将包含文本和数字的特征向量用于sklearn

我刚开始使用sklearn，想对产品进行分类。这些产品出现在订单行中，具有描述、价格、制造商、订单数量等属性。其中一些属性是文本，另一些是数字（整数或浮点数）。我想使用这些属性来预测产品是否需要维护。我们购买的产品可以是发动机、泵等，也可以是螺母、软管、过滤器等。到目前为止，我已经基于价格和数量进行了预测，并基于描述或制造商进行了其他预测。现在我想将这些预测结合起来，但我不知道该怎么做。我已经查看了Pipeline和FeatureUnion的页面，但对我来说有些 confusing。有人有关于如何同时预测包含文本和数字列的数据的简单示例吗？

我现在有以下内容：

order_lines.head(5)    Part No Part Description    Quantity    Price/Base  Supplier Name   Purch UoM   Category0   1112165 Duikwerkzaamheden   1.0 750.00  Duik & Bergingsbedrijf Europa B.V.  pcs 01   1112165 Duikwerkzaamheden bij de helling    1.0 500.00  Duik & Bergingsbedrijf Europa B.V.  pcs 02   1070285 Inspectie boegschroef, dd. 26-03-2012   1.0 0.01    Duik & Bergingsbedrijf Europa B.V.  pcs 03   1037024 Spare parts Albanie Acc. List   1.0 3809.16 Lastechniek Europa B.V. -   04   1037025 M_PO:441.35/BW_INV:0    1.0 0.00    Exalto  pcs 0category_column = order_lines['Category']order_lines = order_lines[['Part Description', 'Quantity', 'Price/Base', 'Supplier Name', 'Purch UoM']]from sklearn.cross_validation import train_test_splitfeatures_train, features_test, target_train, target_test = train_test_split(order_lines, category_column, test_size=0.20)from sklearn.base import TransformerMixin, BaseEstimatorclass FeatureTypeSelector(TransformerMixin, BaseEstimator):    FEATURE_TYPES = {        'price and quantity': [            'Price/Base',            'Quantity',        ],        'description, supplier, uom': [            'Part Description',            'Supplier Name',            'Purch UoM',        ],    }    def __init__(self, feature_type):        self.columns = self.FEATURE_TYPES[feature_type]    def fit(self, X, y=None):        return self    def transform(self, X):        return X[self.columns]from sklearn.feature_extraction.text import CountVectorizerfrom sklearn.naive_bayes import MultinomialNBfrom sklearn.svm import LinearSVCfrom sklearn.pipeline import make_union, make_pipelinefrom sklearn.preprocessing import RobustScalerpreprocessor = make_union(    make_pipeline(        FeatureTypeSelector('price and quantity'),        RobustScaler(),    ),    make_pipeline(        FeatureTypeSelector('description, supplier, uom'),        CountVectorizer(),    ),)preprocessor.fit_transform(features_train)

然后我得到了这个错误：

---------------------------------------------------------------------------ValueError                                Traceback (most recent call last)<ipython-input-51-f8b0db33462a> in <module>()----> 1 preprocessor.fit_transform(features_train)C:\Anaconda3\lib\site-packages\sklearn\pipeline.py in fit_transform(self, X, y, **fit_params)    500         self._update_transformer_list(transformers)    501         if any(sparse.issparse(f) for f in Xs):--> 502             Xs = sparse.hstack(Xs).tocsr()    503         else:    504             Xs = np.hstack(Xs)C:\Anaconda3\lib\site-packages\scipy\sparse\construct.py in hstack(blocks, format, dtype)    462     463     """--> 464     return bmat([blocks], format=format, dtype=dtype)    465     466 C:\Anaconda3\lib\site-packages\scipy\sparse\construct.py in bmat(blocks, format, dtype)    579                 else:    580                     if brow_lengths[i] != A.shape[0]:--> 581                         raise ValueError('blocks[%d,:] has incompatible row dimensions' % i)    582     583                 if bcol_lengths[j] == 0:ValueError: blocks[0,:] has incompatible row dimensions

回答：

我建议不要对不同类型的特征进行预测然后再结合。你最好使用你提到的FeatureUnion，它允许你为每种特征类型创建单独的预处理管道。我经常使用的构造如下…

让我们定义一个玩具示例数据集来试验：

import pandas as pd# create a pandas dataframe that contains your featuresX = pd.DataFrame({'quantity': [13, 7, 42, 11],                  'item_name': ['nut', 'bolt', 'bolt', 'chair'],                  'item_type': ['hardware', 'hardware', 'hardware', 'furniture'],                  'item_price': [1.95, 4.95, 2.79, 19.95]})# create corresponding target (this is often just one of the dataframe columns)y = pd.Series([0, 1, 1, 0], index=X.index)

我使用Pipeline和FeatureUnion（或者更简单的快捷方式make_pipeline和make_union）将所有内容结合在一起：

from sklearn.pipeline import make_union, make_pipelinefrom sklearn.feature_extraction import DictVectorizerfrom sklearn.preprocessing import RobustScalerfrom sklearn.linear_model import LogisticRegression# create your preprocessor that handles different feature types separatelypreprocessor = make_union(    make_pipeline(        FeatureTypeSelector('continuous'),        RobustScaler(),    ),    make_pipeline(        FeatureTypeSelector('categorical'),        RowToDictTransformer(),        DictVectorizer(sparse=False),  # set sparse=True if you get MemoryError    ),)# example use of your combined preprocessorpreprocessor.fit_transform(X)# choose some estimatorestimator = LogisticRegression()# your prediction model can be created as followsmodel = make_pipeline(preprocessor, estimator)# and training is done as followsmodel.fit(X, y)# predict (preferably not on training data X)model.predict(X)

在这里，我定义了自己的自定义转换器FeatureTypeSelector和RowToDictTransformer如下：

from sklearn.base import TransformerMixin, BaseEstimatorclass FeatureTypeSelector(TransformerMixin, BaseEstimator):    """ Selects a subset of features based on their type """    FEATURE_TYPES = {        'categorical': [            'item_name',            'item_type',        ],        'continuous': [            'quantity',            'item_price',        ]    }    def __init__(self, feature_type):        self.columns = self.FEATURE_TYPES[feature_type]    def fit(self, X, y=None):        return self    def transform(self, X):        return X[self.columns]class RowToDictTransformer(TransformerMixin, BaseEstimator):    """ Prepare dataframe for DictVectorizer """    def fit(self, X, y=None):        return self    def transform(self, X):        return (row[1] for row in X.iterrows())

希望这个例子能更清楚地展示如何进行特征联合。

-Kris

学技术

如何将包含文本和数字的特征向量用于sklearn

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复