我刚开始使用sklearn,想对产品进行分类。这些产品出现在订单行中,具有描述、价格、制造商、订单数量等属性。其中一些属性是文本,另一些是数字(整数或浮点数)。我想使用这些属性来预测产品是否需要维护。我们购买的产品可以是发动机、泵等,也可以是螺母、软管、过滤器等。到目前为止,我已经基于价格和数量进行了预测,并基于描述或制造商进行了其他预测。现在我想将这些预测结合起来,但我不知道该怎么做。我已经查看了Pipeline和FeatureUnion的页面,但对我来说有些 confusing。有人有关于如何同时预测包含文本和数字列的数据的简单示例吗?
我现在有以下内容:
order_lines.head(5) Part No Part Description Quantity Price/Base Supplier Name Purch UoM Category0 1112165 Duikwerkzaamheden 1.0 750.00 Duik & Bergingsbedrijf Europa B.V. pcs 01 1112165 Duikwerkzaamheden bij de helling 1.0 500.00 Duik & Bergingsbedrijf Europa B.V. pcs 02 1070285 Inspectie boegschroef, dd. 26-03-2012 1.0 0.01 Duik & Bergingsbedrijf Europa B.V. pcs 03 1037024 Spare parts Albanie Acc. List 1.0 3809.16 Lastechniek Europa B.V. - 04 1037025 M_PO:441.35/BW_INV:0 1.0 0.00 Exalto pcs 0category_column = order_lines['Category']order_lines = order_lines[['Part Description', 'Quantity', 'Price/Base', 'Supplier Name', 'Purch UoM']]from sklearn.cross_validation import train_test_splitfeatures_train, features_test, target_train, target_test = train_test_split(order_lines, category_column, test_size=0.20)from sklearn.base import TransformerMixin, BaseEstimatorclass FeatureTypeSelector(TransformerMixin, BaseEstimator): FEATURE_TYPES = { 'price and quantity': [ 'Price/Base', 'Quantity', ], 'description, supplier, uom': [ 'Part Description', 'Supplier Name', 'Purch UoM', ], } def __init__(self, feature_type): self.columns = self.FEATURE_TYPES[feature_type] def fit(self, X, y=None): return self def transform(self, X): return X[self.columns]from sklearn.feature_extraction.text import CountVectorizerfrom sklearn.naive_bayes import MultinomialNBfrom sklearn.svm import LinearSVCfrom sklearn.pipeline import make_union, make_pipelinefrom sklearn.preprocessing import RobustScalerpreprocessor = make_union( make_pipeline( FeatureTypeSelector('price and quantity'), RobustScaler(), ), make_pipeline( FeatureTypeSelector('description, supplier, uom'), CountVectorizer(), ),)preprocessor.fit_transform(features_train)
然后我得到了这个错误:
---------------------------------------------------------------------------ValueError Traceback (most recent call last)<ipython-input-51-f8b0db33462a> in <module>()----> 1 preprocessor.fit_transform(features_train)C:\Anaconda3\lib\site-packages\sklearn\pipeline.py in fit_transform(self, X, y, **fit_params) 500 self._update_transformer_list(transformers) 501 if any(sparse.issparse(f) for f in Xs):--> 502 Xs = sparse.hstack(Xs).tocsr() 503 else: 504 Xs = np.hstack(Xs)C:\Anaconda3\lib\site-packages\scipy\sparse\construct.py in hstack(blocks, format, dtype) 462 463 """--> 464 return bmat([blocks], format=format, dtype=dtype) 465 466 C:\Anaconda3\lib\site-packages\scipy\sparse\construct.py in bmat(blocks, format, dtype) 579 else: 580 if brow_lengths[i] != A.shape[0]:--> 581 raise ValueError('blocks[%d,:] has incompatible row dimensions' % i) 582 583 if bcol_lengths[j] == 0:ValueError: blocks[0,:] has incompatible row dimensions
回答:
我建议不要对不同类型的特征进行预测然后再结合。你最好使用你提到的FeatureUnion
,它允许你为每种特征类型创建单独的预处理管道。我经常使用的构造如下…
让我们定义一个玩具示例数据集来试验:
import pandas as pd# create a pandas dataframe that contains your featuresX = pd.DataFrame({'quantity': [13, 7, 42, 11], 'item_name': ['nut', 'bolt', 'bolt', 'chair'], 'item_type': ['hardware', 'hardware', 'hardware', 'furniture'], 'item_price': [1.95, 4.95, 2.79, 19.95]})# create corresponding target (this is often just one of the dataframe columns)y = pd.Series([0, 1, 1, 0], index=X.index)
我使用Pipeline
和FeatureUnion
(或者更简单的快捷方式make_pipeline
和make_union
)将所有内容结合在一起:
from sklearn.pipeline import make_union, make_pipelinefrom sklearn.feature_extraction import DictVectorizerfrom sklearn.preprocessing import RobustScalerfrom sklearn.linear_model import LogisticRegression# create your preprocessor that handles different feature types separatelypreprocessor = make_union( make_pipeline( FeatureTypeSelector('continuous'), RobustScaler(), ), make_pipeline( FeatureTypeSelector('categorical'), RowToDictTransformer(), DictVectorizer(sparse=False), # set sparse=True if you get MemoryError ),)# example use of your combined preprocessorpreprocessor.fit_transform(X)# choose some estimatorestimator = LogisticRegression()# your prediction model can be created as followsmodel = make_pipeline(preprocessor, estimator)# and training is done as followsmodel.fit(X, y)# predict (preferably not on training data X)model.predict(X)
在这里,我定义了自己的自定义转换器FeatureTypeSelector
和RowToDictTransformer
如下:
from sklearn.base import TransformerMixin, BaseEstimatorclass FeatureTypeSelector(TransformerMixin, BaseEstimator): """ Selects a subset of features based on their type """ FEATURE_TYPES = { 'categorical': [ 'item_name', 'item_type', ], 'continuous': [ 'quantity', 'item_price', ] } def __init__(self, feature_type): self.columns = self.FEATURE_TYPES[feature_type] def fit(self, X, y=None): return self def transform(self, X): return X[self.columns]class RowToDictTransformer(TransformerMixin, BaseEstimator): """ Prepare dataframe for DictVectorizer """ def fit(self, X, y=None): return self def transform(self, X): return (row[1] for row in X.iterrows())
希望这个例子能更清楚地展示如何进行特征联合。
-Kris