提交pipeline.predict到评分系统时引发ValueError(额外行)

当我尝试将我的Pipeline提交到评分系统时,我会收到下面的ValueError。我不确定我应该从哪里删除12500行数据。

ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,2].shape[0] == 13892, expected 1544.

我的任务是构建一个模型,将疗养院的业务特征与它们的第一周期调查结果以及第一周期与第二周期调查之间的时间相结合,以预测第二周期的总分数。

这是我用来完成上述任务的代码。

# 创建一个自定义转换器来计算调查1与调查2时间之间的差值class TimedeltaTransformer(BaseEstimator, TransformerMixin):    def __init__(self, t1_col, t2_col):        self.t1_col = t1_col        self.t2_col = t2_col    def fit(self, X, y=None):        if not isinstance(X, pd.DataFrame):            X = pd.DataFrame(X)        self.col_1 = X[self.t1_col].apply(pd.to_datetime)        self.col_2 = X[self.t2_col].apply(pd.to_datetime)        return self    def transform(self, X):        difference_list = []        difference = self.col_1 - self.col_2        for obj in difference:            difference_list.append(obj.total_seconds())        return np.array(difference_list).reshape(-1,1)# 创建TimedeltaTransformer对象cycle_1_date = 'CYCLE_1_SURVEY_DATE'cycle_2_date = 'CYCLE_2_SURVEY_DATE'time_feature = TimedeltaTransformer(cycle_1_date, cycle_2_date)# 使用自定义列选择转换器来提取cycle_1_featurescycle_1_cols = ['CYCLE_1_DEFS', 'CYCLE_1_NFROMDEFS', 'CYCLE_1_NFROMCOMP',                'CYCLE_1_DEFS_SCORE', 'CYCLE_1_NUMREVIS',                'CYCLE_1_REVISIT_SCORE', 'CYCLE_1_TOTAL_SCORE']cycle_1_features = Pipeline([    ('cst2', ColumnSelectTransformer(cycle_1_cols)),    ])# 创建我的survey_model Pipeline对象# Pipeline对象是一个两步过程,首先是一个特征联合转换# 并结合业务特征、cycle_1特征以及时间特征;然后将转换后的特征拟合到# RandomForestRegressorsurvey_model = Pipeline([    ('features', FeatureUnion([        ('business', business_features),        ('survey', cycle_1_features),        ('time', time_feature),    ])),    ('forest', RandomForestRegressor()),])# 拟合我的pipeline不会产生错误survey_model.fit(data, cycle_2_score.astype(int))# 调用predict函数并将其传递给评分系统会引发ValueErrorgrader.score.ml__survey_model(survey_model.predict)

拟合后的pipeline看起来像这样

Pipeline(memory=None,         steps=[('features',                 FeatureUnion(n_jobs=None,                              transformer_list=[('business',                                                 FeatureUnion(n_jobs=None,                                                              transformer_list=[('simple',                                                                                 Pipeline(memory=None,                                                                                          steps=[('cst',                                                                                                  ColumnSelectTransformer(columns=['BEDCERT',                                                                                                                                   'RESTOT',                                                                                                                                   'INHOSP',                                                                                                                                   'CCRC_FACIL',                                                                                                                                   'SFF',                                                                                                                                   'CHOW_LAST_12MOS',                                                                                                                                   'SPRINKLER_STATUS',                                                                                                                                   'EXP_TOTAL',                                                                                                                                   'ADJ_TOTAL'])),                                                                                                 ('imputer',                                                                                                  SimpleImpute...                              transformer_weights=None, verbose=False)),                ('forest',                 RandomForestRegressor(bootstrap=True, criterion='mse',                                       max_depth=None, max_features='auto',                                       max_leaf_nodes=None,                                       min_impurity_decrease=0.0,                                       min_impurity_split=None,                                       min_samples_leaf=1, min_samples_split=2,                                       min_weight_fraction_leaf=0.0,                                       n_estimators=10, n_jobs=None,                                       oob_score=False, random_state=None,                                       verbose=0, warm_start=False))],         verbose=False)

一些额外背景:我正在构建这个模型,以便将其predict方法传递到一个自定义的评分系统中用于一个项目。评分系统将一个字典列表传递给我的估计器的predict或predict_proba方法,而不是一个DataFrame。这意味着模型必须能够处理这两种数据类型。因此,我需要提供一个自定义的ColumnSelectTransformer来代替scikit-learn自己的ColumnTransformer。

下面是与业务特征和ColumnSelectTransformer相关的额外代码

# 自定义转换器,用于从数据框中选择列并返回数组class ColumnSelectTransformer(BaseEstimator, TransformerMixin):    def __init__(self, columns):        self.columns = columns    def fit(self, X, y=None):        return self    def transform(self, X):        if not isinstance(X, pd.DataFrame):            X = pd.DataFrame(X)        return X[self.columns].valuessimple_features = Pipeline([    ('cst', ColumnSelectTransformer(simple_cols)),    ('imputer', SimpleImputer(strategy='mean')),])owner_onehot = Pipeline([    ('cst', ColumnSelectTransformer(['OWNERSHIP'])),    ('imputer', SimpleImputer(strategy='most_frequent')),    ('encoder', OneHotEncoder()),])cert_onehot = Pipeline([    ('cst', ColumnSelectTransformer(['CERTIFICATION'])),    ('imputer', SimpleImputer(strategy='most_frequent')),    ('encoder', OneHotEncoder()),])categorical_features = FeatureUnion([    ('owner_onehot', owner_onehot),    ('cert_onehot', cert_onehot),])business_features = FeatureUnion([    ('simple', simple_features),    ('categorical', categorical_features)])

最后,这是完整的错误信息

---------------------------------------------------------------------------ValueError                                Traceback (most recent call last)<ipython-input-165-790ca6139493> in <module>()----> 1 grader.score.ml__survey_model(survey_model.predict)/opt/conda/lib/python3.7/site-packages/static_grader/grader.py in func(*args, **kw)     92   def __getattr__(self, method):     93     def func(*args, **kw):---> 94       return self(method, *args, **kw)     95     return func     96 /opt/conda/lib/python3.7/site-packages/static_grader/grader.py in __call__(self, question_name, func)     88       return     89     test_cases = json.loads(resp.text)---> 90     test_cases_grading(question_name, func, test_cases)     91      92   def __getattr__(self, method):/opt/conda/lib/python3.7/site-packages/static_grader/grader.py in test_cases_grading(question_name, func, test_cases)     40   for test_case in test_cases:     41     if inspect.isroutine(func):---> 42       sub_res = func(*test_case['args'], **test_case['kwargs'])     43     elif not test_case['args'] and not test_case['kwargs']:     44       sub_res = func/opt/conda/lib/python3.7/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)    114     115         # lambda, but not partial, allows help() to work with update_wrapper--> 116         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)    117         # update the docstring of the returned function    118         update_wrapper(out, self.fn)/opt/conda/lib/python3.7/site-packages/sklearn/pipeline.py in predict(self, X, **predict_params)    419         Xt = X    420         for _, name, transform in self._iter(with_final=False):--> 421             Xt = transform.transform(Xt)    422         return self.steps[-1][-1].predict(Xt, **predict_params)    423 /opt/conda/lib/python3.7/site-packages/sklearn/pipeline.py in transform(self, X)    963             return np.zeros((X.shape[0], 0))    964         if any(sparse.issparse(f) for f in Xs):--> 965             Xs = sparse.hstack(Xs).tocsr()    966         else:    967             Xs = np.hstack(Xs)/opt/conda/lib/python3.7/site-packages/scipy/sparse/construct.py in hstack(blocks, format, dtype)    463     464     """--> 465     return bmat([blocks], format=format, dtype=dtype)    466     467 /opt/conda/lib/python3.7/site-packages/scipy/sparse/construct.py in bmat(blocks, format, dtype)    584                                                     exp=brow_lengths[i],    585                                                     got=A.shape[0]))--> 586                     raise ValueError(msg)    587     588                 if bcol_lengths[j] == 0:ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,2].shape[0] == 13892, expected 1544.

回答:

修复我的TimedeltaTransformer有帮助。

class TimedeltaTransformer(BaseEstimator, TransformerMixin):    def __init__(self, t1_col, t2_col):        self.t1_col = t1_col        self.t2_col = t2_col    def fit(self, X, y=None):        return self    def transform(self, X):        if not isinstance(X, pd.DataFrame):            X = pd.DataFrame(X)        timedelta_series = (pd.to_datetime(X[self.t1_col]) - pd.to_datetime(X[self.t2_col]))        array_list = []        for x in timedelta_series:            array_list.append(x.total_seconds())        return np.array(array_list).reshape(-1,1)

Related Posts

在使用k近邻算法时,有没有办法获取被使用的“邻居”?

我想找到一种方法来确定在我的knn算法中实际使用了哪些…

Theano在Google Colab上无法启用GPU支持

我在尝试使用Theano库训练一个模型。由于我的电脑内…

准确性评分似乎有误

这里是代码: from sklearn.metrics…

Keras Functional API: “错误检查输入时:期望input_1具有4个维度,但得到形状为(X, Y)的数组”

我在尝试使用Keras的fit_generator来训…

如何使用sklearn.datasets.make_classification在指定范围内生成合成数据?

我想为分类问题创建合成数据。我使用了sklearn.d…

如何处理预测时不在训练集中的标签

已关闭。 此问题与编程或软件开发无关。目前不接受回答。…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注