在 sklearn 中，GridSearchCV 和 Pipeline 的 fit 方法是否有区别？

这可能只是一个bug，或者是我真的很笨，我（或者更准确地说是我的同事）用一些Keras变换包装了一个Keras模型，这样我们就可以在sklearn库中使用这个Keras模型了。

现在，当我对Pipeline使用fit方法时，它运行得很好。它会运行并返回一个可用的模型实例。然而，当我使用GridSearchCV时，由于某些原因，它似乎无法进行变换，并且给我如下错误：

InvalidArgumentError (see above for traceback): indices[11,2] = 26048 is not in [0, 10001)     [[Node: embedding_4/Gather = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, validate_indices=true, _device="/job:localhost/replica:0/task:0/cpu:0"](embedding_4/embeddings/read, embedding_4/Cast)]]

代码大致如下：

vocab_size = 10001class TextsToSequences(Tokenizer, BaseEstimator, TransformerMixin):    def __init__(self,  **kwargs):        super().__init__(**kwargs)    def fit(self, X, y=None):        print('fitting the text')        print(self.document_count)        self.fit_on_texts(X)        return self    def transform(self, X, y=None):        print('transforming the text')        r = np.array(self.texts_to_sequences(X))        print(r)        print(self.document_count)        return rclass Padder(BaseEstimator, TransformerMixin):    def __init__(self, maxlen=500):        self.maxlen = maxlen        self.max_index = None    def fit(self, X, y=None):        #self.max_index = pad_sequences(X, maxlen=self.maxlen).max()        return self    def transform(self, X, y=None):        print('pad the text')        X = pad_sequences(X, maxlen=self.maxlen, padding='post')        #X[X > self.max_index] = 0        print(X)        return Xmaxlen = 15def makeLstmModel():    model = Sequential()    model.add(Embedding(10001, 100, input_length=15))    model.add(LSTM(35, dropout=0.2, recurrent_dropout=0.2))    model.add(Dense(16, activation='sigmoid'))    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])    model.summary()    return modellstmmodel = KerasClassifier(build_fn=makeLstmModel, epochs=5, batch_size=1000, verbose=42)pipeline =  [        ('seq', TextsToSequences(num_words=vocab_size)),        ('pad', Padder(maxlen)),        ('clf', lstmmodel)    ]textClassifier = Pipeline(pipeline)#Setup parametersparameters = {} #Some params to use in gridsearchskf = StratifiedKFold(n_splits=numberOfFolds, shuffle=True, random_state=1)gscv = GridSearchCV(textClassifier, parameters, cv=skf, iid=False, n_jobs=1, verbose=50)gscv.fit(x_train, y_train)

上面的代码会出现InvalidArgumentError错误，但是当我使用Pipeline运行fit方法时，它却能正常工作：

GridSearchCV和Pipeline中的fit()方法之间是否有区别？我真的是很笨吗，还是这只是一个bug？

顺便提一下，我目前被迫使用Sklearn 0.19.1版本。

回答：

经过数小时的思考和调试，我得出了以下结论：

Pipeline.fit()能够自动填充**kwargs参数。

GridSearchCV.fit()无法自动填充**kwargs参数。

我在sklearn 0.19.1版本上进行了测试

我的问题在于，使用Keras的Tokenizer创建的词袋是通过num_words参数来限制词袋的最大词数的。我的同事在这方面做得不好，因此词数与LSTM模型的输入维度相匹配。因为num_words从未被设置，所以词袋总是大于输入维度。

num_words是以**kwargs参数的形式传递给Tokenizer的。

class TextsToSequences(Tokenizer, BaseEstimator, TransformerMixin):    def __init__(self,  **kwargs):        super().__init__(**kwargs)

由于某些原因，GridSearchCV.fit()无法自动填充这些参数。解决方案是使用固定参数。

class TextsToSequences(Tokenizer, BaseEstimator, TransformerMixin):    def __init__(self, num_words=8000, **kwargs):        super().__init__(num_words, **kwargs)

在进行了上述更改后，GridSearchCV.fit()能够正常工作了。

学技术

在 sklearn 中，GridSearchCV 和 Pipeline 的 fit 方法是否有区别？

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复