Tensorflow 2.0 Hugging Face Transformers, TFBertForSequenceClassification, 推理时意外输出维度

摘要：

我想在自定义数据集上对BERT进行微调以进行句子分类。我参考了一些示例，比如这个，非常有帮助。我还查看了这个代码片段。

我的问题是，在对一些样本进行推理时，输出的维度与我预期的不同。

当我对23个样本进行推理时，得到一个包含维度为(1472, 42)的numpy数组的元组，其中42是我类别的数量。我期望的维度是(23, 42)。

代码和其他细节：

我使用Keras对训练好的模型进行推理，代码如下：

preds = model.predict(features)

其中features已被标记化并转换为数据集：

for sample, ground_truth in tests:    test_examples.append(InputExample(text=sample, category_index=ground_truth))features = convert_examples_to_tf_dataset(test_examples, tokenizer)

其中sample可以是例如"A test sentence I want classified"，而ground_truth可以是例如12，这是编码后的标签。因为我在进行推理，所以提供的真实标签当然不重要。

convert_examples_to_tf_dataset函数如下（我在这个代码片段中找到）：

def convert_examples_to_tf_dataset(    examples: List[Tuple[str, int]],    tokenizer,    max_length=64,):    """    将数据加载到tf.data.Dataset中，以便对给定模型进行微调。    Args:        examples: 表示要输入的示例的元组列表        tokenizer: 将对示例进行标记化的标记器实例        max_length: 最大字符串长度    Returns:        一个包含提供句子浓缩特征的``tf.data.Dataset``    """    features = [] # -> 将保存稍后要转换的InputFeatures    for e in examples:        # 此方法的文档非常强大，请查看        input_dict = tokenizer.encode_plus(            e.text,            add_special_tokens=True,            max_length=max_length, # 如果len(s) > max_length则截断            return_token_type_ids=True,            return_attention_mask=True,            pad_to_max_length=True, # 默认情况下向右填充        )        # input ids = 标记器内部字典中的标记索引        # token_type_ids = 识别模型中不同序列的二进制掩码        # attention_mask = 指示填充标记位置的二进制掩码，使模型不关注它们        input_ids, token_type_ids, attention_mask = (input_dict["input_ids"],            input_dict["token_type_ids"], input_dict['attention_mask'])        features.append(            InputFeatures(                input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, label=e.category_index            )        )    def gen():        for f in features:            yield (                {                    "input_ids": f.input_ids,                    "attention_mask": f.attention_mask,                    "token_type_ids": f.token_type_ids,                },                f.label,            )    return tf.data.Dataset.from_generator(        gen,        ({"input_ids": tf.int32, "attention_mask": tf.int32, "token_type_ids": tf.int32}, tf.int64),        (            {                "input_ids": tf.TensorShape([None]),                "attention_mask": tf.TensorShape([None]),                "token_type_ids": tf.TensorShape([None]),            },            tf.TensorShape([]),        ),    )with tf.device('/cpu:0'):    train_data = convert_examples_to_tf_dataset(train_examples, tokenizer)    train_data = train_data.shuffle(buffer_size=len(train_examples), reshuffle_each_iteration=True) \                           .batch(BATCH_SIZE) \                           .repeat(-1)    val_data = convert_examples_to_tf_dataset(val_examples, tokenizer)    val_data = val_data.shuffle(buffer_size=len(val_examples), reshuffle_each_iteration=True) \                           .batch(BATCH_SIZE) \                           .repeat(-1)

它工作得和我预期的一样，运行print(list(features.as_numpy_iterator())[1])会得到以下结果：

({'input_ids': array([  101, 11639, 19962, 23288, 13264, 35372, 10410,   102,     0,           0,     0,     0,     0,     0,     0,     0,     0,     0,           0,     0,     0,     0,     0,     0,     0,     0,     0,           0,     0,     0,     0,     0,     0,     0,     0,     0,           0,     0,     0,     0,     0,     0,     0,     0,     0,           0,     0,     0,     0,     0,     0,     0,     0,     0,           0,     0,     0,     0,     0,     0,     0,     0,     0,           0], dtype=int32), 'attention_mask': array([1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],      dtype=int32), 'token_type_ids': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],      dtype=int32)}, 6705)

到目前为止，一切看起来都符合我的预期。标记器似乎也正常工作；三个长度为64的数组（对应于我设置的最大长度），以及一个整数标签。

模型的训练方式如下：

config = BertConfig.from_pretrained(    'bert-base-multilingual-cased',    num_labels=len(label_encoder.classes_),    output_hidden_states=False,    output_attentions=False)model = TFBertForSequenceClassification.from_pretrained('bert-base-multilingual-cased', config=config)# train_data 然后是一个可以传递给model.fit()的tf.data.Datasetoptimizer = tf.keras.optimizers.Adam(learning_rate=3e-05, epsilon=1e-08)loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)metric = tf.keras.metrics.SparseCategoricalAccuracy(name='accuracy')model.compile(optimizer=optimizer,              loss=loss,              metrics=[metric])model.summary()history = model.fit(train_data,                    epochs=EPOCHS,                    steps_per_epoch=train_steps,                    validation_data=val_data,                    validation_steps=val_steps,                    shuffle=True,                    )

结果

现在的问题是，当运行预测preds = model.predict(features)时，输出维度与文档中所说的不符：logits (Numpy array or tf.Tensor of shape (batch_size, config.num_labels)):。我得到的是一个包含维度为(1472,42)的numpy数组的元组。

42是有意义的，因为这是我的类别数量。我发送了23个测试样本，23 x 64 = 1472。64是我的最大句子长度，所以听起来有点熟悉。这个输出是否不正确？我如何将这个输出转换为每个输入样本的实际类别预测？我得到1472个预测，而我期望的是23个。

如果我可以提供更多可能有助于解决此问题的细节，请告诉我。

回答：

我找到了问题所在 – 如果在使用Tensorflow数据集(tf.data.Dataset)时得到意外的维度，可能是由于没有运行.batch。

所以在我的例子中：

features = convert_examples_to_tf_dataset(test_examples, tokenizer)

添加：

features = features.batch(BATCH_SIZE)

使其按我期望的方式工作。因此，这不是与TFBertForSequenceClassification相关的问题，只是由于我的输入不正确。我还想补充一个参考这个答案，它让我找到了问题所在。

学技术

Tensorflow 2.0 Hugging Face Transformers, TFBertForSequenceClassification, 推理时意外输出维度

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复