是否有关于如何为文本分类微调HuggingFace BERT模型的逐步解释？

回答：

微调方法

有多个方法可以为目标任务微调BERT模型。

进一步预训练基础BERT模型
在可训练的基础BERT模型之上添加自定义分类层
在不可训练（冻结）的基础BERT模型之上添加自定义分类层

请注意，基础BERT模型仅在原始论文中针对两个任务进行了预训练。

BERT: 深度双向变换器在语言理解上的预训练

3.1 预训练BERT …我们使用两个无监督任务预训练BERT

任务#1：掩码语言模型

任务#2：下一句预测（NSP）

因此，基础BERT模型就像半成品，可以针对目标领域完全“烘烤”（第一种方法）。我们可以将其用作我们自定义模型训练的一部分，基础模型可训练（第二种）或不可训练（第三种）。

第一种方法

如何为文本分类微调BERT？展示了进一步预训练的第一种方法，并指出学习率是避免灾难性遗忘的关键，在学习新知识时会擦除预训练的知识。

我们发现，较低的学习率，例如2e-5，是必要的，使BERT克服灾难性遗忘问题。使用激进的学习率4e-4，训练集无法收敛。

这可能是BERT论文使用5e-5、4e-5、3e-5和2e-5进行微调的原因。

我们使用批量大小为32，对所有GLUE任务的数据进行3个周期的微调。对于每个任务，我们在开发集上选择最佳的微调学习率（在5e-5、4e-5、3e-5和2e-5之间）。

请注意，基础模型的预训练本身使用了更高的学习率。

bert-base-uncased – 预训练

模型在4个云TPU的Pod配置中（总共16个TPU芯片）训练了100万步，批量大小为256。序列长度在90%的步骤中限制为128个标记，其余10%为512个标记。使用的优化器是Adam，学习率为1e-4，β1=0.9和β2=0.999，权重衰减为0.01，学习率预热10,000步，并在之后线性衰减学习率。

将在下面的第三种方法中描述第一种方法。

供参考：TFDistilBertModel是名称为distilbert的基本模型。

Model: "tf_distil_bert_model_1"_________________________________________________________________Layer (type)                 Output Shape              Param #   =================================================================distilbert (TFDistilBertMain multiple                  66362880  =================================================================Total params: 66,362,880Trainable params: 66,362,880Non-trainable params: 0

第二种方法

Huggingface采用了第二种方法，如使用原生PyTorch/TensorFlow进行微调中所示，其中TFDistilBertForSequenceClassification在可训练的基础distilbert模型之上添加了自定义分类层classifier。同样，为了避免灾难性遗忘，也需要小学习率。

from transformers import TFDistilBertForSequenceClassificationmodel = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)model.compile(optimizer=optimizer, loss=model.compute_loss) # can also use any keras loss fnmodel.fit(train_dataset.shuffle(1000).batch(16), epochs=3, batch_size=16)

Model: "tf_distil_bert_for_sequence_classification_2"_________________________________________________________________Layer (type)                 Output Shape              Param #   =================================================================distilbert (TFDistilBertMain multiple                  66362880  _________________________________________________________________pre_classifier (Dense)       multiple                  590592    _________________________________________________________________classifier (Dense)           multiple                  1538      _________________________________________________________________dropout_59 (Dropout)         multiple                  0         =================================================================Total params: 66,955,010Trainable params: 66,955,010  <--- All parameters are trainableNon-trainable params: 0

第二种方法的实现

import pandas as pdimport tensorflow as tffrom sklearn.model_selection import train_test_splitfrom transformers import (    DistilBertTokenizerFast,    TFDistilBertForSequenceClassification,)DATA_COLUMN = 'text'LABEL_COLUMN = 'category_index'MAX_SEQUENCE_LENGTH = 512LEARNING_RATE = 5e-5BATCH_SIZE = 16NUM_EPOCHS = 3# --------------------------------------------------------------------------------# Tokenizer# --------------------------------------------------------------------------------tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')def tokenize(sentences, max_length=MAX_SEQUENCE_LENGTH, padding='max_length'):    """Tokenize using the Huggingface tokenizer    Args:        sentences: String or list of string to tokenize        padding: Padding method ['do_not_pad'|'longest'|'max_length']    """    return tokenizer(        sentences,        truncation=True,        padding=padding,        max_length=max_length,        return_tensors="tf"    )# --------------------------------------------------------------------------------# Load data# --------------------------------------------------------------------------------raw_train = pd.read_csv("./train.csv")train_data, validation_data, train_label, validation_label = train_test_split(    raw_train[DATA_COLUMN].tolist(),    raw_train[LABEL_COLUMN].tolist(),    test_size=.2,    shuffle=True)# --------------------------------------------------------------------------------# Prepare TF dataset# --------------------------------------------------------------------------------train_dataset = tf.data.Dataset.from_tensor_slices((    dict(tokenize(train_data)),  # Convert BatchEncoding instance to dictionary    train_label)).shuffle(1000).batch(BATCH_SIZE).prefetch(1)validation_dataset = tf.data.Dataset.from_tensor_slices((    dict(tokenize(validation_data)),    validation_label)).batch(BATCH_SIZE).prefetch(1)# --------------------------------------------------------------------------------# training# --------------------------------------------------------------------------------model = TFDistilBertForSequenceClassification.from_pretrained(    'distilbert-base-uncased',    num_labels=NUM_LABELS)optimizer = tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE)model.compile(    optimizer=optimizer,    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),)model.fit(    x=train_dataset,    y=None,    validation_data=validation_dataset,    batch_size=BATCH_SIZE,    epochs=NUM_EPOCHS,)

第三种方法

基础知识

请注意，图像取自首次使用BERT的视觉指南并进行了修改。

分词器

分词器生成BatchEncoding的实例，可以像Python字典一样使用，并且作为BERT模型的输入。

BatchEncoding

保存encode_plus()和batch_encode()方法的输出（标记、注意力掩码等）。
这个类是从Python字典派生出来的，并且可以作为字典使用。此外，这个类暴露了从词/字符空间到标记空间的映射的实用方法。

参数

data (dict) – encode/batch_encode方法返回的列表/数组/张量的字典（’input_ids’、’attention_mask’等）。

类的data属性是生成的标记，具有input_ids和attention_mask元素。

input_ids

input_ids

输入id通常是唯一需要传递给模型作为输入的参数。它们是标记索引，构建将用作模型输入的序列的标记的数值表示。

attention_mask

注意力掩码

此参数指示模型应关注哪些标记，以及哪些不应关注。

如果attention_mask为0，则忽略该标记id。例如，如果序列被填充以调整序列长度，则应忽略填充的词，因此它们的attention_mask为0。

特殊标记

BertTokenizer添加特殊标记，用[CLS]和[SEP]包围序列。[CLS]代表分类，[SEP]用于分隔序列。对于问答或释义任务，[SEP]用于分隔要比较的两个句子。

BertTokenizer

cls_token (str, optional, defaults to “[CLS]“)
用于进行序列分类（整个序列的分类而不是每个标记的分类）的分类标记。当使用特殊标记构建序列时，它是序列的第一个标记。

sep_token (str, optional, defaults to “[SEP]”)
分隔符标记，用于从多个序列构建序列时，例如用于序列分类的两个序列，或用于问答的文本和问题。当使用特殊标记构建序列时，它也是序列的最后一个标记。

首次使用BERT的视觉指南展示了分词过程。

[CLS]

基础模型最后一层输出中[CLS]的嵌入向量代表基础模型已学习的分类。因此，将[CLS]标记的嵌入向量输入到添加在基础模型之上的分类层中。

BERT: 深度双向变换器在语言理解上的预训练

每个序列的第一个标记始终是一个特殊的分类标记([CLS])。对应于此标记的最终隐藏状态被用作分类任务的聚合序列表示。句子对被打包成单个序列。我们以两种方式区分句子。首先，我们用一个特殊标记([SEP])分隔它们。其次，我们为每个标记添加一个学习的嵌入，表示它属于句子A还是句子B。

模型结构将如下图所示。

向量大小

在模型distilbert-base-uncased中，每个标记被嵌入到大小为768的向量中。基础模型的输出形状为(batch_size, max_sequence_length, embedding_vector_size=768)。这与BERT论文中关于BERT/BASE模型（如distilbert-base-uncased中所示）的描述相符。

BERT: 深度双向变换器在语言理解上的预训练

BERT/BASE (L=12, H=768, A=12, 总参数=110M) 和 BERT/LARGE (L=24, H=1024, A=16, 总参数=340M)。

基础模型 – TFDistilBertModel

Hugging Face Transformers: 为二分类任务微调DistilBERT

TFDistilBertModel类用于实例化基础DistilBERT模型，没有附加任何特定头部（与其他类如TFDistilBertForSequenceClassification不同，这些类确实有附加的分类头部）。

我们不希望附加任何任务特定的头部，因为我们只是希望基础模型的预训练权重提供对英语的一般理解，而在微调过程中添加我们自己的分类头部是我们的任务，以便帮助模型区分有毒评论和无毒评论。

TFDistilBertModel生成TFBaseModelOutput的实例，其last_hidden_state参数是模型最后一层的输出。

TFBaseModelOutput([(    'last_hidden_state',    <tf.Tensor: shape=(batch_size, sequence_lendgth, 768), dtype=float32, numpy=array([[[...]]], dtype=float32)>)])

TFBaseModelOutput

参数

last_hidden_state (tf.Tensor of shape (batch_size, sequence_length, hidden_size)) – 模型最后一层的隐藏状态序列。

实现

Python模块

import pandas as pdimport tensorflow as tffrom sklearn.model_selection import train_test_splitfrom transformers import (    DistilBertTokenizerFast,    TFDistilBertModel,)

配置

TIMESTAMP = datetime.datetime.now().strftime("%Y%b%d%H%M").upper()DATA_COLUMN = 'text'LABEL_COLUMN = 'category_index'MAX_SEQUENCE_LENGTH = 512   # BERT允许的最大长度为512.NUM_LABELS = len(raw_train[LABEL_COLUMN].unique())MODEL_NAME = 'distilbert-base-uncased'NUM_BASE_MODEL_OUTPUT = 768# 标记冻结基础模型FREEZE_BASE = True# 标记添加自定义分类头USE_CUSTOM_HEAD = Trueif USE_CUSTOM_HEAD == False:    # 当不存在分类头时，使基础模型可训练。    FREEZE_BASE = FalseBATCH_SIZE = 16LEARNING_RATE = 1e-2 if FREEZE_BASE else 5e-5L2 = 0.01

分词器

tokenizer = DistilBertTokenizerFast.from_pretrained(MODEL_NAME)def tokenize(sentences, max_length=MAX_SEQUENCE_LENGTH, padding='max_length'):    """Tokenize using the Huggingface tokenizer    Args:        sentences: String or list of string to tokenize        padding: Padding method ['do_not_pad'|'longest'|'max_length']    """    return tokenizer(        sentences,        truncation=True,        padding=padding,        max_length=max_length,        return_tensors="tf"    )

输入层

基础模型期望input_ids和attention_mask，其形状为(max_sequence_length,)。分别使用Input层为它们生成Keras张量。

# Inputs for token indices and attention masksinput_ids = tf.keras.layers.Input(shape=(MAX_SEQUENCE_LENGTH,), dtype=tf.int32, name='input_ids')attention_mask = tf.keras.layers.Input((MAX_SEQUENCE_LENGTH,), dtype=tf.int32, name='attention_mask')

基础模型层

生成基础模型的输出。基础模型生成TFBaseModelOutput。将[CLS]的嵌入输入到下一层。

base = TFDistilBertModel.from_pretrained(    MODEL_NAME,    num_labels=NUM_LABELS)# Freeze the base model weights.if FREEZE_BASE:    for layer in base.layers:        layer.trainable = False    base.summary()# [CLS] embedding is last_hidden_state[:, 0, :]output = base([input_ids, attention_mask]).last_hidden_state[:, 0, :]

分类层

if USE_CUSTOM_HEAD:    # -------------------------------------------------------------------------------    # Classifiation leayer 01    # --------------------------------------------------------------------------------    output = tf.keras.layers.Dropout(        rate=0.15,        name="01_dropout",    )(output)        output = tf.keras.layers.Dense(        units=NUM_BASE_MODEL_OUTPUT,        kernel_initializer='glorot_uniform',        activation=None,        name="01_dense_relu_no_regularizer",    )(output)    output = tf.keras.layers.BatchNormalization(        name="01_bn"    )(output)    output = tf.keras.layers.Activation(        "relu",        name="01_relu"    )(output)    # --------------------------------------------------------------------------------    # Classifiation leayer 02    # --------------------------------------------------------------------------------    output = tf.keras.layers.Dense(        units=NUM_BASE_MODEL_OUTPUT,        kernel_initializer='glorot_uniform',        activation=None,        name="02_dense_relu_no_regularizer",    )(output)    output = tf.keras.layers.BatchNormalization(        name="02_bn"    )(output)    output = tf.keras.layers.Activation(        "relu",        name="02_relu"    )(output)

Softmax层

output = tf.keras.layers.Dense(    units=NUM_LABELS,    kernel_initializer='glorot_uniform',    kernel_regularizer=tf.keras.regularizers.l2(l2=L2),    activation='softmax',    name="softmax")(output)

最终自定义模型

name = f"{TIMESTAMP}_{MODEL_NAME.upper()}"model = tf.keras.models.Model(inputs=[input_ids, attention_mask], outputs=output, name=name)model.compile(    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),    optimizer=tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE),    metrics=['accuracy'])model.summary()---Layer (type)                    Output Shape         Param #     Connected to                     ==================================================================================================input_ids (InputLayer)          [(None, 256)]        0                                            __________________________________________________________________________________________________attention_mask (InputLayer)     [(None, 256)]        0                                            __________________________________________________________________________________________________tf_distil_bert_model (TFDistilB TFBaseModelOutput(la 66362880    input_ids[0][0]                                                                                   attention_mask[0][0]             __________________________________________________________________________________________________tf.__operators__.getitem_1 (Sli (None, 768)          0           tf_distil_bert_model[1][0]       __________________________________________________________________________________________________01_dropout (Dropout)            (None, 768)          0           tf.__operators__.getitem_1[0][0] __________________________________________________________________________________________________01_dense_relu_no_regularizer (D (None, 768)          590592      01_dropout[0][0]                 __________________________________________________________________________________________________01_bn (BatchNormalization)      (None, 768)          3072        01_dense_relu_no_regularizer[0][0__________________________________________________________________________________________________01_relu (Activation)            (None, 768)          0           01_bn[0][0]                      __________________________________________________________________________________________________02_dense_relu_no_regularizer (D (None, 768)          590592      01_relu[0][0]                    __________________________________________________________________________________________________02_bn (BatchNormalization)      (None, 768)          3072        02_dense_relu_no_regularizer[0][0__________________________________________________________________________________________________02_relu (Activation)            (None, 768)          0           02_bn[0][0]                      __________________________________________________________________________________________________softmax (Dense)                 (None, 2)            1538        02_relu[0][0]                    ==================================================================================================Total params: 67,551,746Trainable params: 1,185,794Non-trainable params: 66,365,952   <--- Base BERT model is frozen

数据分配

# --------------------------------------------------------------------------------# Split data into training and validation# --------------------------------------------------------------------------------raw_train = pd.read_csv("./train.csv")train_data, validation_data, train_label, validation_label = train_test_split(    raw_train[DATA_COLUMN].tolist(),    raw_train[LABEL_COLUMN].tolist(),    test_size=.2,    shuffle=True)# X = dict(tokenize(train_data))# Y = tf.convert_to_tensor(train_label)X = tf.data.Dataset.from_tensor_slices((    dict(tokenize(train_data)),  # Convert BatchEncoding instance to dictionary    train_label)).batch(BATCH_SIZE).prefetch(1)V = tf.data.Dataset.from_tensor_slices((    dict(tokenize(validation_data)),  # Convert BatchEncoding instance to dictionary    validation_label)).batch(BATCH_SIZE).prefetch(1)

训练

# --------------------------------------------------------------------------------# Train the model# https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit# Input data x can be a dict mapping input names to the corresponding array/tensors, # if the model has named inputs. Beware of the "names". y should be consistent with x # (you cannot have Numpy inputs and tensor targets, or inversely). # --------------------------------------------------------------------------------history = model.fit(    x=X,    # dictionary     # y=Y,    y=None,    epochs=NUM_EPOCHS,    batch_size=BATCH_SIZE,    validation_data=V,)

要实现第一种方法，请按以下方式更改配置：

USE_CUSTOM_HEAD = False

然后FREEZE_BASE将更改为False，LEARNING_RATE将更改为5e-5，这将在基础BERT模型上进行进一步的预训练。

保存模型

对于第三种方法，保存模型会导致问题。Huggingface模型的save_pretrained方法不能使用，因为模型不是HuggingfacePreTrainedModel的直接子类。

Keras save_model在默认的save_traces=True时会导致错误，或者在save_traces=True时加载模型时会导致不同的错误，使用Keras load_model。

---------------------------------------------------------------------------ValueError                                Traceback (most recent call last)<ipython-input-71-01d66991d115> in <module>()----> 1 tf.keras.models.load_model(MODEL_DIRECTORY) 11 frames/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/saving/saved_model/load.py in _unable_to_call_layer_due_to_serialization_issue(layer, *unused_args, **unused_kwargs)    865       'recorded when the object is called, and used when saving. To manually '    866       'specify the input shape/dtype, decorate the call function with '--> 867       '`@tf.function(input_signature=...)`.'.format(layer.name, type(layer)))    868     869  ValueError: Cannot call custom layer tf_distil_bert_model of type <class 'tensorflow.python.keras.saving.saved_model.load.TFDistilBertModel'>, because the call function was not serialized to the SavedModel.Please try one of the following methods to fix this issue: (1) Implement `get_config` and `from_config` in the layer/model class, and pass the object to the `custom_objects` argument when loading the model. For more details, see: https://www.tensorflow.org/guide/keras/save_and_serialize (2) Ensure that the subclassed model or layer overwrites `call` and not `__call__`. The input shape and dtype will be automatically recorded when the object is called, and used when saving. To manually specify the input shape/dtype, decorate the call function with `@tf.function(input_signature=...)`.

据我测试，只有Keras Model save_weights有效。

实验

据我测试，在有毒评论分类挑战中，第一种方法提供了更好的召回率（识别真正的有毒评论和非有毒评论）。代码可以按以下方式访问。如果有任何问题，请提供修正/建议。

第一种和第三种方法的代码

学技术

如何为文本分类微调HuggingFace BERT模型

微调方法

第一种方法

第二种方法

第二种方法的实现

第三种方法

基础知识

分词器

input_ids

attention_mask

特殊标记

[CLS]

向量大小

基础模型 – TFDistilBertModel

实现

Python模块

配置

分词器

输入层

基础模型层

分类层

Softmax层

最终自定义模型

数据分配

训练

保存模型

实验

相关

发表回复取消回复

微调方法

第一种方法

第二种方法

第二种方法的实现

第三种方法

基础知识

分词器

input_ids

attention_mask

特殊标记

[CLS]

向量大小

基础模型 – TFDistilBertModel

实现

Python模块

配置

分词器

输入层

基础模型层

分类层

Softmax层

最终自定义模型

数据分配

训练

保存模型

实验

相关

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复