我有一个用于情感分析的小型数据集。分类器将是一个简单的KNN,但我希望使用transformers
库中的Bert
模型来获取词嵌入。请注意,我刚刚发现这个库 – 我还在学习中。
因此,通过查看在线示例,我试图理解从模型返回的维度。
示例:
from transformers import BertTokenizertokenizer = BertTokenizer.from_pretrained('bert-base-uncased')tokens = tokenizer.encode(["Hello, my dog is cute", "He is really nice"])print(tokens)tokens = tokenizer.encode("Hello, my dog is cute", "He is really nice")print(tokens)tokens = tokenizer.encode(["Hello, my dog is cute"])print(tokens)tokens = tokenizer.encode("Hello, my dog is cute")print(tokens)
输出如下:
[101, 100, 100, 102][101, 7592, 1010, 2026, 3899, 2003, 10140, 102, 2002, 2003, 2428, 3835, 102][101, 100, 102][101, 7592, 1010, 2026, 3899, 2003, 10140, 102]
我似乎找不到encode()
的文档 – 我不知道为什么当输入作为列表传递时会返回不同的东西。这是怎么做的?
此外,是否有方法可以传递一个词token并得到实际的词 – 以便排查上述问题?
提前感谢
回答:
你可以调用tokenizer.convert_ids_to_tokens()来获取id的实际token:
from transformers import BertTokenizertokenizer = BertTokenizer.from_pretrained('bert-base-uncased')tokens = []tokens.append(tokenizer.encode(["Hello, my dog is cute", "He is really nice"]))tokens.append(tokenizer.encode("Hello, my dog is cute", "He is really nice"))tokens.append(tokenizer.encode(["Hello, my dog is cute"]))tokens.append(tokenizer.encode("Hello, my dog is cute"))for t in tokens: print(tokenizer.convert_ids_to_tokens(t))
输出:
['[CLS]', '[UNK]', '[UNK]', '[SEP]']['[CLS]', 'hello', ',', 'my', 'dog', 'is', 'cute', '[SEP]', 'he', 'is', 'really', 'nice', '[SEP]']['[CLS]', '[UNK]', '[SEP]']['[CLS]', 'hello', ',', 'my', 'dog', 'is', 'cute', '[SEP]']
如你所见,这里每个输入都被分词了,并且根据你的模型(bert)添加了特殊token。encode函数没有正确处理你的列表,这可能是一个错误或预期行为,具体取决于你如何定义它,因为有一个用于批处理的方法batch_encode_plus
:
tokenizer.batch_encode_plus(["Hello, my dog is cute", "He is really nice"], return_token_type_ids=False, return_attention_mask=False)
输出:
{'input_ids': [[101, 7592, 1010, 2026, 3899, 2003, 10140, 102], [101, 2002, 2003, 2428, 3835, 102]]}
我不确定为什么encode方法没有被文档化,但可能是huggingface希望我们直接使用call方法:
tokens = []tokens.append(tokenizer(["Hello, my dog is cute", "He is really nice"], return_token_type_ids=False, return_attention_mask=False))tokens.append(tokenizer("Hello, my dog is cute", "He is really nice", return_token_type_ids=False, return_attention_mask=False))tokens.append(tokenizer(["Hello, my dog is cute"], return_token_type_ids=False, return_attention_mask=False))tokens.append(tokenizer("Hello, my dog is cute", return_token_type_ids=False, return_attention_mask=False))print(tokens)
输出:
[{'input_ids': [[101, 7592, 1010, 2026, 3899, 2003, 10140, 102], [101, 2002, 2003, 2428, 3835, 102]]}, {'input_ids': [101, 7592, 1010, 2026, 3899, 2003, 10140, 102, 2002, 2003, 2428, 3835, 102]}, {'input_ids': [[101, 7592, 1010, 2026, 3899, 2003, 10140, 102]]}, {'input_ids': [101, 7592, 1010, 2026, 3899, 2003, 10140, 102]}]