如何使用BERT和Elmo嵌入与sklearn

我创建了一个使用sklearn的Tf-Idf的文本分类器,现在我想用BERT和Elmo嵌入来替代Tf-Idf。

该如何做呢?

我使用以下代码获取Bert嵌入:

from flair.data import Sentencefrom flair.embeddings import TransformerWordEmbeddings# init embeddingembedding = TransformerWordEmbeddings('bert-base-uncased')# create a sentencesentence = Sentence('The grass is green .')# embed words in sentenceembedding.embed(sentence)
import pandas as pdimport numpy as npfrom sklearn.compose import ColumnTransformerfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.preprocessing import MinMaxScalerfrom sklearn.linear_model import LogisticRegressioncolumn_trans = ColumnTransformer([    ('tfidf', TfidfVectorizer(), 'text'),    ('number_scaler', MinMaxScaler(), ['number'])])# Initialize datadata = [    ['This process, however, afforded me no means of.', 20, 1],    ['another long description', 21, 1],    ['It never once occurred to me that the fumbling', 19, 0],    ['How lovely is spring As we looked from Windsor', 18, 0]]# Create DataFramedf = pd.DataFrame(data, columns=['text', 'number', 'target'])X = column_trans.fit_transform(df)X = X.toarray()y = df.loc[:, "target"].values# Perform classificationclassifier = LogisticRegression(random_state=0)classifier.fit(X, y)

回答:

Sklearn提供了创建自定义数据转换器的可能性(与机器学习模型“transformers”无关)。

我实现了一个自定义的sklearn数据转换器,使用了你使用的flair库。请注意,我使用了TransformerDocumentEmbeddings而不是TransformerWordEmbeddings。还有一个与transformers库一起使用的版本。

我添加了一个SO问题,讨论使用哪个转换器层更有意义,在这里

我对Elmo不太熟悉,不过我找到了这个,它使用了tensorflow。你可以修改我分享的代码,使Elmo能够工作。

import torchimport numpy as npfrom flair.data import Sentencefrom flair.embeddings import TransformerDocumentEmbeddingsfrom sklearn.base import BaseEstimator, TransformerMixinclass FlairTransformerEmbedding(TransformerMixin, BaseEstimator):    def __init__(self, model_name='bert-base-uncased', batch_size=None, layers=None):        # From https://lvngd.com/blog/spacy-word-vectors-as-features-in-scikit-learn/        # For pickling reason you should not load models in __init__        self.model_name = model_name        self.model_kw_args = {'batch_size': batch_size, 'layers': layers}        self.model_kw_args = {k: v for k, v in self.model_kw_args.items()                              if v is not None}        def fit(self, X, y=None):        return self        def transform(self, X):        model = TransformerDocumentEmbeddings(                self.model_name, fine_tune=False,                **self.model_kw_args)        sentences = [Sentence(text) for text in X]        embedded = model.embed(sentences)        embedded = [e.get_embedding().reshape(1, -1) for e in embedded]        return np.array(torch.cat(embedded).cpu())import numpy as npfrom sklearn.base import BaseEstimator, TransformerMixinfrom transformers import AutoTokenizer, AutoModelfrom more_itertools import chunkedclass TransformerEmbedding(TransformerMixin, BaseEstimator):    def __init__(self, model_name='bert-base-uncased', batch_size=1, layer=-1):        # From https://lvngd.com/blog/spacy-word-vectors-as-features-in-scikit-learn/        # For pickling reason you should not load models in __init__        self.model_name = model_name        self.layer = layer        self.batch_size = batch_size        def fit(self, X, y=None):        return self        def transform(self, X):        tokenizer = AutoTokenizer.from_pretrained(self.model_name)        model = AutoModel.from_pretrained(self.model_name)        res = []        for batch in chunked(X, self.batch_size):            encoded_input = tokenizer.batch_encode_plus(                batch, return_tensors='pt', padding=True, truncation=True)            output = model(**encoded_input)            embed = output.last_hidden_state[:,self.layer].detach().numpy()            res.append(embed)        return np.concatenate(res)

在你的情况下,将你的列转换器替换为这个:

column_trans = ColumnTransformer([    ('embedding', FlairTransformerEmbedding(), 'text'),    ('number_scaler', MinMaxScaler(), ['number'])])

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注