如何将Keras ANN中学习到的嵌入层用作XGBoost模型的输入特征?

我试图通过从神经网络中提取嵌入层来降低分类特征的维度,并将其作为单独的XGBoost模型的输入特征使用。

嵌入层的维度为(唯一类别数 + 1,选择的输出大小)。如何将其与原始训练数据中的连续变量连接起来,原始训练数据的维度为(观察数,特征数)?

以下是一个使用神经网络进行回归的可重现示例,其中一个分类特征被编码为学习到的嵌入层。该示例紧密改编自:http://machinelearningmechanic.com/keras/2018/03/09/keras-regression-with-categorical-variable-embeddings-md.html#Define-the-input-layers

在最后,我打印了嵌入层及其形状。这个层如何与原始训练数据中的连续特征(X_train_continuous)合并?如果行数等于类别数,并且我们知道类别在嵌入层中的表示顺序,那么嵌入数组或许可以根据类别与训练观察值连接,但实际上行数等于类别数 + 1(在代码中:len(values) + 1)。

# Imports and helper functionsimport numpy as npimport pandas as pdimport numpy as npimport pandas as pdimport kerasfrom keras.models import Sequentialfrom keras.layers import Dense, BatchNormalizationfrom keras.layers import Input, Embedding, Densefrom keras.models import Modelfrom keras.callbacks import Callbackimport matplotlib.pyplot as plt# Bayesian Methods for Hackers style sheetplt.style.use('bmh')np.random.seed(1234567890)class PeriodicLogger(Callback):    """    A helper callback class that only prints the losses once in 'display' epochs    """    def __init__(self, display=100):        self.display = display    def on_train_begin(self, logs={}):        self.epochs = 0    def on_epoch_end(self, batch, logs={}):        self.epochs += 1        if self.epochs % self.display == 0:            print("Epoch: %d - loss: %f - val_loss: %f" % (            self.epochs, logs['loss'], logs['val_loss']))periodic_logger_250 = PeriodicLogger(250)# Define the mapping and a function that computes the house price for each# exampleper_meter_mapping = {    'Mercaz': 500,    'Old North': 350,    'Florentine': 230}per_room_additional_price = {    'Mercaz': 15. * 10 ** 4,    'Old North': 8. * 10 ** 4,    'Florentine': 5. * 10 ** 4}def house_price_func(row):    """    house_price_func is the function f(a,s,n).    :param row: dict (contains the keys: ['area', 'size', 'n_rooms'])    :return: float    """    area, size, n_rooms = row['area'], row['size'], row['n_rooms']    return size * per_meter_mapping[area] + n_rooms * \           per_room_additional_price[area]# Create toy dataAREAS = ['Mercaz', 'Old North', 'Florentine']def create_samples(n_samples):    """    Helper method that creates dataset DataFrames    Note that the np.random.choice call only determines the number of rooms and the size of the house    (the price, which we calculate later, is deterministic)    :param n_samples: int (number of samples for each area (suburb))    :return: pd.DataFrame    """    samples = []    for n_rooms in np.random.choice(range(1, 6), n_samples):        samples += [(area, int(np.random.normal(25, 5)), n_rooms) for area in                    AREAS]    return pd.DataFrame(samples, columns=['area', 'size', 'n_rooms'])# Create the train and validation setstrain = create_samples(n_samples=1000)val = create_samples(n_samples=100)# Calculate the prices for each settrain['price'] = train.apply(house_price_func, axis=1)val['price'] = val.apply(house_price_func, axis=1)# Define the features and the y vectorscontinuous_cols = ['size', 'n_rooms']categorical_cols = ['area']y_col = ['price']X_train_continuous = train[continuous_cols]X_train_categorical = train[categorical_cols]y_train = train[y_col]X_val_continuous = val[continuous_cols]X_val_categorical = val[categorical_cols]y_val = val[y_col]# Normalization# Normalizing both train and test sets to have 0 mean and std. of 1 using the# train set mean and std.# This will give each feature an equal initial importance and speed up the# training timetrain_mean = X_train_continuous.mean(axis=0)train_std = X_train_continuous.std(axis=0)X_train_continuous = X_train_continuous - train_meanX_train_continuous /= train_stdX_val_continuous = X_val_continuous - train_meanX_val_continuous /= train_std# Build a model using a categorical variable# First let's define a helper class for the categorical variableclass EmbeddingMapping():    """    Helper class for handling categorical variables    An instance of this class should be defined for each categorical variable    we want to use.    """    def __init__(self, series):        # get a list of unique values        values = series.unique().tolist()        # Set a dictionary mapping from values to integer value        # In our example this will be {'Mercaz': 1, 'Old North': 2,        # 'Florentine': 3}        self.embedding_dict = {value: int_value + 1 for int_value, value in                               enumerate(values)}        # The num_values will be used as the input_dim when defining the        # embedding layer.        # It will also be returned for unseen values        self.num_values = len(values) + 1    def get_mapping(self, value):        # If the value was seen in the training set, return its integer mapping        if value in self.embedding_dict:            return self.embedding_dict[value]        # Else, return the same integer for unseen values        else:            return self.num_values# Create an embedding column for the train/validation setsarea_mapping = EmbeddingMapping(X_train_categorical['area'])X_train_categorical = \    X_train_categorical.assign(area_mapping=X_train_categorical['area']                               .apply(area_mapping.get_mapping))X_val_categorical = \    X_val_categorical.assign(area_mapping=X_val_categorical['area']                             .apply(area_mapping.get_mapping))# Define the input layers# Define the embedding inputarea_input = Input(shape=(1,), dtype='int32')# Decide to what vector size we want to map our 'area' variable.# I'll use 1 here because we only have three areasembeddings_output = 2# Let’s define the embedding layer and flatten itarea_embedings = Embedding(output_dim=embeddings_output,                           input_dim=area_mapping.num_values,                           input_length=1, name="embedding_layer")(area_input)area_embedings = keras.layers.Reshape((embeddings_output,))(area_embedings)# Define the continuous variables input (just like before)continuous_input = Input(shape=(X_train_continuous.shape[1], ))# Concatenate continuous and embeddings inputsall_input = keras.layers.concatenate([continuous_input, area_embedings])# To merge them together we will use Keras Functional API# Will define a simple model with 2 hidden layers, with 25 neurons each.# Define the modelunits=25dense1 = Dense(units=units, activation='relu')(all_input)dense2 = Dense(units, activation='relu')(dense1)predictions = Dense(1)(dense2)# Note using the input object 'area_input' not 'area_embeddings'model = Model(inputs=[continuous_input, area_input], outputs=predictions)# Lets train the modelepochs = 100  # to train properly, use 10000model.compile(loss='mse',              optimizer=keras.optimizers.Adam(lr=.8, beta_1=0.9,                                              beta_2=0.999, decay=1e-03,                                              amsgrad=True))# Note continuous and categorical columns are inserted in the same order as# defined in all_inputshistory = model.fit([X_train_continuous, X_train_categorical['area_mapping']],                    y_train, epochs=epochs, batch_size=128, callbacks=[        periodic_logger_250], verbose=0,                    validation_data=([X_val_continuous, X_val_categorical[                        'area_mapping']], y_val))# Observe the embedding layerembeddings_output = model.get_layer('embedding_layer').get_weights()[0]print(f'Embedding layer:\n{embeddings_output}')print(f'Embedding layer shape: {embeddings_output.shape}')

回答:

首先,这篇文章有一个术语问题:“嵌入”是特定输入样本的表示。它是由层输出的向量。“权重”是存储和训练在层内部的矩阵。

在Keras中,Model类是Layer的子类。你可以将任何模型作为层在更大的模型中使用。

你可以创建一个仅包含嵌入层的模型,然后在构建模型的其余部分时将其用作层。训练后,你可以对该“子模型”调用.predict()。此外,你可以将该子模型保存为json文件并稍后重新加载。

这是创建发出内部嵌入的模型的标准技术。

Related Posts

使用LSTM在Python中预测未来值

这段代码可以预测指定股票的当前日期之前的值,但不能预测…

如何在gensim的word2vec模型中查找双词组的相似性

我有一个word2vec模型,假设我使用的是googl…

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

我试图使用 XGBoost 创建模型。 看起来我成功地…

ML Tuning – Cross Validation in Spark

我在https://spark.apache.org/…

如何在React JS中使用fetch从REST API获取预测

我正在开发一个应用程序,其中Flask REST AP…

如何分析ML.NET中多类分类预测得分数组?

我在ML.NET中创建了一个多类分类项目。该项目可以对…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注