我试图通过从神经网络中提取嵌入层来降低分类特征的维度,并将其作为单独的XGBoost模型的输入特征使用。
嵌入层的维度为(唯一类别数 + 1,选择的输出大小)。如何将其与原始训练数据中的连续变量连接起来,原始训练数据的维度为(观察数,特征数)?
以下是一个使用神经网络进行回归的可重现示例,其中一个分类特征被编码为学习到的嵌入层。该示例紧密改编自:http://machinelearningmechanic.com/keras/2018/03/09/keras-regression-with-categorical-variable-embeddings-md.html#Define-the-input-layers
在最后,我打印了嵌入层及其形状。这个层如何与原始训练数据中的连续特征(X_train_continuous)合并?如果行数等于类别数,并且我们知道类别在嵌入层中的表示顺序,那么嵌入数组或许可以根据类别与训练观察值连接,但实际上行数等于类别数 + 1(在代码中:len(values) + 1)。
# Imports and helper functionsimport numpy as npimport pandas as pdimport numpy as npimport pandas as pdimport kerasfrom keras.models import Sequentialfrom keras.layers import Dense, BatchNormalizationfrom keras.layers import Input, Embedding, Densefrom keras.models import Modelfrom keras.callbacks import Callbackimport matplotlib.pyplot as plt# Bayesian Methods for Hackers style sheetplt.style.use('bmh')np.random.seed(1234567890)class PeriodicLogger(Callback): """ A helper callback class that only prints the losses once in 'display' epochs """ def __init__(self, display=100): self.display = display def on_train_begin(self, logs={}): self.epochs = 0 def on_epoch_end(self, batch, logs={}): self.epochs += 1 if self.epochs % self.display == 0: print("Epoch: %d - loss: %f - val_loss: %f" % ( self.epochs, logs['loss'], logs['val_loss']))periodic_logger_250 = PeriodicLogger(250)# Define the mapping and a function that computes the house price for each# exampleper_meter_mapping = { 'Mercaz': 500, 'Old North': 350, 'Florentine': 230}per_room_additional_price = { 'Mercaz': 15. * 10 ** 4, 'Old North': 8. * 10 ** 4, 'Florentine': 5. * 10 ** 4}def house_price_func(row): """ house_price_func is the function f(a,s,n). :param row: dict (contains the keys: ['area', 'size', 'n_rooms']) :return: float """ area, size, n_rooms = row['area'], row['size'], row['n_rooms'] return size * per_meter_mapping[area] + n_rooms * \ per_room_additional_price[area]# Create toy dataAREAS = ['Mercaz', 'Old North', 'Florentine']def create_samples(n_samples): """ Helper method that creates dataset DataFrames Note that the np.random.choice call only determines the number of rooms and the size of the house (the price, which we calculate later, is deterministic) :param n_samples: int (number of samples for each area (suburb)) :return: pd.DataFrame """ samples = [] for n_rooms in np.random.choice(range(1, 6), n_samples): samples += [(area, int(np.random.normal(25, 5)), n_rooms) for area in AREAS] return pd.DataFrame(samples, columns=['area', 'size', 'n_rooms'])# Create the train and validation setstrain = create_samples(n_samples=1000)val = create_samples(n_samples=100)# Calculate the prices for each settrain['price'] = train.apply(house_price_func, axis=1)val['price'] = val.apply(house_price_func, axis=1)# Define the features and the y vectorscontinuous_cols = ['size', 'n_rooms']categorical_cols = ['area']y_col = ['price']X_train_continuous = train[continuous_cols]X_train_categorical = train[categorical_cols]y_train = train[y_col]X_val_continuous = val[continuous_cols]X_val_categorical = val[categorical_cols]y_val = val[y_col]# Normalization# Normalizing both train and test sets to have 0 mean and std. of 1 using the# train set mean and std.# This will give each feature an equal initial importance and speed up the# training timetrain_mean = X_train_continuous.mean(axis=0)train_std = X_train_continuous.std(axis=0)X_train_continuous = X_train_continuous - train_meanX_train_continuous /= train_stdX_val_continuous = X_val_continuous - train_meanX_val_continuous /= train_std# Build a model using a categorical variable# First let's define a helper class for the categorical variableclass EmbeddingMapping(): """ Helper class for handling categorical variables An instance of this class should be defined for each categorical variable we want to use. """ def __init__(self, series): # get a list of unique values values = series.unique().tolist() # Set a dictionary mapping from values to integer value # In our example this will be {'Mercaz': 1, 'Old North': 2, # 'Florentine': 3} self.embedding_dict = {value: int_value + 1 for int_value, value in enumerate(values)} # The num_values will be used as the input_dim when defining the # embedding layer. # It will also be returned for unseen values self.num_values = len(values) + 1 def get_mapping(self, value): # If the value was seen in the training set, return its integer mapping if value in self.embedding_dict: return self.embedding_dict[value] # Else, return the same integer for unseen values else: return self.num_values# Create an embedding column for the train/validation setsarea_mapping = EmbeddingMapping(X_train_categorical['area'])X_train_categorical = \ X_train_categorical.assign(area_mapping=X_train_categorical['area'] .apply(area_mapping.get_mapping))X_val_categorical = \ X_val_categorical.assign(area_mapping=X_val_categorical['area'] .apply(area_mapping.get_mapping))# Define the input layers# Define the embedding inputarea_input = Input(shape=(1,), dtype='int32')# Decide to what vector size we want to map our 'area' variable.# I'll use 1 here because we only have three areasembeddings_output = 2# Let’s define the embedding layer and flatten itarea_embedings = Embedding(output_dim=embeddings_output, input_dim=area_mapping.num_values, input_length=1, name="embedding_layer")(area_input)area_embedings = keras.layers.Reshape((embeddings_output,))(area_embedings)# Define the continuous variables input (just like before)continuous_input = Input(shape=(X_train_continuous.shape[1], ))# Concatenate continuous and embeddings inputsall_input = keras.layers.concatenate([continuous_input, area_embedings])# To merge them together we will use Keras Functional API# Will define a simple model with 2 hidden layers, with 25 neurons each.# Define the modelunits=25dense1 = Dense(units=units, activation='relu')(all_input)dense2 = Dense(units, activation='relu')(dense1)predictions = Dense(1)(dense2)# Note using the input object 'area_input' not 'area_embeddings'model = Model(inputs=[continuous_input, area_input], outputs=predictions)# Lets train the modelepochs = 100 # to train properly, use 10000model.compile(loss='mse', optimizer=keras.optimizers.Adam(lr=.8, beta_1=0.9, beta_2=0.999, decay=1e-03, amsgrad=True))# Note continuous and categorical columns are inserted in the same order as# defined in all_inputshistory = model.fit([X_train_continuous, X_train_categorical['area_mapping']], y_train, epochs=epochs, batch_size=128, callbacks=[ periodic_logger_250], verbose=0, validation_data=([X_val_continuous, X_val_categorical[ 'area_mapping']], y_val))# Observe the embedding layerembeddings_output = model.get_layer('embedding_layer').get_weights()[0]print(f'Embedding layer:\n{embeddings_output}')print(f'Embedding layer shape: {embeddings_output.shape}')
回答:
首先,这篇文章有一个术语问题:“嵌入”是特定输入样本的表示。它是由层输出的向量。“权重”是存储和训练在层内部的矩阵。
在Keras中,Model类是Layer的子类。你可以将任何模型作为层在更大的模型中使用。
你可以创建一个仅包含嵌入层的模型,然后在构建模型的其余部分时将其用作层。训练后,你可以对该“子模型”调用.predict()。此外,你可以将该子模型保存为json文件并稍后重新加载。
这是创建发出内部嵌入的模型的标准技术。