如何使用LSTM构建语言模型以分配给定句子的发生概率

目前，我使用三元模型来实现这一目标。它为给定的句子分配发生的概率。但它仅限于两个词的上下文。然而，LSTM可以做得更多。那么，如何构建一个使用LSTM模型来分配给定句子发生概率的模型呢？

回答：

我刚刚编写了一个非常简单的示例，展示了如何使用LSTM模型计算句子的发生概率。完整的代码可以在这里找到这里。

假设我们想预测以下数据集的句子发生概率（这首儿歌大约在1765年在伦敦的《鹅妈妈的旋律》中出版）：

# Datadata = ["Two little dicky birds",        "Sat on a wall,",        "One called Peter,",        "One called Paul.",        "Fly away, Peter,",        "Fly away, Paul!",        "Come back, Peter,",        "Come back, Paul."]

首先，让我们使用keras.preprocessing.text.Tokenizer来创建词汇表并对句子进行标记化：

# Preprocess datatokenizer = Tokenizer()tokenizer.fit_on_texts(data)vocab = tokenizer.word_indexseqs = tokenizer.texts_to_sequences(data)

我们的模型将一系列词作为输入（上下文），并输出给定上下文下词汇表中每个词的条件概率分布。为此，我们通过填充序列并在其上滑动窗口来准备训练数据：

def prepare_sentence(seq, maxlen):    # Pads seq and slides windows    x = []    y = []    for i, w in enumerate(seq):        x_padded = pad_sequences([seq[:i]],                                 maxlen=maxlen - 1,                                 padding='pre')[0]  # Pads before each sequence        x.append(x_padded)        y.append(w)    return x, y# Pad sequences and slide windowsmaxlen = max([len(seq) for seq in seqs])x = []y = []for seq in seqs:    x_windows, y_windows = prepare_sentence(seq, maxlen)    x += x_windows    y += y_windowsx = np.array(x)y = np.array(y) - 1  # The word <PAD> does not constitute a classy = np.eye(len(vocab))[y]  # One hot encoding

我决定为每节诗歌单独滑动窗口，但这可以有不同的处理方式。

接下来，我们使用Keras定义并训练一个简单的LSTM模型。该模型包括一个嵌入层，一个LSTM层，以及一个带有softmax激活的密集层（它使用LSTM的最后一个时间步的输出，生成给定上下文下词汇表中每个词的概率）：

# Define modelmodel = Sequential()model.add(Embedding(input_dim=len(vocab) + 1,  # vocabulary size. Adding an                                               # extra element for <PAD> word                    output_dim=5,  # size of embeddings                    input_length=maxlen - 1))  # length of the padded sequencesmodel.add(LSTM(10))model.add(Dense(len(vocab), activation='softmax'))model.compile('rmsprop', 'categorical_crossentropy')# Train networkmodel.fit(x, y, epochs=1000)

句子w_1 ... w_n发生的联合概率P(w_1, ..., w_n)可以使用条件概率规则计算：

P(w_1, ..., w_n)=P(w_1)*P(w_2|w_1)*...*P(w_n|w_{n-1}, ..., w_1)

其中每个条件概率由LSTM模型给出。请注意，这些概率可能非常小，因此为了避免数值不稳定问题，最好在对数空间中工作。将所有内容整合在一起：

# Compute probability of occurence of a sentencesentence = "One called Peter,"tok = tokenizer.texts_to_sequences([sentence])[0]x_test, y_test = prepare_sentence(tok, maxlen)x_test = np.array(x_test)y_test = np.array(y_test) - 1  # The word <PAD> does not constitute a classp_pred = model.predict(x_test)  # array of conditional probabilitiesvocab_inv = {v: k for k, v in vocab.items()}# Compute product# Efficient version: np.exp(np.sum(np.log(np.diag(p_pred[:, y_test]))))log_p_sentence = 0for i, prob in enumerate(p_pred):    word = vocab_inv[y_test[i]+1]  # Index 0 from vocab is reserved to <PAD>    history = ' '.join([vocab_inv[w] for w in x_test[i, :] if w != 0])    prob_word = prob[y_test[i]]    log_p_sentence += np.log(prob_word)    print('P(w={}|h={})={}'.format(word, history, prob_word))print('Prob. sentence: {}'.format(np.exp(log_p_sentence)))

注意：这是一个非常小的玩具数据集，我们可能会过拟合。

更新 2022年10月29日：对于更大的数据集，如果一次处理整个数据集，可能会耗尽内存。在这种情况下，我建议使用生成器来训练您的模型。请查看这个gist，了解使用数据生成器的修改版本。

学技术

如何使用LSTM构建语言模型以分配给定句子的发生概率

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复