在Keras中使用独热编码创建模型

我正在处理一个句子分类问题，并尝试使用Keras来解决。词汇表中的唯一单词总数为36。

在这种情况下，总词汇表是[W1,W2,W3….W36]

所以，如果我有一句话，其中的单词是[W1 W2 W6 W7 W9]，如果我对其进行编码，我会得到一个如下所示的numpy数组

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1] [0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0] [0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]

其形状为(5,36)

我从这里开始卡住了。我已经生成了20000个形状不同的numpy数组，即(N,36)，其中N是句子中的单词数量。所以我有20,000个句子用于训练，100个用于测试，所有句子都标注了(1,36)的独热编码

我有x_train, x_test, y_train和y_test

x_test和y_test的维度为(1,36)

请问有人可以指导我该怎么做吗？

我做了一些下面的编码

model = Sequential()model.add(Dense(512, input_shape=(??????))),model.add(Activation('relu'))model.add(Dropout(0.5))model.add(Dense(num_classes))model.add(Activation('softmax'))model.compile(loss='categorical_crossentropy',          optimizer='adam',          metrics=['accuracy'])

任何帮助将不胜感激。

更新和对@putonspectacles的回应

非常感谢您花时间和精力给出详细的回答。我尝试了您的代码，并做了一些我认为需要做的微小修改以使代码工作。请查看下面的修改

num_classes = 5 max_words = 20sentences = ["The cat is in the house","The green boy","computer programs are not alive while the children are"]labels = np.random.randint(0, num_classes, 3)y = to_categorical(labels, num_classes=num_classes)words = set(w for sent in sentences for w in sent.split())word_map = {w : i+1 for (i, w) in enumerate(words)}#-修改了下面的行，将内层for循环的sent改为sent.split()  sent_ints = [[word_map[w] for w in sent.split()] for sent in sentences]vocab_size = len(words)print(vocab_size)#-修改了下面的行，将外层for循环的sentences改为sent_intsX = np.array([to_categorical(pad_sequences((sent,), max_words),vocab_size+1)  for sent in sent_ints])print(X)print(y)model = Sequential()model.add(Dense(512, input_shape=(max_words, vocab_size + 1)))model.add(LSTM(128))model.add(Dense(5, activation='softmax'))model.compile(loss='categorical_crossentropy',      optimizer='adam',      metrics=['accuracy'])model.fit(X,y)

如果没有这些修改，代码将无法工作。当我运行上述代码时，我得到了如下所示的正确嵌入

[[[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.][1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.][1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.][1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.][1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.][1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.][1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.][1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.][1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.][1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.][1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.][1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.][1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.][1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.][0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.][0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.][0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.][0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.][0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.][0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]]

但我得到的错误是”Error when checking input: expected dense_44_input to have 3 dimensions, but got array with shape (3, 1, 20, 16)“

当我将输入形状更改为model.add(Dense(512, input_shape=(None,max_words, vocab_size + 1)))时

我得到的错误是”Input 0 is incompatible with layer lstm_27: expected ndim=3, found ndim=4“

我正在努力解决这个问题。如果您能给我一些指导，那将非常好。

我已经接受了这个答案，因为它回答了嵌入单词的目标。再次感谢您。

回答：

很好，您已经清理了问题。您想对句子进行分类。我假设您说我想做得比词袋编码更好。您想重视序列的重要性。

那么我们将选择一个新的模型——RNN（LSTM版本）。这个模型有效地对每个单词的重要性（按顺序）进行求和，从而构建出最适合任务的句子表示。

但我们需要对预处理进行一些不同的处理。为了提高效率（以便我们可以批量处理更多的句子，而不是一次处理一个句子），我们希望所有句子具有相同数量的单词。因此，我们选择一个最大单词数，比如20，我们将较短的句子填充到最大单词数，并将超过20个单词的句子截断。

Keras将帮助我们实现这一点。我们将用整数对每个单词进行编码。

from keras.preprocessing.sequence import pad_sequencesfrom keras.utils import to_categoricalfrom keras.models import Sequentialfrom keras.layers import Embedding, Dense, LSTMnum_classes = 5 max_words = 20sentences = ["The cat is in the house",                           "The green boy",            "computer programs are not alive while the children are"]labels = np.random.randint(0, num_classes, 3)y = to_categorical(labels, num_classes=num_classes)words = set(w for sent in sentences for w in sent.split())word_map = {w : i+1 for (i, w) in enumerate(words)}sent_ints = [[word_map[w] for w in sent] for sent in sentences]vocab_size = len(words)

所以“the green boy”现在可能是[1, 3, 5]。然后我们将填充并使用独热编码

# 填充到最大单词长度并使用len(words) + 1进行编码  # + 1因为我们将保留0作为填充标记。X = np.array([to_categorical(pad_sequences((sent,), max_words),         vocab_size + 1)  for sent in sent_ints])print(X.shape) # (3, 20, 16)

现在来看看模型：我们将添加一个Dense层，将那些独热编码的单词转换为密集向量。然后我们使用LSTM将句子中的单词向量转换为密集的句子向量。最后，我们使用softmax激活函数生成类别的概率分布。

model = Sequential()model.add(Dense(512, input_shape=(max_words, vocab_size + 1)))model.add(LSTM(128))model.add(Dense(5, activation='softmax'))model.compile(loss='categorical_crossentropy',          optimizer='adam',          metrics=['accuracy'])

这应该可以编译。然后您可以继续进行训练。

model.fit(X,y)

编辑：

这行代码：

# 我们需要将句子拆分为单词，现在它正在读取每个字母，请注意正确版本中的sent.split()。sent_ints = [[word_map[w] for w in sent] for sent in sentences]

应该改为：

sent_ints = [[word_map[w] for w in sent.split()] for sent in sentences]

学技术

在Keras中使用独热编码创建模型

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复