LSTM模型在第一个epoch后就达到了1.0的验证准确率？

我正在使用LSTM来生成新闻标题。它应该基于序列中前面的字符来预测下一个字符。我有一个包含超过一百万条新闻标题的文件，但为了加快速度，我随机选择了其中的100,000条进行分析。

当我尝试训练我的模型时，仅在第一个epoch它就达到了1.0的验证准确率和0.9986的训练准确率。这肯定是不正确的。我不认为是数据不足的问题，因为90,000个训练数据点应该已经足够了。这看起来不仅仅是基本的过拟合。训练每个epoch似乎也花费了过多的时间（大约2.5分钟），但我之前从未使用过LSTM，所以对于训练时间我不知道该期望什么。是什么导致我的模型表现得如此？

"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""Import Libraries Section"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""import csvimport numpy as npfrom sklearn.model_selection import train_test_splitfrom keras.preprocessing.text import Tokenizerfrom keras.utils import to_categoricalfrom keras.models import Sequentialfrom keras.layers import Embedding, LSTM, Dropout, Dense  import datetimeimport matplotlib.pyplot as plt"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""Load Data Section"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""headlinesFull = []with open("abcnews-date-text.csv", "r") as csv_file:    csv_reader = csv.DictReader(csv_file, delimiter=',')    for lines in csv_reader:        headlinesFull.append(lines['headline_text'])"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""Pretreat Data Section"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""# shuffle and select 100000 headlinesnp.random.shuffle(headlinesFull)headlines = headlinesFull[:100000]# add spaces to make ensure each headline is the same length as the longest headlinemax_len = max(map(len, headlines))headlines = [i + " "*(max_len-len(i)) for i in headlines]# integer encode sequences of words# create the tokenizer t = Tokenizer(char_level=True) # fit the tokenizer on the headlines t.fit_on_texts(headlines)sequences = t.texts_to_sequences(headlines)# vocabulary sizevocab_size = len(t.word_index) + 1# separate into input and outputsequences = np.array(sequences)X, y = sequences[:,:-1], sequences[:,-1]     y = to_categorical(y, num_classes=vocab_size)seq_len = X.shape[1]# split data for validationX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""Define Model Section"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""# define modelmodel = Sequential()model.add(Embedding(vocab_size, 50, input_length=seq_len))model.add(LSTM(100, return_sequences=True))model.add(Dropout(0.2))model.add(LSTM(100))model.add(Dropout(0.2))model.add(Dense(100, activation='relu'))model.add(Dense(vocab_size, activation='softmax'))print(model.summary())# compile modelmodel.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""Train Model Section"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""# fit modelmodel.fit(X_train, y_train, validation_data=(X_test, y_test), batch_size=128, epochs=1)Train on 90000 samples, validate on 10000 samplesEpoch 1/190000/90000 [==============================] - 161s 2ms/step - loss: 0.0493 - acc: 0.9986 - val_loss: 2.3842e-07 - val_acc: 1.0000

回答：

通过观察代码，我能推断出，

您使用空格作为填充字符串以匹配最长标题的长度，headlines = [i + " "*(max_len-len(i)) for i in headlines]
标题被转换为序列，并且在所有标题都调整到最大长度后才进行输入-输出的分割。
因此，对于大多数输入，最后一个词或输出（或最后的数字序列）将是相同的填充字符，这就是为什么即使在第一个epoch后也能得到如此高的准确率。

解决方案：

您可以在标题的开头而不是结尾添加填充字符。

headlines = [" "*(max_len-len(i)) + i for i in headlines]

或者，在将标题分割成X和Y之后，在每个输入的结尾添加填充字符。

学技术

LSTM模型在第一个epoch后就达到了1.0的验证准确率？

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复