验证准确率远低于训练准确率

我在使用MOSI数据集进行多模态情感分析，目前仅对文本数据集进行模型训练。对于文本，我使用了300维的glove嵌入来处理文本。我的总词汇量为2173，填充后的序列长度为30。我的目标数组是[0,0,0,0,0,0,1]，其中最左侧表示高度负面，最右侧表示高度正面。

我这样分割数据集

X_train, X_test, y_train, y_test = train_test_split(WDatasetX, y7, test_size=0.20, random_state=42)

我的分词过程是

MAX_NB_WORDS = 3000tokenizer = Tokenizer(num_words=MAX_NB_WORDS,oov_token = "OOV")tokenizer.fit_on_texts(Text_X_Train)tokenized_X_train = tokenizer.texts_to_sequences(Text_X_Train)tokenized_X_test = tokenizer.texts_to_sequences(Text_X_Test)

我的嵌入矩阵:

vocab_size = len(tokenizer.word_index)+1emb_mean=0def embedding_matrix_filteration():    all_embs = np.stack(list(embeddings_index.values()))    print(all_embs.shape)    emb_mean, emb_std = np.mean(all_embs), np.std(all_embs)    print(emb_mean)    embedding_matrix = np.random.normal(emb_mean, emb_std, (vocab_size, embed_dim)) gives the matrix of specified                                                                    size filled with values from gauss distribution    print(embedding_matrix.shape)     print("length of word2id:",len(word2id))    embeddedCount = 0    not_found = []    for word, idx in tokenizer.word_index.items():        embedding_vector = embeddings_index.get(word.lower())        if word == ' ':            embedding_vector = np.zeros_like(emb_mean)        if embedding_vector is not None:             embedding_matrix[idx] = embedding_vector            embeddedCount += 1        else:            print(word)            print("$$$")    print('total embedded:',embeddedCount,'common words')# words common between glove vector and wordset    print("length of word2id:",len(word2id))    print(len(embedding_matrix))    return embedding_matrixemb = embedding_matrix_filteration()

模型架构:

嵌入层:

embedding_layer = Embedding(    vocab_size,    300,    weights=[emb],    trainable=False,    input_length=sequence_length)

我的模型:

from keras import regularizers,layersmodel = Sequential()model.add(embedding_layer)model.add(Bidirectional(layers.LSTM(512,return_sequences=True)))model.add(Bidirectional(layers.LSTM(512,return_sequences=True)))model.add(Bidirectional(layers.LSTM(256,return_sequences=True)))model.add(Bidirectional(layers.LSTM(256)))#kernel_regularizer=regularizers.l2(0.001)model.add(Dense(128, activation='relu'))# model.add(Dropout(0.2))model.add(Dense(128, activation='relu'))# model.add(Dropout(0.2))model.add(Dense(7, activation='softmax'))

不知何故，当我的训练准确率达到80%时，验证准确率仍然非常低。我尝试了不同的正则化技术、优化器、损失函数，但结果还是一样。我不知道为什么。

请帮助我！！

编辑：总词元数为2719，总句子数（包括测试和训练数据集）为2183。

Compiler: model.compile(optimizer='adam',         loss='mean-squred-error',metrics=['accuracy'])

更新后的统计数据:

我已经将标签大小从7减少到3，即[0,1,0] -> 正面、中性、负面。

model = Sequential()model.add(embedding_layer)model.add(Bidirectional(layers.LSTM(16,activation='relu'))) model.add(Dropout(0.2))model.add(Dense(3, activation='softmax'))

编译:

model.compile( optimizer=keras.optimizers.Adam(learning_rate=0.00005),              loss='categorical_crossentropy',              metrics = ['accuracy'])

图表:

训练:

但是损失仍然很高，此外，我已经对数据集进行了分层处理。

回答：

以下是一些建议:

使用categorical_crossentropy而不是mean_squared_error，在进行分类时，这可以帮你很多（尽管后者也能工作，但前者效果更好）。
你的所有标签都是互斥的吗？如果是，则使用softmax + categorical_crossentropy，否则（例如标签看起来像[1,0,0,0,0,0,1]），使用sigmoid + binary_crossentropy。
最初减小模型的大小，只有在过拟合问题持续存在时才使用Dropout()。使用单层LSTM。
减少单元数量（即使你只有一个LSTM单元，64/128可能就足够了）。
你可以使用双向LSTM（我甚至会选择双向GRU，因为它们更简单，看看性能如何）。
确保你进行stratified split（这样，某些示例肯定会同时出现在训练集和验证集中，并且保持良好的比例）。
从较小的学习率开始（0.0001/0.00005）。
建立一个目标/正确的基线。如果你的数据很少，特别是在处理多模态数据集时（你只获取“文本”），你仅在文本上工作，涉及7个不同的类别，那么你可能无法达到很高的准确率。

请记住，为了在你的案例中获得合理的最终结果，你需要采用数据中心的方法，而不是模型中心的方法。无论可能的改进如何，如果数据稀缺+不全面，你将无法取得很好的结果。

学技术

验证准确率远低于训练准确率

模型架构:

更新后的统计数据:

发表回复取消回复

模型架构:

更新后的统计数据:

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复