我有一个包含50000条数据的集合:评论和情感(正面或负面)
我将90%的数据分配给了训练集,剩余的10%分配给了测试集。
我的问题是,如果我在现有的训练集上运行5个轮次(epochs),难道每个轮次不应该加载9000条数据而不是1407条吗?
# 划分训练和测试集test_sample_size = int(0.1*len(preprocessed_reviews)) # 10%的数据作为验证集# 对于情感sentiment = [1 if x=='positive' else 0 for x in sentiment]# 将数据分成训练和测试集X_test, X_train = (np.array(preprocessed_reviews[:test_sample_size]), np.array(preprocessed_reviews[test_sample_size:]))y_test, y_train = (np.array(sentiment[:test_sample_size]), np.array(sentiment[test_sample_size:]))tokenizer = Tokenizer(oov_token='<OOV>') # 对于未知词tokenizer.fit_on_texts(X_train)vocab_count = len(tokenizer.word_index) + 1 # +1用于填充training_sequences = tokenizer.texts_to_sequences(X_train) # tokenizer.word_index 可查看索引training_padded = pad_sequences(training_sequences, padding='post') # 用0填充序列training_normal = preprocessing.normalize(training_padded) # 归一化数据testing_sequences = tokenizer.texts_to_sequences(X_test) testing_padded = pad_sequences(testing_sequences, padding='post') testing_normal = preprocessing.normalize(testing_padded) input_length = len(training_normal[0]) # 所有序列的长度# 构建模型model = keras.models.Sequential()model.add(keras.layers.Embedding(input_dim=vocab_count, output_dim=2,input_length=input_length))model.add(keras.layers.GlobalAveragePooling1D())model.add(keras.layers.Dense(63, activation='relu')) # 隐藏层model.add(keras.layers.Dense(16, activation='relu')) # 隐藏层model.add(keras.layers.Dense(1, activation='sigmoid')) # 输出层model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])model.fit(training_normal, y_train, epochs=5)
输出:
Epoch 1/51407/1407 [==============================] - 9s 7ms/step - loss: 0.6932 - accuracy: 0.4992Epoch 2/51407/1407 [==============================] - 9s 6ms/step - loss: 0.6932 - accuracy: 0.5030Epoch 3/51407/1407 [==============================] - 9s 6ms/step - loss: 0.6932 - accuracy: 0.4987Epoch 4/51407/1407 [==============================] - 9s 6ms/step - loss: 0.6932 - accuracy: 0.5024Epoch 5/51407/1407 [==============================] - 9s 6ms/step - loss: 0.6932 - accuracy: 0.5020
抱歉我对tensorflow还比较新,希望有人能帮帮我!
回答:
所以如果你有大约50000个数据点,按90/10的比例分配(训练/测试),那么大约45000个将用于训练,剩下的5000个用于测试。当你调用fit方法时,Keras的batch_size默认参数设置为32(你可以将其改为64, 128等)。所以数字1407表示模型需要进行1407次前馈和反向传播步骤,才能完成一个完整的轮次(因为1407 * 32 ≈ 45000)。