使用LSTM训练模型需要多少个epoch

我在我的AI中使用了带有LSTM的神经网络Actor Critic TD3模型。对于每次训练,我都会创建一批顺序数据并训练我的AI。

请问有专家能否帮助我了解一下,这种AI是否也需要使用epoch?并且一般来说,使用这段代码可以运行多少个epoch,因为我在一个训练步骤中创建了许多批次,这样做是否也可行使用epoch?

以下是训练步骤代码

def train(        self,        replay_buffer,        iterations,        batch_size=50,        discount=0.99,        tau=0.005,        policy_noise=0.2,        noise_clip=0.5,        policy_freq=2,        ):                b_state = torch.Tensor([])        b_next_state = torch.Tensor([])        b_done = torch.Tensor([])        b_reward = torch.Tensor([])        b_action = torch.Tensor([])        for it in range(iterations):            # print ('it: ', it, ' iterations: ', iterations)      # Step 4: We sample a batch of transitions (s, s’, a, r) from the memory            (batch_states, batch_next_states, batch_actions,             batch_rewards, batch_dones) = \                replay_buffer.sample(batch_size)            batch_states = batch_states.astype(float)            batch_next_states = batch_next_states.astype(float)            batch_actions = batch_actions.astype(float)            batch_rewards = batch_rewards.astype(float)            batch_dones = batch_dones.astype(float)            state = torch.from_numpy(batch_states)            next_state = torch.from_numpy(batch_next_states)            action = torch.from_numpy(batch_actions)            reward = torch.from_numpy(batch_rewards)            done = torch.from_numpy(batch_dones)            b_size = 1            seq_len = state.shape[0]            batch = b_size            input_size = state_dim            state = torch.reshape(state, ( 1,seq_len, state_dim))            next_state = torch.reshape(next_state, ( 1,seq_len, state_dim))            done = torch.reshape(done, ( 1,seq_len, 1))            reward = torch.reshape(reward, ( 1, seq_len, 1))            action = torch.reshape(action, ( 1, seq_len, action_dim))                        b_state = torch.cat((b_state, state),dim=0)            b_next_state = torch.cat((b_next_state, next_state),dim=0)            b_done = torch.cat((b_done, done),dim=0)            b_reward = torch.cat((b_reward, reward),dim=0)            b_action = torch.cat((b_action, action),dim=0)                        # state = torch.reshape(state, (seq_len, 1, state_dim))            # next_state = torch.reshape(next_state, (seq_len, 1,            #         state_dim))            # done = torch.reshape(done, (seq_len, 1, 1))            # reward = torch.reshape(reward, (seq_len, 1, 1))            # action = torch.reshape(action, (seq_len, 1, action_dim))                        # b_state = torch.cat((b_state, state),dim=1)            # b_next_state = torch.cat((b_next_state, next_state),dim=1)            # b_done = torch.cat((b_done, done),dim=1)            # b_reward = torch.cat((b_reward, reward),dim=1)            # b_action = torch.cat((b_action, action),dim=1)                                                              print("dim state:",b_state.shape)      # for h and c shape (num_layers * num_directions, batch, hidden_size)        ha0 = torch.zeros(lstm_layers, b_state.shape[0], state_dim)        ca0 = torch.zeros(lstm_layers, b_state.shape[0], state_dim)        hc0 = torch.zeros(lstm_layers, b_state.shape[0], state_dim + action_dim)        cc0 = torch.zeros(lstm_layers, b_state.shape[0], state_dim + action_dim)        # Step 5: From the next state s’, the Actor target plays the next action a’                  b_next_action = self.actor_target(b_next_state, (ha0, ca0))        b_next_action = b_next_action[0]          # Step 6: We add Gaussian noise to this next action a’ and we clamp it in a range of values supported by the environment                  noise = torch.Tensor(b_next_action).data.normal_(0,                policy_noise)        noise = noise.clamp(-noise_clip, noise_clip)        b_next_action = (b_next_action + noise).clamp(-self.max_action,                self.max_action)          # Step 7: The two Critic targets take each the couple (s’, a’) as input and return two Q-values Qt1(s’,a’) and Qt2(s’,a’) as outputs                result = self.critic_target(b_next_state, b_next_action, (hc0,cc0))        target_Q1 = result[0]        target_Q2 = result[1]          # Step 8: We keep the minimum of these two Q-values: min(Qt1, Qt2)                  target_Q = torch.min(target_Q1, target_Q2).double()                  # Step 9: We get the final target of the two Critic models, which is: Qt = r + γ * min(Qt1, Qt2), where γ is the discount factor                  target_Q = b_reward + (1 - b_done) * discount * target_Q                  # Step 10: The two Critic models take each the couple (s, a) as input and return two Q-values Q1(s,a) and Q2(s,a) as outputs                  b_action_reshape = torch.reshape(b_action, b_next_action.shape)        result = self.critic(b_state, b_action_reshape, (hc0, cc0))        current_Q1 = result[0]        current_Q2 = result[1]                  # Step 11: We compute the loss coming from the two Critic models: Critic Loss = MSE_Loss(Q1(s,a), Qt) + MSE_Loss(Q2(s,a), Qt)                  critic_loss = F.mse_loss(current_Q1, target_Q) \            + F.mse_loss(current_Q2, target_Q)                  # Step 12: We backpropagate this Critic loss and update the parameters of the two Critic models with a SGD optimizer                  self.critic_optimizer.zero_grad()        critic_loss.backward()        self.critic_optimizer.step()                  # Step 13: Once every two iterations, we update our Actor model by performing gradient ascent on the output of the first Critic model                out = self.actor(b_state, (ha0, ca0))        out = out[0]        (actor_loss, hx, cx) = self.critic.Q1(b_state, out, (hc0,cc0))        actor_loss = -1 * actor_loss.mean()        self.actor_optimizer.zero_grad()        actor_loss.backward()        self.actor_optimizer.step()          # Step 14: Still once every two iterations, we update the weights of the Actor target by polyak averaging                  for (param, target_param) in zip(self.actor.parameters(),                self.actor_target.parameters()):            target_param.data.copy_(tau * param.data + (1 - tau)                    * target_param.data)          # Step 15: Still once every two iterations, we update the weights of the Critic target by polyak averaging                  for (param, target_param) in zip(self.critic.parameters(),                self.critic_target.parameters()):            target_param.data.copy_(tau * param.data + (1 - tau)                    * target_param.data)

回答:

首先我要说,通常强化学习需要大量的训练。然而,这在很大程度上取决于你试图解决的问题的复杂性。如果你的模型很复杂(例如使用LSTM),情况可能会更糟。在训练一个玩Atari游戏的智能体时,你可能需要多达100万个回合(这取决于游戏和所使用的方法)。

关于epoch,如果你指的是重复使用某个特定回合(或一组回合)进行训练,那么这将取决于你是使用在线策略还是离线策略(对于离线策略,这类似于经验回放,使用epoch这个词是不正确的)。Actor-Critic方法通常是基于在线策略的,这意味着它们在每个训练阶段都需要新数据。一旦一个回合被用于训练,就不应该再次使用它。关于在线策略和离线策略的区别,我建议你查看Sutton的书以获取更多信息。

Related Posts

Keras Dense层输入未被展平

这是我的测试代码: from keras import…

无法将分类变量输入随机森林

我有10个分类变量和3个数值变量。我在分割后直接将它们…

如何在Keras中对每个输出应用Sigmoid函数?

这是我代码的一部分。 model = Sequenti…

如何选择类概率的最佳阈值?

我的神经网络输出是一个用于多标签分类的预测类概率表: …

在Keras中使用深度学习得到不同的结果

我按照一个教程使用Keras中的深度神经网络进行文本分…

‘MatMul’操作的输入’b’类型为float32,与参数’a’的类型float64不匹配

我写了一个简单的TensorFlow代码,但不断遇到T…

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注