如何将此代码改为使用Q表进行强化学习

我正在学习Q表，并尝试了一个简单的版本，该版本仅使用一维数组来实现前后移动。现在我尝试四方向移动，但卡在了如何控制角色上。

我已经实现了随机移动，并且它最终会找到目标。但我想让它学会如何到达目标，而不是随机碰到它。我该如何将Q学习添加到这个代码中呢？

这是我目前的完整代码：

import numpy as npimport randomimport mathworld = np.zeros((5,5))print(world)# Make sure that it can never be 0 i.e the start pointgoal_x = random.randint(1,4)goal_y = random.randint(1,4)goal = (goal_x, goal_y)print(goal)world[goal] = 1print(world)LEFT = 0RIGHT = 1UP = 2DOWN = 3map_range_min = 0map_range_max = 5class Agent:    def __init__(self, current_position, my_goal, world):        self.current_position = current_position        self.last_postion = current_position        self.visited_positions = []        self.goal = my_goal        self.last_reward = 0        self.totalReward = 0        self.q_table = world        # Update the totoal reward by the reward            def updateReward(self, extra_reward):        # This will either increase or decrese the total reward for the episode        x = (self.goal[0] - self.current_position[0]) **2        y = (self.goal[1] - self.current_position[1]) **2        dist = math.sqrt(x + y)        complet_reward = dist + extra_reward        self.totalReward += complet_reward     def validate_move(self):        valid_move_set = []        # Check for x ranges        if map_range_min < self.current_position[0] < map_range_max:            valid_move_set.append(LEFT)            valid_move_set.append(RIGHT)        elif map_range_min == self.current_position[0]:            valid_move_set.append(RIGHT)        else:            valid_move_set.append(LEFT)        # Check for Y ranges        if map_range_min < self.current_position[1] < map_range_max:            valid_move_set.append(UP)            valid_move_set.append(DOWN)        elif map_range_min == self.current_position[1]:            valid_move_set.append(DOWN)        else:            valid_move_set.append(UP)        return valid_move_set                # Make the agent move    def move_right(self):        self.last_postion = self.current_position        x = self.current_position[0]        x += 1        y = self.current_position[1]        return (x, y)    def move_left(self):        self.last_postion = self.current_position        x = self.current_position[0]        x -= 1        y = self.current_position[1]        return (x, y)    def move_down(self):        self.last_postion = self.current_position        x = self.current_position[0]        y = self.current_position[1]        y += 1        return (x, y)    def move_up(self):        self.last_postion = self.current_position        x = self.current_position[0]        y = self.current_position[1]        y -= 1        return (x, y)        def move_agent(self):        move_set = self.validate_move()        randChoice = random.randint(0, len(move_set)-1)        move = move_set[randChoice]        if move == UP:            return self.move_up()        elif move == DOWN:            return self.move_down()        elif move == RIGHT:            return self.move_right()        else:            return self.move_left()                      # Update the rewards    # Return True to kill the episode    def checkPosition(self):        if self.current_position == self.goal:            print("Found Goal")            self.updateReward(10)            return False        else:            #Chose new direction            self.current_position = self.move_agent()            self.visited_positions.append(self.current_position)            # Currently get nothing for not reaching the goal            self.updateReward(0)            return True        gus = Agent((0, 0) , goal)play = gus.checkPosition()while play:    play = gus.checkPosition()print(gus.totalReward)

回答：

根据你的代码示例，我有一些建议：

将环境与代理分开。环境需要有一个形式为new_state, reward = env.step(old_state, action)的方法。这个方法说明一个动作如何将旧状态转换为新状态。最好将状态和动作编码为简单的整数。我强烈建议为这个方法设置单元测试。
然后，代理需要有一个等效的方法action = agent.policy(state, reward)。作为第一步，你应该手动编写一个代理，执行你认为正确的操作。例如，它可能只是尝试朝目标位置前进。
考虑状态表示是否具有马尔可夫性。如果你可以通过记住所有过去访问过的状态来更好地解决问题，那么状态就不具有马尔可夫属性。理想情况下，状态表示应该尽可能简洁（最小的仍然是马尔可夫的集合）。
一旦这个结构设置好，你就可以考虑实际学习Q表了。一个可能的方法（易于理解但不一定高效）是使用蒙特卡洛方法，结合探索起点或epsilon-soft贪婪策略。一本好的强化学习书籍应该会提供这两种变体的伪代码。

当你感觉自信时，可以前往OpenAI Gym https://www.gymlibrary.dev/ 了解更详细的类结构。这里有一些关于创建你自己的环境的提示：https://www.gymlibrary.dev/content/environment_creation/

学技术

如何将此代码改为使用Q表进行强化学习

发表回复取消回复

相关文章：

Related Posts

使用LSTM在Python中预测未来值

如何在gensim的word2vec模型中查找双词组的相似性

dask_xgboost.predict 可以工作但无法显示 – 数据必须是一维的

ML Tuning – Cross Validation in Spark

如何在React JS中使用fetch从REST API获取预测

如何分析ML.NET中多类分类预测得分数组？

发表回复 取消回复

发表回复取消回复