我正在学习Q表,并尝试了一个简单的版本,该版本仅使用一维数组来实现前后移动。现在我尝试四方向移动,但卡在了如何控制角色上。
我已经实现了随机移动,并且它最终会找到目标。但我想让它学会如何到达目标,而不是随机碰到它。我该如何将Q学习添加到这个代码中呢?
这是我目前的完整代码:
import numpy as npimport randomimport mathworld = np.zeros((5,5))print(world)# Make sure that it can never be 0 i.e the start pointgoal_x = random.randint(1,4)goal_y = random.randint(1,4)goal = (goal_x, goal_y)print(goal)world[goal] = 1print(world)LEFT = 0RIGHT = 1UP = 2DOWN = 3map_range_min = 0map_range_max = 5class Agent: def __init__(self, current_position, my_goal, world): self.current_position = current_position self.last_postion = current_position self.visited_positions = [] self.goal = my_goal self.last_reward = 0 self.totalReward = 0 self.q_table = world # Update the totoal reward by the reward def updateReward(self, extra_reward): # This will either increase or decrese the total reward for the episode x = (self.goal[0] - self.current_position[0]) **2 y = (self.goal[1] - self.current_position[1]) **2 dist = math.sqrt(x + y) complet_reward = dist + extra_reward self.totalReward += complet_reward def validate_move(self): valid_move_set = [] # Check for x ranges if map_range_min < self.current_position[0] < map_range_max: valid_move_set.append(LEFT) valid_move_set.append(RIGHT) elif map_range_min == self.current_position[0]: valid_move_set.append(RIGHT) else: valid_move_set.append(LEFT) # Check for Y ranges if map_range_min < self.current_position[1] < map_range_max: valid_move_set.append(UP) valid_move_set.append(DOWN) elif map_range_min == self.current_position[1]: valid_move_set.append(DOWN) else: valid_move_set.append(UP) return valid_move_set # Make the agent move def move_right(self): self.last_postion = self.current_position x = self.current_position[0] x += 1 y = self.current_position[1] return (x, y) def move_left(self): self.last_postion = self.current_position x = self.current_position[0] x -= 1 y = self.current_position[1] return (x, y) def move_down(self): self.last_postion = self.current_position x = self.current_position[0] y = self.current_position[1] y += 1 return (x, y) def move_up(self): self.last_postion = self.current_position x = self.current_position[0] y = self.current_position[1] y -= 1 return (x, y) def move_agent(self): move_set = self.validate_move() randChoice = random.randint(0, len(move_set)-1) move = move_set[randChoice] if move == UP: return self.move_up() elif move == DOWN: return self.move_down() elif move == RIGHT: return self.move_right() else: return self.move_left() # Update the rewards # Return True to kill the episode def checkPosition(self): if self.current_position == self.goal: print("Found Goal") self.updateReward(10) return False else: #Chose new direction self.current_position = self.move_agent() self.visited_positions.append(self.current_position) # Currently get nothing for not reaching the goal self.updateReward(0) return True gus = Agent((0, 0) , goal)play = gus.checkPosition()while play: play = gus.checkPosition()print(gus.totalReward)
回答:
根据你的代码示例,我有一些建议:
-
将环境与代理分开。环境需要有一个形式为
new_state, reward = env.step(old_state, action)
的方法。这个方法说明一个动作如何将旧状态转换为新状态。最好将状态和动作编码为简单的整数。我强烈建议为这个方法设置单元测试。 -
然后,代理需要有一个等效的方法
action = agent.policy(state, reward)
。作为第一步,你应该手动编写一个代理,执行你认为正确的操作。例如,它可能只是尝试朝目标位置前进。 -
考虑状态表示是否具有马尔可夫性。如果你可以通过记住所有过去访问过的状态来更好地解决问题,那么状态就不具有马尔可夫属性。理想情况下,状态表示应该尽可能简洁(最小的仍然是马尔可夫的集合)。
-
一旦这个结构设置好,你就可以考虑实际学习Q表了。一个可能的方法(易于理解但不一定高效)是使用蒙特卡洛方法,结合探索起点或epsilon-soft贪婪策略。一本好的强化学习书籍应该会提供这两种变体的伪代码。
当你感觉自信时,可以前往OpenAI Gym https://www.gymlibrary.dev/ 了解更详细的类结构。这里有一些关于创建你自己的环境的提示:https://www.gymlibrary.dev/content/environment_creation/