我正在尝试使用Q学习来训练一个智能体来解决迷宫问题。
我使用以下代码创建了环境:
import gymimport gym_maze import numpy as npenv = gym.make("maze-v0")
由于状态是以[x,y]坐标形式存在的,并且我想有一个二维的Q学习表,我创建了一个字典,将每个状态映射到一个值:
states_dic = {}count = 0for i in range(5): for j in range(5): states_dic[i, j] = count count+=1
然后我创建了Q表:
n_actions = env.action_space.n#将Q-table初始化为0Q_table = np.zeros((len(states_dic),n_actions))print(Q_table)[[0. 0. 0. 0.] [0. 0. 0. 0.] [0. 0. 0. 0.] [0. 0. 0. 0.] [0. 0. 0. 0.] [0. 0. 0. 0.] [0. 0. 0. 0.] [0. 0. 0. 0.] [0. 0. 0. 0.] [0. 0. 0. 0.] [0. 0. 0. 0.] [0. 0. 0. 0.] [0. 0. 0. 0.] [0. 0. 0. 0.] [0. 0. 0. 0.] [0. 0. 0. 0.] [0. 0. 0. 0.] [0. 0. 0. 0.] [0. 0. 0. 0.] [0. 0. 0. 0.] [0. 0. 0. 0.] [0. 0. 0. 0.] [0. 0. 0. 0.] [0. 0. 0. 0.] [0. 0. 0. 0.]]
一些变量:
#我们将运行的回合数n_episodes = 10000#每个回合的最大迭代次数max_iter_episode = 100#将探索概率初始化为1exploration_proba = 1#探索概率的指数衰减衰减率exploration_decreasing_decay = 0.001#探索概率的最小值min_exploration_proba = 0.01#折扣因子gamma = 0.99#学习率lr = 0.1rewards_per_episode = list()
但是当我尝试运行Q学习算法时,我得到了标题中的错误。
#我们迭代回合for e in range(n_episodes): #初始化回合的第一个状态 current_state = env.reset() done = False #累加智能体从环境中获得的奖励 total_episode_reward = 0 for i in range(max_iter_episode): if np.random.uniform(0,1) < exploration_proba: action = env.action_space.sample() else: action = np.argmax(Q_table[current_state,:]) next_state, reward, done, _ = env.step(action) current_coordinate_x = int(current_state[0]) current_coordinate_y = int(current_state[1]) next_coordinate_x = int(next_state[0]) next_coordinate_y = int(next_state[1]) #使用Q-learning迭代更新Q表 current_Q_table_coordinates = states_dic[current_coordinate_x, current_coordinate_y] next_Q_table_coordinates = states_dic[next_coordinate_x, next_coordinate_y] Q_table[current_Q_table_coordinates, action] = (1-lr) *Q_table[current_Q_table_coordinates, action] +lr*(reward + gamma*max(Q_table[next_Q_table_coordinates,:])) total_episode_reward = total_episode_reward + reward #如果回合结束,我们退出for循环 if done: break current_state = next_state #我们使用指数衰减公式更新探索概率 exploration_proba = max(min_exploration_proba,\ np.exp(-exploration_decreasing_decay*e)) rewards_per_episode.append(total_episode_reward)
更新:
分享完整的错误跟踪信息:
---------------------------------------------------------------------------ValueError Traceback (most recent call last)<ipython-input-11-74e6fe3c1212> in <module>() 25 # 环境运行所选择的动作并返回 26 # 下一个状态,一个奖励,如果回合结束则为真。---> 27 next_state, reward, done, _ = env.step(action) 28 29 #### #### #### ####/Users/x/anaconda3/envs/y/lib/python3.6/site-packages/gym/wrappers/time_limit.py in step(self, action) 14 def step(self, action): 15 assert self._elapsed_steps is not None, "在调用reset()之前不能调用env.step()"---> 16 observation, reward, done, info = self.env.step(action) 17 self._elapsed_steps += 1 18 if self._elapsed_steps >= self._max_episode_steps:/Users/x/anaconda3/envs/y/lib/python3.6/site-packages/gym_maze-0.4-py3.6.egg/gym_maze/envs/maze_env.py in step(self, action) 75 self.maze_view.move_robot(self.ACTION[action]) 76 else:---> 77 self.maze_view.move_robot(action) 78 79 if np.array_equal(self.maze_view.robot, self.maze_view.goal):/Users/x/anaconda3/envs/y/lib/python3.6/site-packages/gym_maze-0.4-py3.6.egg/gym_maze/envs/maze_view_2d.py in move_robot(self, dir) 93 if dir not in self.__maze.COMPASS.keys(): 94 raise ValueError("dir不能为%s。有效的方向只能是%s。"---> 95 % (str(dir), str(self.__maze.COMPASS.keys()))) 96 97 if self.__maze.is_open(self.__robot, dir):ValueError: dir不能为1。有效的方向只能是dict_keys(['N', 'E', 'S', 'W'])。
第二次更新:已修复,感谢一些@Alexander L. Hayes的调试工作。
#我们迭代回合for e in range(n_episodes): #初始化回合的第一个状态 current_state = env.reset() done = False #累加智能体从环境中获得的奖励 total_episode_reward = 0 for i in range(max_iter_episode): current_coordinate_x = int(current_state[0]) current_coordinate_y = int(current_state[1]) current_Q_table_coordinates = states_dic[current_coordinate_x, current_coordinate_y] if np.random.uniform(0,1) < exploration_proba: action = env.action_space.sample() else: action = int(np.argmax(Q_table[current_Q_table_coordinates])) next_state, reward, done, _ = env.step(action) next_coordinate_x = int(next_state[0]) next_coordinate_y = int(next_state[1]) #使用Q-learning迭代更新我们的Q表 next_Q_table_coordinates = states_dic[next_coordinate_x, next_coordinate_y] Q_table[current_Q_table_coordinates, action] = (1-lr) *Q_table[current_Q_table_coordinates, action] +lr*(reward + gamma*max(Q_table[next_Q_table_coordinates,:])) total_episode_reward = total_episode_reward + reward #如果回合结束,我们退出for循环 if done: break current_state = next_state #我们使用指数衰减公式更新探索概率 exploration_proba = max(min_exploration_proba,\ np.exp(-exploration_decreasing_decay*e)) rewards_per_episode.append(total_episode_reward)
回答: