我已经花了几天时间开发一个基于MCTS的AI。我尝试将其应用于井字游戏,这是我能想到的最简单的游戏,但不知为何,我的AI总是做出错误的决策。我尝试更改UCB1的探索常数、每次搜索的迭代次数,甚至是赢、输和平局的得分(试图让平局更有吸引力,因为这个AI只在第二位出场,目标是尽量打成平局,否则就赢)。目前,代码如下所示:
import randomimport mathimport copyclass tree: def __init__(self, board): self.board = board self.visits = 0 self.score = 0 self.children = []class mcts: def search(self, mx, player,): root = tree(mx) for i in range(1200): leaf = mcts.expand(self, root.board, player, root) result = mcts.rollout(self, leaf) mcts.backpropagate(self, leaf, root, result) return mcts.best_child(self, root).board def expand(self, mx, player, root): plays = mcts.generate_states(self, mx, player) #all possible plays if root.visits == 0: for j in plays: root.children.append(j) #create child_nodes in case they havent been created yet for j in root.children: if j.visits == 0: return j #first iterations of the loop for j in plays: if mcts.final(self, j.board, player): return j return mcts.best_child(self, root) #choose the one with most potential def rollout(self, leaf): mx = leaf.board aux = 1 while mcts.final(self, mx, "O") != True: if aux == 1: # "X" playing possible_states = [] possible_nodes = mcts.generate_states(self, mx, "X") for i in possible_nodes: possible_states.append(i.board) if len(possible_states) == 1: mx = possible_states[0] else: choice = random.randrange(0, len(possible_states) - 1) mx = possible_states[choice] if mcts.final(self, mx, "X"): #The play by "X" finished the game break elif aux == 0: # "O" playing possible_states = [] possible_nodes = mcts.generate_states(self, mx, "O") for i in possible_nodes: possible_states.append(i.board) if len(possible_states) == 1: mx = possible_states[0] else: choice = random.randrange(0, len(possible_states) - 1) mx = possible_states[choice] aux += 1 aux = aux%2 if mcts.final(self, mx, "X"): for i in range(len(mx)): for k in range(len(mx[i])): if mx[i][k] == "-": return -1 #loss return 0 #tie elif mcts.final(self, mx, "O"): for i in range(len(mx)): for k in range(len(mx[i])): if mx[i][k] == "-": return 1 #win def backpropagate(self, leaf, root, result): # updating our prospects stats leaf.score += result leaf.visits += 1 root.visits += 1 def generate_states(self, mx, player): possible_states = [] #generate child_nodes for i in range(len(mx)): for k in range(len(mx[i])): if mx[i][k] == "-": option = copy.deepcopy(mx) option[i][k] = player child_node = tree(option) possible_states.append(child_node) return possible_states def final(self,mx, player): #check if game is won possible_draw = True win = False for i in mx: #lines if i == [player, player, player]: win = True possible_draw = False if mx[0][0] == player: #diagonals if mx[1][1] == player: if mx[2][2] == player: win = True possible_draw = False if mx[0][2] == player: if mx[1][1] == player: if mx[2][0] == player: win = True possible_draw = False for i in range(3): #columns if mx[0][i] == player and mx[1][i] == player and mx[2][i] == player: win = True possible_draw = False for i in range(3): for k in range(3): if mx[i][k] == "-": possible_draw = False if possible_draw: return possible_draw return win def calculate_score(self, score, child_visits, parent_visits, c): #UCB1 return score / child_visits + c * math.sqrt(math.log(parent_visits) / child_visits) def best_child(self, root): #returns most promising node treshold = -1*10**6 for j in root.children: potential = mcts.calculate_score(self, j.score, j.visits, root.visits, 2) if potential > treshold: win_choice = j treshold = potential return win_choice#todo the AI takes too long for each play, optimize that by finding the optimal approach in the rollout phase
首先,这个AI的目的是返回一个经过修改的矩阵,代表在当前情况下它能做出的最佳决策。我开始怀疑MCTS算法是否是这些错误决策背后的原因,因为可能在实现中存在一些错误。根据我的理解,代码执行了以下步骤:
- 检查根节点是否已经有子节点,如果有,则选择最有前景的子节点。
- 进行一次随机模拟并保存结果。
- 更新叶节点的得分、访问次数以及根节点的访问次数。
- 重复进行1200次迭代,在我的例子中是这样设置的。
- 返回可能的最佳移动(矩阵,子节点)。
为什么它不起作用?为什么它选择了错误的移动而不是最优的?算法是否错误实现了?
回答:
我的错误在于在扩展阶段选择了访问次数最多的节点,而应该根据UCB1公式选择最有潜力的节点。我在实现一些if语句时也犯了一些错误,导致所有输局都没有被正确计算。