我是一个强化学习的初学者,正在尝试使用Tensorflow实现策略梯度方法来解决Open AI Gym的CartPole任务。然而,我的代码运行速度极慢;第一集的运行速度尚可接受,但从第二集开始就变得非常慢。这是为什么呢?我该如何解决这个问题?
我的代码如下:
import tensorflow as tfimport numpy as npimport gymenv = gym.make('CartPole-v0')class Policy: def __init__(self): self.input_layer_fake = tf.placeholder(tf.float32, [4,1]) self.input_layer = tf.reshape(self.input_layer_fake, [1,4]) self.dense1 = tf.layers.dense(inputs = self.input_layer, units = 4, activation = tf.nn.relu) self.logits = tf.layers.dense(inputs = self.dense1, units = 2, activation = tf.nn.relu) def predict(self, inputObservation): sess = tf.InteractiveSession() tf.global_variables_initializer().run() x = tf.reshape(inputObservation, [4,1]).eval() return (sess.run(self.logits, feed_dict = {self.input_layer_fake: x})) def train(self, features_array, labels_array): for i in range(np.shape(features_array)[0]): print("train") print(i) sess1 = tf.InteractiveSession() tf.global_variables_initializer().run() self.cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels = labels_array[i], logits = self.logits)) self.train_step = tf.train.GradientDescentOptimizer(0.5).minimize(self.cross_entropy) y = tf.reshape(features_array[i], [4,1]).eval() sess1.run(self.train_step, feed_dict={self.input_layer_fake:y})agent = Policy()train_array = []features_array = []labels_array = []main_sess = tf.InteractiveSession()tf.global_variables_initializer().run()for i_episode in range(100): observation = env.reset() for t in range(200): prevObservation = observation env.render() if np.random.uniform(0,1) < 0.2: action = env.action_space.sample() else: action = np.argmax(agent.predict((prevObservation))) observation, reward, done, info = env.step(action) add_in = np.random.uniform(0,1) if add_in < 0.5: features_array.append(prevObservation) sarPreprocessed = agent.predict(prevObservation) sarPreprocessed[0][action] = reward labels_array.append(sarPreprocessed) if done: break agent.train(features_array, labels_array) features_array = [] labels_array = []
任何帮助都将不胜感激。
回答:
自从我上次尝试实现策略梯度方法已经过去了一段时间,但我记得问题出在train
函数中使用了循环。
当我遍历features_array
中的每个元素时,由于数组本身的长度不断增加(features_array
从未被重置为[]
),程序运行速度变慢。相反,我应该以“批处理”的方式进行训练,并定期清空features_array
。
我在这里实现了一个更简洁的普通策略梯度算法版本:https://github.com/Ashboy64/rl-reimplementations/blob/master/Reimplementations/Vanilla-Policy-Gradient/vanilla_pg.py
一个性能更好的改进算法(仍然基于策略梯度)的实现,称为PPO(近端策略优化),可以在这里找到:https://github.com/Ashboy64/rl-reimplementations/tree/master/Reimplementations/PPO