PFRLを試してみる - atari - ML Over the Horizon

はじめに

[前回] までPFRLの簡単な使い方を学び、 openai-gymの
pendulum問題の検証を行った。
今回はatari環境においての検証を行う。

検証

PFRLのexampleを参考にした。
https://github.com/pfnet/pfrl/blob/master/examples/atari/reproduction/rainbow/train_rainbow.py

検証環境はGoogle Colaboratory。

atari 環境

ALEによるatari環境は強化学習AIのベンチマークによく利用される。
openai-gymによるwrapperも用意されているため簡単に利用できる。
(Google Colaboratoryでは最初からインストールされている)

今回はatariの中でSpaceInvader環境を選んだ。

環境構築

def make_env(env_name, test):
  # Use different random seeds for train and test envs
  env_seed = test_seed if test else train_seed
  env = atari_wrappers.wrap_deepmind(
      atari_wrappers.make_atari(env_name, max_frames=MAX_FRAMES),
      episode_life=not test,
      clip_rewards=not test,
  )
  env.seed(int(env_seed))
  if test:
      # Randomize actions like epsilon-greedy in evaluation as well
      env = pfrl.wrappers.RandomizeAction(env, EVAL_EPSILON)
  if MONITOR:
      env = pfrl.wrappers.Monitor(
          env, args.outdir, mode="evaluation" if test else "training"
      )
  return env 

# 初期SEED設定
utils.set_random_seed(SEED)

# 環境設定
env_name = "SpaceInvadersNoFrameskip-v4"
env = make_env(env_name, test=False)
eval_env = make_env(env_name, test=True)

atari_wrappers.wrap_deepmindはdeepmindがatariの検証で行った設定
(報酬のclipping等)を適用するためのwrapperである。

Agent (Rainbow)

今回はRainbow¹によって学習を行う。
RainbowはDQN²に様々な改良を施したアルゴリズムである。

DQNは主に以下の要素からなるQ-Learningアルゴリズムである。

行動価値観数をQ_a(s)の形式でNeural Netにより近似する
TD誤差計算のtargetを工夫することで学習を安定化させる
Experience Replayにより過去の遷移を利用する

Rainbowはこれに以下のような改良を加えている。

targetの安定化にDouble DQN
Expericne Replayの改良 (prioritized Replay)
ネットワークアーキテクチャの工夫(Dueling Network)
Multi step Learningによる学習の高速化
Q関数出力を確率分布化 (Distributional RL)
探索の改良 (Noisy Net)

Q関数

obs = env.observation_space
n_actions = env.action_space.n
n_atoms = 51
v_max = 10
v_min = -10
noisy_net_sigma = 0.5

q_func = DistributionalDuelingDQN(n_actions, n_atoms, v_min, v_max,)

pnn.to_factorized_noisy(q_func, sigma_scale=noisy_net_sigma)

print(obs, n_actions)
q_func

Distributional DQNを用いるので、ネットワークの出力は行動ごとに
-10 ~ 10の範囲で51分割したヒストグラムとなる。

また、to_factorized_noisyでネットワーク内のLinearモジュールを
NoiseLinearモジュールに変換している。

探索アルゴリズム・最適化・Replay

explorer = explorers.Greedy()

# 最適化
lr0 = 6.25e-5
eps = 1.5e-4
opt = torch.optim.Adam(q_func.parameters(), eps=eps)

# Experience Replay
steps = 2 * 10 ** 6
update_interval = 4
num_step_return = 3
betasteps = steps / update_interval

rbuf = replay_buffers.PrioritizedReplayBuffer(
            10 ** 5,
            alpha=0.5,
            beta0=0.4,
            betasteps=betasteps,
            num_steps=num_step_return,
        )

DQNの探索アルゴリズムはepsilon-greedyが用いられることが多いが、
Rainobowの場合Noisy Netが探索の役割を担うので通常のGreedyを用いる。

また、multi step Learningのための設定はReplayBufferに対して行う。
(今回はnum_steps=3としているので3ステップ先からtargetの計算を行う)

Agent構築

gamma = 0.9
replay_start_size = 5 * 10**4
target_update_interval = 3 * 10 ** 4
clip_delta = True
gpu = 0 if torch.cuda.is_available() else -1
print("GPU : ", gpu)

def phi(x):
  return np.asarray(x, dtype=np.float32) / 255

agent = agents.CategoricalDoubleDQN(
        q_func,
        opt,
        rbuf,
        gpu=gpu,
        gamma=gamma,
        explorer=explorer,
        replay_start_size=replay_start_size,
        target_update_interval=target_update_interval,
        clip_delta=clip_delta,
        update_interval=update_interval,
        batch_accumulator="sum",
        phi=phi,
    )

学習

import time
class Hook:
  def __init__(self, period=2000):
    self.t0 = time.time()
    self.period = period 

  def __call__(self, env, agent, t):
    if t % self.period == 0:
      t1 = time.time()
      dt = t1 - self.t0 
      print("{} : elps {:.3f}".format(t, dt))
      self.t0 = t1



checkpoint_frequency = 2 * 10 ** 5
eval_n_runs = 10
eval_interval = 5 * 10 ** 4

experiments.train_agent_with_evaluation(
            agent=agent,
            env=env,
            steps=steps,
            eval_n_steps=None,
            checkpoint_freq=checkpoint_frequency,
            eval_n_episodes=eval_n_runs,
            eval_interval=eval_interval,
            outdir=out_dir,
            save_best_so_far_agent=False,
            eval_env=eval_env,
            step_hooks=(Hook(), )
        )

学習中の経過時間を見るためのHookクラスを作成。

結果

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv(os.path.join(out_dir, "scores.txt"), sep="\t")
cols = ["mean", "median", "average_loss"]
fig, axes = plt.subplots(figsize=(20, 6), ncols=3)
for c, col in enumerate(cols):
  df.set_index("steps").plot(y=col, ax=fig.axes[c])

収益の平均、中間値、損失関数をプロットすると下図のようになる。

f:id:nakamrnk:20200806112802p:plain

振動しながらも収益は徐々に上昇傾向にあるように見える。
今回は2 x 10⁶ frameしか学習していないが、
rainbowの論文によるとSpaceInvadorの学習が収束するには
10⁸ frame程度かかるらしいので、Colabで検証するのは難しいと思われる。
(今回は学習に6時間程度かかった)

f:id:nakamrnk:20200806114134g:plain