Przeglądaj źródła

Removed pytorch_rl dependency on OpenAI baselines to make install easier

Maxime Chevalier-Boisvert 7 lat temu
rodzic
commit
2fdde6eb6b

+ 3 - 12
README.md

@@ -36,15 +36,6 @@ cd pytorch_rl
 # PyTorch
 conda install pytorch torchvision -c soumith
 
-# Dependencies needed by OpenAI baselines
-sudo apt install libopenmpi-dev zlib1g-dev cmake
-
-# OpenAI baselines
-git clone https://github.com/openai/baselines.git
-cd baselines
-pip3 install -e .
-cd ..
-
 # Other requirements
 pip3 install -r requirements.txt
 ```
@@ -67,16 +58,16 @@ The environment being run can be selected with the `--env-name` option, eg:
 ```
 
 Basic reinforcement learning code is provided in the `pytorch_rl` subdirectory.
-You can perform training using the ACKTR algorithm with:
+You can perform training using the A2C algorithm with:
 
 ```
-python3 pytorch_rl/main.py --env-name MiniGrid-Empty-6x6-v0 --no-vis --num-processes 32 --algo acktr
+python3 pytorch_rl/main.py --env-name MiniGrid-Empty-6x6-v0 --no-vis --num-processes 48 --algo a2c
 ```
 
 You can view the result of training using the `enjoy.py` script:
 
 ```
-python3 pytorch_rl/enjoy.py --env-name MiniGrid-Empty-6x6-v0 --load-dir ./trained_models/acktr
+python3 pytorch_rl/enjoy.py --env-name MiniGrid-Empty-6x6-v0 --load-dir ./trained_models/a2c
 ```
 
 ## Design

+ 1 - 0
gym_minigrid/envs/gotodoor.py

@@ -100,6 +100,7 @@ class GoToDoorEnv(MiniGridEnv):
         if action == self.actions.wait:
             if (ax == tx and abs(ay - ty) == 1) or (ay == ty and abs(ax - tx) == 1):
                 reward = 1
+            done = True
 
         obs = self._observation(obs)
 

+ 0 - 153
pytorch_rl/README.md

@@ -1,153 +0,0 @@
-# pytorch-a2c-ppo-acktr
-
-## Update 10/06/2017: added enjoy.py and a link to pretrained models!
-## Update 09/27/2017: now supports both Atari and MuJoCo/Roboschool!
-
-This is a PyTorch implementation of
-* Advantage Actor Critic (A2C), a synchronous deterministic version of [A3C](https://arxiv.org/pdf/1602.01783v1.pdf)
-* Proximal Policy Optimization [PPO](https://arxiv.org/pdf/1707.06347.pdf)
-* Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation [ACKTR](https://arxiv.org/abs/1708.05144)
-
-Also see the OpenAI posts: [A2C/ACKTR](https://blog.openai.com/baselines-acktr-a2c/) and [PPO](https://blog.openai.com/openai-baselines-ppo/) for more information.
-
-This implementation is inspired by the OpenAI baselines for [A2C](https://github.com/openai/baselines/tree/master/baselines/a2c), [ACKTR](https://github.com/openai/baselines/tree/master/baselines/acktr) and [PPO](https://github.com/openai/baselines/tree/master/baselines/ppo1). It uses the same hyper parameters and the model since they were well tuned for Atari games.
-
-## Supported (and tested) environments (via [OpenAI Gym](https://gym.openai.com))
-* [Atari Learning Environment](https://github.com/mgbellemare/Arcade-Learning-Environment)
-* [MuJoCo](http://mujoco.org)
-* [PyBullet](http://pybullet.org) (including Racecar, Minitaur and Kuka)
-
-I highly recommend PyBullet as a free open source alternative to MuJoCo for continuous control tasks.
-
-All environments are operated using exactly the same Gym interface. See their documentations for a comprehensive list.
-
-## Requirements
-
-* Python 3 (it might work with Python 2, but I didn't test it)
-* [PyTorch](http://pytorch.org/)
-* [Visdom](https://github.com/facebookresearch/visdom)
-* [OpenAI baselines](https://github.com/openai/baselines)
-
-In order to install requirements, follow:
-
-```bash
-# PyTorch
-conda install pytorch torchvision -c soumith
-
-# Baselines for Atari preprocessing
-git clone https://github.com/openai/baselines.git
-cd baselines
-pip install -e .
-
-# Other requirements
-pip install -r requirements.txt
-```
-
-## Contributions
-
-Contributions are very welcome. If you know how to make this code better, don't hesitate to send a pull request. Also see a todo list below.
-
-Also I'm searching for volunteers to run all experiments on Atari and MuJoCo (with multiple random seeds).
-
-## Disclaimer
-
-It's extremely difficult to reproduce results for Reinforcement Learning methods. See ["Deep Reinforcement Learning that Matters"](https://arxiv.org/abs/1709.06560) for more information. I tried to reproduce OpenAI results as closely as possible. However, majors differences in performance can be caused even by minor differences in TensorFlow and PyTorch libraries.
-
-### TODO
-* Improve this README file. Rearrange images.
-* Improve performance of KFAC, see kfac.py for more information
-* Run evaluation for all games and algorithms
-
-## Training
-
-Start a `Visdom` server with `python -m visdom.server`, it will serve `http://localhost:8097/` by default.
-
-### Atari
-#### A2C
-
-```bash
-python main.py --env-name "PongNoFrameskip-v4"
-```
-
-#### PPO
-
-```bash
-python main.py --env-name "PongNoFrameskip-v4" --algo ppo --use-gae --lr 2.5e-4 --clip-param 0.1 --num-processes 8 --num-steps 128 --num-mini-batch 4 --vis-interval 1 --log-interval 1
-```
-
-#### ACKTR
-
-```bash
-python main.py --env-name "PongNoFrameskip-v4" --algo acktr --num-processes 32 --num-steps 20
-```
-
-### MuJoCo
-#### A2C
-
-```bash
-python main.py --env-name "Reacher-v1" --num-stack 1 --num-frames 1000000
-```
-
-#### PPO
-
-```bash
-python main.py --env-name "Reacher-v1" --algo ppo --use-gae --vis-interval 1  --log-interval 1 --num-stack 1 --num-steps 2048 --num-processes 1 --lr 3e-4 --entropy-coef 0 --ppo-epoch 10 --num-mini-batch 32 --gamma 0.99 --tau 0.95 --num-frames 1000000
-```
-
-#### ACKTR
-
-ACKTR requires some modifications to be made specifically for MuJoCo. But at the moment, I want to keep this code as unified as possible. Thus, I'm going for better ways to integrate it into the codebase.
-
-## Enjoy
-
-Load a pretrained model from [my Google Drive](https://drive.google.com/open?id=0Bw49qC_cgohKS3k2OWpyMWdzYkk).
-
-Also pretrained models for other games are available on request. Send me an email or create an issue, and I will upload it.
-
-Disclaimer: I might have used different hyper-parameters to train these models.
-
-### Atari
-
-```bash
-python enjoy.py --load-dir trained_models/a2c --env-name "PongNoFrameskip-v4" --num-stack 4
-```
-
-### MuJoCo
-
-```bash
-python enjoy.py --load-dir trained_models/ppo --env-name "Reacher-v1" --num-stack 1
-```
-
-## Results
-
-### A2C
-
-![BreakoutNoFrameskip-v4](imgs/a2c_breakout.png)
-
-![SeaquestNoFrameskip-v4](imgs/a2c_seaquest.png)
-
-![QbertNoFrameskip-v4](imgs/a2c_qbert.png)
-
-![beamriderNoFrameskip-v4](imgs/a2c_beamrider.png)
-
-### PPO
-
-
-![BreakoutNoFrameskip-v4](imgs/ppo_halfcheetah.png)
-
-![SeaquestNoFrameskip-v4](imgs/ppo_hopper.png)
-
-![QbertNoFrameskip-v4](imgs/ppo_reacher.png)
-
-![beamriderNoFrameskip-v4](imgs/ppo_walker.png)
-
-
-### ACKTR
-
-![BreakoutNoFrameskip-v4](imgs/acktr_breakout.png)
-
-![SeaquestNoFrameskip-v4](imgs/acktr_seaquest.png)
-
-![QbertNoFrameskip-v4](imgs/acktr_qbert.png)
-
-![beamriderNoFrameskip-v4](imgs/acktr_beamrider.png)

+ 11 - 44
pytorch_rl/enjoy.py

@@ -7,12 +7,10 @@ import time
 import numpy as np
 import torch
 from torch.autograd import Variable
-from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
-from baselines.common.vec_env.vec_normalize import VecNormalize
+from vec_env.dummy_vec_env import DummyVecEnv
 
 from envs import make_env
 
-
 parser = argparse.ArgumentParser(description='RL')
 parser.add_argument('--seed', type=int, default=1,
                     help='random seed (default: 1)')
@@ -26,29 +24,12 @@ parser.add_argument('--load-dir', default='./trained_models/',
                     help='directory to save agent logs (default: ./trained_models/)')
 args = parser.parse_args()
 
-
 env = make_env(args.env_name, args.seed, 0, None)
 env = DummyVecEnv([env])
 
-actor_critic, ob_rms = \
-            torch.load(os.path.join(args.load_dir, args.env_name + ".pt"))
-
+actor_critic, ob_rms = torch.load(os.path.join(args.load_dir, args.env_name + ".pt"))
 
-if len(env.observation_space.shape) == 1:
-    env = VecNormalize(env, ret=False)
-    env.ob_rms = ob_rms
-
-    # An ugly hack to remove updates
-    def _obfilt(self, obs):
-        if self.ob_rms:
-            obs = np.clip((obs - self.ob_rms.mean) / np.sqrt(self.ob_rms.var + self.epsilon), -self.clipob, self.clipob)
-            return obs
-        else:
-            return obs
-    env._obfilt = types.MethodType(_obfilt, env)
-    render_func = env.venv.envs[0].render
-else:
-    render_func = env.envs[0].render
+render_func = env.envs[0].render
 
 obs_shape = env.observation_space.shape
 obs_shape = (obs_shape[0] * args.num_stack, *obs_shape[1:])
@@ -56,7 +37,6 @@ current_obs = torch.zeros(1, *obs_shape)
 states = torch.zeros(1, actor_critic.state_size)
 masks = torch.zeros(1, 1)
 
-
 def update_current_obs(obs):
     shape_dim0 = env.observation_space.shape[0]
     obs = torch.from_numpy(obs).float()
@@ -64,27 +44,21 @@ def update_current_obs(obs):
         current_obs[:, :-shape_dim0] = current_obs[:, shape_dim0:]
     current_obs[:, -shape_dim0:] = obs
 
-
 render_func('human')
 obs = env.reset()
 update_current_obs(obs)
 
-if args.env_name.find('Bullet') > -1:
-    import pybullet as p
-
-    torsoId = -1
-    for i in range(p.getNumBodies()):
-        if (p.getBodyInfo(i)[0].decode() == "torso"):
-            torsoId = i
-
 while True:
-    value, action, _, states = actor_critic.act(Variable(current_obs, volatile=True),
-                                                Variable(states, volatile=True),
-                                                Variable(masks, volatile=True),
-                                                deterministic=True)
+    value, action, _, states = actor_critic.act(
+        Variable(current_obs, volatile=True),
+        Variable(states, volatile=True),
+        Variable(masks, volatile=True),
+        deterministic=True
+    )
     states = states.data
     cpu_actions = action.data.squeeze(1).cpu().numpy()
-    # Obser reward and next obs
+
+    # Observation, reward and next obs
     obs, reward, done, _ = env.step(cpu_actions)
 
     time.sleep(0.05)
@@ -97,13 +71,6 @@ while True:
         current_obs *= masks
     update_current_obs(obs)
 
-    if args.env_name.find('Bullet') > -1:
-        if torsoId > -1:
-            distance = 5
-            yaw = 0
-            humanPos, humanOrn = p.getBasePositionAndOrientation(torsoId)
-            p.resetDebugVisualizerCamera(distance, yaw, -20, humanPos)
-
     renderer = render_func('human')
 
     if not renderer.window:

BIN
pytorch_rl/imgs/a2c_beamrider.png


BIN
pytorch_rl/imgs/a2c_breakout.png


BIN
pytorch_rl/imgs/a2c_qbert.png


BIN
pytorch_rl/imgs/a2c_seaquest.png


BIN
pytorch_rl/imgs/acktr_beamrider.png


BIN
pytorch_rl/imgs/acktr_breakout.png


BIN
pytorch_rl/imgs/acktr_qbert.png


BIN
pytorch_rl/imgs/acktr_seaquest.png


BIN
pytorch_rl/imgs/ppo_halfcheetah.png


BIN
pytorch_rl/imgs/ppo_hopper.png


BIN
pytorch_rl/imgs/ppo_reacher.png


BIN
pytorch_rl/imgs/ppo_walker.png


+ 2 - 3
pytorch_rl/main.py

@@ -14,9 +14,8 @@ import torch.optim as optim
 from torch.autograd import Variable
 
 from arguments import get_args
-from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
-from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv
-from baselines.common.vec_env.vec_normalize import VecNormalize
+from vec_env.dummy_vec_env import DummyVecEnv
+from vec_env.subproc_vec_env import SubprocVecEnv
 from envs import make_env
 from kfac import KFACOptimizer
 from model import RecMLPPolicy, MLPPolicy, CNNPolicy

+ 0 - 2
pytorch_rl/requirements.txt

@@ -1,4 +1,2 @@
 gym
 matplotlib
-pybullet
-opencv-python

+ 21 - 0
pytorch_rl/vec_env/LICENSE

@@ -0,0 +1,21 @@
+The MIT License
+
+Copyright (c) 2017 OpenAI (http://openai.com)
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.

+ 100 - 0
pytorch_rl/vec_env/__init__.py

@@ -0,0 +1,100 @@
+from abc import ABC, abstractmethod
+
+class VecEnv(ABC):
+
+    def __init__(self, num_envs, observation_space, action_space):
+        self.num_envs = num_envs
+        self.observation_space = observation_space
+        self.action_space = action_space
+
+    """
+    An abstract asynchronous, vectorized environment.
+    """
+    @abstractmethod
+    def reset(self):
+        """
+        Reset all the environments and return an array of
+        observations.
+
+        If step_async is still doing work, that work will
+        be cancelled and step_wait() should not be called
+        until step_async() is invoked again.
+        """
+        pass
+
+    @abstractmethod
+    def step_async(self, actions):
+        """
+        Tell all the environments to start taking a step
+        with the given actions.
+        Call step_wait() to get the results of the step.
+
+        You should not call this if a step_async run is
+        already pending.
+        """
+        pass
+
+    @abstractmethod
+    def step_wait(self):
+        """
+        Wait for the step taken with step_async().
+
+        Returns (obs, rews, dones, infos):
+         - obs: an array of observations
+         - rews: an array of rewards
+         - dones: an array of "episode done" booleans
+         - infos: an array of info objects
+        """
+        pass
+
+    @abstractmethod
+    def close(self):
+        """
+        Clean up the environments' resources.
+        """
+        pass
+
+    def step(self, actions):
+        self.step_async(actions)
+        return self.step_wait()
+
+    def render(self):
+        logger.warn('Render not defined for %s'%self)
+
+class VecEnvWrapper(VecEnv):
+    def __init__(self, venv, observation_space=None, action_space=None):
+        self.venv = venv
+        VecEnv.__init__(self,
+            num_envs=venv.num_envs,
+            observation_space=observation_space or venv.observation_space,
+            action_space=action_space or venv.action_space)
+
+    def step_async(self, actions):
+        self.venv.step_async(actions)
+
+    @abstractmethod
+    def reset(self):
+        pass
+
+    @abstractmethod
+    def step_wait(self):
+        pass
+
+    def close(self):
+        return self.venv.close()
+
+    def render(self):
+        self.venv.render()
+
+class CloudpickleWrapper(object):
+    """
+    Uses cloudpickle to serialize contents (otherwise multiprocessing tries to use pickle)
+    """
+    def __init__(self, x):
+        self.x = x
+    def __getstate__(self):
+        import cloudpickle
+        return cloudpickle.dumps(self.x)
+    def __setstate__(self, ob):
+        import pickle
+        self.x = pickle.loads(ob)

+ 31 - 0
pytorch_rl/vec_env/dummy_vec_env.py

@@ -0,0 +1,31 @@
+import numpy as np
+from . import VecEnv
+
+class DummyVecEnv(VecEnv):
+    def __init__(self, env_fns):
+        self.envs = [fn() for fn in env_fns]
+        env = self.envs[0]        
+        VecEnv.__init__(self, len(env_fns), env.observation_space, env.action_space)
+        self.ts = np.zeros(len(self.envs), dtype='int')        
+        self.actions = None
+
+    def step_async(self, actions):
+        self.actions = actions
+
+    def step_wait(self):
+        results = [env.step(a) for (a,env) in zip(self.actions, self.envs)]
+        obs, rews, dones, infos = map(np.array, zip(*results))
+        self.ts += 1
+        for (i, done) in enumerate(dones):
+            if done: 
+                obs[i] = self.envs[i].reset()
+                self.ts[i] = 0
+        self.actions = None
+        return np.array(obs), np.array(rews), np.array(dones), infos
+
+    def reset(self):        
+        results = [env.reset() for env in self.envs]
+        return np.array(results)
+
+    def close(self):
+        return

+ 82 - 0
pytorch_rl/vec_env/subproc_vec_env.py

@@ -0,0 +1,82 @@
+import numpy as np
+from multiprocessing import Process, Pipe
+from vec_env import VecEnv, CloudpickleWrapper
+
+def worker(remote, parent_remote, env_fn_wrapper):
+    parent_remote.close()
+    env = env_fn_wrapper.x()
+    while True:
+        cmd, data = remote.recv()
+        if cmd == 'step':
+            ob, reward, done, info = env.step(data)
+            if done:
+                ob = env.reset()
+            remote.send((ob, reward, done, info))
+        elif cmd == 'reset':
+            ob = env.reset()
+            remote.send(ob)
+        elif cmd == 'reset_task':
+            ob = env.reset_task()
+            remote.send(ob)
+        elif cmd == 'close':
+            remote.close()
+            break
+        elif cmd == 'get_spaces':
+            remote.send((env.observation_space, env.action_space))
+        else:
+            raise NotImplementedError
+
+
+class SubprocVecEnv(VecEnv):
+    def __init__(self, env_fns, spaces=None):
+        """
+        envs: list of gym environments to run in subprocesses
+        """
+        self.waiting = False
+        self.closed = False
+        nenvs = len(env_fns)
+        self.remotes, self.work_remotes = zip(*[Pipe() for _ in range(nenvs)])
+        self.ps = [Process(target=worker, args=(work_remote, remote, CloudpickleWrapper(env_fn)))
+            for (work_remote, remote, env_fn) in zip(self.work_remotes, self.remotes, env_fns)]
+        for p in self.ps:
+            p.daemon = True # if the main process crashes, we should not cause things to hang
+            p.start()
+        for remote in self.work_remotes:
+            remote.close()
+
+        self.remotes[0].send(('get_spaces', None))
+        observation_space, action_space = self.remotes[0].recv()
+        VecEnv.__init__(self, len(env_fns), observation_space, action_space)
+
+    def step_async(self, actions):
+        for remote, action in zip(self.remotes, actions):
+            remote.send(('step', action))
+        self.waiting = True
+
+    def step_wait(self):
+        results = [remote.recv() for remote in self.remotes]
+        self.waiting = False
+        obs, rews, dones, infos = zip(*results)
+        return np.stack(obs), np.stack(rews), np.stack(dones), infos
+
+    def reset(self):
+        for remote in self.remotes:
+            remote.send(('reset', None))
+        return np.stack([remote.recv() for remote in self.remotes])
+
+    def reset_task(self):
+        for remote in self.remotes:
+            remote.send(('reset_task', None))
+        return np.stack([remote.recv() for remote in self.remotes])
+
+    def close(self):
+        if self.closed:
+            return
+        if self.waiting:
+            for remote in self.remotes:
+                remote.recv()
+        for remote in self.remotes:
+            remote.send(('close', None))
+        for p in self.ps:
+            p.join()
+        self.closed = True

+ 38 - 0
pytorch_rl/vec_env/vec_frame_stack.py

@@ -0,0 +1,38 @@
+from vec_env import VecEnvWrapper
+import numpy as np
+from gym import spaces
+
+class VecFrameStack(VecEnvWrapper):
+    """
+    Vectorized environment base class
+    """
+    def __init__(self, venv, nstack):
+        self.venv = venv
+        self.nstack = nstack
+        wos = venv.observation_space # wrapped ob space
+        low = np.repeat(wos.low, self.nstack, axis=-1)
+        high = np.repeat(wos.high, self.nstack, axis=-1)
+        self.stackedobs = np.zeros((venv.num_envs,)+low.shape, low.dtype)
+        observation_space = spaces.Box(low=low, high=high, dtype=venv.observation_space.dtype)
+        VecEnvWrapper.__init__(self, venv, observation_space=observation_space)
+
+    def step_wait(self):
+        obs, rews, news, infos = self.venv.step_wait()
+        self.stackedobs = np.roll(self.stackedobs, shift=-1, axis=-1)
+        for (i, new) in enumerate(news):
+            if new:
+                self.stackedobs[i] = 0
+        self.stackedobs[..., -obs.shape[-1]:] = obs
+        return self.stackedobs, rews, news, infos
+
+    def reset(self):
+        """
+        Reset all environments
+        """
+        obs = self.venv.reset()
+        self.stackedobs[...] = 0
+        self.stackedobs[..., -obs.shape[-1]:] = obs
+        return self.stackedobs
+
+    def close(self):
+        self.venv.close()