8 лет назад · 2fdde6eb6b
--- a/README.md
+++ b/README.md
@@ -36,15 +36,6 @@ cd pytorch_rl
 
				 # PyTorch
			
 
				 conda install pytorch torchvision -c soumith
			
 
				 
			
 
				-# Dependencies needed by OpenAI baselines
			
 
				-sudo apt install libopenmpi-dev zlib1g-dev cmake
			
 
				-
			
 
				-# OpenAI baselines
			
 
				-git clone https://github.com/openai/baselines.git
			
 
				-cd baselines
			
 
				-pip3 install -e .
			
 
				-cd ..
			
 
				-
			
 
				 # Other requirements
			
 
				 pip3 install -r requirements.txt
			
 
				 ```
			
@@ -67,16 +58,16 @@ The environment being run can be selected with the `--env-name` option, eg:
 
				 ```
			
 
				 
			
 
				 Basic reinforcement learning code is provided in the `pytorch_rl` subdirectory.
			
 
				-You can perform training using the ACKTR algorithm with:
			
 
				+You can perform training using the A2C algorithm with:
			
 
				 
			
 
				 ```
			
 
				-python3 pytorch_rl/main.py --env-name MiniGrid-Empty-6x6-v0 --no-vis --num-processes 32 --algo acktr
			
 
				+python3 pytorch_rl/main.py --env-name MiniGrid-Empty-6x6-v0 --no-vis --num-processes 48 --algo a2c
			
 
				 ```
			
 
				 
			
 
				 You can view the result of training using the `enjoy.py` script:
			
 
				 
			
 
				 ```
			
 
				-python3 pytorch_rl/enjoy.py --env-name MiniGrid-Empty-6x6-v0 --load-dir ./trained_models/acktr
			
 
				+python3 pytorch_rl/enjoy.py --env-name MiniGrid-Empty-6x6-v0 --load-dir ./trained_models/a2c
			
 
				 ```
			
 
				 
			
 
				 ## Design
			
--- a/gym_minigrid/envs/gotodoor.py
+++ b/gym_minigrid/envs/gotodoor.py
@@ -100,6 +100,7 @@ class GoToDoorEnv(MiniGridEnv):
 
				         if action == self.actions.wait:
			
 
				             if (ax == tx and abs(ay - ty) == 1) or (ay == ty and abs(ax - tx) == 1):
			
 
				                 reward = 1
			
 
				+            done = True
			
 
				 
			
 
				         obs = self._observation(obs)
			
 
				 
			
--- a/pytorch_rl/README.md
+++ b/pytorch_rl/README.md
@@ -1,153 +0,0 @@
 
				-# pytorch-a2c-ppo-acktr
			
 
				-
			
 
				-## Update 10/06/2017: added enjoy.py and a link to pretrained models!
			
 
				-## Update 09/27/2017: now supports both Atari and MuJoCo/Roboschool!
			
 
				-
			
 
				-This is a PyTorch implementation of
			
 
				-* Advantage Actor Critic (A2C), a synchronous deterministic version of [A3C](https://arxiv.org/pdf/1602.01783v1.pdf)
			
 
				-* Proximal Policy Optimization [PPO](https://arxiv.org/pdf/1707.06347.pdf)
			
 
				-* Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation [ACKTR](https://arxiv.org/abs/1708.05144)
			
 
				-
			
 
				-Also see the OpenAI posts: [A2C/ACKTR](https://blog.openai.com/baselines-acktr-a2c/) and [PPO](https://blog.openai.com/openai-baselines-ppo/) for more information.
			
 
				-
			
 
				-This implementation is inspired by the OpenAI baselines for [A2C](https://github.com/openai/baselines/tree/master/baselines/a2c), [ACKTR](https://github.com/openai/baselines/tree/master/baselines/acktr) and [PPO](https://github.com/openai/baselines/tree/master/baselines/ppo1). It uses the same hyper parameters and the model since they were well tuned for Atari games.
			
 
				-
			
 
				-## Supported (and tested) environments (via [OpenAI Gym](https://gym.openai.com))
			
 
				-* [Atari Learning Environment](https://github.com/mgbellemare/Arcade-Learning-Environment)
			
 
				-* [MuJoCo](http://mujoco.org)
			
 
				-* [PyBullet](http://pybullet.org) (including Racecar, Minitaur and Kuka)
			
 
				-
			
 
				-I highly recommend PyBullet as a free open source alternative to MuJoCo for continuous control tasks.
			
 
				-
			
 
				-All environments are operated using exactly the same Gym interface. See their documentations for a comprehensive list.
			
 
				-
			
 
				-## Requirements
			
 
				-
			
 
				-* Python 3 (it might work with Python 2, but I didn't test it)
			
 
				-* [PyTorch](http://pytorch.org/)
			
 
				-* [Visdom](https://github.com/facebookresearch/visdom)
			
 
				-* [OpenAI baselines](https://github.com/openai/baselines)
			
 
				-
			
 
				-In order to install requirements, follow:
			
 
				-
			
 
				-```bash
			
 
				-# PyTorch
			
 
				-conda install pytorch torchvision -c soumith
			
 
				-
			
 
				-# Baselines for Atari preprocessing
			
 
				-git clone https://github.com/openai/baselines.git
			
 
				-cd baselines
			
 
				-pip install -e .
			
 
				-
			
 
				-# Other requirements
			
 
				-pip install -r requirements.txt
			
 
				-```
			
 
				-
			
 
				-## Contributions
			
 
				-
			
 
				-Contributions are very welcome. If you know how to make this code better, don't hesitate to send a pull request. Also see a todo list below.
			
 
				-
			
 
				-Also I'm searching for volunteers to run all experiments on Atari and MuJoCo (with multiple random seeds).
			
 
				-
			
 
				-## Disclaimer
			
 
				-
			
 
				-It's extremely difficult to reproduce results for Reinforcement Learning methods. See ["Deep Reinforcement Learning that Matters"](https://arxiv.org/abs/1709.06560) for more information. I tried to reproduce OpenAI results as closely as possible. However, majors differences in performance can be caused even by minor differences in TensorFlow and PyTorch libraries.
			
 
				-
			
 
				-### TODO
			
 
				-* Improve this README file. Rearrange images.
			
 
				-* Improve performance of KFAC, see kfac.py for more information
			
 
				-* Run evaluation for all games and algorithms
			
 
				-
			
 
				-## Training
			
 
				-
			
 
				-Start a `Visdom` server with `python -m visdom.server`, it will serve `http://localhost:8097/` by default.
			
 
				-
			
 
				-### Atari
			
 
				-#### A2C
			
 
				-
			
 
				-```bash
			
 
				-python main.py --env-name "PongNoFrameskip-v4"
			
 
				-```
			
 
				-
			
 
				-#### PPO
			
 
				-
			
 
				-```bash
			
 
				-python main.py --env-name "PongNoFrameskip-v4" --algo ppo --use-gae --lr 2.5e-4 --clip-param 0.1 --num-processes 8 --num-steps 128 --num-mini-batch 4 --vis-interval 1 --log-interval 1
			
 
				-```
			
 
				-
			
 
				-#### ACKTR
			
 
				-
			
 
				-```bash
			
 
				-python main.py --env-name "PongNoFrameskip-v4" --algo acktr --num-processes 32 --num-steps 20
			
 
				-```
			
 
				-
			
 
				-### MuJoCo
			
 
				-#### A2C
			
 
				-
			
 
				-```bash
			
 
				-python main.py --env-name "Reacher-v1" --num-stack 1 --num-frames 1000000
			
 
				-```
			
 
				-
			
 
				-#### PPO
			
 
				-
			
 
				-```bash
			
 
				-python main.py --env-name "Reacher-v1" --algo ppo --use-gae --vis-interval 1  --log-interval 1 --num-stack 1 --num-steps 2048 --num-processes 1 --lr 3e-4 --entropy-coef 0 --ppo-epoch 10 --num-mini-batch 32 --gamma 0.99 --tau 0.95 --num-frames 1000000
			
 
				-```
			
 
				-
			
 
				-#### ACKTR
			
 
				-
			
 
				-ACKTR requires some modifications to be made specifically for MuJoCo. But at the moment, I want to keep this code as unified as possible. Thus, I'm going for better ways to integrate it into the codebase.
			
 
				-
			
 
				-## Enjoy
			
 
				-
			
 
				-Load a pretrained model from [my Google Drive](https://drive.google.com/open?id=0Bw49qC_cgohKS3k2OWpyMWdzYkk).
			
 
				-
			
 
				-Also pretrained models for other games are available on request. Send me an email or create an issue, and I will upload it.
			
 
				-
			
 
				-Disclaimer: I might have used different hyper-parameters to train these models.
			
 
				-
			
 
				-### Atari
			
 
				-
			
 
				-```bash
			
 
				-python enjoy.py --load-dir trained_models/a2c --env-name "PongNoFrameskip-v4" --num-stack 4
			
 
				-```
			
 
				-
			
 
				-### MuJoCo
			
 
				-
			
 
				-```bash
			
 
				-python enjoy.py --load-dir trained_models/ppo --env-name "Reacher-v1" --num-stack 1
			
 
				-```
			
 
				-
			
 
				-## Results
			
 
				-
			
 
				-### A2C
			
 
				-
			
 
				-![BreakoutNoFrameskip-v4](imgs/a2c_breakout.png)
			
 
				-
			
 
				-![SeaquestNoFrameskip-v4](imgs/a2c_seaquest.png)
			
 
				-
			
 
				-![QbertNoFrameskip-v4](imgs/a2c_qbert.png)
			
 
				-
			
 
				-![beamriderNoFrameskip-v4](imgs/a2c_beamrider.png)
			
 
				-
			
 
				-### PPO
			
 
				-
			
 
				-
			
 
				-![BreakoutNoFrameskip-v4](imgs/ppo_halfcheetah.png)
			
 
				-
			
 
				-![SeaquestNoFrameskip-v4](imgs/ppo_hopper.png)
			
 
				-
			
 
				-![QbertNoFrameskip-v4](imgs/ppo_reacher.png)
			
 
				-
			
 
				-![beamriderNoFrameskip-v4](imgs/ppo_walker.png)
			
 
				-
			
 
				-
			
 
				-### ACKTR
			
 
				-
			
 
				-![BreakoutNoFrameskip-v4](imgs/acktr_breakout.png)
			
 
				-
			
 
				-![SeaquestNoFrameskip-v4](imgs/acktr_seaquest.png)
			
 
				-
			
 
				-![QbertNoFrameskip-v4](imgs/acktr_qbert.png)
			
 
				-
			
 
				-![beamriderNoFrameskip-v4](imgs/acktr_beamrider.png)
			
--- a/pytorch_rl/enjoy.py
+++ b/pytorch_rl/enjoy.py
@@ -7,12 +7,10 @@ import time
 
				 import numpy as np
			
 
				 import torch
			
 
				 from torch.autograd import Variable
			
 
				-from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
			
 
				-from baselines.common.vec_env.vec_normalize import VecNormalize
			
 
				+from vec_env.dummy_vec_env import DummyVecEnv
			
 
				 
			
 
				 from envs import make_env
			
 
				 
			
 
				-
			
 
				 parser = argparse.ArgumentParser(description='RL')
			
 
				 parser.add_argument('--seed', type=int, default=1,
			
 
				                     help='random seed (default: 1)')
			
@@ -26,29 +24,12 @@ parser.add_argument('--load-dir', default='./trained_models/',
 
				                     help='directory to save agent logs (default: ./trained_models/)')
			
 
				 args = parser.parse_args()
			
 
				 
			
 
				-
			
 
				 env = make_env(args.env_name, args.seed, 0, None)
			
 
				 env = DummyVecEnv([env])
			
 
				 
			
 
				-actor_critic, ob_rms = \
			
 
				-            torch.load(os.path.join(args.load_dir, args.env_name + ".pt"))
			
 
				-
			
 
				+actor_critic, ob_rms = torch.load(os.path.join(args.load_dir, args.env_name + ".pt"))
			
 
				 
			
 
				-if len(env.observation_space.shape) == 1:
			
 
				-    env = VecNormalize(env, ret=False)
			
 
				-    env.ob_rms = ob_rms
			
 
				-
			
 
				-    # An ugly hack to remove updates
			
 
				-    def _obfilt(self, obs):
			
 
				-        if self.ob_rms:
			
 
				-            obs = np.clip((obs - self.ob_rms.mean) / np.sqrt(self.ob_rms.var + self.epsilon), -self.clipob, self.clipob)
			
 
				-            return obs
			
 
				-        else:
			
 
				-            return obs
			
 
				-    env._obfilt = types.MethodType(_obfilt, env)
			
 
				-    render_func = env.venv.envs[0].render
			
 
				-else:
			
 
				-    render_func = env.envs[0].render
			
 
				+render_func = env.envs[0].render
			
 
				 
			
 
				 obs_shape = env.observation_space.shape
			
 
				 obs_shape = (obs_shape[0] * args.num_stack, *obs_shape[1:])
			
@@ -56,7 +37,6 @@ current_obs = torch.zeros(1, *obs_shape)
 
				 states = torch.zeros(1, actor_critic.state_size)
			
 
				 masks = torch.zeros(1, 1)
			
 
				 
			
 
				-
			
 
				 def update_current_obs(obs):
			
 
				     shape_dim0 = env.observation_space.shape[0]
			
 
				     obs = torch.from_numpy(obs).float()
			
@@ -64,27 +44,21 @@ def update_current_obs(obs):
 
				         current_obs[:, :-shape_dim0] = current_obs[:, shape_dim0:]
			
 
				     current_obs[:, -shape_dim0:] = obs
			
 
				 
			
 
				-
			
 
				 render_func('human')
			
 
				 obs = env.reset()
			
 
				 update_current_obs(obs)
			
 
				 
			
 
				-if args.env_name.find('Bullet') > -1:
			
 
				-    import pybullet as p
			
 
				-
			
 
				-    torsoId = -1
			
 
				-    for i in range(p.getNumBodies()):
			
 
				-        if (p.getBodyInfo(i)[0].decode() == "torso"):
			
 
				-            torsoId = i
			
 
				-
			
 
				 while True:
			
 
				-    value, action, _, states = actor_critic.act(Variable(current_obs, volatile=True),
			
 
				-                                                Variable(states, volatile=True),
			
 
				-                                                Variable(masks, volatile=True),
			
 
				-                                                deterministic=True)
			
 
				+    value, action, _, states = actor_critic.act(
			
 
				+        Variable(current_obs, volatile=True),
			
 
				+        Variable(states, volatile=True),
			
 
				+        Variable(masks, volatile=True),
			
 
				+        deterministic=True
			
 
				+    )
			
 
				     states = states.data
			
 
				     cpu_actions = action.data.squeeze(1).cpu().numpy()
			
 
				-    # Obser reward and next obs
			
 
				+
			
 
				+    # Observation, reward and next obs
			
 
				     obs, reward, done, _ = env.step(cpu_actions)
			
 
				 
			
 
				     time.sleep(0.05)
			
@@ -97,13 +71,6 @@ while True:
 
				         current_obs *= masks
			
 
				     update_current_obs(obs)
			
 
				 
			
 
				-    if args.env_name.find('Bullet') > -1:
			
 
				-        if torsoId > -1:
			
 
				-            distance = 5
			
 
				-            yaw = 0
			
 
				-            humanPos, humanOrn = p.getBasePositionAndOrientation(torsoId)
			
 
				-            p.resetDebugVisualizerCamera(distance, yaw, -20, humanPos)
			
 
				-
			
 
				     renderer = render_func('human')
			
 
				 
			
 
				     if not renderer.window:
			
--- a/pytorch_rl/imgs/a2c_beamrider.png
+++ b/pytorch_rl/imgs/a2c_beamrider.png
--- a/pytorch_rl/imgs/a2c_breakout.png
+++ b/pytorch_rl/imgs/a2c_breakout.png
--- a/pytorch_rl/imgs/a2c_qbert.png
+++ b/pytorch_rl/imgs/a2c_qbert.png
--- a/pytorch_rl/imgs/a2c_seaquest.png
+++ b/pytorch_rl/imgs/a2c_seaquest.png
--- a/pytorch_rl/imgs/acktr_beamrider.png
+++ b/pytorch_rl/imgs/acktr_beamrider.png
--- a/pytorch_rl/imgs/acktr_breakout.png
+++ b/pytorch_rl/imgs/acktr_breakout.png
--- a/pytorch_rl/imgs/acktr_qbert.png
+++ b/pytorch_rl/imgs/acktr_qbert.png
--- a/pytorch_rl/imgs/acktr_seaquest.png
+++ b/pytorch_rl/imgs/acktr_seaquest.png
--- a/pytorch_rl/imgs/ppo_halfcheetah.png
+++ b/pytorch_rl/imgs/ppo_halfcheetah.png
--- a/pytorch_rl/imgs/ppo_hopper.png
+++ b/pytorch_rl/imgs/ppo_hopper.png
--- a/pytorch_rl/imgs/ppo_reacher.png
+++ b/pytorch_rl/imgs/ppo_reacher.png
--- a/pytorch_rl/imgs/ppo_walker.png
+++ b/pytorch_rl/imgs/ppo_walker.png
--- a/pytorch_rl/main.py
+++ b/pytorch_rl/main.py
@@ -14,9 +14,8 @@ import torch.optim as optim
 
				 from torch.autograd import Variable
			
 
				 
			
 
				 from arguments import get_args
			
 
				-from baselines.common.vec_env.dummy_vec_env import DummyVecEnv
			
 
				-from baselines.common.vec_env.subproc_vec_env import SubprocVecEnv
			
 
				-from baselines.common.vec_env.vec_normalize import VecNormalize
			
 
				+from vec_env.dummy_vec_env import DummyVecEnv
			
 
				+from vec_env.subproc_vec_env import SubprocVecEnv
			
 
				 from envs import make_env
			
 
				 from kfac import KFACOptimizer
			
 
				 from model import RecMLPPolicy, MLPPolicy, CNNPolicy
			
--- a/pytorch_rl/requirements.txt
+++ b/pytorch_rl/requirements.txt
@@ -1,4 +1,2 @@
 
				 gym
			
 
				 matplotlib
			
 
				-pybullet
			
 
				-opencv-python
			
--- a/pytorch_rl/vec_env/LICENSE
+++ b/pytorch_rl/vec_env/LICENSE
@@ -0,0 +1,21 @@
 
				+The MIT License
			
 
				+
			
 
				+Copyright (c) 2017 OpenAI (http://openai.com)
			
 
				+
			
 
				+Permission is hereby granted, free of charge, to any person obtaining a copy
			
 
				+of this software and associated documentation files (the "Software"), to deal
			
 
				+in the Software without restriction, including without limitation the rights
			
 
				+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
			
 
				+copies of the Software, and to permit persons to whom the Software is
			
 
				+furnished to do so, subject to the following conditions:
			
 
				+
			
 
				+The above copyright notice and this permission notice shall be included in
			
 
				+all copies or substantial portions of the Software.
			
 
				+
			
 
				+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
			
 
				+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
			
 
				+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
			
 
				+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
			
 
				+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
			
 
				+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
			
 
				+THE SOFTWARE.
			
--- a/pytorch_rl/vec_env/__init__.py
+++ b/pytorch_rl/vec_env/__init__.py
@@ -0,0 +1,100 @@
 
				+from abc import ABC, abstractmethod
			
 
				+
			
 
				+class VecEnv(ABC):
			
 
				+
			
 
				+    def __init__(self, num_envs, observation_space, action_space):
			
 
				+        self.num_envs = num_envs
			
 
				+        self.observation_space = observation_space
			
 
				+        self.action_space = action_space
			
 
				+
			
 
				+    """
			
 
				+    An abstract asynchronous, vectorized environment.
			
 
				+    """
			
 
				+    @abstractmethod
			
 
				+    def reset(self):
			
 
				+        """
			
 
				+        Reset all the environments and return an array of
			
 
				+        observations.
			
 
				+
			
 
				+        If step_async is still doing work, that work will
			
 
				+        be cancelled and step_wait() should not be called
			
 
				+        until step_async() is invoked again.
			
 
				+        """
			
 
				+        pass
			
 
				+
			
 
				+    @abstractmethod
			
 
				+    def step_async(self, actions):
			
 
				+        """
			
 
				+        Tell all the environments to start taking a step
			
 
				+        with the given actions.
			
 
				+        Call step_wait() to get the results of the step.
			
 
				+
			
 
				+        You should not call this if a step_async run is
			
 
				+        already pending.
			
 
				+        """
			
 
				+        pass
			
 
				+
			
 
				+    @abstractmethod
			
 
				+    def step_wait(self):
			
 
				+        """
			
 
				+        Wait for the step taken with step_async().
			
 
				+
			
 
				+        Returns (obs, rews, dones, infos):
			
 
				+         - obs: an array of observations
			
 
				+         - rews: an array of rewards
			
 
				+         - dones: an array of "episode done" booleans
			
 
				+         - infos: an array of info objects
			
 
				+        """
			
 
				+        pass
			
 
				+
			
 
				+    @abstractmethod
			
 
				+    def close(self):
			
 
				+        """
			
 
				+        Clean up the environments' resources.
			
 
				+        """
			
 
				+        pass
			
 
				+
			
 
				+    def step(self, actions):
			
 
				+        self.step_async(actions)
			
 
				+        return self.step_wait()
			
 
				+
			
 
				+    def render(self):
			
 
				+        logger.warn('Render not defined for %s'%self)
			
 
				+
			
 
				+class VecEnvWrapper(VecEnv):
			
 
				+    def __init__(self, venv, observation_space=None, action_space=None):
			
 
				+        self.venv = venv
			
 
				+        VecEnv.__init__(self,
			
 
				+            num_envs=venv.num_envs,
			
 
				+            observation_space=observation_space or venv.observation_space,
			
 
				+            action_space=action_space or venv.action_space)
			
 
				+
			
 
				+    def step_async(self, actions):
			
 
				+        self.venv.step_async(actions)
			
 
				+
			
 
				+    @abstractmethod
			
 
				+    def reset(self):
			
 
				+        pass
			
 
				+
			
 
				+    @abstractmethod
			
 
				+    def step_wait(self):
			
 
				+        pass
			
 
				+
			
 
				+    def close(self):
			
 
				+        return self.venv.close()
			
 
				+
			
 
				+    def render(self):
			
 
				+        self.venv.render()
			
 
				+
			
 
				+class CloudpickleWrapper(object):
			
 
				+    """
			
 
				+    Uses cloudpickle to serialize contents (otherwise multiprocessing tries to use pickle)
			
 
				+    """
			
 
				+    def __init__(self, x):
			
 
				+        self.x = x
			
 
				+    def __getstate__(self):
			
 
				+        import cloudpickle
			
 
				+        return cloudpickle.dumps(self.x)
			
 
				+    def __setstate__(self, ob):
			
 
				+        import pickle
			
 
				+        self.x = pickle.loads(ob)
			
--- a/pytorch_rl/vec_env/dummy_vec_env.py
+++ b/pytorch_rl/vec_env/dummy_vec_env.py
@@ -0,0 +1,31 @@
 
				+import numpy as np
			
 
				+from . import VecEnv
			
 
				+
			
 
				+class DummyVecEnv(VecEnv):
			
 
				+    def __init__(self, env_fns):
			
 
				+        self.envs = [fn() for fn in env_fns]
			
 
				+        env = self.envs[0]        
			
 
				+        VecEnv.__init__(self, len(env_fns), env.observation_space, env.action_space)
			
 
				+        self.ts = np.zeros(len(self.envs), dtype='int')        
			
 
				+        self.actions = None
			
 
				+
			
 
				+    def step_async(self, actions):
			
 
				+        self.actions = actions
			
 
				+
			
 
				+    def step_wait(self):
			
 
				+        results = [env.step(a) for (a,env) in zip(self.actions, self.envs)]
			
 
				+        obs, rews, dones, infos = map(np.array, zip(*results))
			
 
				+        self.ts += 1
			
 
				+        for (i, done) in enumerate(dones):
			
 
				+            if done: 
			
 
				+                obs[i] = self.envs[i].reset()
			
 
				+                self.ts[i] = 0
			
 
				+        self.actions = None
			
 
				+        return np.array(obs), np.array(rews), np.array(dones), infos
			
 
				+
			
 
				+    def reset(self):        
			
 
				+        results = [env.reset() for env in self.envs]
			
 
				+        return np.array(results)
			
 
				+
			
 
				+    def close(self):
			
 
				+        return
			
--- a/pytorch_rl/vec_env/subproc_vec_env.py
+++ b/pytorch_rl/vec_env/subproc_vec_env.py
@@ -0,0 +1,82 @@
 
				+import numpy as np
			
 
				+from multiprocessing import Process, Pipe
			
 
				+from vec_env import VecEnv, CloudpickleWrapper
			
 
				+
			
 
				+def worker(remote, parent_remote, env_fn_wrapper):
			
 
				+    parent_remote.close()
			
 
				+    env = env_fn_wrapper.x()
			
 
				+    while True:
			
 
				+        cmd, data = remote.recv()
			
 
				+        if cmd == 'step':
			
 
				+            ob, reward, done, info = env.step(data)
			
 
				+            if done:
			
 
				+                ob = env.reset()
			
 
				+            remote.send((ob, reward, done, info))
			
 
				+        elif cmd == 'reset':
			
 
				+            ob = env.reset()
			
 
				+            remote.send(ob)
			
 
				+        elif cmd == 'reset_task':
			
 
				+            ob = env.reset_task()
			
 
				+            remote.send(ob)
			
 
				+        elif cmd == 'close':
			
 
				+            remote.close()
			
 
				+            break
			
 
				+        elif cmd == 'get_spaces':
			
 
				+            remote.send((env.observation_space, env.action_space))
			
 
				+        else:
			
 
				+            raise NotImplementedError
			
 
				+
			
 
				+
			
 
				+class SubprocVecEnv(VecEnv):
			
 
				+    def __init__(self, env_fns, spaces=None):
			
 
				+        """
			
 
				+        envs: list of gym environments to run in subprocesses
			
 
				+        """
			
 
				+        self.waiting = False
			
 
				+        self.closed = False
			
 
				+        nenvs = len(env_fns)
			
 
				+        self.remotes, self.work_remotes = zip(*[Pipe() for _ in range(nenvs)])
			
 
				+        self.ps = [Process(target=worker, args=(work_remote, remote, CloudpickleWrapper(env_fn)))
			
 
				+            for (work_remote, remote, env_fn) in zip(self.work_remotes, self.remotes, env_fns)]
			
 
				+        for p in self.ps:
			
 
				+            p.daemon = True # if the main process crashes, we should not cause things to hang
			
 
				+            p.start()
			
 
				+        for remote in self.work_remotes:
			
 
				+            remote.close()
			
 
				+
			
 
				+        self.remotes[0].send(('get_spaces', None))
			
 
				+        observation_space, action_space = self.remotes[0].recv()
			
 
				+        VecEnv.__init__(self, len(env_fns), observation_space, action_space)
			
 
				+
			
 
				+    def step_async(self, actions):
			
 
				+        for remote, action in zip(self.remotes, actions):
			
 
				+            remote.send(('step', action))
			
 
				+        self.waiting = True
			
 
				+
			
 
				+    def step_wait(self):
			
 
				+        results = [remote.recv() for remote in self.remotes]
			
 
				+        self.waiting = False
			
 
				+        obs, rews, dones, infos = zip(*results)
			
 
				+        return np.stack(obs), np.stack(rews), np.stack(dones), infos
			
 
				+
			
 
				+    def reset(self):
			
 
				+        for remote in self.remotes:
			
 
				+            remote.send(('reset', None))
			
 
				+        return np.stack([remote.recv() for remote in self.remotes])
			
 
				+
			
 
				+    def reset_task(self):
			
 
				+        for remote in self.remotes:
			
 
				+            remote.send(('reset_task', None))
			
 
				+        return np.stack([remote.recv() for remote in self.remotes])
			
 
				+
			
 
				+    def close(self):
			
 
				+        if self.closed:
			
 
				+            return
			
 
				+        if self.waiting:
			
 
				+            for remote in self.remotes:
			
 
				+                remote.recv()
			
 
				+        for remote in self.remotes:
			
 
				+            remote.send(('close', None))
			
 
				+        for p in self.ps:
			
 
				+            p.join()
			
 
				+        self.closed = True
			
--- a/pytorch_rl/vec_env/vec_frame_stack.py
+++ b/pytorch_rl/vec_env/vec_frame_stack.py
@@ -0,0 +1,38 @@
 
				+from vec_env import VecEnvWrapper
			
 
				+import numpy as np
			
 
				+from gym import spaces
			
 
				+
			
 
				+class VecFrameStack(VecEnvWrapper):
			
 
				+    """
			
 
				+    Vectorized environment base class
			
 
				+    """
			
 
				+    def __init__(self, venv, nstack):
			
 
				+        self.venv = venv
			
 
				+        self.nstack = nstack
			
 
				+        wos = venv.observation_space # wrapped ob space
			
 
				+        low = np.repeat(wos.low, self.nstack, axis=-1)
			
 
				+        high = np.repeat(wos.high, self.nstack, axis=-1)
			
 
				+        self.stackedobs = np.zeros((venv.num_envs,)+low.shape, low.dtype)
			
 
				+        observation_space = spaces.Box(low=low, high=high, dtype=venv.observation_space.dtype)
			
 
				+        VecEnvWrapper.__init__(self, venv, observation_space=observation_space)
			
 
				+
			
 
				+    def step_wait(self):
			
 
				+        obs, rews, news, infos = self.venv.step_wait()
			
 
				+        self.stackedobs = np.roll(self.stackedobs, shift=-1, axis=-1)
			
 
				+        for (i, new) in enumerate(news):
			
 
				+            if new:
			
 
				+                self.stackedobs[i] = 0
			
 
				+        self.stackedobs[..., -obs.shape[-1]:] = obs
			
 
				+        return self.stackedobs, rews, news, infos
			
 
				+
			
 
				+    def reset(self):
			
 
				+        """
			
 
				+        Reset all environments
			
 
				+        """
			
 
				+        obs = self.venv.reset()
			
 
				+        self.stackedobs[...] = 0
			
 
				+        self.stackedobs[..., -obs.shape[-1]:] = obs
			
 
				+        return self.stackedobs
			
 
				+
			
 
				+    def close(self):
			
 
				+        self.venv.close()