Navigation

Deep Reinforcement Learning Nanodegree Project 1

Project description

For this project, the task is to train an agent to navigate in a large, square world, while collecting yellow bananas, and avoiding blue bananas. A reward of +1 is provided for collecting a yellow banana, and a reward of -1 is provided for collecting a blue banana. Thus, the goal is to collect as many yellow bananas as possible while avoiding blue bananas.

State space is 37 dimensional and contains the agent's velocity, along with ray-based perception of objects around the agent's forward direction.
Action space is 4 dimentional. Four discrete actions correspond to:
- 0 - move forward
- 1 - move backward
- 2 - move left
- 3 - move right
Solution criteria: the environment is considered as solved when the agent gets an average score of +13 over 100 consecutive episodes.

Getting started

Configuration

PC configuration used for this project:

OS: Mac OS 10.14 Mojave
i7-8800H, 32GB, Radeon Pro 560X 4GB

Structure

All project files are stored in /src folder:

main.py - main file where the program execution starts.
agent.py - agent class implementation.
unity_env.py - Unity Environment wrapper (borrowed from here and modified).
trainer.py - trainer (interface between agent and environment) implementation.
replay_buffer.py - memory replay buffer implementation.
sum_tree.py - sum tree implementation for memory replay buffer.
q_network.py - neural network implementations (PyTorch)

All project settings are stored in JSON file: settings.json. It is divided into 4 sections:

general_params - general, module agnostic parameters: mode (train or test), number of episodes, seed.
agent_params - agent parameters: epsilon, gamma, learning rate, etc. This section also includes neural network configuration settings and memory replay buffer parameters.
trainer_params - trainer parameters: epsilon decay, learning rate decay, etc. They are responsible for any change of agent learning parameters. Agent can't change them.
env_params - environment parameters: path, number of agents, etc.

Environment setup

For detailed Python environment setup (PyTorch, the ML-Agents toolkit, and a few more Python packages) please follow these steps: link
Download pre-built Unity Environment:
- Linux
- Mac
- Win x32
- Win x64
Open settings.json and specify the relative path to the application file in "path" inside of "env_params".

Implementation details

The following four algorithms were implemented and tested on this environment.

DQN

Idea. Use neural network for Q-value function approximation as state -> action mapping with the following loss function minimised:

Neural network architecture:

Layer	(in, out)	Activation
Layer 1	(`state_size`, 64)	`relu`
Layer 2	(64, 128)	`relu`
Layer 3	(128, 128)	`relu`
Layer 4	(128, 32)	`relu`
Layer 5	(32, `action_size`)	-

Double DQN (DDQN)

Idea. Use neural network for Q-value function approximation as state -> action mapping with the following loss function minimised:

Implementation of both DQN and DDQN can be found in file agent.py:

  loss_fn = nn.MSELoss(reduce=False)

  if self.__learning_mode['DQN']:
      Q_target_next = self.__qnetwork_target.forward(next_states).max(1)[0].unsqueeze(1).detach()
  else:
      Q_target_next = self.__qnetwork_target.forward(next_states). \
          gather(1, self.__qnetwork_local.forward(next_states).max(1)[1].unsqueeze(1)).detach()

  targets = rewards + self.gamma * Q_target_next * (1 - dones)
  outputs = self.__qnetwork_local.forward(states).gather(1, actions)
  loss = loss_fn(outputs, targets)

Dueling DDNQ

Idea. Use neural network for Q-value function approximation. However in this case, the neural net generates both V(s) - value function and A(s, a) - advantage function. The last layer then aggregates them together and forms Q(s, a).

The dueling architecture is implemented in q_network.py and aggregation step is shown below:

  ...
  value = self.V(action)
  af = self.AL(action)
  action = value + (af - torch.mean(af))
  return action

Based on experimental evidence form the original paper (link), this should improve stability and learning rate of agents.

Prioritised Replay

Idea. For memory replay, the agent collects tuples of (state, reward, next_state, action, done) and reuses them for future learning. In case of prioritised replay the agent has to assign priority to each tuple, corresponding to their contribution to learning. After that, these tuples are reused based on their priorities, leading to more efficient learning.

The most challenging part here is the data structure that is fast for search and sampling. The solution was found here where a sum tree was suggested as a good choice. It allows accessing and sampling data in logarithmic time O(log N). The implementation was taken from the same source and adjusted for this particular use case (see sum_tree.py).

Additional modifications

One more modification that seemed to work very well was slowly reducing learning rate during agent training. It lead to a higher average score (see next section where results are discussed). To control learning rate decay, one more parameter learning_rate_decay was introduced (see settings.json).

Results

The best run data can be found in:

results folder contains log file with detailed agent results after each episode. It also contains the hyperparameters used.
models folder contains checkpoint file.

Algorithms comparison

The follwing graph demonstrates the average reward (over 100 episodes) for different algorithms discussed above (and their combinations):

The parameters for these runs remained the same and can be found in settings.json.

Several observations can be made:

Prioritised replay buffer does not affect algorithm performance for this specific environment. Moreover, in some runs I have observed even worse performance (by less than 10%). A replay buffer with uniform choice seems to be enough for this task.
Double DDQN allows to get the fastest reward collection and get 13 points (black solid line on the graph) in about 200-250 episodes. This can also be tuned by adjusting the values of epsilon, tau (target network update time constant) and learning rate).
All algorithms allow getting over 16 points on average in less than 1800 episodes. This satisfied the solution criteria.

Another interesting effect can be seen from the graph below:

I have made two separate runs with the same parameters but in the first case I kept learning rate fixed, whereas in the second case I decreased it after each episode by about 20%. This graph clearly shows how decreasing the learning rate helps agent to get more reward. The possible explanation is that lowering learning rate leads to smaller changes in Q network at later steps which helps convergence.

Visualisation of agent's behaviour

The following gif file shows one agent run where it manages to collect 20 bananas:

Possible improvements

Use Noisy Nets instead of e-greedy policy. This modification shows improvements over conventional e-greedy learning and reduces the number of hyperparameters to tune.
Reward shaping. When the agent is trained enough getting 2nd banana after the 1st one is expected and almost always happens, whereas getting 21st banana after 20th is the achievement. Therefore, the agent could get higher reward in case of improving its result with respect to the expected one.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data		data
models		models
results		results
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
plotting.ipynb		plotting.ipynb
settings.json		settings.json

SIakovlev/Navigation

Folders and files

Latest commit

History

Repository files navigation

Navigation

Project description

Getting started

Configuration

Structure

Environment setup

Implementation details

DQN

Double DQN (DDQN)

Dueling DDNQ

Prioritised Replay

Additional modifications

Results

Algorithms comparison

Visualisation of agent's behaviour

Possible improvements

About

Topics

Resources

Stars

Watchers

Forks

Languages