DRL - MADDPG Algorithm - Tennis Collaboration Continuous Control

Udacity Deep Reinforcement Learning Nanodegree Program - Tennis Collaboration Continuous Control

To run the project just execute the main.py file.
There is also an .ipynb file for jupyter notebook execution.
If you are not using a windows environment, you will need to download the corresponding "Tennis" version for you OS system. Mail me if you need more details about the environment .exe file.
The checkpoint.pth has the expected average score already hit.

The problem:

The task solved here refers to a collaboration continuous control problem where two agents must be able to play "tennis" in collaboration, that is, the longer the rally goes the higher will be the reward that both agents will earn.
It's a continuous problem because the action has a continuous value and the agents must be able to provide this value instead of just chose the one with the biggest value (like in discrete tasks where it should just say which action it wants to execute).
In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1. If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01. Thus, the goal of each agent is to keep the ball in play.
The task is episodic, and in order to solve the environment, the agents must get an average score of +0.5 (over 100 consecutive episodes, after taking the maximum over both agents).

For this problem, I used an implementation of mine from the Multi-Agent Deep Deterministic Policy Gradients algorithm (the DDPG code provided by Udacity was used as a reference).
The challenge here was to find the best way of sharing the experiences between the agents. In the first version of the algorithm, I had two actors, two critics, and a shared memory buffer. I don't know why exactly this idea doesn't work as I expected and then, after a few more research, I migrated my implementation aiming to separate the critic from the actors, that is, one critic was shared by both actors and each actor now had its memory buffer. This idea worked but then I realized that the task was the same (just from a different perspective) and that I could not only share the critic but also the experiences collected by all the actors. I made this change and after hyperparameters tuning, the solution reached the actual results.
I also tested some variations of noise like increasing the noise range and a noise reduction over time approach, but I didn't get better results, so I removed it by now.
I've noticed that the convergence is still a little bit unstable and for the future, I plan to make an improvement upon the neural network structure and check if I can have a faster convergence for this task.

The file with the hyperparameters configuration is the main.py.
If you want you can change the model configuration to into the model.py file.
The noise configuration is in the noise.py file and the values are fixed in the method signature.
The actual configuration of the hyperparameters is:
- Learning Rate:
  - Actors: 1e-4
  - Critic: 3e-4
- Batch Size: 128
- Replay Buffer: 1e6
- Gamma: 0.99
- Tau: 2e-1
- Ornstein-Uhlenbeck noise parameters (0.15 theta and 0.2 sigma.)
For the neural models:
- Actor
  - Hidden: (input, 512) - ReLU
  - Hidden: (512, 256) - ReLU
  - Output: (256, 2) - TanH
- Critic
  - Hidden: (input, 512) - ReLU
  - Hidden: (512 + action_size, 256) - ReLU
  - Output: (256, 1) - Linear

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.ipynb_checkpoints		.ipynb_checkpoints
Tennis_Windows_x86_64		Tennis_Windows_x86_64
__pycache__		__pycache__
README.md		README.md
Tennis.ipynb		Tennis.ipynb
actor.py		actor.py
checkpoint_actor_0.pth		checkpoint_actor_0.pth
checkpoint_actor_1.pth		checkpoint_actor_1.pth
checkpoint_critic.pth		checkpoint_critic.pth
critic.py		critic.py
main.py		main.py
model.py		model.py
noise.py		noise.py
replay_buffer.py		replay_buffer.py
report.md		report.md
result_0.png		result_0.png
result_1.png		result_1.png
result_2.png		result_2.png
unity-environment.log		unity-environment.log