Baxter-Research

Multi-agent deep reinforcement learning research project

TD3 Algorithm

The TD3 algorithm uses two critic networks and selects the smallest value for the target network. To prevent overestimation of policies propogating errorthe policy network is updated after a set number of timesteps and the value network is updated after each time step. Variance will be lower in policy network leading to more stable and efficient training and ultimately a better quality policy. For this implementation, the actor network is updated every 2 timesteps. The policy is smoothed by adding random noise and averaging over mini-batches to reduce the variance caused by overfitting.

Van Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with Double Q-Learning. In Thirtieth AAAI conference on artificial intelligence(2016).
In order to reduce bias this method estimates the current Q value by using a separate target value function.
Hasselt, H. V. Double Q-Learning. In Advances in Neural Information ProcessingSystems(2010), 2613–2621.
In actor-critic networks the policy is updated very slowly making bias a concern. The older version of Double Q Learning uses clipped double Q learning. This takes the smaller value of the two critic networks (the better choice). Even though this promotes underestimation, this is not a concern because the small values will not propogate through the whole algorithm.
Fujimoto, S., van Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477, 2018.
Original citation for the PyTorch implementation fo Twin Delayed Deep Deterministic Policy Gradients (TD3), source code

Add to Overleaf summaries
Upload to shared articles

Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Prioritized experience replay.arXiv preprint arXiv:1511.05952(2015).
Prioritized experience replay- See Overleaf article summary.

Code References

TD3 Algorithm Code from Towards Data Science implementation of Addressing function approximation error in actor-critic methods.
OpenAI Gym, Replay Buffer and Priority Replay Buffer
ROS Robotics by Example Baxter reference for ROS including: joint angles,... (download the book)[https://drive.google.com/open?id=11UpOH1fZd1qhXr9i8tEyVa1g4NVmL-me]
TD3 Implementation Used for TD3 algorithm implementation.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.idea		.idea
baxter-test		baxter-test
graphs		graphs
images		images
other		other
saves		saves
scripts		scripts
td3algorithm		td3algorithm
tests		tests
utility		utility
Baxter-Startup.txt		Baxter-Startup.txt
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.idea

.idea

baxter-test

baxter-test

graphs

graphs

images

images

other

other

saves

saves

scripts

scripts

td3algorithm

td3algorithm

tests

tests

utility

utility

Baxter-Startup.txt

Baxter-Startup.txt

README.md

README.md

main.py

main.py

Repository files navigation

Baxter-Research

TD3 Algorithm

Code References

About

Releases

Packages

Languages

CharlotteMorrison/Baxter-Research

Folders and files

Latest commit

History

Repository files navigation

Baxter-Research

TD3 Algorithm

Code References

About

Resources

Stars

Watchers

Forks

Languages