Reinforcement-Learning-using-OpenAI-Gym

Reinforcement Learning algorithms SARSA, Q-Learning, DQN, for Classical and MuJoCo Environments and testing them with OpenAI Gym.

SARSA Cart Pole

SARSA (State-Action-Reward-State-Action) is a simple on-policy reinforcement learning algorithm in which the agent tries to learn the optimal policy following the current policy (epsilon-greedy) generating action from current state and also the next state.

Implemented SARSA for the Cart Pole problem, a classical environment provided by OpenAI gym.

Problem Goal:
The Cart Pole Problem has 4 states at every time step,

[the position of the cart on the horizontal axis,
the cart’s velocity on that same axis,
the pole’s angular position on the cart,
the angular velocity of the pole on the cart]

and there are 2 actions which the cart can take [going to the left, going to the right].

The main goal is to balance the pole on the cart for the longest time taking appropriate actions at every timestep.

Implementation

Discretized the 4 states into [2,2,8,4] discrete states respectively and have maintained a specific range of values for each of the states.
Used decaying exploration rate to decrease random exploration towards the end of the episodes.

Q-Learning (SARSAMAX) Cart Pole

Q-Learning is a simple off-policy reinforcement learning algorithm in which the agent tries to learn the optimal policy following the current policy (epsilon-greedy) generating action from current state and transitions to the state using the action which has the max Q-value, which is the why it is also called SARSAMAX.

Implemented Q-learning for the Cart Pole problem, a classical environment provided by OpenAI gym.

Implementation

Discretized the 4 states into [2,2,8,4] discrete states respectively and have maintained a specific range of values for each of the states.
Used decaying exploration rate to decrease random exploration towards the end of the episodes.

Results:
It looks from the graph that, the cart is able to balance the pole for the required amount of time almost constantly in about less than 2000 episodes for both algorithms.

SARSA Mountain Car

Problem Goal:
The Mountain Car Problem has 2 states at every time step,

[the position of the car,
the car’s velocity]

and there are 3 actions which the cart can take [going to the left, no action, going to the right].

The main goal is to make the car reach the goal(up-hill) taking appropriate actions at every timestep.

Implementation

Discretized the 2 states into [20,20] discrete states respectively and have maintained a specific range of values for each of the states.
Used decaying exploration rate to decrease random exploration towards the end of the episodes.
Used gradually increasing learning rate because as the exploration rate decreases, confidence level increases and more learning happens towards the end of the episodes.

Q-Learning (SARSAMAX) Mountain Car

Implementation

Discretized the 2 states into [20,20] discrete states respectively and have maintained a specific range of values for each of the states.
Used decaying exploration rate to decrease random exploration towards the end of the episodes.
Used gradually increasing learning rate because as the exploration rate decreases, confidence level increases and more learning happens towards the end of the episodes.

Results:
It looks from the graph that, the car is able to reach the goal almost constantly in about less than 3000 episodes for both algorithms.

SARSA Mountain Car with Backward View (Eligibility Traces)

Implementation

Discretized the 2 states into [65,65] discrete states respectively and have maintained a specific range of values for each of the states.
Used Eligiblity Traces, tuning value for lambda.
Used decaying exploration rate to decrease random exploration towards the end of the episodes.
Used gradually increasing learning rate because as the exploration rate decreases, confidence level increases and more learning happens towards the end of the episodes.

Deep Q-Learning Cart Pole

Implementation

Created 2 Deep Networks which takes state as input and outputs the Target Q-Value for respective actions in respective networks.
Used Experience Replay which does off-line updates sampling batches from memory, and trains the network reducing the MSE.
Used decaying exploration rate to decrease random exploration towards the end of the episodes.
Explicitly for this problem, gave the reard of -10 if the done bool value is True and the timesteps is not 200. So the agent tries to avoid the bool value and keeps on balancing the pole on the cart.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
DQN		DQN
Mountain_Car		Mountain_Car
README.md		README.md
q-learning cart pole.ipynb		q-learning cart pole.ipynb
sarsa_cartpole.ipynb		sarsa_cartpole.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DQN

DQN

Mountain_Car

Mountain_Car

README.md

README.md

q-learning cart pole.ipynb

q-learning cart pole.ipynb

sarsa_cartpole.ipynb

sarsa_cartpole.ipynb

Repository files navigation

Reinforcement-Learning-using-OpenAI-Gym

SARSA Cart Pole

Q-Learning (SARSAMAX) Cart Pole

SARSA Mountain Car

Q-Learning (SARSAMAX) Mountain Car

SARSA Mountain Car with Backward View (Eligibility Traces)

Deep Q-Learning Cart Pole

About

Releases

Packages

Languages

srnand/Reinforcement-Learning-using-OpenAI-Gym

Folders and files

Latest commit

History

Repository files navigation

Reinforcement-Learning-using-OpenAI-Gym

SARSA Cart Pole

Q-Learning (SARSAMAX) Cart Pole

SARSA Mountain Car

Q-Learning (SARSAMAX) Mountain Car

SARSA Mountain Car with Backward View (Eligibility Traces)

Deep Q-Learning Cart Pole

About

Topics

Resources

Stars

Watchers

Forks

Languages