Skip to content

addy1997/RL-Algorithms

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

78 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Software License Build Status Stars Contributions Lines Of Code Total alerts Code CodeFactor

Table of Contents

Algorithm

logo

Theory

SARSA or State-Action-Reward-State-Action is an algorithm based on on-policy TD(0) control method in reinforcement learning. It follows Generalised Policy Iteration strategy: as the policy π becomes greedy with respect to the state-action value function, the state-action value function becomes more optimal. Our aim is to estimate Qπ(s, a) for the current policy π and all state-action (s-a) pairs.

  • We learn the state-action value function Q(s,a) rather than state-value V(s).

  • Here, qπ(s,a) is the estimate for the current behavior policy π for all the state-actions pairs (s,a).

  • Initialising a suitable state s (s should not be a terminal state).

  • Choose an appropriate action A under the policy epsilon-greedy or epsilon-soft.

  • Record the values of the state S' and the reward R.

  • Update the function -> Q(S, A) ← Q(S, A) + αR + γQ(S′, A′) − Q(S, A)

  • This loop runs till it encounters a terminal state where Q(s',a') = 0.

SARSA update rule

logo

Q-learning similar to SARSA, is based on off-policy TD(0) control method. Both the algorithms aim to estimate the Qπ(s, a) value for all the state-action pairs invlved in the task.

Q-learning Algorithm

logo

Q-leaning vs SARSA

The only difference is that in SARSA the action a' to go from current state to the next state is selected by the same policy π (behavioral policy). Whereas in Q-learning, the action a' to go from present state to next state is selected in greedy manner, i.e., there are fewer chances of choosing a random action in a state. Hence, it involves more explotaiton than exploration.

Q-learning update rule

logo

Algorithm

algorithm