Skip to content

Latest commit

 

History

History
236 lines (148 loc) · 36.6 KB

Report.md

File metadata and controls

236 lines (148 loc) · 36.6 KB

Deep Reinforcement Learning for Banana Collecting -- Report

This project was one of the requirements for completing the Deep Reinforcement Learning Nanodegree (DRLND) course at Udacity.com. In short, a learning agent is trained to collect yellow bananas (reward of +1) and to avoid blue bananas (reward of -1 if collected). Please refer to README.md for more details about the environment and installation.

Implementation

The implementation was adapted from the course materials for the coding exercise on deep Q learning. The environment was created using Unity's ML Agents toolkit, but the learning algorithms were implemented with python 3.6 (specifically 3.6.6). The code was run using Jupyter notebook.

Q-learning is a reinforcement learning technique, based on learning a policy which directs the actions taken by an agent given the state of the environment. "Q" is the function which returns the reward for reinforcement and directs the learning of the agent. For deep Q learning, Q is represented by a neural network.

As stated in the README file, the goal was to obtain a score of at least 13 averaged over 100 episodes. This was easily achieved by minimal modification of the code from the coding exercise to train the agent in the Unity ML agent environment rather than an environment from the Open AI Gym. I attempted to approve on this by incorporating variations of deep Q learning that we learned in the course.

Fixed Q-Targets vs Soft Updating

During Q learning, an error is computed as the difference between the current Q value and the target Q value in order to update the weights. However, the target Q value is not really known and must be estimated. One of the problems with deep Q learning is that the same parameters are used to compute both the current and target Q values and the estimated target values. Since they are correlated and the target is changing, the learning becomes oscillatory. We learned about the method of fixed Q-targets for reducing this problem.

This required defining 2 networks: qnetwork_local and qnetwork_target. The target network computes estimated targets for updating qnetwork_local. The targets are fixed by only updating the weights in qnetwork_local once every fixed number of steps rather than during every learning step. The interval number of steps can be adjusted as a hyperparameter.

Although we learned about the fixed Q-targets method for reducing this problem, the solution code for the exercise actually used a "soft update" method for gradually blending in the local network to the target network. Basically it is a weighted average, strongly weighted to the original network so that changes occur gradually. The weight for the target network is tao, and the weight for the local network is thus 1-tao. The current code enables the user to choose either the soft update method or provide a target_update_interval which indicates the number of steps between completely replacing the target with the new network.

Other Options for Reducing Oscillations

Limiting Parameters

I implemented options based on other suggestions for possibly reducing oscillations. These are based on limited the range of the parameters which determine the change in weights:

  1. Limiting the gradients to between -1 and 1.

  2. Limiting the reward to between -1 and 1.

  3. Limiting the magnitude of the time difference error to 1 or less.

For the current environment, the rewards are -1 or 1 for each blue or yellow banana. The limiting of the rewards is thus of minimal use, but I implemented it anyway for future use with other environments. I looked at the actual rewards returned by the environment. Occasionally, I did see rewards of -2 or 2, indicating that occasionally, more than one banana could be picked up following a single action. However, this was uncommon.

I did not see any obvious benefit for these options during my brief testing but would like to test them in a more formal manner. I did not investigate the actual range of gradients and errors experienced during training of this environment.

Double Q Learning

In basic Q learning, the same Q function is used to compute both the current Q value as well as for finding the maximum Q value for the target network. Especially in a noisy environment, this can result in the Q value being overestimated and slowing of the learning process. One partial remedy for this problem is to have 2 independent networks, one for estimating the maximal action and one for computing its Q value. Since we have 2 networks already, qnetwork_local and qnetwork_target, these can be used although they are not entirely independent. The index with the maximum current Q value is determined and then qnetwork_target is used to compute its next state Q value.

Dueling Q Networks

For dueling Q networks, the network is split into 2 streams, a value function and an advantage function. The Q value, Q(s, a), is a metric of the value of choosing a particular action a in a particular state s. The value function, V(s), is a metric of the how good it is to be in a particular state s. The advantage function A(s, a) is a metric of the advantage of, or expected improvement to the current policy provided by, a particular action. The Q value is then defined as V(s) + A(s, a) - mean(s, a). This method is intended to improve the performance of the learning agent. More detals can be found in the following reference:

Wang et al., Dueling Network Architectures for Deep Reinforcement Learning, https://arxiv.org/pdf/1511.06581.pdf

Experience Replay vs Prioritized Experience Replay

The coding exercise included an implementation of experience replay, which stores prior states and actions. Training is performed on a random sampling of the the stored items. The basic version uniformly samples the items.

In prioritized experience replay, each item is weighted by a function of its error. Since the agent may be able to learn more from the actions which result in a relatively larger error, those with larger errors are weighted more during sampling of the replay memory buffer.

I also varied the size of the memory buffer from 1e3 to 1e6. Zhang and Sutton (A Deeper Look at Experience Replay, https://arxiv.org/abs/1712.01275) suggest that large buffer sizes may hurt learning performance. However, I achieved the best final results with 1e6. The smaller buffer sizes seemed to learn faster in the beginning but leveled off at a lower value.

I also varied the interval from 4 to 16 for updating the neural network with training data sampled from the buffer. I decided to keep it at 4 because this was the value used in the coding exercise and I did not see any definitive improvement with higher values.

Evolving Learning Parameters

As is evident, there are many parameters than can be modified to affect the learning process. In addition, the parameters can be set to change with time since the optimal values may not be static. For example, the original solution code from the coding exercise included an evolving probability, episilon, that the agent will choose an action at random. At the start of training, the agent has not learned anything. At this point, a random action should be taken unless there is already some prior knowledge. As the agent learns, epsilon should decrease. I modified the original code to allow episilon at time step k to evolve as follows:

epsilon(k + 1) = epsilon_final + (1 - epsilon_rate) * (epsilon(k) - epsilon_final)

The following were my final choices for values but I tried different values for the rate:

epsilon_initial = 1.0,
epsilon_final = 0.01,
epsilon_rate = 0.005

Gamma is the discount factor for Q learning for indicating a preference for current rewards over potential future rewards. Francois-Lavet et al. (How to Discount Deep Reinforcement Learning: Towards New Dynamic Strategies, https://arxiv.org/abs/1512.02011) recommend gradually increasing gamma as learning progresses. I use a variation of their method and allow gamma to increase according to:

gamma(k + 1) = gamma_final + (1 - gamma_rate) * (gamma_final - gamma(k))

The following were my final choices for values but I tried different values for the rate:

gamma_initial = 0.95,
gamma_final = 0.99,
gamma_rate = 0.01

Schaul et at. (Prioritized Experience Replay, https://arxiv.org/abs/1511.05952) also varied, beta, the exponential parameter in their importance-sampling weights for prioritized experience replay. I thus also allowed beta to increase as follows:

beta(k + 1) = 1 + (1 - beta_rate) * (beta(k) - 1)

The following were my final choices for values but I tried different values for the rate:

beta_initial = 0.4,
beta_rate = 0.005

Since I had difficulties with the scores leveling off at 16 to 18, depending on my parameters, I also tried allowing tao, the weighting factor for soft updating, to gradually increase:

tao(k + 1) = tao_final + (1 - tao_rate) * (tao_final - tao(k))

I set tao_final to what I thought we be a fairly high value but made the rate small, so it never reached this value during the number of episodes that I tried. I did it this way since I do not have any intuition for what values would be ideal:

tao_initial = 0.001,
tao_final = 0.1,
tao_rate = 0.0001

Neural Networks

The basic neural network, which I obtained from the coding exercise, has the following structure:

  1. fully connected layer taking the number of possible states (37) as input and send out 64 outputs
  2. Rectified LInear Unit (ReLU)
  3. fully connected layer (64 in and out)
  4. ReLU
  5. fully connected layer(64 in and the number of possible actions out (4))

I also tried a version of dueling Q networks with tanh activation functions but it is currently not an option in the implementation of the learning agent.

Versions with batch normalization have the normalization performed after each ReLU.

The versions for dueling Q Networks both the value and advantage networks share the first fully connected layer and ReLU but then subsequently branch apart to have their own fully connect layers.

The learning utilizes the Adam optimizer.

Learning Algorithm

Agent parameters

The following parameters determine the learning agent:

    state_size (int): Number of parameters in the environment state
    action_size (int): Number of different actions
    seed (int): random seed
    learning_rate (float): initial learning rate
    batch_normalize (boolean): Flag for using batch normalization in the neural network
    error_clipping (boolean): Flag for limiting the TD error to between -1 and 1 
    reward_clipping (boolean): Flag for limiting the reward to between -1 and 1
    gradient_clipping (boolean): Flag for clipping the norm of the gradient to 1
    target_update_interval (int): Number of 
    double_dqn (boolean): Flag for using double Q learning
    dueling_dqn (boolean): Flag for using dueling Q networks
    prioritized_replay (boolean): Flag for using prioritized replay memory sampling

Training parameters

These parameters adjust the learning:

agent (Agent): the learning agent
n_episodes (int): the number of epsides for training
epsilon_initial: initial value for the probability that a random action will be chosen
epsilon_final: final epsilon
epsilon_rate = rate metric for decreasing epsilon,
gamma_initial = initial gamma discount factor for the Q learning
gamma_rate = rate metric for increasing gamma up to the the final value (1 - 1 / state_size)
beta_initial(float): Initial exponent for determining the importance-sampling weights for prioritized experience replay
beta_rate (float): Rate metric for increasing beta to 1

The training is performed by looping through multiple episodes for the environment.

For each episode

  1. Get an action
  2. Send the action to the environment
  3. Get the next state
  4. Get the reward
  5. Add the state, action, reward, and next state to the experience replay buffer. Add the error as well for prioritized experience replay
  6. Every 4 time steps in the environment, randomly sample the experience replay buffer and perform a learning step (I also tried interval of 8 and 16). If soft updating is performed the weights of the neural network are update every learning step. If fixed Q targets are used, the target neural network weights are only updated at separate intervals specified in the parameters.
  7. The reward is added to the score
  8. Update the state to the next state and loop back to step 1.

This is implemented in section 3 of the Jupyter notebook, Navigation_SNH.ipynb , as dqn. Running dqn returns the scores from all of the episodes.

Results and Plots of Rewards

As I stated, I had very little difficult achieving an average score of 13 but struggled to get much better. My best average scores were just over 18 although individual scores were sometimes in the mid twenties. There was a lot of variability and individual single digit scores pulled down the averages. I did not perform a systematic exploration of all the parameters although I varied all of them to some degree. After a multiple runs, there was no obvious improvement obtained by limiting parameters (gradients, errors, etc) or with batch normalization, so I set these flags to false. I decided on soft updating rather than fixed Q targets because of better initial results. Double DQN and dueling Q networks appeared to improve the learning obtained by the agent. Prioritized experience replay seemed helpful sometimes when I tried the same parameters without and without it. However, my results were not always consistent and any improvement was at most minimal. During learning, the initial improvement in scores was variable and more easy to influence than the final scores. I noticed that faster initial improvement did not equate to a better steady state final score. The number of episodes to achieve an average score of 13 varied from 300 to 1200.

The run plotted below used the following parameters:

agent = Agent(state_size=state_size,action_size=action_size,seed=0,batch_normalize = False,error_clipping = False,reward_clipping = False,gradient_clipping = False,double_dqn = True,prioritized_replay = True,dueling_dqn = True,learning_rate = 0.0005 )

scores = dqn(agent,n_episodes = 5000,epsilon_initial = 1.0,epsilon_final = 0.01,epsilon_rate = 0.005,gamma_initial = 0.90,gamma_final = 0.99,gamma_rate = 0.01,beta_initial = 0.4,beta_rate = 0.005,tau_initial = 0.001,tau_final = 0.1,tau_rate = 0.00001 )

In addition, hard coded parameters in dqn_agent.py which are not changeable via function parameters include the buffer size of 1e6 for experience replay and an interval of 4 time steps for updating the neural network with data sampled from the the buffer.

img

The minimum required score of 13 was surpassed at 600 episodes (average score of 13.78 for episodes 501 to 600 during training). Average score over 100 episodes with no training after training for 5000 episodes was 18.62. The checkpoints for this run were renamed to checkpoint13_buffer1e6_every4_tau.pth and checkpoint_final_buffer1e6_every4_tau.pth.

Ideas for Future Work

The code requires optimization especially the prioritized replay memory implementation.

Although I attempted an implementation for all of the options described. It's still unclear whether I did them correctly since I was not able to achieve significant improvement. This could be related to a coding error or possible the nature of this specific environment. More formal testing is necessary

Other techniques to implement that were mentioned in the course include multi-step bootstrap targets (e.g. A3C), distribution DQN, and noisy DQN, basically the other methods included in the Rainbow algorithm.