Skip to content

Reinforcement learning using a genetic algorithm to train a neural network to play a version of the classic game Breakout

License

Notifications You must be signed in to change notification settings

AndrewStaus/ML-Genetic-Breakout

Repository files navigation

Genetic Breakout

Overview

Breakout is a classic game created by Atari in 1976. The goal of the game is to use a Pong like ball and paddle to hit bricks at the top of the screen to break them. Breaking a break awards a point. Once all bricks are broken the game advances to the next screen. In this version of the game, each new screen adds an additional row of bricks to increase difficulty. Once 15 levels are completed the game ends.

The objective of this project is to use a genetic algorithm to train a neural network (the agent) to achieve a perfect score for the game (1800 points). I will use a type of unsupervised learning called reinforcement learning that will "reward" the best agents by preferring them as parents when the next generation is created.

The Genetic Algorithm

A genetic algorithm will be used in this implementation of reinforcement learning. Genetic algorithms typically have four main subroutines:

  1. Determine Fitness of Agents
  2. Selection
  3. Crossover
  4. Mutation
These steps ensures that the agent pool stays diverse so that new approaches are tried, and the agents do not become stuck in local optima.

Determine Fitness of Agents

Fitness is a measure of how well an agent performed a desired task. If a fitness function does not accurately reward desired activity, agents may be slow to learn or not even progress. To determine fitness, all agents in a generation play the game and their fitness is determined by the score they achieve. If an agent gets stuck for too long without breaking a block, the agent loses a life ensuring that there is not an incentive for the agents to enter infinite loops.

Selection

Once fitness is determined, agents are selected stochastically, weighted by their fitness. Agents with higher fitness are more likely to be chosen more often. Agents can be chosen more than once.

Crossover

Once two agents (parents) are selected, crossover occurs. Each parameter (weights and biases) from the agents is given a 50% chance to be selected from either parent. The new resulting agent (child) has about half of the parameters from one parent, and about half of the parameters from the other.

Mutation

The new agent (child) then undergoes mutation. A small portion of the weights and biases are chosen at random for change, and then slightly tweaked.

Implementation

Libraries

  • Numpy: Vectorized calculations speed up hypothesis processing, greatly reducing training time
  • Pickle: Serialize and save the agents for later use
  • Pandas: Export log data to data frames
  • Pygame: Create a basic Breakout like game
  • Concurrent Futures: Enable multi-processing to allow multiple agents to play at the same time during training

Layers

Input layer consists of 25 nodes:

  • Paddle X Position
  • Ball X Position
  • Ball Y Position
  • Ball X Vector
  • Ball Y Vector
  • Count of active blocks in each row (20 values)

Hidden layer has 16 nodes using ReLU activations.
Output layer has 3 nodes with Softmax activations:

  • Move Paddle Left
  • Move Paddle Right
  • Do Nothing

For each clock cycle (frame) of the game, the input values are propagated through the agent network. The node with the highest activation in the output layer is used as the control input for the game.

Training

256 agents are tested for fitness in each generation. To speed up training time, multiple agents play the game in in parallel by leveraging multi-processing:

Due to games running simultaneously, system resources are taxed by rending the graphics. To further speed up training time, the games are run in headless mode with graphics and interface disabled during training so that the CPU resources can be used more efficiently.

Hyper-Parameters

  • Fitness Function: Score^2. Using an exponential function for fitness causes agents that perform slightly better to have a much larger probability in selected than their close competitors. This helps keep the agent pool healthy
  • Mutation Rate: 25% of the weights and biases will be altered on an agent during the mutation step
  • Mutation Scale: 0.10 will cause a relatively small change to occur on the weights and biases that are chosen randomly for mutation

Results

Optimization was slow for the first 44 generations while the agents learned how to clear the first screen. Once that hurdle was overcome, they were able to generalize to later levels spiking the learning rate.

Top agent scores fluctuate during training, but the mean score of the population continues to increase. There is a breakthrough around generation 60 and the agents optimize for a perfect score on generation 72.

Trained Agent Playback

The below video shows the playback of the trained agent successfully completing level 15 and achieve the max score of 1800 without losing any lives.

Trained.Agent.mp4

Playback is not sped up; the framerate is bottlenecked mainly by the hypothesis calculation time of the agent.

Conclusions

The goal of the project was achieved! An agent that is capable of scoring a perfect game was trained in 72 generations, however the agent does not play optimally. There are long periods of not scoring any points before finding the brick when the screen is almost cleared. There is room for improvement.

Possible Enhancements

  • Alter fitness function: The current fitness function is only based off high scores. Indirectly the time limit imposed on the agent to make a score does influence them to not waste time, but a bonus score for screen clear times would likely improve performance
  • More inputs: The agent's knowledge of the game state is limited. It only knows the number of active blocks in a row, it does not know what column the blocks are in. Increasing the number of inputs would likely help the network but would also greatly increase training time as there will be many more weights and biases
  • More hidden layers or nodes: a single 16 node hidden layer is quite shallow. Adding complexity may allow the agents to learn more sophisticated functions. However, this would also greatly increase training times
  • Memory: The agent is only aware of the current state of the game and has no knowledge of previous states or actions it has taken. Providing short term memory so that the agent has context for what they were doing in previous frames may also help it make better decisions

File Descriptions

  • Breakout.py: This is the human playable game. The agents make it look easy, give it a go for yourself!
  • Train.py: This is the main training script. It will train a new batch of agents. Results will be printed to the console and logged
  • Play Agent.py: This will launch the game under control of one of the trained agents. Graphics are enabled so you can watch it play
  • Training Results.ipynb: Outputs the log to a graph so you can have a visual reference of training performance. Can be accessed while training is in progress