Skip to content

We implemented a Monte Carlo Tree Search (MCTS) from scratch and we successfully applied it to Tic-Tac-Toe game.

Notifications You must be signed in to change notification settings

salimandre/Monte-Carlo-Tree-Search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Monte Carlo Tree Search

We implemented a Monte Carlo Tree Search (MCTS) from scratch and we successfully applied it to Tic-Tac-Toe game.

We performed tests using different allocated time to build MCTS (thinking time). We also could test several symmetrical and asymmetrical reward designs.

Summary of steps of MCTS model:

  • At each turn do:
    • during a time T, repeat the following steps:
      • create a Tree (root) where each node is associated to a state of game
      • select a leaf by UCB exploration
      • create a child node from this leaf
      • evaluate new state by roll out simulations
      • backpropagate from leaf to the root by updating visit count and reward at each node of the path

In our MCTS we used the following features:

  • we used the following UCB for trees (UCT) formula with c = 1.4 :

  • roll out simulations we performed by using random uniform policy

Results

Here is a demo with 2 machines using MCTS playing against each other. Allocated thinking time is 0.1s. Reward designs are symmetrical with reward(win)=3 and reward(draw)=1.

Here is a demo with 2 machines using MCTS playing against each other. Allocated thinking time is 0.3s. Reward designs are symmetrical with reward(win)=3 and reward(draw)=1.

In the next figure we show the distribution of outcomes after we performed 100 games where we vary the allocated thinking time from 0.01s to 0.2s. Reward designs are symmetrical with reward(win)=3 and reward(draw)=1. Player 1 is the player to play first while player 2 is the player to play second.

In the next figure we show the distribution of outcomes after we performed 100 games where we vary the allocated thinking time from 0.01s to 0.2s. Reward designs are asymmetrical with reward_1(win)=3 and reward_1(draw)=1 for player 1 (player to play first) and reward_2(win)=1 and reward_2(draw)=1 for player 2 (player to play second).

In the next figure we show the distribution of outcomes after we performed 100 games where we vary the allocated thinking time from 0.01s to 0.2s. Reward designs are symmetrical with reward(win)=1 and reward(draw)=0. Player 1 is the player to play first while player 2 is the player to play second.

In the next figure we show the distribution of outcomes after we performed 100 games where we vary the allocated thinking time from 0.01s to 0.2s. Reward designs are asymmetrical with reward_1(win)=1 and reward_1(draw)=0 for player 1 (player to play first) and reward_2(win)=1 and reward_2(draw)=1 for player 2 (player to play second).

Conclusion

From the previous plots we can make two observations:

  • The more thinking time the higher chance the game outcome will be a draw.

  • The more thinking time the lower chance for player 2 to win.

  • Reward designs performance depend on thinking time.

  • Aggressive play is not optimal for player 1 nor for player 2.

  • Playing too safe (looking equally for draw or win) is not optimal for player 2, it is more effective to look more for victory, confirming the popular adage "attack is the better defense".

  • Reward design with reward(win)=3 and reward(draw)=1 seems to produce solid play for both player 1 and player 2

About

We implemented a Monte Carlo Tree Search (MCTS) from scratch and we successfully applied it to Tic-Tac-Toe game.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages