Skip to content

google-deepmind/diplomacy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

README

This directory contains code to let agents from Learning to Play No Press Diplomacy with Best Response Policy Iteration (Anthony et al 2020) play Diplomacy.

The code provided here, paired with a Diplomacy environment and adjudicator, can be used to evaluate our agents, and generate game trajectories.

A Diplomacy environment/adjudicator is required to play games, specifications for this module can be found in the protocol in environment/diplomacy_state.py.

This readme describes the observations and action space required, and tests to confirm the environment and agent are working correctly.

Implementation Details

Action Space

In Diplomacy, each turn a player must choose actions for each of their units.

The unit-actions always have an order type (like move or support); always have a source area (where the unit is now); usually have a target area (e.g. the destination of a movement). Support move and convoy order types have a third area, which is the location of the unit receiving support/being convoyed.

The unit-actions are represented by a 64 bit integer. Bits 0-31 represent ORDER|ORDERED AREA|TARGET AREA|THIRD AREA, (each of these takes up to 8 bits). Bits 32-47 are always 0. Bits 48-63 are used to record the index of each action into POSSIBLE_ACTIONS.

The different order codes are constants can be found in environment/action_utils.py.

The 8-bit representation of the areas in the action are as follows:

  • The first 7 bits identify the province. The ids of each province are given by calling province_order.province_name_to_id()

  • The last bit is a coast flag to identify which coast of a bi-coastal province is being referred to. It is 1 for the South Coast area. For the main area, single-coastal provinces, or the North/East coast of a bi-coastal province, it is 0

(Note: elsewhere in the code areas are represented as a (province_id, coast_id) tuple, where coast_id is 0 for the main area and 1 or 2 for the two coasts, or as a single area_id from 0 to 80.)

Bits 0-31 make the meaning of an action easy to calculate. The file environment/actions_utils.py includes several functions for parsing unit actions. The file environment/human_readable_actions.py converts the integer actions into a human readable format.

The indexing part of the action representation is used to convert between the one-hot output of a neural network and the interpretable action representation.

Not all syntactically-correct unit-actions are possible in Diplomacy, for instance Army Paris Move to Berlin is never legal because Berlin is not adjacent to Paris. The list of actions in environment/action_list.py contains all actions that could ever be legal in a game of Diplomacy. This list allows the full 64 bit action to be recovered from the action’s index.

The file environment/mila_actions.py contains functions to convert between the action format used by this codebase (hereafter DM actions) and the action format used by Pacquette et al. (MILA actions)

These mappings are not one-to-one for a few reasons: - MILA actions do not distinguish between disbanding a unit in a retreats phase and disbanding during the builds phase, DM actions do. - MILA actions specify the unit type (fleet/army) and coast it occupies when referring to units on the board. DM actions specify these details only for build actions. In all other circumstances the province uniquely specifies the unit given the context of the board state. - Pacquette et al. disallowed long convoys, and some convoy orders that are always irrelevant to the adjudicaiton.

For converting from MILA actions to DM actions, the function mila_action_to_action gives a one-to-one conversion by taking the current season (an environment/observation_utils.Season) as additional context.

When converting from DM actions to MILA actions, the function action_to_mila_actions returns a list of up to 6 possible MILA actions. Given a state, at most one of these actions can be legal, which one can be inferred by checking the game state.

Observations

The observation format is defined in observation_utils.Observation. It is a named tuple of:

season: One of observation_utils.Season

board: An array of shape (observation_utils.NUM_AREAS, utils.PROVINCE_VECTOR_LENGTH). The areas are ordered by their AreaID as given by province_order.province_name_to_id(province_order.MapMDF.BICOASTAL_MAP). The vector representing a single area is, in order:

  • 3 flags representing the presence of an army, a fleet or an empty province respectively
  • 7 flags representing the owner of the unit, plus an 8th that is true if there is no such unit
  • 1 flag representing whether a unit can be built in the province
  • 1 flag representing whether a unit can be removed from the province
  • 3 flags representing the existence of a dislodged army or fleet, or no dislodged unit
  • 7 flags representing the owner of the dislodged unit, plus an 8th that is true if there is no such unit
  • 3 flags representing whether the area is a land, sea or coast area of a bicoastal province. These are mutually exclusive: a land area is any area an army can occupy, which includes e.g. StP but does not include StP/NC or StP/SC.
  • 7 flags representing the owner of the supply centre in the province, plus an 8th representing an unowned supply centre. The 8th flag is false if there is no SC in the area

build_numbers: In build phases, this is a vector of length 7 saying how many units a player may build (positive values) or must remove (negative values). This number is the number of units they can actually build. So, for example, if a player has 2 fewer units than owned supply centres, but only 1 unoccupied home supply centre, then they can only build 1 unit, and the build number is 1.

In non-build phases, the removal counts (negative values) from the previous build phase are retained, however the build counts (positive values) are zeroed out. (This was a bug in the observations, which should be reproduced because the agents were trained using such observations).

last_actions: A list of the actions submitted in the last phase of the game. They are in the same order as given in the previous step method, but flattened into a single list.

For the build_numbers, last_actions, and one-hot flags of unit and supply centre owners, the powers are ordered alphabetically: Austria, England, France, Germany, Italy, Russia, Turkey.

Run network test

You can make sure this code runs successfully by using the run.sh script provided. The script will set up a fresh virtual environment, download the appropriate libraries, and then run our tests/network_test.py (see below).

You can also do these steps manually using the following commants:

Setup

To set up a python3 virtual environment with the required dependencies, use the following commands, or simply run run.sh.

cd ..
python3 -m venv dip_env
source dip_env/bin/activate
pip3 install --upgrade pip
pip3 install -r diplomacy/requirements.txt

Running a basic smoke test

Use the following command to run basic tests and make sure you have all the required dependencies. See the next paragraph for an more detailed explanation of the tests we provide.

python3 -m diplomacy.tests.network_test

Tests

We provide two test files:

  • tests/network_test.py contains smoke tests that will fail if the network does not produce the correct output shape or format, or is unable to perform a dummy parameter update.

  • tests/observation_test.py tests that the network plays Diplomacy as expected given the paremeters we provide, and it checks that the user's Diplomacy environment and adjudicator produce the same observations and trajectories as our internal implementation. See below for the steps to run this test.

Running observation_test.py

tests/observation_test.py contains a template test class. To run this test, write a new test class that inherits from ObservationTest. The steps to do this are:

  1. Create a new test class that inherits from ObservationTest (usually in a new file) and add a call to absl.main() in that file.
  2. Implement the abstract methods of ObservationTest. These are get_parameter_provider, get_reference_observations, get_reference_legal_actions, get_reference_step_outputs, and get_actions_outputs. These methods are to load the network parameters and test data files linked below, suggested implementations are included in the comments on ObservationTest
  3. Add an implementation of the environment.diplomacy_state.DiplomacyState abstract class. The implementation will usually be a wrapper around the user's own diplomacy adjudicator, which will convert to match the agent's expected action and observation formats. The sections of this README on Observations and Action Space document the required behaviour of the diplomacy state, and describe several utilities intended to help with the implementation.
  4. Implement the abstract method ObservationTest.get_diplomacy_state with a call to your implementation of a DiplomacyState

If the implementation of the DiplomacyState is incorrect, both test methods test_fixed_play and test_network_play will fail. If the DiplomacyState implementation is correct, but the network is not behaving correctly, then only test_network_play will fail, but test_fixed_play will pass.

Running the trained agents

Once both ObservationTest test methods pass, code similar to the first lines of the method test_network_play can be written to load the trained networks as a network.network_policy.Policy. The Policy has an actions method that produces actions. In order to behave correctly, the actions method must be called every turn of the game, in order, starting from Spring 1901. If phases are missed, the agent will not be able to construct the network input correctly, as it depends on the observations from several consecutive phases.

Download parameters and test trajectories.

We provide network parameters for the SL and FPPI-2 training schemes (see Learning to Play No Press Diplomacy with Best Response Policy Iteration (Anthony et al 2020)).

We further provide trajectories generated with the SL parameters and our internal Diplomacy environment and adjudicator. This is so that users can verify that the network plays Diplomacy as expected, and that their environment and adjudicator produce match the behavior of our internal ones using the tests described above.

Type Description Link
Parameters Supervised Imitation Learning (SL) download
Parameters Fictitious Play Policy Iteration 2 (FPPI-2) download
Trajectory Observations download
Trajectory Legal Actions download
Trajectory Step Outputs download
Trajectory Action Outputs download

Citing

Please cite Learning to Play No Press Diplomacy with Best Response Policy Iteration (Anthony et al 2020)

@misc{anthony2020learning,
  title={Learning to Play No-Press Diplomacy with Best Response Policy Iteration},
  author={Thomas Anthony and Tom Eccles and Andrea Tacchetti and János Kramár
  and Ian Gemp and Thomas C. Hudson and Nicolas Porcel and Marc Lanctot and
  Julien Pérolat and Richard Everett and Roman Werpachowski and Satinder Singh
  and Thore Graepel and Yoram Bachrach},
   year={2020},
   eprint={2006.04635},
   archivePrefix={arXiv},
   primaryClass={cs.LG}
}

Disclaimer

This is not an official Google product.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published