Baechi: Fast Device Placement on Machine Learning Graphs (SoCC 2020)

Install dependencies

Install dependencies with Anaconda

$ conda install -y python=3.6 numpy=1.16 tensorflow-gpu=1.12 bazel=0.20.0 \
      networkx future matplotlib cvxopt scikit-learn

Mosek

$ pip install -f https://download.mosek.com/stable/wheel/index.html Mosek==8.1.82

Our code runs MOSEK as an LP solver for SCT. MOSEK provides a free personal academic license. You can request a license at https://www.mosek.com/products/academic-licenses. The license file (mosek.lic) should be placed at $HOME/mosek.

Example usage

This example generates the placement of 4-layer GNMT v2 with a batch size of 128, a maximum sequence length of 40, and a vocabulary size of 30000.

Build a Python program to place operators of an ML model.

$ bazel build :train

Generate profiles.

$ ./bazel-bin/train \
    --costgen \
    --cost_path=/tmp/cost.pkl \
    --optimizer=adam \
    --batch_size=128 \
    --model_name=gnmt_v2 \
    --vocab_size=30000 \
    --max_seq_length=40 \
    --rnn_unit_type=lstm \
    --rnn_units=512 \
    --num_layers=4 \
    --encoder_type=gnmt \
    --num_gpus=4 \
    --residual \
    --colocate_grads_with_ops \
    --only_forward

This generates profiles of the forward pass and stores them at /tmp/cost.pkl.

Generate a communication cost function between GPUs through the linear regression.

$ bazel build //utils:communication_benchmark
$ ./bazel-bin/utils/communication_benchmark

This runs a benchmark that transfers tensors between different GPUs for various tensor sizes. By default, the benchmark transfers tensors from GPU:0 to GPU:1 with tensor sizes in the range [2⁰, 2²⁹]. After the benchmark finishes, it prints out a generated communication cost function that should be given as the --comm_cost_coeffs argument value for the placement.

An example output would be the following.

...
Communication cost function: 0.0001754 x + 134

Place operators of GNMT v2 and measure average step times.

$ ./bazel-bin/train \
    --cost_path=/tmp/cost.pkl \
    --optimizer=adam \
    --batch_size=128 \
    --model_name=gnmt_v2 \
    --vocab_size=30000 \
    --max_seq_length=40 \
    --rnn_unit_type=lstm \
    --rnn_units=512 \
    --num_layers=4 \
    --encoder_type=gnmt \
    --num_gpus=4 \
    --residual \
    --colocate_grads_with_ops \
    --only_forward \
    --placement_method=m_etf \
    --placer_type=fusion \
    --grouper=coplace \
    --comm_cost_coeffs=0.0001754,134 \
    --memory_fraction=1.0

This runs the placement of GNMT v2 operators using m-ETF based on the forward operators. When the placement is done, this measures the average step time of the placement results and prints it out.

Docker image

A Docker image with all dependencies installed is available.

$ docker pull beomyeol/baechi
$ docker run -it --rm --gpus all beomyeol/baechi /bin/bash

This gives you direct access to the container with all GPUs enabled. You can follow the example usage within the container.

License

University of Illinois/NCSA Open Source License

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
docker		docker
image_classifier/networks		image_classifier/networks
nmt		nmt
placer		placer
third_party/grappler		third_party/grappler
utils		utils
.gitignore		.gitignore
BUILD		BUILD
License.txt		License.txt
README.md		README.md
WORKSPACE		WORKSPACE
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docker

docker

image_classifier/networks

image_classifier/networks

nmt

nmt

placer

placer

third_party/grappler

third_party/grappler

utils

utils

.gitignore

.gitignore

BUILD

BUILD

License.txt

License.txt

README.md

README.md

WORKSPACE

WORKSPACE

train.py

train.py

Repository files navigation

Baechi: Fast Device Placement on Machine Learning Graphs (SoCC 2020)

Install dependencies

Example usage

Docker image

License

About

Releases

Packages

Languages

License

beomyeol/baechi

Folders and files

Latest commit

History

Repository files navigation

Baechi: Fast Device Placement on Machine Learning Graphs (SoCC 2020)

Install dependencies

Example usage

Docker image

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages