Skip to content
/ malt2 Public

The parent malt2 repo. Contains dstorm/orm module for malt2. Clone this recursively to get all required sub-modules.


Notifications You must be signed in to change notification settings


Repository files navigation

MALT-2:Distributed Data-Parallel Learning for Torch

Please refer to our paper that describes this federated learning framework.


MALT-2 is a distributed data-parallel machine learning system for Torch.

MALT-2 is a ML parallelization framework to paralleize any existing ML application. The system is designed to be simple to use and easy to extend, while maintaining efficiency and state-of-the-art accuracy.

  • Easy to add to existing code general-purpose interface, requires only changing optimization type to dstsgd (distributed SGD).
  • Support for multi-machine, multi-GPU training with CUDA implementations for distributed parameter averaging.
  • Includes C++ and Lua interface to extend existing code. Support for Torch and NEC MiLDE.
  • Easily extend your existing Torch code with minimal changes.
  • Explore existing distributed GPU apps over Resnets, and large language models.
  • Various optimizations such as sparse-reduce, NOTIFY_ACK to accelerate distributed model training

Building MALT with Torch



Install Torch, MPI, Boost and CUDA (if using GPU).

Follow the torch, cuda and boost websites to install the respective packages. For Open-MPI follow instructions below to install MPI with CUDA.

tar xfj openmpi-2.1.2.tar.bz2
cd openmpi-2.1.2; mkdir build; cd build
../configure --prefix=$HOME/usr --enable-mpi-cxx --enable-shared --enable-mpi-thread-multiple --enable-mpi-ext=affinity,cuda --with-cuda=/usr/local/cuda
make -j 8 all
make install

Note: Use similar instructions with openmpi-3.0.0.tar.bz2, but --enable-mpi-thread-multiple needs then to be removed.

  • Checkout the latest version of MALT-2 from github
git clone --recursive

Setup the environment variables

Source your torch/cuda/MKL environment:

on some machines, you might need things something like (MKL is optional):

source [torch-dir]/install/bin/torch-activate
source /opt/intel/mkl/bin/intel64/ intel64

If using modules, you can try:

module install icc cuda80 luajit

To build everything including dstorm, orm and torch, just type from the top-level directory:


Component-wise build

To build componenet-wise (not required if using make above):

Build the dstorm directory, run:

cd dstorm
./ GPU test

You should get a SUCCESS as the output. Check the log files to ensure the build is successful.

The general format is:

./ <type> 

where TYPE is: or MPI (liborm + mpi) or GPU (liborm + mpi + gpu) A side effect is to create ../dstorm-env.{mk|cmake} environment files, so lua capabilities can match the libdstorm compile options.

Build the orm

cd orm
./ GPU

Building Torch packages. With Torch environment setup, install the malt-2 and dstoptim (distributed optimization packages)

cd dstorm/src/torch
rm -rf build && VERBOSE=7 luarocks make malt-2-scm-1.rockspec >& mk.log && echo YAY #build and install the malt-2 package
cd dstoptim
rm -rf build && VERBOSE=7 luarocks make dstoptim-scm-1.rockspec >&mk.log && echo YAY # build the dstoptim package


  • A very basic test is to run th and then try, by hand,
require "malt2"

Run a quick test.

  • With MPI, then you'll need to run via mpirun, perhaps something like:
mpirun -np 2 `which th` `pwd -P`/test.lua mpi 2>&1 | tee test-mpi.log
  • if GPU,
mpirun -np 2 `which th` `pwd -P`/test.lua gpu 2>&1 | tee test-GPU-gpu.log
  • NEW: a WITH_GPU compile can also run with MPI transport
mpirun -np 2 `which th` `pwd -P`/test.lua mpi 2>&1 | tee test-GPU-mpi.log

default transport is set to the "highest" built into libdstorm2: GPU > MPI > SHM

mpirun -np 2 `which th` `pwd -P`/test.lua 2>&1 | tee test-best.log

Running over multiple GPUs.

  • MPI only sees the hostname. By default, on every host, MPI jobs enumerate the GPUs and start running the processes. The only way to change this and run on other GPUs in a round-robin fashion is to change this enumeration for every rank using CUDA_VISIBLE_DEVICES. An example script is in file in the top-level directory.

  • To run:

mpirun -np 2 ./ `which th` `pwd`/test.lua

This script assigns available GPUs in a round-robin fashion. Since MPI requires visibility of all other GPUs to correctly access shared memory, this script only changes the enumeration order and does not restrict visibility.


Now we can run simple torch demos such as distributed linear-regression or imagenet.

Clone the tutorials repo:

git clone

Run individual tutorials as per README in each sub-directory. The is the general launch script. An additonal script is provided to distribute MPI processes over different GPUs.


The parent malt2 repo. Contains dstorm/orm module for malt2. Clone this recursively to get all required sub-modules.








No releases published


No packages published