Asynchronous I/O Support for CosmoFlow

This repository contains an improvement of CosmoFlow by adding the support of asynchronous data read opertions. CosmoFlow is a parallel deep learning application developed for studying data generated from cosmological N-body dark matter simulations. The source codes of CosmoFlow are available on both github and MLPerf. Programs in this repo update the CosmoFlow source codes by incorporating the LBANN model and parallelizing it using Horovod. The training data files are available at NERSC.

To reduce the cost of reading the training data from files and thus improve the end-to-end training time, this repo adds an asynchronous I/O module using the python multiprocessing package, which enables overlapping of file reads with the computation of model training on GPUs.

Software Requirements

TensorFlow > 2.0.0 (2.2.0 is recommended)
Horovod > 0.16

Run Instructions on Cori at NERSC

Clone the source codes.

git clone https://github.com/swblaster/tf2-cosmoflow

Customize run-time paratemters by modifying the file paths in ./test.yaml.

frameCnt: the number of samples in each file.
numPar: the number of parameters to be predicted.
sourceDir/prj: the top directory of the data files.
subDir: the sub-directory under sourceDir/prj, where the actual files are located.
splitIdx/train: the indices of the training files.
splitIdx/test: the indices of the test files.

Below shows an example of test.yaml file.

frameCnt: 128
numPar: 4
parNames: [Omega_m, sigma_8, N_spec, H_0]
sourceDir: {
  prj: /global/cscratch1/sd/slz839/cosmoflow_c1/,
subDir: multiScale_tryG/
splitIdx:
  test: [100, 101, 102, 103, 104, 105, 106, 107]
  train: [20, 21, 22, 23, 24, 25, 26, 27,
          30, 31, 32, 33, 34, 35, 36, 37,
          40, 41, 42, 43, 44, 45, 46, 47,
          50, 51, 52, 53, 54, 55, 56, 57,
          60, 61, 62, 63, 64, 65, 66, 67,
          70, 71, 72, 73, 74, 75, 76, 77,
          80, 81, 82, 83, 84, 85, 86, 87,
          90, 91, 92, 93, 94, 95, 96, 97]

Command-line Options
- --epochs: the number of epochs for training.
- --batch_size: the local batch size (the batch size for each process).
- --overlap: (0:off / 1:on) disable/enable the I/O overlap feature.
- --checkpoint: (0:off / 1:on) disable/enable the checkpointing.
- --buffer_size: the I/O buffer size with respect to the number of samples.
- --record_acc: (0:off / 1:on) disable/enable the accuracy recording.
- --config: the file path for input data configuration.
- --enable: (0:off / 1:on) disable/enable evaluation of the trained model.
- --async_io: (0:off / 1:on) disable/enable the asynchronous I/O feature.

Start Training Jobs of parallel training can be submitted to Cori's batch queue using a script file. An example is given in ./sbatch.sh. File ./myjob.lsf is an example script file for running on Summit at OLCF. Below shows an example python command that can be used in the job script file.

python3 main.py --epochs=3 \
                --batch_size=4 \
                --overlap=1 \
                --checkpoint=0 \
                --buffer_size=128 \
                --file_shuffle=1 \
                --record_acc=0 \
                --config="test.yaml" \
                --evaluate=0 \
                --async_io=1

Publication

Sunwoo Lee, Qiao Kang, Kewei Wang, Jan Balewski, Alex Sim, Ankit Agrawal, Alok Choudhary, Peter Nugent, Kesheng Wu, and Wei-keng Liao. Asynchronous I/O Strategy for Large-Scale Deep Learning Applications. In the 28th International Conference on High-Performance Computing, Data, and Analytics (HiPC), December 2021.

Development team

Northwestern University
- Sunwoo Lee <sunwoolee1.2014@u.northwestern.edu>
- Kewei Wang <keweiwang2019@u.northwestern.edu>
- Wei-keng Liao <wkliao@northwestern.edu>
Lawrence Berkeley National Laboratory
- Alex Sim <asim@lbl.gov>
- Jan Balewski <balewski@lbl.gov>
- Peter Nugent <penugent@lbl.gov>
- John Wu <kwu@lbl.gov>

Questions/Comments

Sunwoo Lee <sunwoolee1.2014@u.northwestern.edu>
Wei-keng Liao <wkliao@northwestern.edu>

Project Funding Supports

This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Scientific Discovery through Advanced Computing (SciDAC) program. This project is a joint work of Northwestern University and Lawrence Berkeley National Laboratory supported by the RAPIDS Institute. This work is also supported in part by the DOE awards, United States DE-SC0014330 and DE-SC0019358.

Name		Name	Last commit message	Last commit date
Latest commit History 158 Commits
COPYRIGHT		COPYRIGHT
README.md		README.md
feeder_async.py		feeder_async.py
feeder_sync.py		feeder_sync.py
io_daemon.py		io_daemon.py
main.py		main.py
model.py		model.py
myjob.lsf		myjob.lsf
sbatch.sh		sbatch.sh
test.yaml		test.yaml
test_summit.yaml		test_summit.yaml
tf2-cosmoflow-summit.pdf		tf2-cosmoflow-summit.pdf
train.py		train.py

NU-CUCIS/async-io-cosmoflow

Folders and files

Latest commit

History

Repository files navigation

Asynchronous I/O Support for CosmoFlow

Software Requirements

Run Instructions on Cori at NERSC

Publication

Development team

Questions/Comments

Project Funding Supports

About

Topics

Resources

Stars

Watchers

Forks

Languages