Skip to content

NU-CUCIS/async-io-cosmoflow

Repository files navigation

Asynchronous I/O Support for CosmoFlow

This repository contains an improvement of CosmoFlow by adding the support of asynchronous data read opertions. CosmoFlow is a parallel deep learning application developed for studying data generated from cosmological N-body dark matter simulations. The source codes of CosmoFlow are available on both github and MLPerf. Programs in this repo update the CosmoFlow source codes by incorporating the LBANN model and parallelizing it using Horovod. The training data files are available at NERSC.

To reduce the cost of reading the training data from files and thus improve the end-to-end training time, this repo adds an asynchronous I/O module using the python multiprocessing package, which enables overlapping of file reads with the computation of model training on GPUs.

Software Requirements

  • TensorFlow > 2.0.0 (2.2.0 is recommended)
  • Horovod > 0.16

Run Instructions on Cori at NERSC

  1. Clone the source codes.

    git clone https://github.com/swblaster/tf2-cosmoflow
    
  2. Customize run-time paratemters by modifying the file paths in ./test.yaml.

    • frameCnt: the number of samples in each file.
    • numPar: the number of parameters to be predicted.
    • sourceDir/prj: the top directory of the data files.
    • subDir: the sub-directory under sourceDir/prj, where the actual files are located.
    • splitIdx/train: the indices of the training files.
    • splitIdx/test: the indices of the test files.

    Below shows an example of test.yaml file.

    frameCnt: 128
    numPar: 4
    parNames: [Omega_m, sigma_8, N_spec, H_0]
    sourceDir: {
      prj: /global/cscratch1/sd/slz839/cosmoflow_c1/,
    subDir: multiScale_tryG/
    splitIdx:
      test: [100, 101, 102, 103, 104, 105, 106, 107]
      train: [20, 21, 22, 23, 24, 25, 26, 27,
              30, 31, 32, 33, 34, 35, 36, 37,
              40, 41, 42, 43, 44, 45, 46, 47,
              50, 51, 52, 53, 54, 55, 56, 57,
              60, 61, 62, 63, 64, 65, 66, 67,
              70, 71, 72, 73, 74, 75, 76, 77,
              80, 81, 82, 83, 84, 85, 86, 87,
              90, 91, 92, 93, 94, 95, 96, 97]
    
  3. Command-line Options

    • --epochs: the number of epochs for training.
    • --batch_size: the local batch size (the batch size for each process).
    • --overlap: (0:off / 1:on) disable/enable the I/O overlap feature.
    • --checkpoint: (0:off / 1:on) disable/enable the checkpointing.
    • --buffer_size: the I/O buffer size with respect to the number of samples.
    • --record_acc: (0:off / 1:on) disable/enable the accuracy recording.
    • --config: the file path for input data configuration.
    • --enable: (0:off / 1:on) disable/enable evaluation of the trained model.
    • --async_io: (0:off / 1:on) disable/enable the asynchronous I/O feature.
  4. Start Training Jobs of parallel training can be submitted to Cori's batch queue using a script file. An example is given in ./sbatch.sh. File ./myjob.lsf is an example script file for running on Summit at OLCF. Below shows an example python command that can be used in the job script file.

    python3 main.py --epochs=3 \
                    --batch_size=4 \
                    --overlap=1 \
                    --checkpoint=0 \
                    --buffer_size=128 \
                    --file_shuffle=1 \
                    --record_acc=0 \
                    --config="test.yaml" \
                    --evaluate=0 \
                    --async_io=1
    

Publication

Development team

Questions/Comments

Project Funding Supports

This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Scientific Discovery through Advanced Computing (SciDAC) program. This project is a joint work of Northwestern University and Lawrence Berkeley National Laboratory supported by the RAPIDS Institute. This work is also supported in part by the DOE awards, United States DE-SC0014330 and DE-SC0019358.

Releases

No releases published

Packages

No packages published