PyTorch Template Using DistributedDataParallel

This is a seed project for distributed PyTorch training, which was built to customize your network quickly.

Overview

Here is an overview of what this template can do, and most of them can be customized by the configure file.

Basic Functions

checkpoint/resume training
progress bar (using tqdm)
progress logs (using logging)
progress visualization (using tensorboard)
finetune (partial network parameters training)
learning rate scheduler
random seed (reproducibility)

Features

distributed training using DistributedDataParallel
base class for extensibility
.json configure file for most parameter tuning
support multiple networks/losses/metrics definition
debug mode for fast test 🌟

Usage

You Need to Know

cuDNN default settings are as follows for training, which may reduce your code reproducibility! Notice it to avoid unexpected behaviors.

 torch.backends.cudnn.enabled = True
 # speed-reproducibility tradeoff https://pytorch.org/docs/stable/notes/randomness.html
 if seed >=0 and gl_seed>=0:  # slower, more reproducible
     torch.backends.cudnn.deterministic = True
     torch.backends.cudnn.benchmark = False
 else:  # faster, less reproducible, default setting
     torch.backends.cudnn.deterministic = False
     torch.backends.cudnn.benchmark = True

The project allows custom classes/functions and parameters by configure file. You can define dataset, losses, networks, etc. by the specific format. Take the network as an example:

// import Network() class from models.network.py file with args
"which_networks": [
	{
    	"name": ["models.network", "Network"],
    	"args": { "init_type": "kaiming"}
	}
],

// import mutilple Networks from defualt file with args
"which_networks": [ 
    {"name": "Network1", args: {"init_type": "kaiming"}},
    {"name": "Network2", args: {"init_type": "kaiming"}},
],

// import mutilple Networks from defualt file without args
"which_networks" : [
    "Network1", // equivalent to {"name": "Network1", args: {}},
    "Network2"
]

// more details can be found on More Details part and init_objs function in praser.py

Start

Run the run.py with your setting.

python run.py

More choices can be found on run.py and config/base.json.

Customize Dataset

Dataset part decides the data need to be fed into the network, you can define the dataset by following steps:

Put your dataset under data folder. See dataset.py in this folder as an example.
Edit the [dataset][train|test] part in config/base.json to import and initialize dataset.

"datasets": { // train or test
    "train": { 
            "which_dataset": {  // import designated dataset using args 
            "name": ["data.dataset", "Dataset"], 
            "args":{ // args to init dataset
                "data_root": "/data/jlw/datasets/comofod"
            } 
        },
        "dataloader":{
        	"validation_split": 0.1, // percent or number
            "args":{ // args to init dataloader
                "batch_size": 2, // batch size in every gpu
                "num_workers": 4,
                "shuffle": true,
                "pin_memory": true,
                "drop_last": true
            }
        }
    },
}

More details

You can import dataset from a new file. Key name can be a list to show your file name and class/function name, or a single string to explain class name in default file(data.dataset.py). An example is as follows:

"name": ["data.dataset", "Dataset"], // import Dataset() class from data.dataset.py
"name": "Dataset", // import Dataset() class from default file

You can control and record more parameters through configure file. Take data_root as the example, you just need to add it in args dict and edit the corresponding class to parse this value:

"args":{ // args to init dataset
    "data_root": "your data path"
}

class Dataset(data.Dataset):
	def __init__(self, data_root, phase='train', image_size=[256, 256], loader=pil_loader):
		imgs = make_dataset(data_root) # data_root value is from configure file

Customize Network

Network part shows your learning network structure, you can define your network by following steps:

Put your network under models folder. See network.py in this folder as an example.
Edit the [model][which_networks] part in config/base.json to import and initialize your networks, and it is a list.

"which_networks": [ // import designated list of networks using args
    {
        "name": "Network",
        "args": { // args to init network
            "init_type": "kaiming" 
        }
    }
],

More details

You can import networks from a new file. Key name can be a list to show your file name and class/function name, or a single string to explain class name in default file(models.network.py ). An example is as follows:

"name": ["models.network", "Network"], // import Network() class from models.network.py
"name": "Network", // import Network() class from default file

You can control and record more parameters through configure file. Take init_type as the example, you just need to add it in args dict and edit corresponding class to parse this value:

"args": { // args to init network
    "init_type": "kaiming" 
}

class BaseNetwork(nn.Module):
	def __init__(self, init_type='kaiming', gain=0.02):
		super(BaseNetwork, self).__init__() # init_type value is from configure file
class Network(BaseNetwork):
	def __init__(self, in_channels=3, **kwargs):
    	super(Network, self).__init__(**kwargs) # get init_type value and pass it to base network

You can import multiple networks. You should import the networks in configure file and use it in model.

"which_networks": [ 
    {"name": "Network1", args: {}},
    {"name": "Network2", args: {}},
],

Customize Model(Trainer)

Model part shows your training process including optimizers/losses/process control, etc. You can define your model by following steps:

Put your Model under models folder. See model.py in its folder as an example.
Edit the [model][which_model] part in config/base.json to import and initialize your model.

"which_model": { // import designated  model(trainer) using args 
    "name": ["models.model", "Model"],
    "args": { // args to init model
    } 
},

More details

You can import model from a new file. Key name can be a list to show your file name and class/function name, or a single string to explain class name in default file(models.model.py ). An example is as follows:

"name": ["models.model", "Model"], // import Model() class / function(not recommend) from models.model.py (default is [models.model.py])
"name": "Model", // import Model() class from default file

You can control and record more parameters through configure file. Please infer to above More details part.

Losses and Metrics

Losses and Metrics are defined on configure file. You also can control and record more parameters through configure file, please refer to the above More details part.

"which_metrics": ["mae"], 
"which_losses": ["mse_loss"]

After the above steps, you need to rewrite several functions like base_model.py/model.py for your network and dataset.

Init step

See __init__() functions as the example.

Training/validation step

See train_step()/val_step() functions as the example.

Checkpoint/Resume training

See save_everything()/load_everything() functions as the example.

Debug mode

Sometimes we hope to debug the process quickly to ensure the whole project works, so debug mode is necessary.

This mode will reduce the dataset size and speed up the training process. You just need to run the file with -d option and edit the debug dict in configure file.

python run.py -d

"debug": { // args in debug mode, which will replace args in train
    "val_epoch": 1,
    "save_checkpoint_epoch": 1,
    "log_iter": 30,
    "data_len": 50 // percent or number, change the size of dataloder to debug_split.
}

Customize More

You can choose the random seed, experiment path in configure file. We will add more useful basic functions with related instructions. Welcome to more contributions for more extensive customization and code enhancements.

Todo

Here are some basic functions or examples that this repository is ready to implement:

Acknowledge

We are benefit a lot from following projects:

https://github.com/Janspiry/Image-Super-Resolution-via-Iterative-Refinement

https://github.com/researchmm/PEN-Net-for-Inpainting

https://github.com/tczhangzhi/pytorch-distributed

https://github.com/victoresque/pytorch-template

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
config		config
core		core
data		data
experiments		experiments
misc		misc
models		models
slurm		slurm
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
new_project.py		new_project.py
requirements.txt		requirements.txt
run.py		run.py

License

Janspiry/distributed-pytorch-template

Folders and files

Latest commit

History

Repository files navigation

PyTorch Template Using DistributedDataParallel

Overview

Basic Functions

Features

Usage

You Need to Know

Start

Customize Dataset

More details

Customize Network

More details

Customize Model(Trainer)

More details

Losses and Metrics

Init step

Training/validation step

Checkpoint/Resume training

Debug mode

Customize More

Todo

Acknowledge

About

Topics

Resources

License

Stars

Watchers

Forks

Languages