Skip to content

Code for the paper "A Study of Gradient Variance in Deep Learning" and a number of other ideas.

Notifications You must be signed in to change notification settings

fartashf/gvar_code

Repository files navigation

A Study of Gradient Variance in Deep Learning (original raw repository)

(Here is a lazy way of open-sourcing all the code for a multi-year long project.)

Code in this repository contains the implementation of a collection of ideas on improving and analysis of optimization methods. That includes the experiments for the following paper: A Study of Gradient Variance in Deep Learning , F. Faghri, D. Duvenaud, D. J. Fleet, J. Ba, arXiv:2007.04532.

Features

The following ideas might be interesting to different researchers.

All experiments and notes are in Jupyter notebooks

See notebooks/figures*.ipynb for all the records of what was tried, what failed, what worked, and all the figures. These are probably the main reason I am open-sourcing this code the lazy way so that all of it is there if someone wants to to inspect further.

Grid run

See grid_run and cluster.py for a light-weight grid-search code. An example of using the code is here.

Abstraction of Iterative Optimization Methods

(For a cleaner implementation of this abstraction see FOptim repository.)

In order to accurately measure statistics of gradients, we need to make sure the logging does not interfere with internal operations of an optimizer. For example, the sampling of the data should not be affected by the sampling for estimating statistics. Also some optimizers such as K-FAC have periodic operations that should not be performed while measuring statistics. Our abstraction facilitates these.

Here we describe an abstraction of the following implemented optimizers: SGD, SGD+Momentum, Adam, K-FAC, and the variance reduction methods SVRG and our proposed GC sampler/optimizer. An optimization method has a major operation step step function that is repeatedly executed that relies on a direction and a step size/learning rate. It can also include some frequent or infrequent operations.
Frequent updates are as cheap as a single gradient calculation and their cost becomes negligible if done every 10-100 iterations. Infrequent "snapshots" have a cost relative to a full pass over the training set and can be amortized out if done only a few times during the entire training. K-FAC is an example of an optimizer that has both types of updates. See NTK branch for that implementation code.

The important part of this abstraction is that each optimizer has to have a grad() function that returns the step direction specific to the optimizer.
For example, in K-FAC, the proposed direction is the preconditioned gradient code.
In SVRG it is the gradient after adding the control variate and subtracting the old values of the gradient of the current mini-batch code.

This abstraction allows us to measure statistics of the step direction that is studied in the above paper. For that, we have the grad_estim function that allows us to evaluate gradient multiple times during training while not affecting the internal operations of an optimizer. The statistics are measured by calls to grad_estim in get_Ege_var function that is called by log_var function that is called from the main training loop along with other logging calls code.

Gradient Clustering (GC or Gluster in this repo)

Gradient Clustering is an efficient method for clustering gradients during training. Using stratified sampling we achieve low variance unbiased gradient estimators. The non-weighted implementation is also a tool for inspecting gradients of a model and understanding its decisions.

The main classes are in gluster directory and gluster estimator.
We have implemented both an online and a full batch version of this gradient estimator. Stratified sampling is implemented here and relies on a slightly modified dataset wrapper that returns indices code.
This requirement for indices is why we have some challenges in applying this idea to infinite data streams and large datasets but there are solutions that we have implemented separately and will include in another branch.

Zero gradients and Ad-hoc sampling

We implemented a lot of ad-hoc ideas for important sampling of data points according to their loss value or norm of the gradient. All these ideas fail on large datasets but they all work on MNIST. For those ideas see here, here and here.

Dependencies

We recommended to use Anaconda for the following packages.

Reference

If you found this code useful, please cite the following paper:

@misc{faghri2020study,
    title={A Study of Gradient Variance in Deep Learning},
    author={Fartash Faghri and David Duvenaud and David J. Fleet and Jimmy Ba},
    year={2020},
    eprint={2007.04532},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

License

Apache License 2.0

About

Code for the paper "A Study of Gradient Variance in Deep Learning" and a number of other ideas.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published