Skip to content

Latest commit

 

History

History
38 lines (24 loc) · 2.45 KB

ml_tricks.md

File metadata and controls

38 lines (24 loc) · 2.45 KB

ML tricks

Notes on ML tricks for ensuring faster convergence to global minimum.

On the Convergence of Adam and Adagrad A main practical takeaway, increasing the exponential decay factor is as critical as decreasing the learning rate for converging to a critical point. Our analysis also highlights a link between Adam and a finite-horizon version of Adam: for fixed N,taking α = 1/√N and β2 = 1 − 1/N for Adam gives the same convergence bound as Adagrad

ReZero The idea is simple: ReZero initializes each layer to perform the identity operation. For each layer, we introduce a residual connection for the input signal x and one trainable parameter α that modulates the non-trivial transformation of the layer F(x),

Statistical Adaptive Stochastic Gradient Methods Automatically find a good learning rate. !! Need to be tested and compared to the good old Adam optimizer

Evolving Normalization-Activation Layers EvoNorm-S0 normalisation layer + activation function : \frac{x * \sigma(v_1 * x))}{ \sqrt{std_{h,w,c/g}^2(x) + \epsilon} * \gamma + \beta } v_1, \gamma and \beta are learnable weights

Do We Need Zero Training Loss After Achieving Zero Training Error? Ensure you can't reach 0-training loss while reaching 0-training error. This leads the NN to random walk and reach "flat" minima.

The large learning rate phase of deep learning-the catapult mechanism A way to initialize your learning rate. Compute the curvature at initilization and use to define the critical learning rate. (This holds for SGD, large width, ReLU networks, MSE loss)

Empiric Scaling formula

A constructive prediction of the generalization error across scales

Scaling Laws for Neural Language Models

Weight normalization

Micro-batch training (1 datum per batch) can pose problems for training. Weight standardization used with Group Normalization solved those problems (the statistics of the weights are computer over c_in x w_k x h_k)

Engineering

Data echoing to alleviate your input data pipeline latency.