Notes on ML tricks for ensuring faster convergence to global minimum.
On the Convergence of Adam and Adagrad A main practical takeaway, increasing the exponential decay factor is as critical as decreasing the learning rate for converging to a critical point. Our analysis also highlights a link between Adam and a finite-horizon version of Adam: for fixed N,taking α = 1/√N and β2 = 1 − 1/N for Adam gives the same convergence bound as Adagrad
ReZero The idea is simple: ReZero initializes each layer to perform the identity operation. For each layer, we introduce a residual connection for the input signal x and one trainable parameter α that modulates the non-trivial transformation of the layer F(x),
Statistical Adaptive Stochastic Gradient Methods Automatically find a good learning rate. !! Need to be tested and compared to the good old Adam optimizer
Evolving Normalization-Activation Layers EvoNorm-S0 normalisation layer + activation function : \frac{x * \sigma(v_1 * x))}{ \sqrt{std_{h,w,c/g}^2(x) + \epsilon} * \gamma + \beta } v_1, \gamma and \beta are learnable weights
Do We Need Zero Training Loss After Achieving Zero Training Error? Ensure you can't reach 0-training loss while reaching 0-training error. This leads the NN to random walk and reach "flat" minima.
The large learning rate phase of deep learning-the catapult mechanism A way to initialize your learning rate. Compute the curvature at initilization and use to define the critical learning rate. (This holds for SGD, large width, ReLU networks, MSE loss)
A constructive prediction of the generalization error across scales
Scaling Laws for Neural Language Models
Micro-batch training (1 datum per batch) can pose problems for training. Weight standardization used with Group Normalization solved those problems (the statistics of the weights are computer over c_in x w_k x h_k)
Data echoing to alleviate your input data pipeline latency.