Skip to content
Varun Sundar edited this page Nov 30, 2020 · 5 revisions

Welcome to the rethinking-sparse-learning wiki!

TODO

Nov 23rd to Nov 30th:

  • Hyperparam tuning - [x] Alpha, Delta T - [x] Optuna, use 15 trials, 3 jobs in parallel - [x] Maximise val_accuracy - [x] Use single DB, different study names - [x] Plot should be of test

     - [x]  Learning Rate
         - [x]  Plot for each sparsity across 4 alpha, Delta T.
             - [x]  $(\alpha, \Delta T) = (0.3, 100), (0.4,200), (0.4, 500), (0.5,750)$
    
  • Dec 1st to 14th
    • CIFAR10

      • Plots
      • Hyperparam plots
    • Mini-Imagenet

      • Dataloader
      • Which runs?
      • Dense
        • Do we need linear warmup & fancy tricks?
    • Extensions

      • Distributions. Evaluate ERK vs Uniform on computation

      • Dynamic Structured Sparsity

      • Effect of accumulation gradient

      • Effect of redistribution

        • Can ERK be a proxy? ie., avoid redistribution, use ERK instead.
        • Need to show no gains for ERK by redistribution
        • And some for random

        Experiments:

        \begin{itemize} \item RigL Random \item RigL Random with gradient re-distribution \item RigL Random with momentum re-distribution \item RigL Random with final static distribution found above \item RigL ERK \item RigL ERK with distribution \end{itemize}

        \vscomment{Question: Is the effect of redistribution to find a better power-law distribution? Question: Is the found distribution even power-law?

      • Ablation CAM: how do sparse nets see?

Clone this wiki locally