Skip to content

ashkan-mokarian/d2l-pytorch

Repository files navigation

How to use

  • Error when !pip install d2l: instead do !pip install d2l==1.0.0a0

Very important parts for a quick look

  • Generalization in DNN: Weight regularization - weight decay (not so powerful); Early-stopping (essential for noisy datasets); Dropout (adding noise to training to make the problem hard for network);

  • Look at the iimplementation of batch norm: Contains a good example for coding. Note that the main functionality of the added behaviour of batch norm is seperated from the module definition in a seperate function. This part contains the algorithms and functionality of it. The model book keeping, parameters, lr, moving averages, etc, are seperated from the math inside the module.

Skipped

  • About Autograd and chain rule: Check this. I don't understand it, but if you spend some time, a very good resource is linked.

  • Read more about central limit theorem if you find out. Why does the true average uncertainty tend towards a normal distribution centered at the true mean with rate $\mathcal{O}(1/\sqrt{n})$.

  • Neural tangent kernels: a good framework for understanding DNN. apparently it relates DNN (parametric model) to non-parametric models (kernel methods). In short, there are more analytical arguments that can be made for non-parametric methods, and by this connection, it can serve an analytical tool for understanding over-parameterized DNN.

  • BLEU Score for sequence to sequence evaluation. Described here.

TOC

  1. Introduction

    1. notes: Data is important; Major breakthroughs in the last century leading to current state of DL
  2. Preliminaries

    1. notes
    2. data manipulation and processing.ipynb: Tensor and ndarray, initialization, operation, in-place assignment for memory management, pandas
    3. derivatives, plots, automatic differentiation: some basic matplotlib plots; automatic differentiation aka autograd; usually zero grads, but sometimes useful; how to deal with non-scalar grads; detaching computation; Backpropagation in Python control flow (for, if, dependent on tensor);
    4. probability, statistics, documentation: drawing samples to compute frequenct of events; by law of large numbers and central limit theorem, we know for coin toss frequencies converge to true probs and errors should go down with a rate of $(1/\sqrt{n})$; Bayes' theorem; chebyshev inequality; invoke documentatio, help, and docstrings by help() or ?list;
  3. Linear Neural Networks for Regression

    1. notes: Generalization and model selection;
    2. Linear Regression: Some implementation basics for d2l OOP implementation in notebooks; Creating dataset using d2l; linear regression implementation from scratch; Weight decay notes and implmenetation;
  4. Linear Neural Network for Classification

    1. notes: cross-entropy loss; A naive softmax implementation can lead to over/under-flow; Central limit theorem; Environments and distribution shift;
    2. linear classification with Fashion-MNIST: Softmax linear regression implementation;
  5. Multilayer Perceptrons

    1. notes: what happens to derivate of relu at 0?; sigmoid; usually number of elements, e.g. layer width are powers of 2, why?; Forward pass, backpropagation, computational graph, and the memory requirements for training using backprop; Numertical stability, Vanishing and Exploding gradients; Early stopping; Dropout;
    2. MLP: plots of relu, sigmoid, tanh, scratch and concise implementation for MLP;
    3. Dropout
    4. (Could not make it work. NaN loss)House prediction on Kaggle: implementation of house prediction on Kaggle dataset; using pandas for loading csv data, and preprocessing; first checking with simple linear regression model, to see if data processing works and also get a baseline;
  6. Builder's Guide

    1. notes: nothing
    2. IO and saving models on disk
    3. GPU: By default, parameters stored on gpu; operation on multiple parameters require them to be on the same device, otherwise cannot conclude where to store result, or where to do the computations;
  7. CNN

    1. notes: convolution kernel, padding, striding, computation for multi-channel
    2. LeNet: not much
  8. Modern CNN

    1. notes: Conv blocks instead of Conv layers; stem, body, head design pattern; multi-branching in googlenet; Batch Normalization; ResNet and residual connections; Grouped Convolutions to reduce memory and time by dividing the channels into multiple branches; Some general Design concepts for designing CNNs; Final note about Scalability trumps Inductive biases, transformer better than CNN;
    2. VGG: Implementation of VGG-11
    3. NiN: " of NiN, takes less memory by using 1x1 in early and intermediate layers and nn.AdaptiveAveragePool2d in the end.
    4. GoogleNet: not much to see, model too complicated, a lot of parameters for the number of channels which do not give any hinsights. maybe just overal design pattern seems interesting. implementation of inception block, and the modular design of such a large network could also be interesting to look at.
    5. Batch Norm: Implementation of batch norm from scratch; batch norm is placed between the conv/FC layer and the consequent non-linearity layer;
  9. RNN

    1. notes: autoregressision; $\tau -th order markov condition$;
    2. Markov model: k-step ahead prediction and accumulation of errors problem showcased on a synthetic dataset;
    3. Language Model: Preprossing raw text into sequence data, tokenizer, vacabulary set; Zipf law for n-grams; Dataset sampling strategy ot How to sample train and val datasets from a corpus;
    4. RNN:
  10. Modern RNN

    1. notes: lstm; GRU;
    2. LSTM: lstm; Deep RNN;
    3. Encoder Decoder:
  11. Transformer

    1. notes: scaled dot-product and additive attention scoring function; Multi-head attention very short description; Self-attention; Positional Encoding; The transformer architecure;
    2. attention: heatmap visualization of attention weights; Nadaraya-Watson regression, a simple example of regression problem using attention pooling; Masked softmax; scaled dot-product and additive attention;
    3. Bahadanau Attention: nothing too special, just the implementations of seq2seq model using attention but not exactly like trasformers. probably better models to look at in the notebooks below.
    4. Transformer: Multi-head attention, scratch implementation with reshaping convenience functions for parallel computation of all heads; Positional encoding; encoder-decoder implementation for sequence task using transformer self-attention;
    5. ViT: Patch Embedding using conv; ViT implementation;
  12. Optimization Algorithms

    1. notes: Convexity; Dynamic learning rate; SGD; Momentum; AdaGrad; RMSprop; Adadelta; Adam; LR scheduling;
  13. Computational Performance

    1. notes: Imperative vs. Symbolic programming; Asynchronous computation; blockers and barriers between frontend and backend in ML frameworks; Parallelism of GPUs; non-blocking communication; Hardware short introduction; Multiple-GPU training strategies and data-parallelism;
    2. some code: Nothing special.
  14. Computer Vision

    1. notes: Some basic Augmentation methods in torchvision.transforms; Compose[transforms] for combining multiple augmentations; A particular case of transfer learning: fine-tuning; Anchor box; AnchorBoxes as training data for object detection; Multi-scale object detection; Semantic segmentation (mainly in the notebook explained and not here); transposed convolutions for upsampling; FCN; initializing transposed convolutions using bilinear interpolation; predicting semantic segmentations for images larger than input size; Neural Style Transfer (using a pretrained CNN to update parameters of a synthesized image using backpropagation);
    2. Object Detection: anchor box implementation based on a list of sizes and ratios;
    3. SSD: Single Shot Multibox Object Detection implementation;
    4. Semantic Segmentation: Implementation for reading VOC dataset, doing necessary transformation of colors into labels or indices for the label maps, creating dataset and dataloader class; Removing the head of a ResNet18, replacing with a 1x1 layer to get the num_classes channels and adding transposed conv layer to build the whole semantic segmentation network; Not continued because the rest was just training and evaluation;
    5. Neural Style Transfer: uses a style image to apply the style to a content image. Pretrained VGG is used to extract features. The parameters of the synthesized image is the model and weights to be trained, and VGG is frozen in this scenario. Content loss is the squared loss of features for one of the layers close to ouput, and image also initialized with content image. For Style loss, the Gram matrix of several layers is compared and not the features itself, but just the correlation of their features to match only the style and not the contents. Total variation loss is used for denoising.
  15. Reinforcement Learning

    1. notes: Markov Dicision process; Value, Action-Value, Policy functions; Value iteration algorithm;