Skip to content

Framework, Model & Kernel Optimizations for Distributed Deep Learning - Data Hack Summit

Notifications You must be signed in to change notification settings

abhilash1910/Framework-Optimization

Repository files navigation

Distributed Framework Optimization - Data Hack Summit 2023

image

Deep Learning Frameworks form the baseline over which millions of models (LLMs, multimodals , auto regressive) are being compiled and built on. Many of these frameworks require sophisticated optimization to make models train and infer faster in constrained hardware chips. The intrinsic kernels which form a part of these Frameworks (such as Pytorch) leverage profound adaptive features to help break perf- benchmarks in supercomputing and federated deep learning . This is a glimpse of different sub-kernel, intermediate framework and superficial model optimization techniques which help people run large models such as GPTs on constrained environments and clusters.

image1

Pytorch

Most of the session would revolve around different model optimizations strategies and how the Pytorch framework can make training and finetuning efficient. This would involve features such as aten Graph Capture ,Lowering , Composite Graph Compilation ( by Inductor) followed by device specific IR which the device compiler can optimize further for model performance.

image

Distributed Pytorch

To extend different parallelisms over a dedicated set of hardware device combinations (CPU-GPU,GPU-GPU,multi XPU,multi TPU ,MPS) the distributed backend of pytorch comes into picture. It enables scale up and out of sharded models , data and parameters to efficiently distribute gradients, checkpoints, activations across different devices.

image

Data ,Model and Pipeline Parallelism

In data parallel training, the dataset is split into several shards, each shard is allocated to a device. This is equivalent to parallelize the training process along the batch dimension.

image

Model Parallelism involves sharding model blocks (not separate tensor lists) across devices in a uniform manner.

image

Pipeline parallelism splits the model layer by layer into several chunks, each chunk is given to a device. The caveat here includes a single optimizer.step forces forward (increasing pipeline stages) and backward (decreasing pipeline stages) in an interleaved manner .

image

Deepspeed Zero

ZeRO leverages the aggregate computation and memory resources of data parallelism to reduce the memory and compute requirements of each device (GPU) used for model training. ZeRO reduces the memory consumption of each GPU by partitioning the various model training states (weights, gradients, and optimizer states) across the available devices (GPUs and CPUs) in the distributed training hardware.

image

Triton Compiler

Triton is a deep learning compiler created specifically to abstract IR code and optimize kernels which would otherwise be difficult to optimiz in cuda.

image

About

Framework, Model & Kernel Optimizations for Distributed Deep Learning - Data Hack Summit

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published