Skip to content

Latest commit

 

History

History
2192 lines (1655 loc) · 143 KB

File metadata and controls

2192 lines (1655 loc) · 143 KB

Network Compression and Acceleration

It is about how to accelerate the training and inference of deep learning(generally machine learning) except numerical optimization methods including the following topics:

  • compiler optimization for computation intensive programs;
  • system architecture design for computation intensive programs;
  • network model compression.

The World of Neural Network Acceleration
Choice of Algorithm
Parallelism
Distributed Computing
Hardware Architectures

To revolutionize deep learning with real-time AI solutions that scale from the edge to the data center.

The parameters of deep neural networks are tremendous. And deep learning is matrix-computation intensive. Specific hardware such as GPU or TPU is used to speed up the computation of deep learning in training or inference. The optimization methods are used to train the deep neural network. To boost the training of deep learning, we would like to design faster optimization methods such as ADAM and delicate architectures of neural network such as ResNet. After training, the parameters of the deep neural network are fixed and used for inference, we would do much matrix multiplication via the saved fixed parameters of deep neural network.
From What’s the Difference Between Deep Learning Training and Inference?

Training Inference
Acceleration Compression
https://web.stanford.edu/~perdavan/DNNTrain/ https://www.intel.ai/accelerating-tensorflow-inference-with-intel-deep-learning-boost-on-2nd-gen-intel-xeon-scalable-processors/
Tutorial on Hardware Accelerators for Deep Neural Networks Accelerating Large Scale Deep Learning Inference through DeepCPU at Microsoft
Evolution of Model Compression and Acceleration
Computer Architecture: TPUs, GPUs
Compilers: TVM
Model Re-design: EfficientNet
Re-parameterization: Pruning
Transfer Learning

When the computation resource is limited such as embedded or mobile system, can we deploy deep learning models? Definitely yes.

Resource on ML Sys

Workshop and Conference

Patents and Products

Courses and Labs

System for Deep Learning

Over the past few years, deep learning has become an important technique to successfully solve problems in many different fields, such as vision, NLP, robotics. An important ingredient that is driving this success is the development of deep learning systems that efficiently support the task of learning and inference of complicated models using many devices and possibly using distributed resources. The study of how to build and optimize these deep learning systems is now an active area of research and commercialization.

Matrix computation dense application like deep neural network would take the advantages of specific architecture design. Thus it is really close to high performance computational science when solving some computation dense problems.

Parallel Architectures and Special Hardware

Parallel Architectures for Parallel Processing as co-design is a subfield of system for machine learning.

GPU

This GPU architecture works well on applications with massive parallelism, such as matrix multiplication in a neural network. Actually, you would see order of magnitude higher throughput than CPU on typical training workload for deep learning. This is why the GPU is the most popular processor architecture used in deep learning at time of writing.

But, the GPU is still a general purpose processor that has to support millions of different applications and software. This leads back to our fundamental problem, the von Neumann bottleneck. For every single calculation in the thousands of ALUs, GPU need to access registers or shared memory to read and store the intermediate calculation results. Because the GPU performs more parallel calculations on its thousands of ALUs, it also spends proportionally more energy accessing memory and also increases footprint of GPU for complex wiring.

TPU

TPUs can't run word processors, control rocket engines, or execute bank transactions, but they can handle the massive multiplications and additions for neural networks, at blazingly fast speeds while consuming much less power and inside a smaller physical footprint.

The key enabler is a major reduction of the von Neumann bottleneck. Because the primary task for this processor is matrix processing, hardware designer of the TPU knew every calculation step to perform that operation. So they were able to place thousands of multipliers and adders and connect them to each other directly to form a large physical matrix of those operators. This is called systolic array architecture.

NPU

A neural processing unit (NPU) is a microprocessor that specializes in the acceleration of machine learning algorithms, typically by operating on predictive models such as artificial neural networks (ANNs) or random forests (RFs). It is, also, known as neural processor.

NPU are required for the following purpose:

  1. Accelerate the computation of Machine Learning tasks by several folds (nearly 10K times) as compared to GPUs
  2. Consume low power and improve resource utilization for Machine Learning tasks as compared to GPUs and CPUs

Compilers for Deep Learning

LLVM

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies. Despite its name, LLVM has little to do with traditional virtual machines. The name "LLVM" itself is not an acronym; it is the full name of the project.

MLIR

The Multi-Level Intermediate Representation (MLIR) is intended for easy expression and optimization of computations involving deep loop nests and dense matrices of high dimensionality. It is thus well-suited to deep learning computations in particular. Yet it is general enough to also represent arbitrary sequential computation. The representation allows high-level optimization and parallelization for a wide range of parallel architectures including those with deep memory hierarchies --- general-purpose multicores, GPUs, and specialized neural network accelerators.

IREE

IREE (Intermediate Representation Execution Environment1) is an MLIR-based end-to-end compiler and runtime that lowers Machine Learning (ML) models to a unified IR that scales up to meet the needs of the datacenter and down to satisfy the constraints and special considerations of mobile and edge deployments.

TVM and Versatile Tensor Accelerator (VTA)

TVM is an open deep learning compiler stack for CPUs, GPUs, and specialized accelerators. It aims to close the gap between the productivity-focused deep learning frameworks, and the performance- or efficiency-oriented hardware backends. TVM provides the following main features:

Compilation of deep learning models in Keras, MXNet, PyTorch, Tensorflow, CoreML, DarkNet into minimum deployable modules on diverse hardware backends. Infrastructure to automatic generate and optimize tensor operators on more backend with better performance.

The Versatile Tensor Accelerator (VTA) is an extension of the TVM framework designed to advance deep learning and hardware innovation. VTA is a programmable accelerator that exposes a RISC-like programming abstraction to describe compute and memory operations at the tensor level. We designed VTA to expose the most salient and common characteristics of mainstream deep learning accelerators, such as tensor operations, DMA load/stores, and explicit compute/memory arbitration.

DLVM

We present DLVM, a design and implementation of a compiler infrastructure with a linear algebra intermediate representation, algorithmic differentiation by adjoint code generation, domain-specific optimizations, and a code generator targeting GPU via LLVM.

JAX: Autograd and XLA

With its updated version of Autograd, JAX can automatically differentiate native Python and NumPy functions. It can differentiate through loops, branches, recursion, and closures, and it can take derivatives of derivatives of derivatives. It supports reverse-mode differentiation (a.k.a. backpropagation) via grad as well as forward-mode differentiation, and the two can be composed arbitrarily to any order.

The XLA compilation framework is invoked on subgraphs of TensorFlow computations. The framework requires all tensor shapes to be fixed, so compiled code is specialized to concrete shapes. This means, for example, that the compiler may be invoked multiple times for the same subgraph if it is executed on batches of different sizes.

Glow

Glow is a machine learning compiler and execution engine for hardware accelerators. It is designed to be used as a backend for high-level machine learning frameworks. The compiler is designed to allow state of the art compiler optimizations and code generation of neural network graphs. This library is in active development.

nGraph

nGraph is an end to end deep learning compiler for inference and training with extensive framework and hardware support.

CHET

CHET is a domain-specific optimizing compiler designed to make the task of programming FHE applications easier. Motivated by the need to perform neural network inference on encrypted medical and financial data, CHET supports a domain-specific language for specifying tensor circuits. It automates many of the laborious and error prone tasks of encoding such circuits homomorphically, including encryption parameter selection to guarantee security and accuracy of the computation, determining efficient tensor layouts, and performing scheme-specific optimizations.

NNFusion

NNFusion is a flexible and efficient DNN compiler that can generate high-performance executables from a DNN model description (e.g., TensorFlow frozen models and ONNX format).

DLR

DLR is a compact, common runtime for deep learning models and decision tree models compiled by AWS SageMaker Neo, TVM, or Treelite. DLR uses the TVM runtime, Treelite runtime, NVIDIA TensorRT™, and can include other hardware-specific runtimes. DLR provides unified Python/C++ APIs for loading and running compiled models on various devices. DLR currently supports platforms from Intel, NVIDIA, and ARM, with support for Xilinx, Cadence, and Qualcomm coming soon.

Parallel Programming

Cilk

Cilk aims to make parallel programming a simple extension of ordinary serial programming. Other concurrency platforms, such as Intel’s Threading Building Blocks (TBB) and OpenMP, share similar goals of making parallel programming easier. But Cilk sets itself apart from other concurrency platforms through its simple design and implementation and its powerful suite of provably effective tools. These properties make Cilk well suited as a platform for next-generation multicore research. Tapir enables effective compiler optimization of parallel programs with only minor changes to existing compiler analyses and code transformations. Tapir uses the serial-projection property to order logically parallel fine-grained tasks in the program's control-flow graph. This ordered representation of parallel tasks allows the compiler to optimize parallel codes effectively with only minor modifications.

Triton

The aim of Triton is to provide an open-source environment to write fast code at higher productivity than CUDA, but also with higher flexibility than other existing DSLs.

TASO

TASO optimizes the computation graphs of DNN models using automatically generated and verified graph transformations. For an arbitrary DNN model, TASO uses the auto-generated graph transformations to build a large search space of potential computation graphs that are equivalent to the original DNN model. TASO employs a cost-based search algorithm to explore the space, and automatically discovers highly optimized computation graphs.

Jittor

Jittor is a high-performance deep learning framework based on JIT compiling and meta-operators. The whole framework and meta-operators are compiled just-in-time. A powerful op compiler and tuner are integrated into Jittor. It allowed us to generate high-performance code with specialized for your model. Jittor also contains a wealth of high-performance model libraries, including: image recognition, detection, segmentation, generation, differentiable rendering, geometric learning, reinforcement learning, etc.。

halide

Halide is a programming language designed to make it easier to write high-performance image and array processing code on modern machines.

Halide is a language for fast, portable computation on images and tensors

taichi

Taichi (太极) is a programming language designed for high-performance computer graphics. It is deeply embedded in Python, and its just-in-time compiler offloads compute-intensive tasks to multi-core CPUs and massively parallel GPUs.

Taichi Lang is an open-source, imperative, parallel programming language for high-performance numerical computation. It is embedded in Python and uses just-in-time (JIT) compiler frameworks, for example LLVM, to offload the compute-intensive Python code to the native GPU or CPU instructions.

Numerical algorithms for high-performance computational science

Several key themes emerged across multiple talks in Royal Society Discussion Meeting, all in the context of today’s high performance computing landscape in which processor clock speeds have stagnated (with the end of Moore’s law) and exascale machine are just two or three years away.

  • An important way of accelerating computations is through the use of low precision floating-point arithmetic—in particular by exploiting a hierarchy of precisions.
  • We must exploit low rank matrix structure where it exists, for example in hierarchical (H-matrix) form, combining it with randomized approximations.
  • Minimizing data movement (communication) is crucial, because of its increasing costs relative to the costs of floating-point arithmetic.
  • Co-design (the collaborative and concurrent development of hardware, software, and numerical algorithms, with knowledge of applications) is increasingly important for numerical computing.

For more on high performance computation on GPU see https://hgpu.org/.

Why GEMM is at the heart of deep learning

General Matrix Multiply (GEMM) is a common algorithm in linear algebra, machine learning, statistics, and many other domains.

Fast Matrix-vector Multiplication

Matrix-vector multiplication is a special matrix multiplication: $$\mathbb R^m\mapsto \mathbb R^n: Mv\to u \ u=Mv=\sum_{i=1}A^{(i)}v_i$$ where $M\in\mathbb R^{m\times n}, u\in\mathbb R^n$; each column $M^{(i)}$ can, metaphorically, indicate one address or house and each $v(i)$ a letter addressed to it.

Computation of Matrix Chain Products

Generations of students have learned that the product $xy^Tz$, where $x, y,$ and $z$ are n-vectors, should be written and evaluated as $x(y^Tz)$ ($O(n)$ flops) rather than $(xy^T)z$ ($O(n^2)$) flops). More generally, deciding where to put the parentheses in a matrix product $A_1A_2\dots A_k$ to minimize the number of operations in the evaluation is a nontrivial problem, known as the matrix chain multiplication problem.

A special case is when $A_1=A_2=\dots =A_k$ the problem is to compute the $A^k=\underbrace{A\cdots A}_{k}$.

Generalized Matrix to Matrix Multiplication

If the computation speed of matrix operation is boosted, the inference of deep learning model is accelerated. Matrix multiplication $C_{M\times N}=A_{M\times K}B_{K\times N}$ via dot product is defined as $$C[m,n]=\left< A[m,:], B[:, m]\right>=\sum_{k=1}^{K}A[m, k]\times B[k, n]$$

which is esentially product-sum.

for (int m = 0; m < M; m++) {
  for (int n = 0; n < N; n++) {
    C[m][n] = 0;
    for (int k = 0; k < K; k++) {
      C[m][n] += A[m][k] * B[k][n];
    }
  }
}

It needs $O(MKN)$ multiplication.

The picture below visualizes the computation of a single element in the result matrix $C$. Each element in the result matrix $C$ is the sum of element-wise multiplication of a row from $A$ and a column from $B$.

Our program is memory bound, which means that the multipliers are not active most of the time because they are waiting for memory.

Strassen Algorithms

It is based on block-multiplication. It is required that $C\in\mathbb R^{2^n\times 2^n}$.

The matrice are rearranged as blocks: $$ \mathbf{A} = \begin{bmatrix} \mathbf{A}{1,1} & \mathbf{A}{1,2} \ \mathbf{A}{2,1} & \mathbf{A}{2,2} \end{bmatrix}, \mathbf{B} = \begin{bmatrix} \mathbf{B}{1,1} & \mathbf{B}{1,2} \ \mathbf{B}{2,1} & \mathbf{B}{2,2} \end{bmatrix},\ \mathbf{C} = \begin{bmatrix} \mathbf{C}{1,1} & \mathbf{C}{1,2} \ \mathbf{C}{2,1} & \mathbf{C}{2,2} \end{bmatrix} = \begin{bmatrix} \mathbf{A}{1,1}\mathbf{B}{1,1} & \mathbf{A}{1,2}\mathbf{B}{1,2} \ \mathbf{A}{2,1}\mathbf{B}{2,1} & \mathbf{A}{2,2}\mathbf{B}{2,2} \end{bmatrix}. $$

Submatrix(blocks) multiplication is performed in the following way: $$ \mathbf{M}{1} =\left(\mathbf{A}{1,1}+\mathbf{A}{2,2}\right)\left(\mathbf{B}{1,1}+\mathbf{B}{2,2}\right) \ \mathbf{M}{2} =\left(\mathbf{A}{2,1}+\mathbf{A}{2,2}\right) \mathbf{B}{1,1} \ \mathbf{M}{3} =\mathbf{A}{1,1}\left(\mathbf{B}{1,2}-\mathbf{B}{2,2}\right) \ \mathbf{M}{4} =\mathbf{A}{1,2}\left(\mathbf{B}{2,1}-\mathbf{B}{1,1}\right) \ \mathbf{M}{5} =\left(\mathbf{A}{1,1}+\mathbf{A}{1,2}\right) \mathbf{B}{2,2} \ \mathbf{M}{6} =\left(\mathbf{A}{2,1}-\mathbf{A}{1,1}\right)\left(\mathbf{B}{1,1}+\mathbf{B}{1,2}\right) \ \mathbf{M}{7} =\left(\mathbf{A}{1,2}-\mathbf{A}{2,2}\right)\left(\mathbf{B}{2,1}+\mathbf{B}_{2,2}\right) $$

And then $$ \mathbf{C}{1,1} =\mathbf{M}{1}+\mathbf{M}{4}-\mathbf{M}{5}+\mathbf{M}{7} \ \mathbf{C}{1,2} =\mathbf{M}{3}+\mathbf{M}{5} \ \mathbf{C}{2,1} =\mathbf{M}{2}+\mathbf{M}{4} \ \mathbf{C}{2,2} =\mathbf{M}{1}-\mathbf{M}{2}+\mathbf{M}{3}+\mathbf{M}{6} $$

Coppersmith–Winograd Algorithms

For matrix multiplication $U = V = W = \mathbb F^{n \times n}$, and we write this bilinear map as $\phi=&lt;n, n, n&gt;$ where $\phi(\cdot, \cdot): U\times V\mapsto W$ is bilinear.

The tensor corresponding to the multiplication of an $m\times n$ matrix by an $n\times p$ matrix is $$\left&lt;m, n, p\right&gt;=\sum_{i=1}^{m}\sum_{j=1}^{p}\sum_{k=1}^{n} a_{ik}\otimes b_{kj}\otimes c_{ij}.$$

One can define in a natural way the tensor product of two tensors. In particular, for matrix multipli- cation tensors, we obtain the following identity: for any positive integers $m, m_0, n, n_0, p, p_0$, $$\left&lt;m, n, p\right&gt;\otimes \left&lt;m_0, n_0, p_0\right&gt;=\left&lt;m m_0, n n_0, p p_0\right&gt;.$$

Consider three vector spaces $U$, $V$ and $W$ over the field $\mathbb F$ and

  • $U=span{x_1, \dots, x_{dim(U)}}$,
  • $V=span{y_1, \dots, y_{dim(U)}}$,
  • $W=span{z_1, \dots, z_{dim(U)}}$.

A tensor over $(U, V, W)$ is an element of $U\otimes V\otimes W$ i.e., a formal sum $$T=\sum_{u=1}^{dim(U)}\sum_{v=1}^{dim(V)}\sum_{w=1}^{dim(W)}\underbrace{d_{uvw}}{\in\mathbb F} x{u}\otimes y_{v}\otimes z_{w}.$$

We use tensors (and their low-rank decompositions) to multiply matrices faster than $O(n^3)$. Define the matrix multiplication tensor as follows: $$ M_{(a, b), (c, d), (e, f)}^{(n)} =\begin{cases} 1, &\text{if $b=c, d=e, f=a$}, \ 0, &\text{otherwise}. \end{cases} $$ Suppose $\mathrm{rank}(M^{(n)})\leq r$, i.e.,

$$\displaystyle{M_{(a,b),(c,d),(e,f)}^{(n)}=\sum_{\ell=1}^r x_{ab}^\ell y_{cd}^\ell z_{ef}^\ell.}$$

Then we can use this decomposition to re-express matrix multiplication: $$ {(AB)}{ik}=\sum{j}A_{ij}B_{jk}\ =\sum_{j}M_{(i, j), (j, k), (k, i)}^{(n)}A_{ij}B_{jk}\ = \sum_{a=1}^n\sum_{b=1}^n\sum_{c=1}^n\sum_{d=1}^n M_{(a, b), (c, d), (k, i)}^{(n)}A_{ab}B_{cd}\ = \sum_{a=1}^n\sum_{b=1}^n\sum_{c=1}^n\sum_{d=1}^n (\sum_{l=1}^{r} x_{ab}^l y_{cd}^l z_{ki}^l)A_{ab}B_{cd}\ = \sum_{l=1}^r z_{ki}^l(\sum_{a=1}^n\sum_{b=1}^n x_{ab}^l A_{ab})(\sum_{c=1}^n\sum_{d=1}^n y_{cd}^l B_{cd}) $$

Linear Algebra Packages


Butterflies

Fast linear transforms are ubiquitous in machine learning, including the discrete Fourier transform, discrete cosine transform, and other structured transformations such as convolutions. All of these transforms can be represented by dense matrix-vector multiplication, yet each has a specialized and highly efficient (subquadratic) algorithm. We ask to what extent hand-crafting these algorithms and implementations is necessary, what structural priors they encode, and how much knowledge is required to automatically learn a fast algorithm for a provided structured transform. Motivated by a characterization of fast matrix-vector multiplication as products of sparse matrices, we introduce a parameterization of divide-and-conquer methods that is capable of representing a large class of transforms. This generic formulation can automatically learn an efficient algorithm for many important transforms; for example, it recovers the $O(N\log N)$ Cooley-Tukey FFT algorithm to machine precision, for dimensions N up to 1024. Furthermore, our method can be incorporated as a lightweight replacement of generic matrices in machine learning pipelines to learn efficient and compressible transformations. On a standard task of compressing a single hidden-layer network, our method exceeds the classification accuracy of unconstrained matrices on CIFAR-10 by 3.9 points ---the first time a structured approach has done so---with 4X faster inference speed and 40X fewer parameters.

Automatic Differentiation, Differentiable Programming and Program Transformations

Automatic Differentiation

All numerical gradient-based optimization methods benefits from faster computation of gradients specially backprop.

Many algorithms in machine learning, computer vision, physical simulation, and other fields require the calculation of gradients and other derivatives. Manual derivation of gradients can be time consuming and error-prone. Automatic Differentiation (AD) is a technology for automatically augmenting computer programs, including arbitrarily complex simulations, with statements for the computation of derivatives, also known as sensitivities. Automatic differentiation comprises a set of techniques to calculate the derivative of a numerical computation expressed as a computer program. These techniques are commonly used in atmospheric sciences and computational fluid dynamics, and have more recently also been adopted by machine learning researchers. Practitioners across many fields have built a wide set of automatic differentiation tools, using different programming languages, computational primitives and intermediate compiler representations. Each of these choices comes with positive and negative trade-offs, in terms of their usability, flexibility and performance in specific domains.

In the ideal case, automatically generated derivatives should be competitive with manually generated ones and run at near-peak performance on modern hardware, but the most expressive systems for autodiff which can handle arbitrary, Turing-complete programs, are unsuited for performance-critical applications, such as large-scale machine learning or physical simulation. Alternatively, the most performant systems are not designed for use outside of their designated application space, e.g. graphics or neural networks.

“What does AD mean, independently of implementation?” An answer arises in the form of naturality of sampling a function and its derivative. Automatic differentiation flows out of this naturality condition, together with the chain rule. Graduating from first-order to higher-order AD corresponds to sampling all derivatives instead of just one. Next, the setting is expanded to arbitrary vector spaces, in which derivative values are linear maps. The specification of AD adapts to this elegant and very general setting, which even simplifies the development.

Differentiable Programming

Deep learning may look like another passing fad, in the vein of "expert systems" or "big data." But it's based on two timeless ideas (back-propagation and weight-tying), and while differentiable programming is a very new concept, it's a natural extension of these ideas that may prove timeless itself. Even as specific implementations, architectures, and technical phrases go in and out of fashion, these core concepts will continue to be essential to the success of AI.

Constructing neural networks using pure and higher-order differentiable functions and training them using reverse-mode automatic differentiation is unsurprisingly called Differentiable Programming.


--- Deep Learning Differentiable Programming
Primary purpose Learning Learning+Optimization
Typical usage Learn-once, Eval-many Learn-once, Eval-once
Input granularity Fat objects (images, voice sequences, lidar scans, full text pages) Thin objects (products, clients, SKUs, prices)
Input variety Homogeneous objects (e.g. images all having the same height/width ratio) Heterogeneous objects (relational tables, graphs, time-series)

Program Transformations

Program Transformations for Machine Learning- Workshop at NeurIPS 2019 – December 13 or 14 2019, Vancouver, Canada - claims that

Machine learning researchers often express complex models as a program, relying on program transformations to add functionality. New languages and transformations (e.g., TorchScript and TensorFlow AutoGraph) are becoming core capabilities of ML libraries. However, existing transformations, such as automatic differentiation (AD or autodiff), inference in probabilistic programming languages (PPLs), and optimizing compilers are often built in isolation, and limited in scope. This workshop aims at viewing program transformations in ML in a unified light, making these capabilities more accessible, and building entirely new ones.

Program transformations are an area of active study. AD transforms a program performing numerical computation into one computing the gradient of those computations. In probabilistic programming, a program describing a sampling procedure can be modified to perform inference on model parameters given observations. Other examples are vectorizing a program expressed on one data point, and learned transformations where ML models use programs as inputs or outputs.

This workshop will bring together researchers in the fields of AD, probabilistic programming, programming languages, compilers, and ML with the goal of understanding the commonalities between disparate approaches and views, and sharing ways to make these techniques broadly available. It would enable ML practitioners to iterate faster on novel models and architectures (e.g., those naturally expressed through high-level constructs like recursion).

Open Auto-differentiation Library

Deep Model Compression


CNN is the most wisely used deep learning models in computer vision.

Theme Name Description Application More Details
Parameter pruning and sharing Reducing redundant parameters which are not sensitive to the performance Convolutional layer and fully connected layer Robust to various setting, can achieve good performance, can support both train from scratch and pre-trained model
Low-rank factorization Using matrix/tensor decomposition to estimate the information parameters Convolutional layer and fully connected layer Standardized pipeline, easily to be implemented, can support both train from scratch and pre-trained model
Transferred/compact convolutional filters Designing special structural convolutional filter to save parameters Convolutional layer only Algorithms are dependent on applications, usually achieve good performance, only support train from scratch
Knowledge distillation Training a compact neural network with distilled knowledge of a large model Convolutional layer and fully connected layer Model performances are sensitive to applications and network structure only support train from scratch

Fixed-point Arithmetic and Approximate Computing

Today’s computing systems are designed to deliver only exact solutions at high energy cost, while many of the algorithms that are run on data are at their heart statistical, and thus do not require exact answers.

It turns out that it is sometimes possible to get high-accuracy solutions from low-precision training— and here we'll describe a new variant of stochastic gradient descent (SGD) called high-accuracy low precision (HALP) that can do it. HALP can do better than previous algorithms because it reduces the two sources of noise that limit the accuracy of low-precision SGD: gradient variance and round-off error.

Huffman Encoding

Huffman code is a type of optimal prefix code that is commonly used for loss-less data compression. It produces a variable-length code table for encoding source symbol. The table is derived from the occurrence probability for each symbol. As in other entropy encoding methods, more common symbols are represented with fewer bits than less common symbols, thus save the total space.

Knowledge Distillation

Distillation (Hinton et al., 2015) is a kind of model compression approaches in which a pre-trained large model teaches a smaller model to achieve the similar prediction performance. It is often named as the "teacher-student" training, where the large model is the teacher and the smaller model is the student.

Initially, it is used to compress the knowledge in an ensemble into a single model which is much easier to deploy.

The core idea is that an obvious way to transfer the generalization ability of the cumbersome model to a small model is to use the class probabilities produced by the cumbersome model as “soft targets” for training the small model. When the soft targets have high entropy, they provide much more information per training case than hard targets and much less variance in the gradient between training cases, so the small model can often be trained on much less data than the original cumbersome model and using a much higher learning rate.

With distillation, knowledge can be transferred from the teacher model to the student by minimizing a loss function to recover the distribution of class probabilities predicted by the teacher model. In most situations, the probability of the correct class predicted by the teacher model is very high, and probabilities of other classes are close to 0, which may not be able to provide extra information beyond ground-truth labels. To overcome this issue, a commonly-used solution is to raise the temperature of the final softmax function until the cumbersome model produces a suitably soft set of targets. The soften probability $q_i$ of class $i$ is calculated from the logit $z_i$ $$q_i = \frac{\exp \left( z_i / T \right)}{\sum_j{\exp \left( z_j / T \right)}}$$

where $T$ is the temperature. As $T$ grows, the probability distribution is more smooth, providing more information as to which classes the cumbersome model more similar to the predicted class. It is better to include the standard loss ($T=1$) between the predicted class probabilities and ground-truth labels. The overall loss function is given by:

$$L(x;W)=H(y,\sigma(z_s;T=1))+\alpha⋅H(\sigma(z_t;T=\tau),\sigma(z_s,T=\tau))$$ where $x$ is the input, $W$ are parameters of the distilled small model and $y$ is ground-truth labels, $\sigma$ is the softmax parameterized by temperature $T, H$ is the cross-entropy loss, and $\alpha$ is the coefficient of distillation loss.

Knowledge Distillation (KD) is a widely used approach to transfer the output information from a heavy network to a smaller network for achieving higher performance. The student network can be optimized using the following loss function based on knowledge distillation: $$\mathcal L_{KD}=\frac{1}{n}H(y_S^i, y_T^i).$$

Therefore, utilizing the knowledge transfer technique, a portable network can be optimized without the specific architecture of the given network.

Parameter Pruning and Sharing

Pruning is to prune the connections in deep neural network in order to reduce the number of weights.

  • Learn the connectivity via normal network training
  • Prune the low-weight connections
  • Retrain the sparse network


Quantization and Fixed-point Arithmetic

Network quantization compresses the original network by reducing the number of bits required to represent each weight.

Uniform quantization is widely used for model compression and acceleration. Originally the weights in the network are represented by 32-bit floating-point numbers. With uniform quantization, low-precision (e.g. 4-bit or 8-bit) fixed-point numbers are used to approximate the full-precision network. For k-bit quantization, the memory saving can be up to $32/k$​. For example, 8-bit quantization can reduce the network size by 4 folds with negligible drop of performance. The lth quantized ReLU $\sigma(x_l, \alpha_l)$ acts element-wise on vector $x_l$ from a previous layer and is parameterized by trainable scalar $\alpha_l&gt;0$. In uniform quantization, $$ \sigma (x,\alpha ) = \begin{cases} 0, \quad & \mathrm{if}\quad x \leq 0,\ k, \alpha, \quad & \mathrm{if}\quad \left(k-1\right)\alpha < x \leq k, \alpha, ; k = 1, 2, \dots, 2^{b_a}-1,\ \left(2^{b_a}-1\right)\alpha,\quad & \mathrm{if}\quad x > \left(2^{b_a}-1\right)\alpha, \tag1 \end{cases} $$ where $x$ is the scalar input, $b_a$ is the bit-width, and $k$ is the quantization level. For a 4-bit quantization, $b_a=4$ and $2^{b_a}=16$ levels exist, including zero.

Given a pre-defined full-precision model, the learner inserts quantization nodes and operations into the computation graph of the model. With activation quantization enabled, quantization nodes will also be placed after activation operations (e.g. ReLU).

In the training phase, both full-precision and quantized weights are kept. In the forward pass, quantized weights are obtained by applying the quantization function on full-precision weights. To update full-precision weights in the backward pass, since gradients w.r.t. quantized weights are zeros almost everywhere, we use the straight-through estimator (STE, Bengio et al., 2015) to pass gradients of quantized weights directly to full-precision weights for update.

Fixed-point Arithmetic

The precision of a fixed-point number is the number of digits to the right of the decimal point, and it normally stays the same when computations are performed on the number.

Low Bit Neural Network

8-bit-training

The state-of-the-art hardware platforms for training Deep Neural Networks (DNNs) are moving from traditional single precision (32-bit) computations towards 16 bits of precision -- in large part due to the high energy efficiency and smaller bit storage associated with using reduced-precision representations. However, unlike inference, training with numbers represented with less than 16 bits has been challenging due to the need to maintain fidelity of the gradient computations during back-propagation. Here we demonstrate, for the first time, the successful training of DNNs using 8-bit floating point numbers while fully maintaining the accuracy on a spectrum of Deep Learning models and datasets. In addition to reducing the data and computation precision to 8 bits, we also successfully reduce the arithmetic precision for additions (used in partial product accumulation and weight updates) from 32 bits to 16 bits through the introduction of a number of key ideas including chunk-based accumulation and floating point stochastic rounding.

Binarized Neural Network, Ternary Weight Networks, XOR-Net

Binarized Neural Network

Binary neural networks are networks with binary weights and activations at run time. At training time these weights and activations are used for computing gradients; however, the gradients and true weights are stored in full precision. This procedure allows us to effectively train a network on systems with fewer resources.

Forward Binarization

For forward propagation, we need two binary matrices; we thus binarize the weight matrix and the incoming activation from the previous layer.

A key to the success of BNNs it the binary activation function, which clamps all negatives inputs to −1 and all positive inputs to 1. There are two binarized functions:

deterministic stochastic

Here $\sigma(x)$ is is the “hard sigmoid” function: $\sigma(x)=\max(0, min(1, \frac{x+1}{2}))$. The stochastic binarization is better than the Sign function but is harder to implement. As a result, the deterministic Sign function is used more often.

Gradient Propagation Through Discretization

The derivative of the sign function is zero almost everywhere, making it incompatible with backpropagation. Thus, a straight-through estimator is used. This preserves the gradient's information and cancels large gradients.

While updating the weights, the following is done:

Each real valued weight, $w^r$, is constrained to remain between -1 and +1. If a weight update brings $w^r$ outside $[-1, 1]$, it is clipped. This is done because otherwise, the real-valued weights will grow very large without having any impact on the binary weights, $w^b$. The new updated binary weights are then calculated as $w^b = Sign(w^r)$.

This network has the following layers:

  • Fully connected (128)
  • Ramp - rectified linear unit (ReLU) activation function
  • Binarize activations
  • Fully connected (128)
  • Ramp - ReLU activation function
  • Binarize activations
  • Fully connected (10)
  • Sigmoid activation function
Ternary Weight Networks

Ternary weight networks (TWNs) is the neural networks with weights constrained to +1, 0 and -1. This network can date up to the paper Learning algorithms with neural network with ternary weights in 1988.

XNOR-Net

In Binary-WeightNetworks, the filters are approximated with binary values resulting in 32× memory saving. In XNOR-Networks, both the filters and the input to convolutional layers are binary. XNOR-Networks approximate convolutions using primarily binary operations. This results in 58× faster convolutional operations and 32× memory savings.


Mixed Precision Training

Mixed-precision training lowers the required resources by using lower-precision arithmetic, which has the following benefits.

  • Decrease the required amount of memory. Half-precision floating point format (FP16) uses 16 bits, compared to 32 bits for single precision (FP32). Lowering the required memory enables training of larger models or training with larger minibatches.
  • Shorten the training or inference time. Execution time can be sensitive to memory or arithmetic bandwidth. Half-precision halves the number of bytes accessed, thus reducing the time spent in memory-limited layers.

The Y-axis is the percentage of all values on a log scale. The X-axis is the log scale of absolute values, as well as a special entry for zeros. For example, in this training session 66.8% of values were zero, whereas 4% of values were between $2^{-32}$ and $2^{-30}$. A very efficient way to ensure that gradients fall into the range representable by half precision is to multiply the training loss with the scale factor. This adds just a single multiplication and by the chain rule it ensures that all the gradients are scaled up (or shifted up) at no additional cost. Loss scaling ensures that relevant gradient values lost to zeros are recovered. Weight gradients need to be scaled down by the same factor $S$ before the weight update. The scale-down operation could be fused with the weight update itself (resulting in no extra memory accesses) or carried out separately.

Additions to the traditional iteration procedure are in bold.

  1. Make an FP16 copy of the weights
  2. Forward propagate using FP16 weights and activations
  3. Multiply the resulting loss by the scale factor S
  4. Backward propagate using FP16 weights, activations, and their gradients
  5. Multiply the weight gradients by 1/S
  6. Optionally process the weight gradients (gradient clipping, weight decay, etc.)
  7. Update the master copy of weights in FP32

AdderNet

In AdderNets, we take the $\ell_1$-norm distance between filters and input feature as the output response.

The convolution in CNN is replaced by the calculating $\ell_1$-norm distance distance between the filter and the input.

Blended Coarse Gradient Descent

Coarse gradient is generally not a gradient of any function but an artificial ascent direction. The weight update of BCGD goes by coarse gradient correction of a weighted average of the full-precision weights and their quantization (the so-called blending), which yields sufficient descent in the objective value and thus accelerates the training.

Low-precision Training

It turns out that DNNs can work with smaller datatypes, with less precision, such as 8-bit integers. Roughly speaking, we’re trying to work with a number line looking closer to the sparse one on the bottom. The numbers are quantized, i.e. discretized to some specific values, which we can then represent using integers instead of floating-point numbers.

High-accuracy Low Precision

High-accuracy low precision (HALP) is our algorithm which runs SVRG and uses bit centering with a full gradient at every epoch to update the low-precision representation. It can do better than previous algorithms because it reduces the two sources of noise that limit the accuracy of low-precision SGD: gradient variance and round-off error.

  • To reduce noise from gradient variance, HALP uses a known technique called stochastic variance-reduced gradient (SVRG). SVRG periodically uses full gradients to decrease the variance of the gradient samples used in SGD.
  • To reduce noise from quantizing numbers into a low-precision representation, HALP uses a new technique we call bit centering. The intuition behind bit centering is that as we get closer to the optimum, the gradient gets smaller in magnitude and in some sense carries less information, so we should be able to compress it. By dynamically re-centering and re-scaling our low-precision numbers, we can lower the quantization noise as the algorithm converges.
Ultra-Low Precision Training

There are three primary challenges that make it difficult to scale precision below 16 bits while fully preserving model accuracy. Firstly, when all the operands (i.e., weights, activations, errors, and gradients) for general matrix multiplication (GEMM) and convolution computations are simply reduced to 8 bits, most DNNs suffer noticeable accuracy degradation. Secondly, reducing the bit precision of accumulations in GEMM from 32 bits to 16 bits significantly impacts the convergence of DNN training. This is why commercially available hardware platforms exploiting scaled precision for training (including GPUs) still continue to use 32 bits of precision for accumulation. Reducing accumulation bit precision below 32 bits is critically important for reducing the area and power of 8-bit hardware. Finally, reducing the bit precision of weight updates to 16-bit floating-point impacts accuracy, while 32-bit weight updates, used in today’s systems, require an extra copy of the high-precision weights and gradients to be kept in memory, which is expensive.

ADMM-NN

We can apply alteranting direction method of mulipliers(ADMM) to train deep neural networks. The first part of ADMM-NN is a systematic, joint framework of DNN weight pruning and quantization using ADMM. It can be understood as a smart regularization technique with regularization target dynamically updated in each ADMM iteration, thereby resulting in higher performance in model compression than prior work. The second part is hardware-aware DNN optimizations to facilitate hardware-level implementations. Without accuracy loss, we can achieve 85\timesand 24\timespruning on LeNet-5 and AlexNet models, respectively, significantly higher than prior work. The improvement becomes more significant when focusing on computation reductions. Combining weight pruning and quantization, we achieve 1,910\timesand 231\timesreductions in overall model size on these two benchmarks, when focusing on data storage. Highly promising results are also observed on other representative DNNs such as VGGNet and ResNet-50.

Transferred/Compact Convolutional Filters

Transfer learning methods have demonstrated state-of-the-art performance on various small-scale image classification tasks. This is generally achieved by exploiting the information from an ImageNet convolution neural network (ImageNet CNN). However, the transferred CNN model is generally with high computational complexity and storage requirement. It raises the issue for real-world applications, especially for some portable devices like phones and tablets without high-performance GPUs. Several approximation methods have been proposed to reduce the complexity by reconstructing the linear or non-linear filters (responses) in convolutional layers with a series of small ones.

Tensor Methods

Note that the deep learning models are composite of linear and non-linear maps. And linear maps are based on matrices.

These methods take a layer and decompose it into several smaller layers. Although there will be more layers after the decomposition, the total number of floating point operations and weights will be smaller. Some reported results are on the order of x8 for entire networks (not aimed at large tasks like imagenet, though), or x4 for specific layers inside imagenet. My experience was that with these decompositions I was able to get a speedup of between x2 to x4, depending on the accuracy drop I was willing to take.

Singular value decomposition

The matrix $A_{m\times n}$ can be decomposed as the multiplication of two matrices such as $A_{m\times n}= Q_{m\times r}R_{r\times n}$, so that the storage is from $O(m\times n)$ to $O(m+n)\times O(r)$.

To explore a low-rank subspace combined with a sparse structure for the weight matrix $W$, we assume that $W \approx L+S$, where $L$ is a low-rank component and $S$ is a sparse matrix. Then, to compress the weight matrix, we have the following model: $$ \min_{L, S}\frac{1}{2}{|W-L-S|}_F^2,\ s.t.\quad rnak(L) \leq r,\ card(S)\leq c,$$ where $rank(L)$ denotes the rank of $L$ and $card(S)$ denotes the cardinality of matrix $S$.

And Toeplitz Matrix can be applied to approximate the weight matrix $$ W = {\alpha}1T{1}T^{−1}{2} + {\alpha}2 T_3 T{4}^{-1} T{5} $$

where ${M}$ is the square weight matrix, $T_1, T_2, T_3, T_4, T_5$ are square Toeplitz matrix.


Compressing Recurrent Neural Network

All techniques above can be used to fully-connected networks or generally feed-forward network. RNN is feedback network where there is rings in its computational graph.

Compressing GANs

GAN-pruning

Compressed Transformer

Compressed BERT

Hashing-accelerated neural networks

Our approach is compellingly simple: we use a hash function to group network connections into hash buckets uniformly at random such that all connections grouped to the i th hash bucket share the same weight value w_i. Our parameter hashing is akin to prior work in feature hashing and requires no additional memory overhead. The backpropagation algorithm can naturally tune the hash bucket parameters and take into account the random weight sharing within the neural network architecture.

Distributed Training

The problem of deep learning $T(x;\Theta)$ is the big model and big data, i.e., $\Theta$ may be in too extra-high dimensional space to store in a single laptop computer. And the training process is to find the optimal parameters $\arg\min_{\Theta}\sum_{i}L(T(x_i;\Theta), y_i)$, which requires sufficient data size.

Sometimes we need partition the model or the data into different machines. In another world, the model or the data are distributed in a few machines.

Distributed training of deep learning models is a branch of distributed computation.

Training advanced deep learning models is challenging. Beyond model design, model scientists also need to set up the state-of-the-art training techniques such as distributed training, mixed precision, gradient accumulation, and checkpointing. Yet still, scientists may not achieve the desired system performance and convergence rate. Large model sizes are even more challenging: a large model easily runs out of memory with pure data parallelism and it is difficult to use model parallelism.

It is really important to reduce the cost of communication in distributed computation including the communication time, communication frequency, communication content and latency.

Accelerating Deep Learning Workloads

The more we know about the resource usage patterns of workloads, the better we can allocate resources.

PipeDream

PipeDream, a system developed as part of Microsoft Research’s Project Fiddle, introduces pipeline parallelism, a new way to parallelize DNN training by combining traditional intra-batch parallelism (model and data parallelism) with inter-batch parallelism (pipelining).

AdaptDL

AdaptDL is a resource-adaptive deep learning (DL) training and scheduling framework. The goal of AdaptDL is to make distributed DL easy and efficient in dynamic-resource environments such as shared clusters and the cloud.

Efficient Communication for Distributed Training

The communication cost of distributed training depends on the content which the distributed machines share.

DeepSpeed

The DeepSpeed API is a lightweight wrapper on PyTorch. This means that you can use everything you love in PyTorch and without learning a new platform. In addition, DeepSpeed manages all of the boilerplate state-of-the-art training techniques, such as distributed training, mixed precision, gradient accumulation, and checkpoints so that you can focus on your model development. Most importantly, you can leverage the distinctive efficiency and effectiveness benefit of DeepSpeed to boost speed and scale with just a few lines of code changes to your PyTorch models.

NCCL

NCCL (pronounced "Nickel") is a stand-alone library of standard communication routines for GPUs, implementing all-reduce, all-gather, reduce, broadcast, reduce-scatter, as well as any send/receive based communication pattern. It has been optimized to achieve high bandwidth on platforms using PCIe, NVLink, NVswitch, as well as networking using InfiniBand Verbs or TCP/IP sockets. NCCL supports an arbitrary number of GPUs installed in a single node or across multiple nodes, and can be used in either single- or multi-process (e.g., MPI) applications.

Gradient Code and Compression

Gradient code and compression is to accelerate the distributed training of deep learning models by reducing the communication cost.

Gradient Code and Approximate Gradient Coding

Some nodes may be much slower than others parallelized or distributed computation enviroment. With redundancy, we may not need every node to finish. See prior articles for more areticles on this topic.

Approximate gradient coding allows us to tolerate more stragglers with less work.


Gradient Compression

In distributed training of machine learning models with stochastic optimization, the exchange of parameter updates between workers often is a bottleneck that limits the scalability of distributed training. This is especially true for models with a large parameter space, such as neural networks. Several techniques have been proposed to enhance scalability by compressing gradients, e.g. by sending a sparse set of coordinates only, or by quantization. We study the gradient compression literature from both sides: on the one hand, we study properties of these algorithms in a distributed setting, and their effectiveness for speed and scalability. On the other hand, we explore properties of the minima found by these algorithms, such as robustness or generalisation.

Deep Gradient Compression @ MIT

Gradient Compression @ epfl

We study gradient compression methods to alleviate the communication bottleneck in data-parallel distributed optimization. Despite the significant attention received, current compression schemes either do not scale well or fail to achieve the target test accuracy. We propose a new low-rank gradient compressor based on power iteration that can i) compress gradients rapidly, ii) efficiently aggregate the compressed gradients using all-reduce, and iii) achieve test performance on par with SGD. The proposed algorithm is the only method evaluated that achieves consistent wall-clock speedups when benchmarked against regular SGD with an optimized communication backend. We demonstrate reduced training times for convolutional networks as well as LSTMs on common datasets.


Gradient Compression @ Edinburgh
Gradient Compression @ kaust

Count-Sketches

Sketch is a class of algorithms using a probabilistic data structure to approximate the distribution of input data.

The count-sketches streaming algorithm instantiates the following framework:

  1. Find a randomized streaming algorithm whose output (as a random variable) has the desired expectation but usually high variance (i.e., noise).
  2. To reduce the variance/noise, run many independent copies in parallel and combine their outputs.

Synthetic gradient

It’s a simple idea: rather than compute gradients through backpropagation, we can train a model to predict what those gradients will be, and use our prediction to update our weights. It’s dynamic programming for neural networks.

Gradient Centralization

Gradient Centralization (GC) is a simple and effective optimization technique for Deep Neural Networks (DNNs), which operates directly on gradients by centralizing the gradient vectors to have zero mean. It can both speedup training process and improve the final generalization performance of DNNs.

Ranger now uses Gradient Centralization by default, and applies it to all conv and fc layers by default. However, everything is customizable so you can test with and without on your own datasets. (Turn on off via "use_gc" flag at init).

Privacy and Security

Existing security mechanisms for high-performance and distributed computing infrastructure are complex and difficult to deploy. As a result, many high-performance and distributed computing facilities do no deploy sufficient security mechanisms. This has prevented privacy-sensitive applications, such as those in the medical fields, and security-sensitive applications from using such facilities. In this project, we will develop and deploy DICE, Data Insurance in the Cluster Environment, to enhance the security in HPC and distributed computing clusters. DICE will consist of three major components: a container-based virtual cluster, a component to defend against side-channel attacks, and a secure execution ledger for auditing. The container-based virtual cluster will be developed based on the Docker Linux container. The Docker security mechanism will be enhanced by deploying an effective key management scheme for groups and by reducing the attack surface exposed to containers. Novel defense mechanisms will be developed and deployed to defend against side-channel attacks in the cluster environment by exploiting new security features in the recent processors. The secure execution ledger will provide a global holistic view of program execution in the whole system, enabling auditing the behavior of individual user as well as user groups. DICE essentially creates a two-level security model: on the (physical) cluster level, a group of (mostly) mutually trusted users share a single virtual cluster for their jobs; and inside the virtual cluster, the group may use existing security mechanisms of their software-of-choice to further refine security.

Distributed deep learning libraries

Deep learning + Spark

BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can write their deep learning applications as standard Spark programs, which can directly run on top of existing Spark or Hadoop clusters. To makes it easy to build Spark and BigDL applications, a high level Analytics Zoo is provided for end-to-end analytics + AI pipelines.

Drizzle is a low latency execution engine for Apache Spark that is targeted at stream processing and iterative workloads. Currently, Spark uses a BSP computation model, and notifies the scheduler at the end of each task. Invoking the scheduler at the end of each task adds overheads and results in decreased throughput and increased latency.

Horovod can additionally run on top of Apache Spark, making it possible to unify data processing and model training into a single pipeline. Once Horovod has been configured, the same infrastructure can be used to train models with any framework, making it easy to switch between TensorFlow, PyTorch, MXNet, and future frameworks as machine learning tech stacks continue to evolve.

Elephas is an extension of Keras, which allows you to run distributed deep learning models at scale with Spark.

Products and Packages

TensorFlow

TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries, and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML-powered applications.

PyTorch

PyTorch is a Python package that provides two high-level features:

  • Tensor computation (like NumPy) with strong GPU acceleration
  • Deep neural networks built on a tape-based autograd system

PyTorch has minimal framework overhead. We integrate acceleration libraries such as Intel MKL and NVIDIA (cuDNN, NCCL) to maximize speed. At the core, its CPU and GPU Tensor and neural network backends (TH, THC, THNN, THCUNN) are mature and have been tested for years.

Hence, PyTorch is quite fast – whether you run small or large neural networks.

The memory usage in PyTorch is extremely efficient compared to Torch or some of the alternatives. We've written custom memory allocators for the GPU to make sure that your deep learning models are maximally memory efficient. This enables you to train bigger deep learning models than before.

PaddlePaddle

PaddlePaddle, as the only independent R&D deep learning platform in China, has been officially open-sourced to professional communities since 2016. It is an industrial platform with advanced technologies and rich features that cover core deep learning frameworks, basic model libraries, end-to-end development kits, tools & components as well as service platforms. PaddlePaddle is originated from industrial practices with dedication and commitments to industrialization. It has been widely adopted by a wide range of sectors including manufacturing, agriculture, enterprise service, and so on while serving more than 2.3 million developers. With such advantages, PaddlePaddle has helped an increasing number of partners commercialize AI.

DasyDL

MindSpore

MindSpore is a new open source deep learning training/inference framework that could be used for mobile, edge and cloud scenarios. MindSpore is designed to provide development experience with friendly design and efficient execution for the data scientists and algorithmic engineers, native support for Ascend AI processor, and software hardware co-optimization. At the meantime MindSpore as a global AI open source community, aims to further advance the development and enrichment of the AI software/hardware application ecosystem.

ModelArts

MegEngine

MegEngine is a fast, scalable and easy-to-use deep learning framework, with auto-differentiation.

Oneflow

Oneflow is an open source deep learning platform with whole new frame design and the world's leading technology for distributed system.

Darknet

Darknet is an open source neural network framework written in C and CUDA. It is fast, easy to install, and supports CPU and GPU computation.

Edge Computation

Edge computation is to perform some computation on the edge devices such as the monior in order to send less raw data to the computation center.

Machine learning models for edge devices need to have a small footprint in terms of storage, prediction latency, and energy. One instance of where such models are desirable is resource-scarce devices and sensors in the Internet of Things (IoT) setting. Making real-time predictions locally on IoT devices without connecting to the cloud requires models that fit in a few kilobytes.

Mobile Deep Learning

Mobile deep learning is aimed to run deep learning models (training or inference) on the mobile phones.

It is necessary to compress the deep learning models in order to run it in mobile phones.

Toolkits


Inference

MNN

MNN is a highly efficient and lightweight deep learning framework. It supports inference and training of deep learning models, and has industry leading performance for inference and training on-device. At present, MNN has been integrated in more than 20 apps of Alibaba Inc, such as Taobao, Tmall, Youku, Dingtalk, Xianyu and etc., covering more than 70 usage scenarios such as live broadcast, short video capture, search recommendation, product searching by image, interactive marketing, equity distribution, security risk control. In addition, MNN is also used on embedded devices, such as IoT.

TNN

TNN is a high-performance and lightweight inference framework for mobile devices. It provides lots of advanced features such as cross-platform, model-compression, and code-pruning. TNN, inspired by mainstream open-source industry frameworks, integrates and leverages Youtu Lab's Rapidnet, ncnn framework.

ncnn is a high-performance neural network inference computing framework optimized for mobile platforms. ncnn is deeply considerate about deployment and uses on mobile phones from the beginning of design. ncnn does not have third party dependencies. it is cross-platform, and runs faster than all known open source frameworks on mobile phone cpu. Developers can easily deploy deep learning algorithm models to the mobile platform by using efficient ncnn implementation, create intelligent APPs, and bring the artificial intelligence to your fingertips. ncnn is currently being used in many Tencent applications, such as QQ, Qzone, WeChat, Pitu and so on.

OpenVINO™

Tool kits


DNN Acceleratation Framewore
https://hgpu.org/
NVIDIA
https://developer.nvidia.com/cudnn)
http://nvdla.org/
https://docs.nvidia.com/cuda/
https://developer.nvidia.com/tensorrt
cupy
intel
ideep
Intel(R) Math Kernel Library for Deep Neural Networks (Intel(R) MKL-DNN)
https://github.com/intel/onnxruntime
nGraph,PlaidML
https://intel.github.io/mkl-dnn/
Reference workloads for modern deep learning methods.
Minerva: a fast and flexible tool for deep learning on multi-GPU.
SigDL -- Deep Learning for IoT Device and Edge Computing Embedded Targets
Menoh: fast DNN inference library with multiple programming language support
trillium
https://github.com/alibaba/MNN
https://github.com/sql-machine-learning/elasticdl
https://github.com/Tencent/TNN

Model Compression Packages
Distiller is an open-source Python package for neural network compression research.
PocketFlow
PocketFlow中的模型压缩算法
BNN
PERMDNN: Efficient Compressed DNN Architecture with Permuted Diagonal Matrices
knowledge-distillation-pytorch
keras_compressor
TensorFlow Lite, tensorflow-compression
TensorRT
https://github.com/Tencent/ncnn
Introduction to Intel® Deep Learning Deployment Toolkit