

# Enabling Binary Neural Network Training on the Edge

Erwei Wang, *Member, IEEE*, James J. Davis, *Member, IEEE*, Daniele Moro, Piotr Zielinski, Jia Jie Lim, Claudionor Coelho, Satrajit Chatterjee, Peter Y. K. Cheung, *Senior Member, IEEE*, and George A. Constantinides, *Senior Member, IEEE*

**Abstract**—The ever-growing computational demands of increasingly complex machine learning models frequently necessitate the use of powerful cloud-based infrastructure for their training. Binary neural networks are known to be promising candidates for on-device inference due to their extreme compute and memory savings over higher-precision alternatives. However, their existing training methods require the concurrent storage of high-precision activations for all layers, generally making learning on memory-constrained devices infeasible. In this article, we demonstrate that the backward propagation operations needed for binary neural network training are strongly robust to quantization, thereby making on-the-edge learning with modern models a practical proposition. We introduce a low-cost binary neural network training strategy exhibiting sizable memory footprint reductions while inducing little to no accuracy loss vs Courbariaux & Bengio’s standard approach. These decreases are primarily enabled through the retention of activations exclusively in binary format. Against the latter algorithm, our drop-in replacement sees memory requirement reductions of 3–5 $\times$ , while reaching similar test accuracy in comparable time, across a range of small-scale models trained to classify popular datasets. We also demonstrate from-scratch ImageNet training of binarized ResNet-18, achieving a 3.78 $\times$  memory reduction. Our work is open-source, and includes the Raspberry Pi-targeted prototype we used to verify our modeled memory decreases and capture the associated energy drops. Such savings will allow for unnecessary cloud offloading to be avoided, reducing latency, increasing energy efficiency, and safeguarding end-user privacy.

**Index Terms**—Deep neural network, binary neural network, training, edge devices, embedded systems, memory reduction.

## 1 INTRODUCTION

Although binary neural networks (BNNs) feature weights and activations with just single-bit precision, many models are able to reach accuracy indistinguishable from that of their higher-precision counterparts [1], [2]. Since BNNs are functionally complete, their limited precision does not impose an upper bound on achievable accuracy [3]. BNNs represent the ideal class of neural network for edge inference, particularly for custom hardware implementation, due to their use of XNOR for multiplication: a fast and cheap operation to perform. Their compact weights also suit systems with limited memory and increases opportunities for caching, providing further potential performance boosts. FINN, the seminal BNN implementation for field-programmable gate arrays,

reached the highest CIFAR-10, and SVHN classification rates to date at the time of its publication [4].

Despite featuring binary forward propagation, existing BNN training approaches perform backward propagation using high-precision floating-point data types—typically float32—often making training infeasible on memory-constrained devices. The high-precision activations used between forward and backward propagation commonly constitute the largest proportion of the total memory footprint of a training run [11], [12]. Our understanding of standard BNN training algorithms led us to the following realization: high-precision activations should not be used since we are only concerned with weights and activations’ *signs*. In this article, we present a low-memory BNN training scheme based on this intuition featuring binary activations only, facilitated through batch normalization modification.

By increasing the viability of learning on the edge, this work will reduce the domain mismatch between training and inference—particularly in conjunction with federated learning [13], [14]—and ensure privacy for sensitive applications [15]. Via the aggressive memory footprint reductions they facilitate, our proposals will enable models to be trained without the network access reliance, latency and energy overheads or data divulgence inherent to cloud offloading. Herein, we make the following novel contributions.

- We conduct the first variable representation and lifetime analysis of Courbariaux & Bengio’s standard BNN training process [1]. We use this to identify opportunities for memory savings through approximation.
- Via our proposed BNN-specific forward and backward batch normalization operations, we implement the first

- Erwei Wang is with AMD, Westbrook Centre, Block 7, Milton Road, Cambridge, CB4 1YG, United Kingdom.  
E-mail: erweiw@amd.com
- James J. Davis, Peter Y. K. Cheung, and George A. Constantinides are with the Circuits and Systems research group, Department of Electrical and Electronic Engineering, Imperial College London, London, SW7 2AZ, United Kingdom.  
E-mail: {james.davis,p.cheung,g.constantinides}@imperial.ac.uk
- Daniele Moro and Piotr Zielinski are with Google, 2015 Stierlin Court, Mountain View, CA 94043, United States.  
E-mail: {danielemoro,zielinski}@google.com
- Jia Jie Lim is with iSize, 107 Cheapside, London, EC2V 6DN, United Kingdom.  
E-mail: jj.lim@ isize.co
- Claudionor Coelho is with Palo Alto Networks, 3000 Tannery Way, Santa Clara, CA 95054, United States.  
E-mail: claudionor.coelho@alumni.stanford.edu
- Satrajit Chatterjee is in Palo Alto, CA, United States.  
E-mail: satrajit@gmail.com

TABLE 1  
Applied approximations, where applicable, used in low-cost neural network training works.

|                             | Weights           |          |           | Activations |                         | Activation gradients | Batch normalization |
|-----------------------------|-------------------|----------|-----------|-------------|-------------------------|----------------------|---------------------|
|                             | Forward           | Backward | gradients | Forward     | Backward                |                      |                     |
| Zhou <i>et al.</i> [5]      | int6 <sup>1</sup> | int6     | int6      | int6        | int6                    | int6                 | x                   |
| Gruslys <i>et al.</i> [6]   | x                 | x        | x         | x           | Recomputed <sup>2</sup> | x                    | x                   |
| Ginsburg <i>et al.</i> [7]  | float16           | float16  | float16   | float16     | float16                 | float16              | x                   |
| Graham <i>et al.</i> [8]    | x                 | x        | x         | int         | int                     | x                    | x                   |
| Bernstein <i>et al.</i> [9] | x                 | x        | bool      | x           | x                       | x                    | x                   |
| Wu <i>et al.</i> [10]       | x                 | x        | x         | x           | x                       | x                    | $\ell_1$            |
| <b>This work</b>            | bool              | float16  | bool      | bool        | bool                    | float16              | BNN-specific        |

<sup>1</sup> Arbitrary precision was supported, but significant accuracy degradation was observed below 6 bits.

<sup>2</sup> Activations were not retained between forward and backward propagation in order to save memory.

neural network training regime featuring all-binary activations. This significantly reduces the greatest constituent of a given training run’s total memory footprint.

- We present the first successful combination of binary activations and binary weight gradients during neural network training. This aggregation allows for further reductions in memory footprint.
- We systematically evaluate the impact of each of our approximations, and provide a detailed characterization of our scheme’s memory requirements *vs* accuracy.
- Against the standard approach, we report memory reductions of up to 5.44 $\times$ , with little to no accuracy or convergence rate degradation, when training BNNs to classify MNIST, CIFAR-10, and SVHN. No hyperparameter tuning is required. We also show that the batch size used can be increased by  $\sim$ 10 $\times$  while remaining within a given memory envelope, and even demonstrate the efficacy of ImageNet training as a hard target.
- We provide an open-source release of our Keras-based training software, memory modeling tool, and Raspberry Pi-targeted prototype for the community to use and build upon<sup>1</sup>. Our memory breakdown analysis represents a clear road map to further, future reductions.

## 2 RELATED WORK

The authors of all published works on BNN inference acceleration to date made use of high-precision floating-point data types during training [1], [16], [17], [18], [19], [20], [21], [22], [23], [24]. There is precedent, however, for the use of quantization when training non-binary networks, as we show in Table 1 via side-by-side comparison of the approximation approaches taken in those works along with those detailed in this article.

The effects of quantizing the gradients of models with high-precision data, either fixed or floating point, have been studied extensively. Zhou *et al.* [5] and Wu *et al.* [25] trained networks with fixed-point weights and activations using fixed-point gradients, reporting no accuracy loss for AlexNet classifying ImageNet with gradients wider than five bits. Wen *et al.* [26] and Bernstein *et al.* [9] focused solely on aggressive weight gradient quantization, aiming to reduce communication costs for distributed learning. Weight gradients were losslessly quantized into ternary and binary formats, respectively, with forward propagation and activation gradients

kept at high precision. In this work, we make the novel observation that BNNs are more robust to approximation during training than higher-precision networks. We thus propose a data representation scheme more aggressive than all of the aforementioned works combined, delivering large memory savings with near-lossless performance.

An intuitive method to lower the memory footprint of training is to simply reduce the batch size. However, doing so generally leads to increased total training time due to reduced memory reuse [11]. The method we propose in this article does not conflict with batch size tuning, and further allows the use of large batches while remaining within the memory limits of edge devices.

Gradient checkpointing—the recomputation of activations during backward propagation—has been proposed as a method to reduce the memory consumption of training [6], [27]. Such methods introduce additional forward passes, however, and so increase each run’s duration and energy cost. Graham [8] and Chakrabarti & Moseley [28] saved memory during training by buffering activations in low-precision formats, achieving comparable accuracy to all-float32 baselines. Wu *et al.* [10] and Hoffer *et al.* [29] reported reduced computational costs via  $\ell_1$  batch normalization. Finally, Helwegen *et al.* [30] asserted that the use of both trainable weights and momenta is superfluous in BNN optimizers, proposing a weightless BNN-specific optimizer, Bop, able to reach the same level of accuracy as Adam. We took inspiration from these works in locating sources of redundancy present in standard BNN training schemes, and propose BNN-specific modifications to  $\ell_1$  batch normalization allowing for activation quantization all the way to binary, thus saving memory without increasing latency.

Recent efforts have shown that, in some circumstances, batch normalization can be completely removed from BNN training. Chen *et al.* replaced the trainable scaling factors and biases within standard  $\ell_2$  batch normalization with hand-tuned values, thereby approximating these functions via trial and error [31]. Our method follows a conventional training approach; no manual, offline steps are required. Jiang *et al.* proposed the use of batch normalization-free BNNs for super-resolution imaging [32]. The information loss incurred from the removal of batch normalization in this case is recovered by expanding the receptive fields of convolution operations using parallel sets of binary dilated convolutions. While Jiang *et al.* demonstrated promising results for super-resolution imaging, we assume a generic deep learning setting rather than focusing on a specific application domain.

<sup>1</sup> <https://github.com/awai54st/Enabling-Binary-Neural-Network-Training-on-the-Edge>



Fig. 1. Standard BNN training graph for fully connected layer  $l$ . “sgn”, “ $\times$ ”, and “BN” are sign, matrix multiplication, and batch normalization operations. Forward propagation dependencies are shown with solid lines; those for backward passes are dashed. High-precision activations must be retained due to the red dependency.

We further present an open-source Raspberry Pi-based prototype to corroborate our memory reduction estimates, making our work closer to real application deployment than both of the aforementioned publications.

The authors of works including Bi-Real Net [19], ResNetE18 [33], and ReActNet [24] discovered that the accuracy of BNNs can be significantly increased via the addition of high-precision skip connections. Many further enhanced BNN performance via improvements to gradient approximation and weight initialization [19], [24], [33], [34], [35]. Optimizations such as these are intended to increase accuracy: a goal orthogonal to ours of efficiently deploying BNNs on edge-scale devices. Nevertheless, we incorporated all of them into our work in order to reach competitive accuracy.

For works such as ReActNet [24], BN-Free [31], BN-Free ISR [32], and Real-to-Binary [35], it was found that knowledge distillation—the employment of a high-precision network as a “teacher” running alongside a BNN—can greatly improve the performance of the latter’s training. This method is, however, outside our scope; the teacher would dominate overall memory requirements and thereby make savings with regards to the BNN insignificant.

### 3 STANDARD TRAINING FLOW

For simplicity of exposition, we assume the use of a multi-layer perceptron (MLP), although the presence of convolutional layers would not change any of the principles that follow. Let  $\mathbf{W}_l$  and  $\mathbf{X}_l$  denote matrices of weights and activations, respectively, in the network’s  $l^{\text{th}}$  layer, with  $\partial\mathbf{W}_l$  and  $\partial\mathbf{X}_l$  being their gradients. For  $\mathbf{W}_l$ , rows and columns span input and output channels, respectively, while for  $\mathbf{X}_l$  they span a batch’s feature maps and their channels. Henceforth, we use  $\hat{\bullet}$  to denote binary encoding.

Fig. 1 shows the training graph of a fully connected binary layer. A detailed description of the standard BNN training procedure introduced by Courbariaux & Bengio [1] for each batch of  $B$  training samples, which we henceforth refer to as a *step*, is provided in Algorithm 1. Therein, “ $\odot$ ” signifies element-wise multiplication. For brevity, we omit some of the intricacies of the baseline implementation—lack of first-layer quantization, use of a final softmax layer, and the inclusion of weight gradient cancellation [1]—as these standard BNN practices are not impacted by our work. We initialize weights as outlined by Glorot & Bengio [36].

Many authors have established that BNNs require batch normalization in order to avoid gradient explosion [37], [38], [39], and our early experiments confirmed this to indeed be the case. We thus apply it as standard. Matrix products  $\mathbf{Y}_l$  are channel-wise batch-normalized across each layer’s  $M_l$  output channels (lines 5–7) to form the subsequent layer’s inputs,  $\mathbf{X}_{l+1}$ .  $\beta$  constitutes the batch normalization biases. Layer-wise moving means  $\mu(\mathbf{y}_l)$  and standard deviations  $\sigma(\mathbf{y}_l)$  are retained for use during backward propagation and inference. We forgo trainable scaling factors; these are irrelevant to BNNs since their activations are binarized thereafter (line 2).

As emphasized in both Fig. 1 and Algorithm 1 (line 12), *high-precision* storage of the entire network’s activations is required. Addressment of this forms our key contribution.

## 4 VARIABLE ANALYSIS

In order to quantify the potential gains from approximation, we conducted a variable representation and lifetime analysis of Algorithm 1 following the approach taken by Sohoni *et al.* [11]. Table 2 lists the properties of all variables in Algorithm 1, with each variable’s contribution to the total footprint shown for a representative example. Variables are divided into two classes: those that must remain in memory between computational phases (forward propagation, backward propagation, and weight update), and those that need not. This is of pertinence since, for those in the latter category, only the largest layer’s contribution counts towards the total memory occupancy. For example,  $\partial\mathbf{X}_l$  is read during the backward propagation of layer  $l - 1$  only, thus  $\partial\mathbf{X}_{l-1}$  can safely overwrite  $\partial\mathbf{X}_l$  for efficiency. Additionally,  $\mathbf{Y}$  and  $\partial\mathbf{X}$  are shown together since they are equally sized and only need to reside in memory during the forward and backward pass for each layer, respectively.

## 5 LOW-COST BNN TRAINING

As shown in Table 2, all variables within the standard BNN training flow use `float32` representation. In the subsections that follow, we detail the application of aggressive approximation specifically tailored to BNN training. Further to this, and in line with the observation by many authors that `float16` can be used for ImageNet training without inducing accuracy loss [7], [40], [41], we also switch all remaining variables to this format. Our final training procedure is captured in Algorithm 2, with modifications from Algorithm 1 in red and the corresponding data representations used shown in Table 2.

### 5.1 Batch Normalization Approximation

Analysis of the backward pass of Algorithm 1 reveals conflicting requirements for the precision of  $\mathbf{X}$ . When computing weight gradients  $\partial\mathbf{W}$  (line 15), only binary activations  $\hat{\mathbf{X}}$  are needed. For the batch normalization training (lines 10–13), however, high-precision  $\mathbf{X}$  is used. The latter occurrences are highlighted with dashed boxes. Per Table 2, the storage of  $\mathbf{X}$  between forward and backward propagation constitutes the single largest portion of the algorithm’s total memory. If we are able to use  $\hat{\mathbf{X}}$  in place of  $\mathbf{X}$  for these operations, there will be no need for this high-precision activation retention, significantly reducing memory footprint as a result. We achieve this as follows.

**Algorithm 1** Standard BNN training step.

```

1: for  $l \leftarrow \{1, \dots, L-1\}$  do            $\triangleright$  Forward prop.
2:    $\hat{\mathbf{X}}_l \leftarrow \text{sgn}(\mathbf{X}_l)$ 
3:    $\hat{\mathbf{W}}_l \leftarrow \text{sgn}(\mathbf{W}_l)$ 
4:    $\hat{\mathbf{Y}}_l \leftarrow \hat{\mathbf{X}}_l \hat{\mathbf{W}}_l$ 
5:   for  $m \leftarrow \{1, \dots, M_l\}$  do            $\triangleright$  Batch norm.
6:      $\psi_l^{(m)} \leftarrow \sigma(\mathbf{y}_l^{(m)})$ 
7:      $\mathbf{x}_{l+1}^{(m)} \leftarrow \frac{\mathbf{y}_l^{(m)} - \mu(\mathbf{y}_l^{(m)})}{\psi_l^{(m)}} + \beta_l^{(m)}$ 
8:
9:   for  $l \leftarrow \{L-1, \dots, 1\}$  do            $\triangleright$  Backward prop.
10:    for  $m \leftarrow \{1, \dots, M_l\}$  do            $\triangleright$  Batch norm.
11:       $\mathbf{v} \leftarrow \frac{1}{\psi_l^{(m)}} \partial \mathbf{x}_{l+1}^{(m)}$ 
12:       $\partial \mathbf{y}_l^{(m)} \leftarrow \mathbf{v} - \mu(\mathbf{v}) - \mu\left(\mathbf{v} \odot \left[\mathbf{x}_{l+1}^{(m)}\right] \left[\mathbf{x}_{l+1}^{(m)}\right]\right)$ 
13:       $\partial \beta_l^{(m)} \leftarrow \sum \partial \mathbf{x}_{l+1}^{(m)}$ 
14:       $\partial \mathbf{X}_l \leftarrow \partial \mathbf{Y}_l \hat{\mathbf{W}}_l^T$ 
15:       $\partial \mathbf{W}_l \leftarrow \hat{\mathbf{X}}_l \partial \mathbf{Y}_l$ 
16:
17:   for  $l \leftarrow \{1, \dots, L-1\}$  do            $\triangleright$  Weight update
18:      $\mathbf{W}_l \leftarrow \text{Optimize}(\mathbf{W}_l, \partial \mathbf{W}_l, \eta)$ 
19:      $\beta_l \leftarrow \text{Optimize}(\beta_l, \partial \beta_l, \eta)$ 
20:    $\eta \leftarrow \text{LearningRateSchedule}(\eta)$ 

```

• denotes binary encoding. Our refinements are shown in red. Dashed boxes highlight Algorithm 2's lack of high-precision activations.

TABLE 2  
Exemplary memory-related properties of variables used during CIFAR-10 training of BinaryNet with Adam and a batch size of 100.

| Variable                                  | Per-layer lifetime <sup>1</sup> | Standard training |                      |        | Proposed training |                      |                              |
|-------------------------------------------|---------------------------------|-------------------|----------------------|--------|-------------------|----------------------|------------------------------|
|                                           |                                 | Data type         | Modeled memory (MiB) | %      | Data type         | Modeled memory (MiB) | $\Delta (\times \downarrow)$ |
| $\mathbf{X}$                              | ✗                               | float32           | 111.33               | 21.71  | bool              | 3.48                 | 32.00                        |
| $\partial \mathbf{X}, \mathbf{Y}^2$       | ✓                               | float32           | 50.00                | 9.75   | float16           | 25.00                | 2.00                         |
| $\mu(\mathbf{y}_l), \sigma(\mathbf{y}_l)$ | ✗                               | float32           | 0.03                 | 0.00   | float16           | 0.02                 | 2.00                         |
| $\partial \mathbf{Y}$                     | ✓                               | float32           | 50.00                | 9.75   | float16           | 25.00                | 2.00                         |
| $\mathbf{W}$                              | ✗                               | float32           | 53.49                | 10.43  | float16           | 26.74                | 2.00                         |
| $\partial \mathbf{W}$                     | ✗                               | float32           | 53.49                | 10.43  | bool              | 1.67                 | 32.00                        |
| $\beta, \partial \beta$                   | ✗                               | float32           | 0.03                 | 0.00   | float16           | 0.02                 | 2.00                         |
| Momenta                                   | ✗                               | float32           | 106.98               | 20.86  | float16           | 53.49                | 2.00                         |
| Pooling masks                             | ✗                               | float32           | 87.46                | 17.06  | bool              | 2.73                 | 32.00                        |
| Total                                     |                                 |                   | 512.81               | 100.00 |                   | 138.15               | 3.71                         |

<sup>1</sup> ✓ indicates that a variable does not need to be retained between forward, backward or update phases.

<sup>2</sup>  $\partial \mathbf{X}$  and  $\mathbf{Y}$  can share memory since they are equally sized and have non-overlapping lifetimes.

**Step 1:  $\ell_1$  Normalization**

Standard batch normalization sees channel-wise  $\ell_2$  normalization performed on each layer's centralized activations. Since batch normalization is immediately followed by binarization in BNNs, however, we argue that less-costly  $\ell_1$  normalization is good enough in this circumstance. Replacement of batch normalization's backward propagation operation with our  $\ell_1$  norm-based version sees lines 11–12 of Algorithm 1 swapped with (1), where  $B$  is the batch size. Not only does our use of  $\ell_1$  batch normalization transform one occurrence of  $\mathbf{x}_{l+1}^{(m)}$  into its binary form, it also beneficially eliminates all squares and square roots.

**Algorithm 2** Proposed BNN training step.

```

1: for  $l \leftarrow \{1, \dots, L-1\}$  do            $\triangleright$  Forward prop.
2:    $\hat{\mathbf{X}}_l \leftarrow \text{sgn}(\mathbf{X}_l)$ 
3:    $\hat{\mathbf{W}}_l \leftarrow \text{sgn}(\mathbf{W}_l)$ 
4:    $\hat{\mathbf{Y}}_l \leftarrow \hat{\mathbf{X}}_l \hat{\mathbf{W}}_l$ 
5:   for  $m \leftarrow \{1, \dots, M_l\}$  do            $\triangleright$  Batch norm.
6:      $\psi_l^{(m)} \leftarrow \|\mathbf{y}_l^{(m)} - \mu(\mathbf{y}_l^{(m)})\|_1 / B$ 
7:      $\mathbf{x}_{l+1}^{(m)} \leftarrow \frac{\mathbf{y}_l^{(m)} - \mu(\mathbf{y}_l^{(m)})}{\psi_l^{(m)}} + \beta_l^{(m)}$ 
8:      $\omega_{l+1}^{(m)} \leftarrow \|\mathbf{x}_{l+1}^{(m)}\|_1 / B$ 
9:   for  $l \leftarrow \{L-1, \dots, 1\}$  do            $\triangleright$  Backward prop.
10:    for  $m \leftarrow \{1, \dots, M_l\}$  do            $\triangleright$  Batch norm.
11:       $\mathbf{v} \leftarrow \frac{1}{\psi_l^{(m)}} \partial \mathbf{x}_{l+1}^{(m)}$ 
12:       $\partial \mathbf{y}_l^{(m)} \leftarrow \mathbf{v} - \mu(\mathbf{v}) - \mu\left(\mathbf{v} \odot \left[\hat{\mathbf{x}}_{l+1}^{(m)} \omega_{l+1}^{(m)}\right] \left[\hat{\mathbf{x}}_{l+1}^{(m)}\right]\right)$ 
13:       $\partial \beta_l^{(m)} \leftarrow \sum \partial \mathbf{x}_{l+1}^{(m)}$ 
14:       $\partial \mathbf{X}_l \leftarrow \partial \mathbf{Y}_l \hat{\mathbf{W}}_l^T$ 
15:       $\partial \mathbf{W}_l \leftarrow \hat{\mathbf{X}}_l \partial \mathbf{Y}_l$ 
16:       $\partial \hat{\mathbf{W}}_l \leftarrow \text{sgn}(\partial \mathbf{W}_l)$ 
17:   for  $l \leftarrow \{1, \dots, L-1\}$  do            $\triangleright$  Weight update
18:      $\mathbf{W}_l \leftarrow \text{Optimize}(\mathbf{W}_l, \partial \hat{\mathbf{W}}_l / \sqrt{M_{l-1}}, \eta)$ 
19:      $\beta_l \leftarrow \text{Optimize}(\beta_l, \partial \beta_l, \eta)$ 
20:    $\eta \leftarrow \text{LearningRateSchedule}(\eta)$ 

```

Our derivation of this function is as follows. Let

$$\mathbf{a} = \mathbf{y}_l - \mu(\mathbf{y}_l)$$

and

$$\mathbf{v} = \frac{1}{\|\mathbf{a}\|} \partial \mathbf{x}_{l+1}.$$

$$\begin{aligned} \mathbf{v} &\leftarrow \frac{1}{\|\mathbf{y}_l^{(m)} - \mu(\mathbf{y}_l^{(m)})\|_1 / B} \partial \mathbf{x}_{l+1}^{(m)} \\ \partial \mathbf{y}_l^{(m)} &\leftarrow \mathbf{v} - \mu(\mathbf{v}) - \mu\left(\mathbf{v} \odot \left[\hat{\mathbf{x}}_{l+1}^{(m)} \omega_{l+1}^{(m)}\right] \left[\hat{\mathbf{x}}_{l+1}^{(m)}\right]\right) \end{aligned} \quad (1)$$

We have

$$\partial \mathbf{a} = \mathbf{v} - \frac{\mathbf{a}}{\|\mathbf{a}\|} \times \frac{1}{\left(\frac{\|\mathbf{a}\|}{B}\right)^2} \times \mu(\mathbf{a} \odot \partial \mathbf{x}_{l+1})$$

and thus

$$\begin{aligned} \partial \mathbf{y}_l &= \partial \mathbf{a} - \mu(\partial \mathbf{a}) \\ &= (\mathbf{v} - \mu(\mathbf{v})) - \\ &\quad \left( \frac{\mathbf{x}_{l+1}}{\frac{\|\mathbf{a}\|}{B} \odot \|\mathbf{a}\|} - \mu \left( \frac{\mathbf{x}_{l+1}}{\frac{\|\mathbf{a}\|}{B} \odot \|\mathbf{a}\|} \right) \right) (\mu(\mathbf{a} \odot \partial \mathbf{x}_{l+1})). \end{aligned}$$

Since the output of batch normalization,  $\mathbf{x}_{l+1}$ , is expected to have a mean value of zero across samples in a batch, *i.e.*,

$$\mu(\mathbf{x}_{l+1}) \approx \mathbf{0},$$

we have

$$\partial \mathbf{y}_l \approx (\mathbf{v} - \mu(\mathbf{v})) - \mu(\mathbf{v} \odot \mathbf{x}_{l+1}) \hat{\mathbf{x}}_{l+1}.$$

#### Step 2: BNN-Specific Approximation

We further replace the remaining  $\mathbf{x}_{l+1}^{(m)}$  term in (1) with the product of its signs and mean magnitude— $\hat{\mathbf{x}}_{l+1}^{(m)} \omega_{l+1}^{(m)}$ —where  $\omega_{l+1}^{(m)}$  is precomputed (line 8).

Our complete batch normalization training functions are shown on lines 10–13 of Algorithm 2. As again highlighted within dashed boxes, these only require the storage of binary  $\hat{\mathbf{X}}$  along with layer- and channel-wise mean magnitudes. With elements of  $\hat{\mathbf{X}}$  now binarized, we reduce its memory cost by  $32\times$  and also save energy thanks to the corresponding memory traffic reduction.

## 5.2 Weight Gradient Quantization

Intuitively, BNNs should be particularly robust to weight gradient binarization since their weights only constitute signs. On line 16 of Algorithm 2, therefore, we quantize and store weight gradients in binary format,  $\partial \hat{\mathbf{W}}$ , for weight update. During that phase, we attenuate the gradients by  $\sqrt{N_l}$ , where  $N_l$  is layer  $l$ 's fan-in, to reduce the learning rate and prevent premature weight clipping as advised by Sari *et al.* [38] (line 18). Since fully connected layers are used as an example in Algorithm 2,  $N_l = M_{l-1}$  in this instance.

Table 2 shows that, with binarization, the portion of our exemplary training run's memory consumption attributable to weight gradients dropped from 53.49 to just 1.67 MiB, leaving the scarce resources available for more quantization-sensitive variables such as  $\mathbf{W}$  and momenta. Energy consumption will also decrease due to the associated reduction in memory traffic.

## 6 EVALUATION

### 6.1 Keras Emulation

We built a GPU-based implementation emulating our BNN training method using Keras and TensorFlow, and experimented with the small-scale MNIST, CIFAR-10, and SVHN datasets, as well as large-scale ImageNet, using a range of network models. Our baseline for comparison was the standard BNN training method introduced by Courbariaux & Bengio [1], and we followed those authors' practice of reporting the highest test accuracy achieved in each run. Note that we did not tune hyperparameters, thus it is likely that higher accuracy than we report is achievable.

#### 6.1.1 Small-Scale Datasets

For MNIST we evaluated using a five-layer MLP—henceforth simply denoted “MLP”—with 256 neurons per hidden layer, and CNV [4] and BinaryNet [1] for both CIFAR-10 and SVHN. We used three popular BNN optimizers: Adam [42], stochastic gradient descent (SGD) with momentum, and Bop [30]. While all three function reliably with our training scheme, we used Adam by default due to its stability. We used the development-based learning rate scheduling approach proposed by Wilson *et al.* [43] with an initial learning rate  $\eta$  of 0.001 for all optimizers except for SGD with momentum, for which we used 0.1. We used batch size  $B = 100$  for all except for Bop, for which we used  $B = 50$  as recommended by Helwegen *et al.* [30]. MNIST and CIFAR-10 were trained for 1000 epochs; SVHN for 200.

Our choice of quantization targets primarily rested on the intuition that BNNs should be more robust to approximation in backward propagation than their higher-precision counterparts. To illustrate that this is indeed the case, we applied our method to both BNNs and `float32` networks, with identical topologies and hyperparameters. Results of those experiments are shown in Table 3, in which significantly higher accuracy degradation was observed for the non-binary networks, as expected.

While our proposed BNN training method does exhibit limited accuracy degradation, as can be seen for three cases in Table 4, this comes in return for a geomean modeled memory saving of  $3.67\times$ . It is also interesting to note that the reduction achievable for a given dataset depends on the model used. This observation is largely orthogonal to our work: by applying our approach to the training of a more appropriately chosen model, one can obtain the advantages of both optimized network selection and training.

In order to explore the impacts of the various facets of our scheme, we applied them sequentially while training BinaryNet to classify CIFAR-10 with multiple optimizers. As shown in Table 5, choices of data type, optimizer, and batch normalization implementation lead to tradeoffs against performance and memory costs. Major savings are attributable to the use of `float16` variables and through the high-precision activation elimination our  $\ell_1$  norm-based batch normalization facilitates.

Fig. 2 shows the modeled memory footprint savings from our proposed BNN training method for different optimizers and batch sizes, again for BinaryNet with the CIFAR-10 dataset. Across all of these, we achieved a geomean reduction of  $4.81\times$ . Also observable from Fig. 2 is that, for all optimizers, movement from the standard to our proposed BNN training allows the batch size used to increase by around  $10\times$ , facilitating faster completion, without a material memory increase. Fig. 2 finally shows that test accuracy does not drop significantly due to our approximations. With Adam and Bop, accuracy was near-identical, while with SGD we actually saw modest improvements.

While not of concern with regards to memory consumption, decreases in convergence rate are undesirable due to their elongation of training times and, consequently, reduction of energy efficiency. In order to ensure that our algorithmic modifications do not cause material convergence rate degradation, we inspected the validation accuracy curves obtained during our training runs. Figs. 3 and 4

TABLE 3

Test accuracy of non-binary networks and BNNs using the standard and our proposed training approaches with Adam and a batch size of 100. Results for our training approach applied to the former are included for reference only; we do not advocate for its use with non-binary networks.

| Model     | Dataset  | Top-1 test accuracy |         |               |                     |                            |                   |                            |
|-----------|----------|---------------------|---------|---------------|---------------------|----------------------------|-------------------|----------------------------|
|           |          | Standard training   |         |               | Reference training  |                            | Proposed training |                            |
|           |          | NN (%) <sup>1</sup> | BNN (%) | $\Delta$ (pp) | NN (%) <sup>1</sup> | $\Delta$ (pp) <sup>2</sup> | BNN (%)           | $\Delta$ (pp) <sup>3</sup> |
| MLP       | MNIST    | 98.22               | 98.24   | 0.02          | 89.98               | -8.24                      | <b>96.90</b>      | -1.34                      |
| CNV       | CIFAR-10 | 86.37               | 82.67   | -3.70         | 69.88               | -16.49                     | <b>83.08</b>      | 0.41                       |
| CNV       | SVHN     | 97.30               | 96.37   | -0.93         | 79.44               | -17.86                     | <b>94.28</b>      | -2.09                      |
| BinaryNet | CIFAR-10 | 88.20               | 88.74   | 1.61          | 76.56               | -11.64                     | <b>89.09</b>      | 0.35                       |
| BinaryNet | SVHN     | 96.54               | 97.40   | 0.86          | 85.71               | -10.83                     | <b>95.93</b>      | -1.47                      |

<sup>1</sup> Non-binary neural network.

<sup>2</sup> Baseline: non-binary network with standard training.

<sup>3</sup> Baseline: BNN with standard training.



Fig. 2. Batch size vs training memory footprint and achieved test accuracy for BinaryNet with CIFAR-10. Annotations show memory reductions for the proposed training approach. Each test accuracy point marks the mean of five independent training runs, with an error bar indicating its distribution.

TABLE 4

Test accuracy and memory footprint of the standard and our proposed training schemes using Adam and a batch size of 100.

| Model<br>(Dataset)      | Top-1 test accuracy |              |                  | Modeled memory |                |                                     |
|-------------------------|---------------------|--------------|------------------|----------------|----------------|-------------------------------------|
|                         | Std.<br>(%)         | Prop.<br>(%) | $\Delta$<br>(pp) | Std.<br>(MiB)  | Prop.<br>(MiB) | $\Delta$<br>( $\times \downarrow$ ) |
| MLP<br>(MNIST)          | 98.24               | <b>96.90</b> | -1.34            | 7.40           | <b>2.65</b>    | <b>2.78</b>                         |
| CNV<br>(CIFAR-10)       | 82.67               | <b>83.08</b> | 0.41             | 134.05         | <b>32.16</b>   | <b>4.17</b>                         |
| CNV<br>(SVHN)           | 96.37               | <b>94.28</b> | -2.09            | 134.05         | <b>32.16</b>   | <b>4.17</b>                         |
| BinaryNet<br>(CIFAR-10) | 88.74               | <b>89.09</b> | 0.35             | 512.81         | <b>138.15</b>  | <b>3.71</b>                         |
| BinaryNet<br>(SVHN)     | 97.40               | <b>95.93</b> | -1.47            | 512.81         | <b>138.15</b>  | <b>3.71</b>                         |

exemplify these for the experiments whose results were reported in Table 4 and Fig. 2, respectively. No discernible change in convergence rate can be seen in any of the plots, thus we can be confident that our proposals will not negatively impact training times.

### 6.1.2 ImageNet

We also trained ResNetE-18 [33] and Bi-Real-18 [19]—mixed-precision models with most convolutional layers binarized—to classify ImageNet. These models are representative of a broad class of ImageNet-capable networks, thus similar results should be achievable for others with which they share architectural features. Finding development-based learning rate scheduling to not work well with ResNetE-18, we resorted to the fixed decay schedule described by Bethge *et al.* [33].  $\eta$  began at 0.016 and decayed by a factor of 10 at epochs 70, 90, and 110. We trained for 120 epochs with  $B = 4096$ . For Bi-Real-18, we trained for 80 epochs with  $B = 512$  and a cosine-decaying learning rate starting from  $\eta = 0.001$ . Both models were optimized using Adam.

We show the performance of these benchmarks when applying each of our proposed approximations in turn, as well their assemblage, in Table 6. Since the Tensor Processing Units we used here natively support `bfloat16` rather than `float16`, we switched to the former for these experiments. Where `bfloat16` variables were used, these were employed across all layers; the remaining approximations were applied only to binary layers. While these savings are smaller than those for our small-scale experiments, we note that the first convolutional layer of both ResNetE-18 and Bi-Real-18 is

TABLE 5

Impacts of moving from the standard to our proposed data representations with BinaryNet and CIFAR-10 and a batch size of 100.

| Optimizer         | Data type             |                       | Batch normalization | Top-1 test accuracy |                            | Modeled memory |                                           |
|-------------------|-----------------------|-----------------------|---------------------|---------------------|----------------------------|----------------|-------------------------------------------|
|                   | $\partial \mathbf{W}$ | $\partial \mathbf{Y}$ |                     | %                   | $\Delta$ (pp) <sup>1</sup> | MiB            | $\Delta (\times \downarrow)$ <sup>1</sup> |
| Adam              | float32               | float32               | $\ell_2$            | 88.74               | –                          | 512.81         | –                                         |
|                   | float16               | float16               | $\ell_2$            | 88.71               | –0.03                      | 256.41         | 2.00                                      |
|                   | bool                  | float16               | $\ell_2$            | 87.93               | –0.81                      | 231.33         | 2.22                                      |
|                   | bool                  | float16               | $\ell_1$            | 89.69               | 0.95                       | 231.33         | 2.22                                      |
|                   | bool                  | float16               | Proposed            | 89.09               | 0.35                       | 138.15         | 3.71                                      |
| SGD with momentum | float32               | float32               | $\ell_2$            | 88.52               | –                          | 459.32         | –                                         |
|                   | float16               | float16               | $\ell_2$            | 88.54               | 0.02                       | 229.66         | 2.00                                      |
|                   | bool                  | float16               | $\ell_2$            | 87.35               | –1.17                      | 204.58         | 2.25                                      |
|                   | bool                  | float16               | $\ell_1$            | 89.09               | 0.57                       | 204.58         | 2.25                                      |
|                   | bool                  | float16               | Proposed            | 88.10               | –0.42                      | 109.20         | 4.21                                      |
| Bop               | float32               | float32               | $\ell_2$            | 91.38               | –                          | 405.83         | –                                         |
|                   | float16               | float16               | $\ell_2$            | 91.36               | –0.02                      | 202.92         | 2.00                                      |
|                   | bool                  | float16               | $\ell_2$            | 90.54               | –0.84                      | 177.84         | 2.28                                      |
|                   | bool                  | float16               | $\ell_1$            | 91.27               | –0.11                      | 177.84         | 2.28                                      |
|                   | bool                  | float16               | Proposed            | 91.48               | 0.10                       | 82.45          | 4.92                                      |

<sup>1</sup> Baseline: float32  $\partial \mathbf{W}$  and  $\partial \mathbf{X}$  with standard ( $\ell_2$ ) batch normalization.



Fig. 3. Achieved validation accuracy over time for the experiments whose results are reported in Table 4.



Fig. 4. Achieved validation accuracy over time for the experiments whose results are reported in Fig. 2.

the largest and is non-binary, thus its activation storage dwarfs that of the remaining layers. We also remark that, while  $\sim 2$  pp accuracy drops may not be acceptable for some application deployments, sizable memory reductions are otherwise achievable. The effects of binarized  $\partial W$  are insignificant since ImageNet’s large images result in proportionally small weight memory occupancy.

We acknowledge that dataset storage requirements likely render ImageNet training on edge platforms infeasible, and that network fine-tuning is a task more commonly deployed on devices of such scale. However, given that the accuracy changes and resource savings we report for more challenging, from-scratch training are favorable and reasonably consistent across a wide range of use-cases, we have confidence that positive results are readily achievable for fine-tuning as well. Nevertheless, our ImageNet proof of concept confirms the

efficacy of large-scale neural network training on the edge.

In common with our small-scale experiments, our proposals did not lead to noticeable convergence rate changes *vs* the standard BNN training algorithm. This is evident from Fig. 5, which contains the validation accuracy curves obtained for the experiments whose results were reported in Table 6.

## 6.2 Embedded Platform Prototypes

To more concretely demonstrate the benefits of our proposed training method, we also wrote software targeting an embedded-scale computing platform. We chose to use a Raspberry Pi 3B+, a popular single-board computer with hardware representative of current mobile and other edge devices, for this purpose. The platform features a four-core, 64-bit Arm Cortex-A53 CPU clocked at 1.4 GHz and 1 GiB

TABLE 6

Test accuracy and memory footprint of the standard and proposed schemes for ImageNet training with Adam and a batch size of 4096.

| Approximations            | ResNetE-18          |                            |                |                              | Bi-Real-18          |                            |                |                              |
|---------------------------|---------------------|----------------------------|----------------|------------------------------|---------------------|----------------------------|----------------|------------------------------|
|                           | Top-1 test accuracy |                            | Modeled memory |                              | Top-1 test accuracy |                            | Modeled memory |                              |
|                           | %                   | $\Delta$ (pp) <sup>1</sup> | GiB            | $\Delta \times \downarrow^1$ | %                   | $\Delta$ (pp) <sup>1</sup> | GiB            | $\Delta \times \downarrow^1$ |
| None                      | 58.77               | –                          | 70.11          | –                            | 56.71               | –                          | 70.11          | –                            |
| All-bfloat16              | 58.85               | 0.08                       | 35.45          | 1.98                         | 56.72               | 0.01                       | 35.45          | 1.98                         |
| bool $\partial W$ only    | 57.59               | –1.28                      | 70.07          | 1.00                         | 55.69               | –1.02                      | 70.07          | 1.00                         |
| $\ell_1$ batch norm. only | 58.34               | –0.43                      | 70.11          | 1.00                         | 56.08               | –0.63                      | 70.11          | 1.00                         |
| Prop. batch norm. only    | 58.23               | –0.54                      | 47.86          | 1.46                         | 55.59               | –1.12                      | 47.86          | 1.46                         |
| <b>Proposed</b>           | <b>57.04</b>        | <b>–1.73</b>               | <b>18.54</b>   | <b>3.78</b>                  | <b>54.45</b>        | <b>–2.26</b>               | <b>18.54</b>   | <b>3.78</b>                  |

<sup>1</sup> Baseline: approximation-free training.



Fig. 5. Achieved validation accuracy over time for the experiments whose results are reported in Table 6.

of LPDDR2 RAM. We used the PyPI `memory_profiler` module and Valgrind to monitor the memory occupancy of Keras- and C++-based implementations, respectively. Energy consumption was logged with an external power meter.

### 6.2.1 Naïve C++ Implementation

While existing training frameworks, including TensorFlow and PyTorch, allow for some data format customization, they lack support for direct control of variable storage. Moreover, when in training mode, they tend to reserve hundreds of MiBs of memory regardless of the model size, making their use infeasible on edge devices. TensorFlow-lite delivers low-memory inference, but it does not support training. Therefore, while these existing frameworks are useful for accuracy evaluation, implementations of our approach that realize its promised memory advantage must be built from scratch. Our first prototypes were direct implementations of Algorithms 1 and 2 in C++. We also trained using Keras, where possible within the Raspberry Pi’s memory limit, for comparison.

Measurements of the peak memory use of our naïve C++ prototypes prove the validity of our memory model. As reflected in Fig. 6, two effects cause the model to produce underestimates. There is a constant, ~5% memory increase across all experiment pairs. This is attributable to process overheads, which we left unmodeled. There is also a second, batch size-correlated overhead due to activation copying between layers. This is significantly more pronounced for the standard algorithm due to its use of `float32`—rather than `bool`—activations. While we did not model these copies



Fig. 6. Batch size vs memory footprint for our naïve C++ prototypes training MLP to classify MNIST with Adam. Annotations mark the ratio between measured and modeled memory pairs.

since they are not strictly necessary, their avoidance would have unbeneficially complicated our software.

Figs. 7(a) and 7(b) show the measured memory footprint *vs* training time for the naïve (standard and proposed) and Keras implementations across a range of batch sizes. For MLP trained to classify MNIST, our naïve implementation saw memory requirements reduce by 2.90–4.54× *vs* the standard approach, with no impact on speed. While use



Fig. 7. Measured peak memory consumption *vs* training time (a)–(b) and energy (c) per batch for implementations training MLP/MNIST and BinaryNet/CIFAR-10. In (a) and (b), each data point represents a distinct batch size. In (c), batch sizes of 200 and 40 were chosen for MLP and BinaryNet, respectively. BinaryNet/CIFAR-10 training with Keras was not possible due to the Raspberry Pi’s memory limit ( $\triangleleft$ ).

of Keras led to much shorter training times, this came at the cost of superproportional memory increases: two orders of magnitude higher than the demands of the proposed approach. Keras-based training of BinaryNet is not possible due to the platform’s 1 GiB memory limit.

### 6.2.2 CBLAS Acceleration

In a bid to close our training time gap with Keras, we optimized our prototypes using the CBLAS library, trading off memory for speed. As shown in Fig. 7(a), this reimplementation led to reductions in training times of an order of magnitude with MLP, making our optimized implementations reach similar speed to Keras. While the CBLAS-accelerated proposed algorithm requires 1.59–2.08× more memory than its naïve counterpart, this comes in return for speedups of 8.60–29.76× while remaining 2.16–2.61× more memory-efficient than the standard approach with acceleration. Our approach with CBLAS bettered Keras’ memory requirements by 27.66–58.34× while experiencing slowdowns of 2.10–3.22×. Experiments with BinaryNet and CIFAR-10 showed similar trends, with the accelerated standard implementation failing to run with a batch size over 40. Note that, due to operating system overheads, it was not possible for the running training program to occupy all of the platform’s memory.

**6.2.2.1 Energy Efficiency:** In addition to memory savings, our use of low-precision activations and gradients also reduces memory traffic, leading to reduced energy consumption. Fig. 7 shows the measured memory footprint *vs* energy consumption per epoch for both MLP with MNIST and BinaryNet with CIFAR-10. For the batch sizes we tested, the CBLAS-accelerated implementation of our proposed training method surpasses the equally optimized standard approach in terms of energy efficiency by 1.02× and 1.18× for those respective network-dataset pairs. We remark that these savings are not significant since the memory traffic-associated energy reductions are partially offset by the costs of `bool`-packing (and -unpacking) operations.

## 7 CONCLUSION

In this article, we introduced the first neural network training scheme tailored specifically to BNNs. Moving first to 16-bit floating-point representation, we selectively and opportunistically approximated beyond this based on careful analysis of the standard training algorithm presented by Courbariaux & Bengio [1]. With a comprehensive evaluation conducted across multiple models, datasets, optimizers, and batch sizes, we showed the generality of our approach and reported significant memory reductions *vs* the prior art, challenging the notion that the resource constraints of edge platforms present insurmountable barriers to on-device learning. We validated the veracity of our claimed savings with Raspberry Pi-targeted prototypes, whose source code we have made openly available for use and further development. In the future, we will explore the potential of our training approximations in the custom hardware domain, within which we expect there to be vast energy-saving opportunity via use of tailor-made arithmetic operators.

## 8 ACKNOWLEDGMENTS

The authors are grateful for the support of the United Kingdom EPSRC (grant numbers EP/P010040/1 and EP/S030069/1). They also wish to thank Sergey Ioffe and Michele Covell for their helpful suggestions.

For the purpose of open access, the authors will apply a Creative Commons Attribution (CC BY) license to any accepted version of this manuscript.

## REFERENCES

- [1] M. Courbariaux and Y. Bengio, "BinaryNet: Training deep neural networks with weights and activations constrained to +1 or -1," *arXiv preprint arXiv:1602.02830*, 2016.
- [2] E. Wang, J. J. Davis, R. Zhao, H.-C. Ng, X. Niu, W. Luk, P. Y. K. Cheung, and G. A. Constantinides, "Deep neural network approximation for custom hardware: Where we've been, where we're going," *ACM Computing Surveys*, vol. 52, no. 2, 2019.
- [3] G. A. Constantinides, "Rethinking arithmetic for deep neural networks," *Philosophical Transactions of the Royal Society A*, vol. 378, no. 2166, 2019.
- [4] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. H. W. Leong, M. Jahre, and K. Vissers, "FINN: A framework for fast, scalable binarized neural network inference," in *ACM/SIGDA International Symposium on Field-Programmable Gate Arrays*, 2017.
- [5] S. Zhou, Z. Ni, X. Zhou, H. Wen, Y. Wu, and Y. Zou, "DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients," *arXiv preprint arXiv:1606.06160*, 2016.
- [6] A. Gruslys, R. Munos, I. Danihelka, M. Lanctot, and A. Graves, "Memory-efficient backpropagation through time," in *Advances in Neural Information Processing Systems*, 2016.
- [7] B. Ginsburg, S. Nikolaev, and P. Micikevicius, "Training of deep networks with half-precision float," in *Nvidia GPU Technology Conference*, 2017.
- [8] B. Graham, "Low-precision batch-normalized activations," *arXiv preprint arXiv:1702.08231*, 2017.
- [9] J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar, "SignSGD: Compressed optimisation for non-convex problems," in *International Conference on Machine Learning*, 2018.
- [10] S. Wu, G. Li, L. Deng, L. Liu, D. Wu, Y. Xie, and L. Shi, " $l_1$ -norm batch normalization for efficient training of deep neural networks," *IEEE Transactions on Neural Networks and Learning Systems*, vol. 30, no. 7, 2018.
- [11] N. S. Sohoni, C. R. Aberger, M. Leszczynski, J. Zhang, and C. Ré, "Low-memory neural network training: A technical report," *arXiv preprint arXiv:1904.10631*, 2019.
- [12] H. Cai, C. Gan, L. Zhu, and S. Han, "Tiny Transfer Learning: Towards memory-efficient on-device learning," in *IEEE Conference on Computer Vision and Pattern Recognition*, 2020.
- [13] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, "Communication-efficient learning of deep networks from decentralized data," in *International Conference on Artificial Intelligence and Statistics*, 2017.
- [14] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. Kiddon, J. Konečný, S. Mazzocchi, H. B. McMahan, T. van Overveldt, D. Petrou, D. Ramage, and J. Roslander, "Towards federated learning at scale: System design," in *Conference on Machine Learning and Systems*, 2019.
- [15] N. Agarwal, A. T. Suresh, F. Yu, S. Kumar, and H. B. McMahan, "CpSGD: Communication-efficient and differentially-private distributed SGD," in *International Conference on Neural Information Processing Systems*, 2018.
- [16] M. Courbariaux, Y. Bengio, and J.-P. David, "BinaryConnect: Training deep neural networks with binary weights during propagations," in *Conference on Neural Information Processing Systems*, 2015.
- [17] X. Lin, C. Zhao, and W. Pan, "Towards accurate binary convolutional neural network," in *Conference on Neural Information Processing Systems*, 2017.
- [18] M. Ghasemzadeh, M. Samragh, and F. Koushanfar, "ReBNet: Residual binarized neural network," in *IEEE International Symposium on Field-Programmable Custom Computing Machines*, 2018.
- [19] Z. Liu, B. Wu, W. Luo, X. Yang, W. Liu, and K.-T. Cheng, "Bi-Real Net: Enhancing the performance of 1-bit CNNs with improved representational capability and advanced training algorithm," in *European Conference on Computer Vision*, 2018.
- [20] E. Wang, J. J. Davis, P. Y. K. Cheung, and G. A. Constantinides, "LUTNet: Rethinking inference in FPGA soft logic," in *IEEE International Symposium on Field-Programmable Custom Computing Machines*, 2019.
- [21] ———, "LUTNet: Learning FPGA configurations for highly efficient neural network inference," *IEEE Transactions on Computers*, vol. 69, no. 12, 2020.
- [22] Y. Umuroglu, Y. Akhauri, N. J. Fraser, and M. Blott, "LogicNets: Co-designed neural networks and circuits for extreme-throughput applications," in *International Conference on Field-Programmable Logic and Applications*, 2020.
- [23] X. He, Z. Mo, K. Cheng, W. Xu, Q. Hu, P. Wang, Q. Liu, and J. Cheng, "ProxyBNN: Learning binarized neural networks via proxy matrices," in *European Conference on Computer Vision*, 2020.
- [24] Z. Liu, Z. Shen, M. Savvides, and K.-T. Cheng, "ReActNet: Towards precise binary neural network with generalized activation functions," in *European Conference on Computer Vision*, 2020.
- [25] S. Wu, G. Li, F. Chen, and L. Shi, "Training and inference with integers in deep neural networks," in *International Conference on Learning Representations*, 2018.
- [26] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, "TernGrad: Ternary gradients to reduce communication in distributed deep learning," in *Advances in Neural Information Processing Systems*, 2017.
- [27] T. Chen, B. Xu, C. Zhang, and C. Guestrin, "Training deep nets with sublinear memory cost," *arXiv preprint arXiv:1604.06174*, 2016.
- [28] A. Chakrabarti and B. Moseley, "Backprop with approximate activations for memory-efficient network training," in *Advances in Neural Information Processing Systems*, 2019.
- [29] E. Hoffer, R. Banner, I. Golan, and D. Soudry, "Norm matters: Efficient and accurate normalization schemes in deep networks," in *Advances in Neural Information Processing Systems*, 2018.
- [30] K. Helwegen, J. Widdicombe, L. Geiger, Z. Liu, K.-T. Cheng, and R. Nusselder, "Latent weights do not exist: Rethinking binarized neural network optimization," in *Advances in Neural Information Processing Systems*, 2019.
- [31] T. Chen, Z. Zhang, X. Ouyang, Z. Liu, Z. Shen, and Z. Wang, "“BNN - BN = ?”: Training binary neural networks without batch normalization," in *IEEE Conference on Computer Vision and Pattern Recognition*, 2021.
- [32] X. Jiang, N. Wang, J. Xin, K. Li, X. Yang, and X. Gao, "Training binary neural network without batch normalization for image super-resolution," in *AAAI Conference on Artificial Intelligence*, 2021.
- [33] J. Bethge, H. Yang, M. Bornstein, and C. Meinel, "Back to simplicity: How to train accurate BNNs from scratch?" *arXiv preprint arXiv:1906.08637*, 2019.
- [34] S. Darabi, M. Belbahri, M. Courbariaux, and V. P. Nia. (2018) BNN+: Improved binary network training. [Online]. Available: <https://openreview.net/pdf?id=SJffHg2A5Q>
- [35] B. Martinez, J. Yang, A. Bulat, and G. Tzimiropoulos, "Training binary neural networks with real-to-binary convolutions," in *International Conference on Learning Representations*, 2020.
- [36] X. Glorot and Y. Bengio, "Understanding the difficulty of training deep feedforward neural networks," in *International Conference on Artificial Intelligence and Statistics*, 2010.
- [37] M. Alizadeh, J. Fernández-Marqués, N. D. Lane, and Y. Gal, "An empirical study of binary neural networks' optimisation," in *International Conference on Learning Representations*, 2018.
- [38] E. Sari, M. Belbahri, and V. P. Nia, "How does batch normalization help binary training," *arXiv preprint arXiv:1909.09139*, 2019.
- [39] H. Qin, R. Gong, X. Liu, X. Bai, J. Song, and N. Sebe, "Binary neural networks: A survey," *Pattern Recognition*, vol. 105, 2020.
- [40] N. Wang, J. Choi, D. Brand, C.-Y. Chen, and K. Gopalakrishnan, "Training deep neural networks with 8-bit floating point numbers," in *Advances in Neural Information Processing Systems*, 2018.
- [41] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu, "Mixed precision training," in *International Conference on Learning Representations*, 2018.
- [42] D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," in *International Conference on Learning Representations*, 2015.
- [43] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht, "The marginal value of adaptive gradient methods in machine learning," in *Advances in Neural Information Processing Systems*, 2017.



**Erwei Wang** (S'18, M'21) is a Staff Machine Learning Engineer at AMD. He received a PhD in Electrical and Electronic Engineering from Imperial College London in 2021. His research interests include deep neural networks, computer vision systems, and high-performance computing architectures, with an emphasis on improving speed and energy efficiency for custom hardware implementation. He is a Member of the IEEE and the ACM.



**Claudionor Coelho** is the Vice President/Fellow for Artificial Intelligence – Head of Artificial Intelligence Labs at Palo Alto Networks, leading new research and products on time series analysis, causality, and neural-symbolic computing. Previously, he worked on machine learning and deep learning at Google. He was the Vice President of Software Engineering, Machine Learning, and Deep Learning at NVXL Technology, and has also worked at Synopsys, Cadence Design Systems, and Jasper Design Automation. He is named on

more than 100 papers, patents, and academic and industrial awards. He is currently Invited Professor for Deep Learning at Santa Clara University, and was previously Associate Professor of Computer Science at the Federal University of Minas Gerais (UFMG), Brazil. He has a PhD in Electrical Engineering and Computer Science from Stanford University, an MBA from Ibmec Business School, and an MSCS and BSEE from UFMG.



**James J. Davis** (S'08–M'16) is a Research Fellow in the Department of Electrical and Electronic Engineering's Circuits and Systems group at Imperial College London. He received a PhD in Electrical and Electronic Engineering from Imperial College London in 2016. His research is focussed on the exploitation of FPGA features for cutting-edge applications, driving up performance, energy efficiency, and reliability. Dr Davis serves on the technical program committees of the four top-tier reconfigurable computing conferences (FPGA, FCCM, FPL, and FPT) and is a multi-best paper award recipient. He is a Member of the IEEE and the ACM.



**Satrajit Chatterjee** is an independent researcher. He was previously an Engineering Leader and Researcher at Google, where his research focused on fundamental questions in deep learning and various applications of machine learning. Before Google, he was a Senior Vice President at Two Sigma, where he founded one of the first successful deep learning-based alpha research groups on Wall Street and led a team that built one of the earliest end-to-end FPGA-based trading systems for general-purpose ultra-

low-latency trading. Prior to that, he was a Research Scientist at Intel, where he worked on microarchitectural performance analysis and formal verification for on-chip networks. He completed his undergraduate studies at the Indian Institute of Technology Bombay, has a PhD in Computer Science from the University of California Berkeley, and has published in the top machine learning, design automation, and formal verification conferences.



**Daniele Moro** is a Software Engineer at Google researching the quantization of deep neural networks for constrained application-specific edge inference and training. A Boise State Top Ten Scholar, Kleiner Perkins Engineering Fellow, Carnegie Mellon Robotics Institute Summer Scholar, and Barry Goldwater Scholar, Daniele graduated from Boise State University with a degree in Computer Science and a minor in Applied Mathematics.



**Peter Y. K. Cheung** (M'85–SM'04) is Professor of Digital Systems in the Department of Electrical and Electronic Engineering at Imperial College London. His research interests include VLSI architectures for signal processing, asynchronous systems, reconfigurable computing, and architectural synthesis. Prof. Cheung received a DSc from Imperial College London. He is a Senior Member of the IEEE and Fellow of the IET.



**Piotr Zielinski** is a Software Engineer at Google, currently focused on machine learning fundamentals. Prior to that he was a Quantitative Software Engineer at Two Sigma, working on both high-frequency trading and research automation. He has a master's degree in Theoretical Computer Science from Jagiellonian University in Krakow, Poland.



**George A. Constantinides** (S'96–M'01–SM'08) received the PhD degree from Imperial College London in 2001. Since 2002, he has been with the faculty at Imperial College London, where he is currently Professor of Digital Computation and Associate Dean (Academic Workload) in the Faculty of Engineering. He was the general chair of the ACM International Symposium on Field-Programmable Gate Arrays in 2015. He serves on several program committees and has published over 200 research papers in peer-reviewed journals and international conferences. Prof. Constantinides is a Senior Member of the IEEE and a Fellow of the BCS.



**Jia Jie Lim** received an MSc degree in Future Power Networks from Imperial College London in 2021. Since 2020, he has been involved in computer vision with iSize Technology, London. His research interests include denoising, visual perception, image quality-preserving video bitrate reduction, and three-dimensional avatars.