

# A 19.4-nJ/Decision, 364-K Decisions/s, In-Memory Random Forest Multi-Class Inference Accelerator

Mingu Kang<sup>✉</sup>, Member, IEEE, Sujan K. Gonugondla, Student Member, IEEE,  
Sungmin Lim, and Naresh R. Shanbhag, Fellow, IEEE

**Abstract**—This paper presents an integrated circuit (IC) realization of a random forest (RF) machine learning classifier in a 65-nm CMOS. Algorithm, architecture, and circuits are co-optimized to achieve aggressive energy and delay benefits by taking advantage of the inherent error resiliency derived from the ensemble nature of an RF classifier. Deterministic sub-sampling (DSS) and regularized decision trees reduce interconnect complexity, and avoid irregular memory access patterns and computations, thereby reducing the energy-delay product (EDP). The prototype IC also employs low-swing analog in-memory computations embedded in a standard 6T SRAM to enable massively parallel tree node comparisons, thereby minimizing the memory fetches and reducing the EDP further. The 65-nm CMOS prototype IC achieves a 3.1× and 2.2× improved energy efficiency and throughput leading to 6.8× lower EDP compared to a conventional digital system at the same accuracies of 94% and 97.5% for two tasks: 1) eight-class traffic sign recognition and 2) face detection, respectively.

**Index Terms**—Accelerator, analog processing, in-memory computing, machine learning (ML), random forest (RF).

## I. INTRODUCTION

MAchine learning (ML)-based systems are transforming the way we live and interact with the world around us. In various recognition tasks, such as those in computer vision, machines have begun to exceed human performance [1]. However, the energy and delay costs of ML algorithms are very high and needs to be significantly reduced for the deployment on real-time sensor-rich platforms such as biomedical devices, wearables, autonomous vehicles, Internet-of-things (IoT), and many others. As a result, a number of integrated circuit (IC) implementations of ML kernels and algorithms have been proposed recently [2]–[8] to minimize the energy, delay, and latency of ML systems in silicon.

Many of these IC realizations have focused on efficient implementation of deep neural network (DNN) algorithms due to its state-of-the-art performance in various decision-making

Manuscript received November 25, 2017; revised February 19, 2018 and March 25, 2018; accepted March 26, 2018. Date of publication May 2, 2018; date of current version June 25, 2018. This paper was approved by Guest Editor Shidhartha Das. This work was supported by Systems On Nanoscale Information fabriCs, one of the six SRC STARnet Centers, sponsored by SRC and DARPA. (*Corresponding author: Mingu Kang.*)

The authors are with the Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, Champaign, IL 61801 USA (e-mail: mkang17@illinois.edu; gonugon2@illinois.edu; sungmin3@illinois.edu; shanbhag@illinois.edu).

Color versions of one or more of the figures in this paper are available online at <http://ieeexplore.ieee.org>.

Digital Object Identifier 10.1109/JSSC.2018.2822703

tasks [9]. However, the high complexity of DNNs with an irregular data flow across multiple layers limits the achievable energy efficiency and throughput making them unsuitable for severely resource-limited embedded platforms. In contrast, the random forest (RF) algorithm [10] is an attractive alternative due to the simplicity of its computations (mainly comparisons), applicability to multi-class problems, and high-accuracy. In addition, the RF algorithm has inherent robustness to non-ideal computations due to its ensemble nature [10]. However, realizing an energy-efficient implementation of the RF algorithm is made challenging due to its high data access rate combined with its irregular data access pattern. There have been digital implementations of the RF algorithm on field-programmable gate array (FPGAs), GPUs, and multi-core processors [11]. However, these fail to take advantage of the inherent tolerance of the algorithm to hardware non-idealities and the opportunities afforded by analog computations.

This paper presents an energy-efficient and high-throughput RF classifier IC in a 65-nm CMOS demonstrating a 6.8× lower energy-delay product (EDP) compared to a conventional 8-b fixed-point digital implementation for an eight-class traffic sign recognition and face detection tasks at the same accuracy of 94% and 97.5%, respectively. Preliminary measured results of this in-memory RF IC were reported in [12]. The energy and throughput improvements are achieved by employing the following.

- 1) Deterministic sub-sampling (DSS) to reduce the interconnect complexity.
- 2) Balanced full decision trees to regularize processing and memory access pattern.
- 3) Deep in-memory architecture (DIMA) [13]–[17] to embed energy-efficient low-swing analog memory readout and computations in the periphery of an SRAM bitcell array (BCA).
- 4) Class ADD generator (CAG) to obtain the tree-level decisions from the outputs of all the tree nodes executed in parallel.

While the previous DIMA chip [16] implemented simple ML algorithms such as the support vector machine (SVM), more than 75% of energy consumption in the RF algorithm comes from operations, which cannot be accelerated by DIMA. Therefore, this paper proposes additional techniques to complement DIMA in order to realize an RF classifier. To the best of our knowledge, this is the first IC implementation of the RF algorithm.



Fig. 1. RF algorithm.

### Summary of Notation

|                                                       |                                                      |
|-------------------------------------------------------|------------------------------------------------------|
| $\mathbf{P}_m = [p_{m,1}, p_{m,2}, \dots, p_{m,N}]$ : | pseudorandom pattern                                 |
| $p_{m,n}$ :                                           | pixel index of $n^{th}$ node in $m^{th}$ tree        |
| RSS:                                                  | random sub-sampling using $\mathbf{P}_m$             |
| $\tau_{m,n}$ :                                        | threshold of the $n^{th}$ node in the $m^{th}$ tree  |
| $x(p_{m,n})$ :                                        | $p_{m,n}^{th}$ pixel of input image $X$              |
| $q_{m,n}$ :                                           | output of the $n^{th}$ node in the $m^{th}$ tree     |
| $c_{m,l}$ :                                           | label of the $l^{th}$ leaf node in the $m^{th}$ tree |
| ( $m: 1\sim M$ , $n: 1\sim N$ , $l: 1\sim N + 1$ )    |                                                      |

This paper is organized as follows. Section II introduces the RF algorithm, its implementation challenges, and provides background on the DIMA. Section III describes the algorithm, circuit, and architecture-level techniques to address the implementation challenges listed in Section II. Design details for the architecture and prototype IC are described in Section IV. Measurement results including task-level accuracy, energy, and delay are presented in Section V. Section VI concludes this paper.

## II. BACKGROUND

This section provides background on the RF algorithm, its implementation challenges, and DIMA.

### A. Random Forest (RF) Algorithm

The RF algorithm (Fig. 1) consists of  $M$  decision trees, each comprising a maximum of  $N$  nodes where  $1 + \log_2(N + 1)$  is the tree depth including the leaf nodes. The  $m^{th}$  tree processes data obtained by random sub-sampling (RSS) the input image  $X$  based on a pseudorandom pattern vector  $\mathbf{P}_m$ , which is

obtained during the training stage. The  $n^{th}$  node in the  $m^{th}$  tree compares the pixel (or feature)  $x(p_{m,n})$  indexed by  $p_{m,n}$ , with a threshold  $\tau_{m,n}$  to obtain a node-level binary decision  $q_{m,n}$ . Either the left or right branch is taken based on  $q_{m,n}$ . This process is repeated until a leaf node is reached. The label  $c_{m,l}$  corresponding to the  $l^{th}$  leaf node is the tree-level decision. The final decision is obtained by majority voting  $M$  such tree-level decisions. Although the RF algorithm is conceptually simple, it has a number of implementation challenges as described in Section II-B.

### B. RF Implementation Challenges

To enable energy-efficient and low-latency RF system, the following implementation challenges need to be addressed:

- 1) the complex crossbar (CB) problem;
- 2) irregular trees;
- 3) parallelizing comparisons;
- 4) lookup table (LUT) inefficiency.

1) *Complex Crossbar Problem*: The RSS operation requires a complex CB (e.g.,  $K:1$ , where  $K > 256$  is the number of pixels in  $X$ ) to route a specific pixel of  $X$  to the corresponding tree node, i.e., to generate  $x(p_{m,n})$  in Fig. 1.

2) *Irregular Trees*: Each tree can have different shapes and different number of nodes (e.g., up to  $2^{(N-1)}$  possible shapes with  $N \geq 31$ ), as shown in Fig. 1. The  $q_{m,n}$ 's from each tree node is employed as the address to an LUT to obtain  $c_{m,l}$ . In this case, the irregular tree shapes incur different processing delays and complex LUT logic (e.g., requires do not care logic for nonexistent tree nodes).

3) *Parallelizing Comparisons*: Low-latency requires computing many tree nodes in parallel including the nodes on the non-selected paths. This requires highly parallel comparisons by fetching  $2N$  bytes data and  $N$  comparisons per tree (e.g.,  $N \geq 31$ ) via the limited bit widths of SRAM IO ( $B_{IO}$ ) and the bus ( $B_{BUS}$ ) (e.g.,  $B_{IO} = B_{BUS} = 64$ -b per SRAM bank) for 64 to a few hundred trees.

4) *LUT Inefficiency*: A brute-force implementation of the LUT to compute the labels  $c_{m,l}$  from tree node decisions  $q_{m,n}$  requires fetching the LUT contents of  $2^N$  cases (e.g.,  $N \geq 31$ ) each consisting of an index and corresponding label, which needs to be processed in an additional hardware.

### C. Deep In-Memory Architecture (DIMA)

DIMA addresses the high energy and latency costs of data movement between processor and memory by embedding analog computations deeply into the periphery of the memory core, i.e., the BCA [13]–[17]. DIMA has four sequentially executed processing stages.

- 1) *Functional Read (FR)*: Fetch the stored data from multiple rows in the column of BCA at a time to generate linearly weighted sum of the binary data as an analog voltage of the bitline (BL), i.e., digital-to-analog conversion.
- 2) *BL Processing (BLP)*: Compute word-level arithmetic operations, e.g., addition, subtraction, or multiplication, by processing the BL voltage level generated via



Fig. 2. Proposed RF algorithm with DSS and regularized decision trees.

FR in the column pitch-matched analog processors in parallel.

- 3) *Cross BL Processing (CBLP)*: Aggregates multiple BLP outputs via charge sharing to obtain a scalar output.
- 4) *Thresholding (TH)*: Generates the final classifier decision.

Silicon prototypes of DIMA have demonstrated significant energy and throughput benefits, e.g., up to  $56\times$  in energy-delay product (EDP) and up to  $9.7\times$  energy reduction [16] over their fixed-function digital counterparts due to its low-swing analog processing and multi-row reads per BL precharge. In this paper, we employ DIMA to efficiently compare  $x(p_{m,n})$  with  $\tau_{m,n}$ .

### III. PROPOSED SOLUTIONS TO THE RF IMPLEMENTATION CHALLENGES

This section proposes algorithm, architecture, and the circuit-level solutions to address the following implementation challenges: 1) irregular trees, 2) the complex CB problem, 3) parallelizing comparisons, and 4) LUT inefficiency described in Section II-B. We propose a modified RF algorithm (in Fig. 2) to address 1) and 2) by using regularized trees and DSS, respectively. We propose the use of DIMA for implementation of tree node comparisons to address 3) and a CAG to address 4). The required operations are summarized in Table I.

#### A. Regularized Tree

The decision trees are extended to form a balanced tree with  $N$  nodes (Fig. 2) by introducing filler nodes in order to regularize the memory access and processing patterns.

TABLE I  
REQUIRED OPERATIONS PER TREE FOR THE PROPOSED (CONVENTIONAL) ARCHITECTURE WITH  $N = 31$  AND  $K = 256$

| OPs per tree  | Memory access |           |           | Compare      | crossbar            |
|---------------|---------------|-----------|-----------|--------------|---------------------|
|               | Data          | $p_{m,n}$ | $c_{m,l}$ | $\tau_{m,n}$ |                     |
| Bit precision | 6(8)          | 3(3)      | 8(8)      | 8(8)         | $x(p_{m,n})$        |
| Size (Bytes)  | 23.3(31)      | 0.5(16)   | 0(31)     | 1            | Mux ratio           |
| # of OPs*     | 3(4)          | 0.5(2)    | 0(4)      | 31(31)       | 64 : 1<br>(256 : 1) |
|               |               |           |           |              | 1(1)                |

\*8 bytes per SRAM access



Fig. 3. DSS ratio versus simulated misclassification rate ( $1 - P_{DET}$ ) with test data set including 700 images.

Note that tree regularization does not incur any additional delay overhead or tree node hardware. This is because any RF architecture has to accommodate all possible tree shapes including the worst case scenario of having to implement  $N$  nodes and incurring the worst case delay of  $\log_2(N + 1)$  node comparisons. Furthermore, a regularized tree does not require an additional pointer address for the child node. Though one could choose to disable the filler nodes selectively to minimize computations in the regularized tree, doing so requires an additional data bit to indicate a null node and an additional memory access. Post-layout simulations show that a 1-b memory access requires  $14\times$  higher energy than a comparison operation. Thus, we allow all the nodes to be computed including filler nodes. The comparison result  $q_{m,n}$  of the filler nodes do not affect the classification results as both left and right paths end up with identical labels. Therefore, for every filler node, we assign those values of  $x(p_{m,n})$  and  $\tau_{m,n}$  for which  $|x(p_{m,n}) - \tau_{m,n}|$  is maximized. Doing so minimizes the probability of a metastability event in the analog comparator as described in Section III-C.

#### B. Deterministic Sub-Sampling (DSS)

The modified RF algorithm (Fig. 2) employs a fixed pattern DSS step prior to RSS to solve the problem of requiring a complex CB. Fig. 3 shows that sub-sampling up to a ratio of 4:1 does not affect the misclassification rate due to the highly correlated pixel values. Therefore, 4:1 DSS ratio is chosen to minimize the CB complexity without degrading  $P_{DET}$ .



Fig. 4. In-memory comparison of  $T = \sum_{i=0}^{B-1} 2^i t_i$  with  $X = \sum_{i=0}^{B-1} 2^i x_i$  for bit precision  $B = 4$ .

The optimal DSS ratio depends on the image resolution. The complexity of the subsequent RSS CB is reduced from 256:1 to 64:1 via DSS when the input  $X$  is  $16 \times 16$  image ( $K = 256$ ). This results in the additional benefit of reducing the precision of  $p_{m,n}$  from 8- to 6 b.

### C. DIMA-Based Comparison

DIMA-based tree processing addresses the requirement of highly parallel comparisons (Fig. 4), and thus eliminates the need to explicitly fetch the thresholds  $\tau_{m,n}$ . In-memory comparison requires  $B$ -b thresholds  $\tau_{m,n}$  ( $T$  in Fig. 4) and the indexed pixels  $x(p_{m,n})$  ( $X$  in Fig. 4) to be stored in a column-major format, i.e., bits of a word are stored in a column. The comparison begins with the simultaneous application of WL access pulses with binary-weighted pulse widths to all the rows storing  $T$  and  $\bar{X}$ . This process is referred to as FR. Here, the pulse width is proportional to the bit position. Doing so generates a BL voltage (discharge) swing  $\Delta V_{BL}$  proportional to  $(X - T)$  given by

$$\begin{aligned} \Delta V_{BL}(T, \bar{X}) &= \frac{V_{PRE} W_0}{R_{BL} C_{BL}} \sum_{i=0}^{B-1} 2^i (\bar{t}_i + x_i) \\ &= \Delta V_{lsb}(X - T - 1) \end{aligned} \quad (1)$$

where  $V_{PRE}$  is precharged BL voltage,  $W_0$  is the LSB pulse width,  $R_{BL}$  is the resistance of BL discharge path comprising the access and pull-down transistors of the enabled bitcells, and  $\Delta V_{lsb} = (V_{PRE} W_0 / R_{BL} C_{BL})$ . The bias “-1” is generated by expressing negative numbers in 2’s complement representation [13], [16]. Due to the complementary nature of SRAM bitcell,  $\Delta V_{BLB}$  is also proportional to  $(T - X)$ . Next, BLs



Fig. 5. Sub-ranged read [16]. (a) Column pair Implementation. (b) Equivalent capacitance model.



Fig. 6. Replica BCA operation [16]. (a) Bitcell. (b) Write access timing diagram.

and BLBs are fed into analog comparators [18] to generate node-level decisions  $(q_{m,n})$  for all columns in parallel.

Integral non-linearity (INL) of FR is improved up to 65% by sub-ranged read (Fig. 5), where  $B/2$ -b MSBs and  $B/2$ -b LSBs are read from adjacent columns (column pair) followed by a capacitively weighted charge sharing step that assigns  $16\times$  greater weight to the MSBs [16]. The  $(1/16)C_{BL}$  in Fig. 5(b) is implemented by the parasitic capacitance of the sub-range switch and a tunable capacitor to calibrate the accuracy [16]. The WL voltage  $V_{WL} < 0.65$  V to prevent destructive read and further improve linearity. The replica BCA [16] (Fig. 6) stores  $\bar{X}$  via a separate write BL (WBL) and wordline (WWL) to avoid full-swing toggling of high-capacitance BL, which is required during the SRAM write operation.

In-memory comparison is a massively parallel operation as it fetches and processes  $B/2$ -b per column per read access bypassing the column mux, whereas the conventional memory fetches only one bit per  $L$  columns per access, where  $L$  is the column muxing ratio with typical values from 4 to 16. The FR also saves precharge energy by accessing  $B/2$ -b per BL precharge, whereas conventional read fetches single bit through the column mux per  $L$  BL precharges.

### D. Class Address (ADD) Generator

The LUT inefficiency problem is solved by employing a CAG, which converts the result of comparisons  $q_{m,n}$ s into a memory address for the chosen label  $c_{m,l}$ . This conversion is achieved by generating the memory address as  $f(m) + g(q_{m,1 \sim N})$ , where  $f(\ )$  provides an offset address



Fig. 7. CAG function  $g()$ , where red tree nodes are an example of the selected path, and CAG logic generates the BCA column address corresponding to the chosen label,  $c_{31}$ .

and  $g()$  specifies the address. The functions  $f()$  and  $g()$  need to guarantee one-to-one mapping from  $(m, q_{m,n})$  to memory address. For example, in this paper,  $f()$  generates the row address  $= 12 \lfloor (m-1)/4 \rfloor + 8 + \text{mod}(\lfloor (m-1)/2 \rfloor, 2)$  and decides either the left or right half of BCA based on  $\text{mod}(m+1, 2)$ . Function  $g()$  specifies the column address via simple Boolean logic with  $q_{m,n}$ s, as shown in Fig. 7. Therefore, the only label of the chosen path (rather than the index and label of every path) is fetched.

These proposed techniques result in only 24.8 bytes/tree of data needing to be fetched as compared to 79 bytes/tree in the conventional parallel architecture, as summarized in Table I.

#### IV. PROTOTYPE RANDOM FOREST IC

This section describes the design of the prototype RF IC architecture and its implementation in a 65-nm CMOS process.

##### A. Architecture and Timing

The prototype IC architecture (Fig. 8) includes a digital controller (CTRL) and a CORE consisting of a  $512 \times 256$  SRAM BCA, peripherals for standard read/write operations, 64-b I/O ( $B_{IO} = 64$ ) with a 4:1 ( $L = 4$ ) column mux, FR WL drivers, 4:1 DSS input buffer to store the streamed 256-byte  $X(K = 256)$ , RSS CBs, CAG, label finder, and a majority voter.

The timing diagrams in Fig. 9 shows that a group of four trees are processed in parallel requiring 171 clockcycles, and  $M/4$  such groups are processed sequentially for a total of  $M \leq 168$  trees. In the beginning, four groups of 4:1 sub-sampled 64 pixel words ( $X$ ) are stored in the four DSS input buffers. First, the pixel indices  $p_{m,n}$ s are fetched from the BCA into index registers (IREG) through 12 normal SRAM read accesses. Next, a main CTRL enables four RSS CBs that places  $B = 8$ -b pixels  $x(p_{m,n})$ s into the RSS registers (RSREG) according to index  $p_{m,n}$  stored in IREGs. Next, the  $x(p_{m,n})$ s are written into the replica BCA for in-memory comparison with thresholds  $\tau_{m,n}$ . The FR WL drivers apply binary-weighted pulse width modulation (PWM) pulses to WLS simultaneously, and the discharged BLs and BLBs are fed to analog comparators (Fig. 4). The in-memory comparison generates 128 comparison outputs  $q_{m,n}$ s per



Fig. 8. Proposed RF architecture (IREG: pixel index register, RSS: RSS register, COMP: analog comparator, and CB: crossbar).



Fig. 9. Timing diagram showing the number of required clock cycles per stage.

precharge cycle at the output of the 128 pitch-matched analog comparators in parallel. The CTRL fetches four tree-level labels  $c_{m,l}$ s from the BCA via normal read operation using the address generated by CAG using the in-memory comparison outputs  $q_{m,n}$ . Finally, the majority voter makes a final decision based on the  $M$  tree-level labels  $c_{m,l}$  after processing  $M/4$  such groups.

##### B. Prototype IC Implementation

The prototype IC (Fig. 10) is fabricated in a 65-nm CMOS process and packaged in 88-pin QFN, as summarized



Fig. 10. Micrograph of the in-memory RF classifier chip.

TABLE II  
CHIP SUMMARY

|                                                 |                                       |
|-------------------------------------------------|---------------------------------------|
| Technology                                      | 65 nm CMOS                            |
| Die Size                                        | 1.2 mm × 1.2 mm                       |
| SRAM Capacity                                   | 16 kB (512 × 256 bitcells)            |
| Bitcell dimension                               | 2.11 μm × 0.92 μm                     |
| CTRL operating freq                             | 1 GHz                                 |
| Supply voltage (V)                              | CTRL: 0.75<br>CORE: 1                 |
| Energy per decision (pJ)<br>(4 trees, 64 trees) | CTRL: (0.3, 5.0)<br>CORE: (0.9, 14.4) |
| Decision throughput<br>(decisions/s)            | 4 trees: 5.6 M<br>64 trees: 364.4 K   |

in Table II. The logic blocks including DSS input buffer, IREG, CB, and RSREG occupy 25% of area, whereas less than 10% of area is occupied by the additional circuitry for in-memory comparison such as analog comparators and replica BCA. The CTRL operates with 1-GHz clock to provide fine time resolution for control signals. In this implementation, all the control signals, even for standard read/write operations, are synced with clock. On the other hand, the multi-row WL driver in Fig. 8 includes a pulse generator, whose output pulselength is set by the configuration code, which alters the number of inverter-based delay cells.

The 256-byte input image  $X$  is provided serially into the input buffer, and the configuration word defines the operations such as number of trees ( $M$ ) to be processed. The final 3-b decision, to support a maximum of eight classes, is fetched through the serial output port.

## V. MEASURED RESULTS

This section describes the measured results from the prototype IC and evaluates its energy, delay, accuracy benefits, and robustness with respect to the conventional system, which employs the same architecture as shown in Fig. 8, but with a 256:1 RSS CB and no DSS, using a digital comparator,



| Task                          | Class | Dataset                                                                                        | Image Size                        |
|-------------------------------|-------|------------------------------------------------------------------------------------------------|-----------------------------------|
| Traffic sign recognition [19] | 8     | KUL Belgium traffic sign dataset<br>- Train: 148 images per class<br>- Test: 200 random images | Resized 16×16 pixels (gray-scale) |
| Face detection [20]           | 2     | MIT CBCL dataset<br>- Train: 2000 images per class<br>- Test: 200 random images                |                                   |

Fig. 11. Data sets used in measurements.

which is an 8-b subtractor with a sign detector, instead of in-memory comparison, and a digital LUT logic to store all the labels  $c_{m,l}$ s. The digital comparators processes eight comparisons at a time with a 64-b output (eight  $\tau_{m,n}$ s) from the SRAM while the next eight words are fetched. The energy and delay of the conventional architecture are estimated from: 1) measurements of the normal SRAM read access of the prototype IC and 2) post-layout simulations of synthesized digital comparators and a 256:1 CB. The required operations (memory access, comparison, and enabling CB) of the conventional system are calculated using Table I. Layouts of the digital comparator and the 256:1 CB are matched to the horizontal dimension of the SRAM BCA to align well with the SRAM IOs. The area of the synthesized 256:1 RSS CB is found to be four times larger than that of the proposed CB due to its higher complexity, whereas the area of the digital comparators is similar to the proposed analog processors (replica BCA and analog comparators). Therefore, the conventional system occupies approximately 1.8× more area than that of the proposed system.

### A. Application Mapping and Classifier Training

The proposed architecture was tested on two data sets (Fig. 11): 1) the KUL Belgium traffic sign data set [19] for eight-class traffic sign recognition task and 2) the MIT CBCL data set [20] for face detection task. The input pixels and the threshold values  $\tau_{m,n}$  are represented in 8-b fixed point as this precision provides approximately the same accuracy as floating point [9], [21], [22]. During off-chip training, 148 training images per class are used for traffic sign recognition, whereas 2000 training images per class are used for face detection.

The maximum supported tree depth is chosen to be six, which is considered optimal for the target application [11]. To evaluate the impact of the number of trees on accuracy, two different cases of  $M = 4$  and  $M = 64$  are tested.

The classification accuracy ( $P_{DET}$ ) is measured by streaming 200 randomly chosen images and counting the correct decisions.



Fig. 12. Measured accuracy of in-memory subtraction from four dies where  $T_{\text{MSB}}$  and  $X_{\text{MSB}}$  are the decimal representations of the top four MSBs of the threshold and input pixel, respectively.



Fig. 13. Measured accuracy of DIMA comparisons with all possible combinations of  $(X_{\text{MSB}}, T_{\text{MSB}})$  with  $\Delta V_{\text{lsb}} = 25 \text{ mV}$ . Each data-point is obtained by averaging 256 measurements over 256 different locations of the BCA.

### B. Component-Level Accuracy

Fig. 12 shows that the in-memory subtraction, which is realized as part of the FR process, achieves an INL < 1.85 LSB in the range of  $-15 \leq T_{\text{MSB}} - X_{\text{MSB}} \leq 15$ .

Deviation in  $V_{\text{BL}}$  is < 25 mV over different combinations of  $T_{\text{MSB}}$  and  $X_{\text{MSB}}$ . These variations are induced by circuit non-idealities including inaccurate ratio of PWM pulse widths, asymmetry of the replica BCA and BL voltage dependence of the discharge path resistance. Spatial transistor threshold voltage ( $V_t$ ) variations caused by random dopant fluctuations lead to increased comparator errors as shown in Fig. 13, where the in-memory comparator accuracy ranges from 100% to 50% (when  $T_{\text{MSB}} \approx X_{\text{MSB}}$ ). The asymmetry of the replica BCA is due to the use of single-ended WBL as shown in Fig. 6, which results in an asymmetric discharge current between BL and BLB. This asymmetry leads to an increase in the comparator error rate for large values of  $X_{\text{MSB}}$  and  $T_{\text{MSB}}$ , e.g.,  $X_{\text{MSB}} > 10$  and  $T_{\text{MSB}} > 10$ .



Fig. 14. Measured comparator error rate. Here  $\Delta V_{\text{lsb}}^*(M)$  is the minimum  $\Delta V_{\text{lsb}}$  to avoid accuracy degradation when using  $M$  trees. The  $\Delta V_{\text{lsb}}$  is controlled by changing the voltage level of WL enabling signal to affect  $R_{\text{BL}}$  in (1). The  $\Delta V_{\text{lsb}}$  is estimated from (1) by measuring  $\Delta V_{\text{BL}}$  with  $X = 15$  and  $T = 0$  in the test mode.



Fig. 15. Energy versus misclassification rate with regard to  $\Delta V_{\text{lsb}}$  for face detection with  $M = 64$  trees, where  $\Delta V_{\text{BL}} = 8\Delta V_{\text{lsb}}$  during normal SRAM read in order to achieve zero bit-error rate at default configuration ( $\Delta V_{\text{BL}} = 200 \text{ mV}$  and  $\Delta V_{\text{lsb}} = 25 \text{ mV}$ ).

Fig. 14 shows that the measured in-memory comparison error rate increases from 1.6% to 14.5% due to the increased impact of process variations as  $\Delta V_{\text{lsb}}$  [see (1)] reduces from 25 to 5 mV. Comparator errors were measured at each  $\Delta V_{\text{lsb}}$  by counting the errors during the classification with the KUL Belgium data set.

The impact of in-memory comparison errors on the misclassification rate was studied in Section V-C. System simulations indicate that a comparison error rate of less than 9.5% will result in an indiscernible eight-class classification accuracy loss, whereas  $M = 4$  trees can tolerate a comparison error rate of only 4%. Thus, Fig. 14 shows that the minimum  $\Delta V_{\text{lsb}}$  ( $\Delta V_{\text{lsb}}^*(M)$ ) required by an  $M$ -tree RF architecture to avoid accuracy degradation is  $\Delta V_{\text{lsb}}^*(4) = 15 \text{ mV}$  and  $\Delta V_{\text{lsb}}^*(64) = 8 \text{ mV}$ . This analysis indicates that  $\Delta V_{\text{lsb}}$  should be assigned based on the number of trees ( $M$ ).

### C. Task-Level Accuracy, Energy, and Throughput

Fig. 15 shows that the proposed IC achieves a maximum of  $3.1 \times$  energy savings over the conventional architecture for



(a)



(b)

Fig. 16. Task-level accuracy versus  $\Delta V_{lsb}$  for different number of trees ( $M$ ). (a) Traffic sign recognition. (b) Face detection.

the same misclassification rate of 6%. Fig. 16 shows that the inherent error tolerance of the RF algorithm improves with the number of trees ( $M$ ). The RF algorithm with  $M = 64$  trees can operate at  $\Delta V_{lsb}$  of 15 mV without any loss in accuracy as compared to an ideal fixed-point implementation (Fig. 16), while the RF with four trees observes degradation as soon as  $\Delta V_{lsb} < 20$  mV for both data sets. This requirement is 5 ~ 7 mV greater than that predicted by Fig. 14 where only errors from FR and analog comparators were taken into account. This discrepancy can be attributed to the presence of additional error sources in other stages such as pixel read and label read, where conventional SRAM read may also become erroneous due to  $\Delta V_{lsb}$  scaling. It is expected that the energy efficiency will improve in highly complex real-world tasks with a few hundred trees as it will enable further reductions in  $\Delta V_{lsb}$ . Binary face detection is less sensitive to  $\Delta V_{lsb}$  reduction than eight-class traffic sign recognition as expected. This result indicates that the  $\Delta V_{lsb}$  can be systematically scaled based on the number of classes, the number of trees  $M$ , and the target classification accuracy.

Fig. 17 shows the energy versus accuracy tradeoff with respect to two parameters: 1) number of trees ( $M$ ) and 2) BL voltage swing  $\Delta V_{lsb}$ . The decision energy can be saved by reducing either  $M$  or  $\Delta V_{lsb}$ . However, the  $P_{DET}$  degrades



Fig. 17. Energy versus task-level accuracy for traffic sign recognition with different number of trees ( $M$ ) at  $\Delta V_{lsb} = 5$  to 25 mV.



Fig. 18. Measured misclassification rates of multiple chip instances for traffic sign recognition with different number of trees ( $M$ ) at  $\Delta V_{lsb} = 25$  mV.

more gracefully when the energy is reduced by reducing  $M$ , e.g.,  $P_{DET}$  drops by 10% achieving 14.5 $\times$  energy savings with  $M$ , whereas  $P_{DET}$  degrades by 68% for 1.7 $\times$  energy reduction with  $\Delta V_{lsb}$ . Therefore, classification with smaller  $M$  always achieves better energy efficiency for a fixed classification accuracy  $P_{DET}$ , though the maximum achievable  $P_{DET}$  also reduces.

To observe the impact of process variations,  $P_{DET}$  is measured by testing five dies. Fig. 18 shows minor differences in  $P_{DET}$ , e.g., <6% and <2% with  $M = 4$  and 64, respectively. This result indicates that the process variations are well compensated by the inherent error tolerance of ensemble classification in RF algorithm.

Fig. 19 shows the energy and delay breakdowns of the proposed IC as compared to the conventional architecture. The breakdowns are obtained from post-layout simulations as it is difficult to measure the energy and delay of each component from the measurement of prototype IC. The energy and delay are reduced by 3.1 $\times$  and 2.2 $\times$ , respectively, thereby providing an overall 6.8 $\times$  lower EDP. Fig. 19 indicates that DSS, in-memory comparison, CAG contribute 47%, 37%, and 16% of the total energy savings, and 18%, 58%, and 24% of the total delay reduction, respectively, thereby indicating the relative contributions of DIMA and non-DIMA techniques.

TABLE III  
COMPARISON OF ENERGY EFFICIENCY AND THROUGHPUT WITH PRIOR ART

| Prior art  | Process    | Algorithm         | Power (mW) | Layout area (mm <sup>2</sup> ) | Decision throughput (decisions/s) ① | Decision energy (nJ/decision) ② | # of scalar node computations /decision ③* | Node energy (pJ) ④=②/③ | Decision EDP (f J·s /decision) ⑤=②/① | Node EDP (10 <sup>-17</sup> J·s) ⑤/③ |
|------------|------------|-------------------|------------|--------------------------------|-------------------------------------|---------------------------------|--------------------------------------------|------------------------|--------------------------------------|--------------------------------------|
| [23]       | 65 nm CMOS | Vocabulary tree   | 5.6        | 6.3                            | 30                                  | 186.7K                          | 8.4M                                       | 22                     | 6.2G                                 | 74K                                  |
| [24]       | 65 nm CMOS | Vocabulary forest | 27.6       | 2.3                            | 60                                  | 460K                            | 320M                                       | 1.4                    | 7.7G                                 | 2.4K                                 |
| This work† | 65 nm CMOS | Random forest     | 7.1        | 1.0                            | 364.4K                              | 19.4                            | 1984                                       | 9.8                    | 53.2                                 | 2.7                                  |

\* (# of trees/decision) × (# of tree nodes/tree) × (dimension/tree node), e.g., 64 × 31 × 1 in this work.

† at  $\Delta V_{lsb} = 15$  mV with  $M = 64$  trees and misclassification rate of 94%.



Fig. 19. Energy and delay breakdown at  $\Delta V_{lsb} = 25$  mV ( $\Delta V_{BL} = 200$  mV) obtained via post-layout simulations.

Table III shows that the prototype IC achieves a throughput of 364 K decisions/s and energy efficiency of 19.4 nJ/decision including the energy from CTRL. To the best of our knowledge, this is the first IC prototype of the RF algorithm. Therefore, we compare our work with other tree-based classifier ICs [23], [24]. Due to the vast difference in the data sets, tree complexity, and node architecture, we focus on node-level metrics (last three columns in Table III). The proposed RF classifier achieves node-level EDP that is three to four orders-of-magnitude lower than [23] and [24].

## VI. CONCLUSION AND FUTURE WORK

This paper has presented an IC realization of RF algorithm to achieve energy-efficient and high throughput by co-optimizing algorithm, architecture, and circuit design. As a result, the prototype IC achieves a 3.1× energy savings and 2.2× speed-up providing a 6.8× lower EDP at the same accuracy of >93% compared to conventional digital architecture.

The benefits of the proposed architecture are expected to increase with large-scale applications. This is because high resolution images have increased pixel correlation, which allows higher DSS ratio without degrading classification accuracy. The random noise components of in-memory computation also get averaged out better with higher tree numbers so that a lower voltage swing or a smaller technology node can be employed.

On the other hand, complex applications require processing a large number of tall trees. This is not an issue, as it is straightforward to parallelize the proposed RF architecture by employing multiple banks and exploiting the independence between tree executions. It is also expected that the regular RF structure provides strong scalability and reconfigurability features, e.g., stacking trees (one tree on top and  $N + 1$  trees at bottom) doubles the tree height and cascading the 4:1 DSS blocks generates a 16:1 DSS ratio. These features make the proposed RF classifier suitable for resource-constrained applications such as IoT devices to sustain the always-ON functionality, whereas DNN will be suited for higher accuracy applications.

## ACKNOWLEDGMENT

The authors would like to thank S. Eilert, K. Curewitz, N. Verma, B. Murmann, and P. Hanumolu, for their constructive discussions.

## REFERENCES

- [1] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in *Proc. IEEE Conf. Comput. Vis. Pattern Recognit.*, Jun. 2016, pp. 770–778.
- [2] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze, "Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks," *IEEE J. Solid-State Circuits*, vol. 52, no. 1, pp. 127–138, Jan. 2017.
- [3] H. Kaul *et al.*, "A 21.5 M-query-vectors/s 3.37 nJ/vector reconfigurable k-nearest-neighbor accelerator with adaptive precision in 14 nm tri-gate CMOS," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Jan. 2016, pp. 260–261.
- [4] S. Park, K. Bong, D. Shin, J. Lee, S. Choi, and H.-J. Yoo, "A 1.93 TOPS/W scalable deep learning/inference processor with tetra-parallel SIMD architecture for big-data applications," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2015, pp. 1–3.
- [5] M. Price, J. Glass, and A. P. Chandrakasan, "A scalable speech recognizer with deep-neural-network acoustic models and voice-activated power gating," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2017, pp. 244–245.
- [6] P. N. Whatmough, S. K. Lee, H. Lee, S. Rama, D. Brooks, and G.-Y. Wei, "A 28 nm SoC with a 1.2 GHz 568nJ/prediction sparse deep-neural-network engine with > 0.1 timing error rate tolerance for IoT applications," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2017, pp. 242–243.
- [7] B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst, "ENVISION: A 0.26-to-10 TOPS/W subword-parallel dynamic-voltage-accuracy-frequency-scalable convolutional neural network processor in 28 nm FD-SOI," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2017, pp. 246–247.

- [8] K. Bong, S. Choi, C. Kim, S. Kang, Y. Kim, and H.-J. Yoo, "A 0.62 mw ultra-low-power convolutional-neural-network face-recognition processor and a CIS integrated with always-on Haar-like face detector," in *IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2017, pp. 248–249.
- [9] D. Silver *et al.*, "Mastering the game of Go with deep neural networks and tree search," *Nature*, vol. 529, pp. 484–489, Jan. 2016.
- [10] L. Breiman, "Random forests," *Mach. Learn.*, vol. 45, no. 1, pp. 5–32, 2001.
- [11] B. Van Essen, C. Macaraeg, M. Gokhale, and R. Prenger, "Accelerating a random forest classifier: Multi-core, GP-GPU, or FPGA?" in *Proc. IEEE Annu. Int. Symp. Field-Program. Custom Comput. Mach. (FCCM)*, Apr. 2012, pp. 232–239.
- [12] M. Kang, S. K. Gonugondla, and N. R. Shanbhag, "A 19.4 nJ/decision 364K decisions/s in-memory random forest classifier in 6T SRAM array," in *Proc. IEEE Eur. Solid-State Circuits Conf. (ESSCIRC)*, Sep. 2017, pp. 263–266.
- [13] M. Kang, M.-S. Keel, N. R. Shanbhag, S. Eilert, and K. Curewitz, "An energy-efficient VLSI architecture for pattern recognition via deep embedding of computation in SRAM," in *Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP)*, May 2014, pp. 8326–8330.
- [14] N. Shanbhag, M. Kang, and M.-S. Keel, "Compute memory," U.S. Patent 9,697,877 B2, Jul. 4, 2017.
- [15] J. Zhang, Z. Wang, and N. Verma, "In-memory computation of a machine-learning classifier in a standard 6T SRAM array," *IEEE J. Solid-State Circuits*, vol. 52, no. 4, pp. 915–924, Apr. 2017.
- [16] M. Kang, S. K. Gonugondla, A. Patil, and N. R. Shanbhag, "A multi-functional in-memory inference processor using a standard 6T SRAM array," *IEEE J. Solid-State Circuits*, vol. 53, no. 2, pp. 642–655, Feb. 2018.
- [17] S. K. Gonugondla, M. Kang, and N. Shanbhag, "A 42pJ/decision 3.12 TOPS/W robust in-memory machine learning classifier with on-chip training," in *Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers*, Feb. 2018, pp. 490–492.
- [18] T. Kobayashi, K. Nogami, T. Shirotori, and Y. Fujimoto, "A current-controlled latch sense amplifier and a static power-saving input buffer for low-power architecture," *IEICE Trans. Electron.*, vol. 76, no. 5, pp. 863–867, May 1993.
- [19] V. A. Prisacariu, R. Timofte, K. Zimmermann, I. Reid, and L. Van Gool, "Integrating object detection with 3D tracking towards a better driver assistance system," in *Proc. IEEE Int. Conf. Pattern Recognit. (ICPR)*, Aug. 2010, pp. 3344–3347.
- [20] (2000). *Center for Biological & Computational Learning (CBCL) at MIT*. [Online]. Available: <http://poggio-lab.mit.edu/codedatasets>
- [21] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, "Deep learning with limited numerical precision," in *Proc. 32nd Int. Conf. Mach. Learn. (ICML)*, 2015, pp. 1737–1746.
- [22] C. Sakr, Y. Kim, and N. Shanbhag, "Analytical guarantees on numerical precision of deep neural networks," in *Proc. Int. Conf. Mach. Learn.*, 2017, pp. 3007–3016.
- [23] T.-W. Chen, Y.-C. Su, K.-Y. Huang, Y.-M. Tsai, S.-Y. Chien, and L.-G. Chen, "Visual vocabulary processor based on binary tree architecture for real-time object recognition in full-HD resolution," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 20, no. 12, pp. 2329–2332, Dec. 2012.
- [24] K. J. Lee, G. Kim, J. Park, and H. J. Yoo, "A vocabulary forest object matching processor with 2.07 M-vector/s throughput and 13.3 nJ/vector per-vector energy for full-HD 60 fps video object recognition," *IEEE J. Solid-State Circuits*, vol. 50, no. 4, pp. 1059–1069, Apr. 2015.



**Mingu Kang** (M'13) received the B.S. and M.S. degrees in electrical and electronic engineering from Yonsei University, Seoul, South Korea, in 2007 and 2009, respectively, and the Ph.D. degree in electrical and computer engineering from the University of Illinois at Urbana-Champaign, Champaign, IL, USA, in 2017.

From 2009 to 2012, he was with the Memory Division, Samsung Electronics, Hwaseong, South Korea, where he was involved in the circuit and architecture design of phase change memory (PRAM). Since

2017, he has been with the IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA, where he designs machine learning (ML) accelerator architecture. His current research interests include low-power integrated circuits, architecture, and system for ML, signal processing, and neuromorphic computing.



**Sujan K. Gonugondla** (S'16) received the B.Tech. and M.Tech. degrees in electrical engineering from IIT Madras, Chennai, India, in 2014. He is currently pursuing the Ph.D. degree in electrical and computer engineering with the University of Illinois at Urbana-Champaign (UIUC), Champaign, IL, USA.

His current research interests include low-power integrated circuits specifically algorithm hardware co-design for machine learning systems on resource-constrained environments.

Sujan received the Dr. Ok Kyun Kim Fellowship 2018–2019 from the ECE Department, UIUC, and was a recipient of the ADI Outstanding Student Designer Award 2018.



**Sungmin Lim** received the B.S. and M.S. degrees in electrical engineering from the Korea Advanced Institute of Science and Technology, Daejeon, South Korea, in 2014 and 2016, respectively. He is currently pursuing the Ph.D. degree in electrical and computer engineering with the University of Illinois at Urbana-Champaign, Champaign, IL, USA.

His current research interests include the design of energy-efficient architectures and implementation on integrated circuits for machine learning in resource-constrained platforms.



**Naresh R. Shanbhag** (F'06) received the Ph.D. degree in electrical engineering from the University of Minnesota, Minneapolis, MN, USA, in 1993.

From 1993 to 1995, he was with the AT&T Bell Laboratories, Murray Hill, NJ, USA, where he led the design of high-speed transceiver chip-sets for very high-speed digital subscriber line. In 1995, he joined the University of Illinois at Urbana-Champaign, Champaign, IL, USA. He has held visiting faculty appointments at the National Taiwan University, Taipei, Taiwan, in 2007, and at Stanford University, Stanford, CA, USA, in 2014. He is currently the Jack Kilby Professor of electrical and computer engineering with the University of Illinois at Urbana-Champaign. He has authored or co-authored over 200 publications in his research area and holds 13 U.S. patents. His current research interests include the design of energy-efficient integrated circuits and systems for communications, signal processing, and machine learning.

Dr. Shanbhag was a recipient of the National Science Foundation CAREER Award in 1996, the IEEE Circuits and Systems Society Distinguished Lecturer in 1997, the 2010 Richard Newton GSRC Industrial Impact Award, 826 and multiple Best Paper Awards. In 2000, he co-founded and served as the Chief Technology Officer of Intersymbol Communications, Inc., (acquired in 2007 by Finisar Corporation). From 2013 to 2017, he was the Founding Director of the Systems On Nanoscale Information fabriCs Center.