

# User Manual of DNN+NeuroSim Framework V1.4

Developers: Xiaochen Peng, Shanshi Huang, Anni Lu, Junmo Lee, and James Read

PI: Prof. Shimeng Yu, Georgia Institute of Technology

August 1st, 2023

---

## Index

|     |                                                     |    |
|-----|-----------------------------------------------------|----|
| 1.  | Introduction .....                                  | 1  |
| 2.  | New Feature Highlights in Version 1.4.....          | 2  |
| 3.  | System Requirements (Linux) .....                   | 3  |
| 4.  | Installation and Usage (Linux) .....                | 3  |
| 5.  | Chip Level Architectures.....                       | 5  |
| 5.1 | Interconnect: H-Tree .....                          | 5  |
| 5.2 | Floorplan of Neural Networks.....                   | 7  |
| 5.3 | Weight Mapping Methods .....                        | 8  |
| 5.4 | Pipeline System .....                               | 9  |
| 6.  | Circuit Level: Synaptic Array Architectures .....   | 10 |
| 6.1 | Parallel Synaptic Array Architectures .....         | 11 |
| 6.2 | Array Peripheral Circuits.....                      | 13 |
| 7.  | Device Level: Technology File .....                 | 22 |
| 8.  | Algorithm Level: PyTorch Wrapper.....               | 24 |
| 9.  | Algorithm Level: Inference Accuracy Estimation..... | 25 |
| 10. | How to run <i>DNN +NeuroSim</i> .....               | 28 |
| 11. | Reference.....                                      | 31 |

---

## 1. Introduction

*DNN+NeuroSim* is an integrated framework, which is developed in C++ and wrapped by Pytorch, to emulate the deep neural networks (DNN) inference performance (in V1.0-V1.4) or on-chip training (in V2.0-V2.1) performance on the hardware accelerator based on near-memory computing or in-memory computing architectures. Various device technologies are supported, including SRAM, emerging non-volatile memory (eNVM) based on resistance switching (e.g. RRAM, PCM, STT-MRAM), and

ferroelectric FET (FeFET). SRAM is by nature 1-bit per cell, eNVMs and FeFET in this simulator could support either 1-bit or multi-bit per cell. *NeuroSim* [1] is a circuit-level macro model for benchmarking neuro-inspired architectures (including memory array, peripheral logic, and interconnect routing) in terms of circuit-level performance metrics, such as chip area, latency, dynamic energy and leakage power. With Pytorch wrapper, *DNN +NeuroSim* framework can support hierarchical organization from the device level (transistors from 130 nm down to 7 nm, eNVM and FeFET device properties) to the circuit level (periphery circuit modules such as analog-to-digital converters, ADCs), to chip level (tiles of processing-elements built up by multiple sub-arrays, and global interconnect and buffer) and then to the algorithm level (different convolutional neural network topologies), enabling instruction-accurate evaluation on the inference accuracy as well as the circuit-level performance metrics at the run-time of inference.

The target users for this simulator are circuit/architecture designers who wish to quickly estimate the system-level performance with different network and hardware configurations (e.g. device technology choices, sequential read-out or parallel read-out, etc.). Different from our earlier released simulators (*MLP+NeuroSim* [2]), where the network was fixed to a 2-layer MLP and executed purely in C++ (consumes long run-time), this *DNN+NeuroSim* framework is an integrated simulator with Pytorch wrapper (i.e. C++ wrapped by python). With the wrapper, users can define various network structures, precisions of synaptic weights and neural activations, which guarantee efficient inference running with the popular machine learning platforms. Meanwhile, the wrapper will automatically save the real traces (synaptic weights and neural activations) during inference and send them to *NeuroSim* for real-time and real-traced hardware estimation. In this released version, three networks (VGG-8 network for CIFAR-10 dataset, DenseNet-40 network for CIFAR-10 dataset, ResNet-18 network for ImageNet dataset) are provided as default models in the wrapper, with 8-bit synaptic weights and neural activations, while users could modify the precisions and neural network topologies. The hardware parameters (such as technology nodes, memory cell properties, operation modes, and so on) will be defined under *NeuroSim* in **Param.cpp**.

---

## 2. New Feature Highlights in Version 1.4.

Key changes in this released V1.4 are summarized as follows.

### 1) Technology update down to 1 nm node

In NeuroSim 1.4, the technology node for the main array (SRAM, RRAM, etc.) and peripheral circuit design (switchmatrix, interconnect, ADC) are supported down to 1 nm node. The device/standard cell/interconnect parameters are updated to capture the industry trend and IRDS 2021-2022 projections.

### 2) Partial parallel mode support

Another key feature of NeuroSim 1.4 is the introduction of partial parallel mode for both PyTorch wrapper and hardware estimation. In previous NeuroSim versions, only fully parallel mode and sequential mode were supported. In practical CIM chip operations, fully parallel MAC operations can result in computation errors caused by various factors. These include sensing errors arising from limited ADC precision, high SNR (Signal-to-Noise Ratio), non-linearity of the MAC output, and the limited on/off ratio of memory cells. These errors can accumulate throughout the neural network layers, leading to a degradation of inference accuracy. To improve the inference accuracy, rows are allowed to be partially enabled and the PPA estimation equations are modified accordingly.

At this time, we only support partial parallel mode for single-level memory cells (i.e. 1b/cell) in the python wrapper. Partial parallel mode is fully supported in the hardware estimations if the users wish to estimate the hardware performance for multi-level cells without including the inference accuracy. We plan to fully support partial parallel mode for multi-level cells in the python wrapper in the next update.

### 3) **Modification of PPA estimation for few circuit modules**

To improve generalizability of the PPA estimation for various subarray sizes and ADC settings, PPA estimation code for flash ADC has been updated from the previous version (V1.3). Detailed explanation on the changes can be found in Section 6.2.

---

## 3. System Requirements (Linux)

The tool is expected to run in Linux with required system dependencies installed. These include GCC, GNU make, GNU C libraries (glibc). We have tested the compatibility of the tool with a few different Linux environments, such as (1) Red Hat 7.8 (Maipo), gcc v4.8.5, glibc v2.17, (2) Ubuntu 16.04, gcc v5.5.0, glibc v2.23, and they are all workable.

※ The tool may not run correctly (stuck forever) if compiled with gcc 4.5 or below, because some C++11 features are not well supported.

---

## 4. Installation and Usage (Linux + Anaconda)

**1. Download Anaconda/Miniconda:** <https://docs.conda.io/en/latest/miniconda.html>

**2. Follow install instructions:** <https://docs.conda.io/en/latest/miniconda.html#installing>

**3. Get the tool from GitHub**

```
git clone https://github.com/neurosim/DNN_NeuroSim_V1.4.git  
cd DNN_NeuroSim_V1.4
```

**4. Create a conda environment**

```
conda create --name neurosim
```

**5. Download and install PyTorch packages**

```
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
```

## **6. Activate neurosim environment**

conda activate neurosim

## **7. Pick a network architecture. The following have been pre-trained and provided with NeuroSim.**

1. VGG8 on cifar10

8-bit "WAGE" mode pretrained model is uploaded to './log/VGG8.pth'

2. DenseNet40 on cifar10

8-bit "WAGE" mode pretrained model is uploaded to './log/DenseNet40.pth'

3. ResNet18 on imagenet

"FP" mode pretrained model is loaded from '<https://download.pytorch.org/models/resnet18-5c106cde.pth>'

## **8. (Optional) Train the network to get the model for inference**

### **Step 9: Compile the NeuroSim C++ code**

cd Inference\_pytorch/NeuroSIM

make

### **Step 10: Run Pytorch/Tensorflow wrapper (integrated with NeuroSim). The following are some examples with arguments.**

cd ..

python inference.py --dataset cifar10 --model VGG8 --mode WAGE --inference 1 --cellBit 1 --subArray 128 --parallelRead 64

python inference.py --dataset cifar10 --model DenseNet40 --mode WAGE --inference 1 --cellBit 2 --ADCprecision 6

python inference.py --dataset imagenet --model ResNet18 --mode FP --inference 1 --onoffratio 100

A summary of the useful commands is provided below. It is recommended to execute these commands under the tool's directory.

| Command    | Description                                                                   |
|------------|-------------------------------------------------------------------------------|
| make       | Compile the <i>NeuroSim</i> codes and build the “main” program                |
| make clean | Clean up the directory by removing the object files and the “main” executable |

※ The simulation uses OpenMP for multithreading, and it will use up all the CPU cores by default.

---

## 5. Chip Level Architectures

In this framework, we assume the on-chip memory is sufficient to store synaptic weights of the entire neural network, thus the only off-chip memory access is to fetch in the input data. Fig. 1 shows the modeled chip hierarchy, where the top level of chip consists of multiple tiles, global buffer, accumulation units, activation units (sigmoid or ReLU), and pooling units. Fig. 1 (b) shows the structure of a tile, which contains several processing elements (PEs), tile buffer to load in neural activations, accumulation modules to add up partial sums from PEs and output buffer. Similarly, as Fig. 1 (c) shows, a PE is built up by groups of synaptic sub-arrays, PE buffers, accumulation modules and output buffer. Fig. 1 (d) shows an example of synaptic sub-array, which is based on one-transistor-one-resistor (1T1R) architecture for eNVMs. At sub-array level, the array architecture is different for SRAM or FeFET (not shown in this figure).

### 5.1 Interconnect: H-Tree

To estimate the area, latency, dynamic energy and leakage of the interconnect, we assume the routing



Fig. 1. The diagram of (a) top level of chip architecture, which contains multiple tiles, global buffer, accumulation units, activation units (sigmoid or ReLU) and pooling units; (b) a tile with multiple processing elements (PEs), tile buffer to load in activations, accumulation modules to add up partial sums from PEs and output buffer; (c) a PE contains a group of synaptic arrays, PE buffer and control units, accumulation modules and output buffer; (d) an example of synaptic array based on one-transistor-one-resistor (1T1R) architecture.

among modules in each hierarchy is based on an H-tree structure. According to interconnect engineering, the wire delay could be reduced by introducing repeaters which is used to split the wire into multiple segments. As Fig. 2 shows, a wire could be considered as a group of wire segments and repeaters. To find the optimal length of wire segment between repeaters leading to minimum delay, a VLSI design function



Fig. 2. The diagram of wire with repeaters.



Fig. 3. An example of H-tree for a  $4 \times 4$  computation-unit array.

[5] is introduced in EQ (4.1).  $R$  is the resistance of a minimum-sized repeater,  $C$  is the gate capacitance,  $C_{pinv}$  is the diffusion capacitance,  $R_w$  and  $C_w$  are the unit resistance and capacitance of wire, respectively.

$$L_{optimal} = \sqrt{\frac{2RC(1+p_{inv})}{R_w C_w}} \quad (4.1)$$

The repeater size should use an NMOS transistor width of

$$W = \sqrt{\frac{RC_w}{R_w C}} \quad (4.2)$$

Alternatively, to limit the energy consumption of interconnect, we may find a semi-optimal design option of trade-offs between wire latency and energy. In this framework, we introduce two parameters called “globalBusDelayTolerance” (and “localBusDelayTolerance” for global bus and tile/PE local bus respectively) to find the semi-optimal floorplan of bus with such delay sacrifice. These parameters are defined in **param.cpp**.

Fig. 3 shows an example of H-tree structure for  $4 \times 4$  computation units (either tiles or PEs), where the bus width connected to each unit is assumed to be same. The H-tree is composed of multiple stages (horizontal and vertical) from the widest (main bus) to the narrowest (connected to computation units). The wire length decreases by  $\times 2$  at each stage from wide to narrow ones, while the sum of bus width at each stage is fixed to the width of main bus. We find that when there are many computation units, the H-tree structure becomes inefficient with long detours and limited bus width for each unit. Therefore, we support horizontal x bus and vertical y bus to connect between global buffer and tiles in the 1.4 version. To enable the X-Y bus interconnect, set “globalBusType = false” in **Param.cpp**.

## 5.2 Floorplan of Neural Networks

To map various neural networks according to the defined chip architecture, it is crucial to follow a certain rule which does not violate hardware structure (and data flow) while guarantees high-enough memory utilization. We defined an algorithm to automatically generate the floorplan based on two kinds of weight-mapping methods, which optimize the memory utilization and define the tile size, PE size, number of tiles needed, based on user-defined synaptic array size.

The floorplan starts from tile sizing to PE sizing, while the size of synaptic array is defined by users in **Param.cpp**. With pre-defined network structure and weight mapping method, *NeuroSim* automatically calculate weight-matrix size for each layer (especially for convolutional ones, where 3D kernels will be unrolled to 2D matrices), the tile size firstly is set to a maximum value which could contain the largest weight-matrix among all the layers, then *NeuroSim* calculate the memory utilization (defined as memory mapped by synaptic weights / total memory storage on chip), keep decreasing the tile size till *NeuroSim* find a solution with optimal memory utilization.

To further increase memory utilization and speed up the processing speed of whole network as much as possible, weight duplication is introduced to each layer. Since the layer structure (such as input feature size, channel depth and kernel size) varies significantly in DNNs, which could occupy various amounts of synaptic arrays, it is possible that, the weight of several layers cannot fully fill one PE or even one synaptic array, a naïve way to custom-design the hardware is to mix multiple such small layers into one tile (or even one PE), however, this could make it complicated to define tile/PE size and number of tiles needed, thus, in this framework, we assume one tile is the minimum computation units for each layer, i.e., it is not allowed to map more than one layer into one tile, but there could be multiple tiles to map one single layer.

Hence, similarly, *NeuroSim* will continue to decide the PE size and possibilities of weight duplication among PEs, with pre-defined tile size as discussed above. For example, if the weight-matrix of a specific layer is smaller than the tile size (which means the tile cannot be fully filled by one weight-matrix), it is possible to duplicate the weight-matrix and fetch in multiple neural activation vectors, thus to speed up the process of this layer. In this step, *NeuroSim* start the PE design with a maximum PE size which equals to half of the tile size (to guarantee the exist of defined hierarchy), and decide whether to duplicate the weight-matrix and how many times of duplication for each layer, then recalculate the memory utilization with weight duplication factors, keep decreasing the PE size till *NeuroSim* find the optimal solution with highest memory utilization.

Finally, weight duplication could be further utilized inside PE, i.e. duplicate weight among synaptic arrays, in the similar way as PE design, the only difference is the synaptic array size if fixed. With these three stage floorplans, *NeuroSim* could guarantee high-enough memory utilization, meanwhile optimize the inference process speed.

Table I shows the overall memory utilization of the floorplan algorithm of AlexNet, VGG-16 and ResNet-34, based on the two supported mapping methods for ImageNet dataset, as well as the VGG-8 network for CIFAR-10 dataset. The results were based on assumption that one memory cell is sufficient to map one synaptic weight (i.e. an 8-bit cell to map an 8-bit synapse), and synaptic array size is  $128 \times 128$ . With various hardware configuration (such as two 4-bit memory cells form one 8-bit synaptic weight), the memory utilization could be slightly different.

Table I Memory Utilization

| Network          | Conventional Mapping | Novel Mapping |
|------------------|----------------------|---------------|
| VGG-8 (CIFAR-10) | 90.65%               | 98.73%        |
| AlexNet          | 98%                  | 97%           |
| VGG-16           | 98.79%               | 99.24%        |
| ResNet-34        | 85.88%               | 90.13%        |

### 5.3 Weight Mapping Methods

We support two mapping methods in this framework, conventional mapping and novel mapping method which was proposed in [6]. Fig. 4 shows the example of conventional mapping for one convolutional layer, where each 3D kernel (weight) is unrolled into a long column, since the partial sums in each 3D will be summed up to get the final output. Thus, the total kernels in each convolutional layer will form a group of such long columns, i.e., a large weight matrix.

To get the output feature maps (OFMs), as Fig. 4 shows, at first cycle, a part of input feature maps (IFMs) (shown in dark blue cube) will be multiplied with each 3D kernels. If we assume a single OFM has size of  $W \times W$ , with channel depth of  $N$ , there are  $N$  such OFM in total, we call the front OFM as the first OFM, and the back one as the  $N^{\text{th}}$  OFM. In this way, the sum of dot-products from the first kernel will be the first element in the first OFM, the sum of dot-products from the second kernel will be the first element in the second OFM, and so on, thus, at the first cycle, we could get the first elements in every OFM from front to back (as shown in light green row in size  $1 \times 1 \times N$ ). In the same way, at the second cycle, the kernels will “slide over” the inputs with a stride (equals to one in this example), after the dot-product operation, we will get all the second elements in each OFM. Thus, to generate the total OFMs in layer  $< n >$ , we need to “slide over” the IFMs by  $W \times W$  times, i.e. we need  $W \times W$  cycles to finish the computation.

It should be noted that, in conventional mapping, during the entire operation, a part of the IMFs used in earlier cycle will always be reused at current cycle. Considering about the huge amount of dot-product operations in convolutional layers, these frequent revisiting of input data from upper-level buffers could cause a significant energy and latency waste. Thus, a novel mapping method is introduced to maximize input data reuse.

Fig. 5 shows an example of novel mapping for the same convolutional layer. Instead of unrolling 3D kernels into a large matrix, the weights at different spatial location of each kernel are mapped into



Fig. 4. An example of conventional mapping method of input and weight data.



- IFM size =  $W \times W \times D$ , Kernel size =  $K \times K \times D \times N$ , OFM size =  $W \times W \times N$
- Number of sub-matrix =  $K \times K$

Fig. 5. An example of novel mapping method of input and weight data.

different sub-matrices. According to the spatial location of partitioned kernel data in each kernel, we define which group of these partitioned kernel data should belong to. Hence,  $K \times K$  sub-matrices are needed for the kernels (whose first and second dimension equal to  $K$  and  $K$ ), since each sub-matrix has size  $D \times N$ , the size of total weight matrix will be  $K \times K \times D \times N$ , which equals to the size of unrolled matrix from conventional mapping method (as Fig. 3 shows). Similarly, the input data which should be assigned to various spatial location in each kernel, will be sent to the corresponding sub-matrix, respectively. Partial sums from sub-matrices could be obtained in parallel. Later, an adder tree will be used to sum up the partial sums.

Hence, such group of sub-arrays with the necessary input and output buffers and accumulation modules can be defined as a processing element (PE). The kernels are split into several PEs according to their spatial locations, and assign the input data into corresponding ones, it is possible to reuse the input data among these PEs, i.e., directly transfer input data among PEs which do not need to revisit upper-level buffers.

#### 5.4 Pipeline System

In this framework, we assume all the synaptic weights are mapped on to the inference chip, which means it is possible to build up a pipeline system with acceptable global buffer overhead (to save activations for different images), to improve throughput and energy efficiency (less leakage for idle cycles).

To avoid overhead of complicated control circuits, we assume each layer as one pipeline stage, and the pipeline system clock cycle is defined as the longest latency among all the layers. According to the mapping method of synaptic weights, the total latency of each layer is related to the size of its input feature maps (IFMs) and stride size.

Since the IFMs tend to become deeper but smaller from shallow layers to deeper layers, the speed of deeper layers will be limited by the shallow layers, since they have to wait for the shallow layers to generate the IFMs. During the waiting period, the deeper layers have to stay idle, and thus cause leakage energy.



Fig. 6. Speed-up in Pipeline System.

In this case, we defined a parameter “speedUpDegree” in file “Param.cpp” to speed up each layer, by duplicating the weight and processing different IFMs simultaneously. As Fig. 6 shown below, if the size of IFMs is  $(C_1 * C_1)$ , with 2X speed up, the actual latency will be  $(C_1 * C_1)/4$ . It should be noted that, in this framework, to avoid ultimate speed-up (i.e. ultimate weight duplication), we define a boundary of speed-up degree as the maximum speed-up allowed: there is no idle period across all the layers after speed-up.

## 6. Circuit Level: Synaptic Array Architectures

With various device technologies, the chip could operate in different modes, such as digital sequential (row-by-row) read-out for near-memory computing, or analog parallel read-out for in-memory computing. In the simulator, the parameters of synaptic devices and synaptic array modes will be instantiated in **param.cpp**.



Fig. 7. The diagram of SRAM-based (a) sequential-read-out; (b) parallel-read-out synaptic arrays.



Fig. 8. (a) sequential-read-out and (b) parallel-read-out analog eNVM pseudo-1T1R synaptic arrays; (c) sequential-read-out and (d) parallel-read-out analog FeFET synaptic arrays.

### 6.1 Parallel Synaptic Array Architectures

Fig. 7 and Fig. 8 show three kinds of supported synaptic arrays, which could be used to process analog in-memory computing. Here are some assumptions that apply to all kinds of array architectures below. The higher precision than 1-bit in the input neuron activation is represented by multiple cycles of input voltage signals to the row, and no analog voltage is used to represent the input, thus no digital-to-analog converter (DAC) is used, as the nonlinearity in I-V curve of eNVMS will introduce distortion in parallel read-out [7]. The higher precision than 1-bit in the weight could be represented by a single analog synaptic cell or multiple synaptic cells. For example, an 8-bit weight could be represented by a single 8-bit eNVM cell (assuming it is technologically viable), or 2 eNVM cells (4 bits per cell), or 4 eNVM cells (2 bits per cell), or 8 eNVM binary cells. In our design, the inference is performed in parallel mode by activating all the rows, while the weight update in the training is performed in a row-by-row fashion. It should be noted that as the peripheral ADC size is typically much larger than the column pitch of the array, therefore column sharing is used by the column mux (e.g. 8 columns share one ADC). Although the highest parallelism is achieved with all rows activated, the column currents become high with synaptic resistances parallel connected. It causes large IR drop on wires and column mux and high energy consumption. To balance this issue, partial parallel mode is added to the 1.4 version to enable customized number of rows. The leakage of SRAM is included with the rows in idle as well.

#### 1) SRAM synaptic array

Multiple digital SRAM cells can be grouped along the row to represent one weight with higher precision than 1-bit, as shown in Fig. 7. The weighted sum and weight update operations are similar to the row-by-row read and write operations in conventional SRAM for memory, respectively. In sequential-read-out mode as Fig. 7 (a) shows, to select a row, the WL is activated through the WL decoder. To access all the cells on the selected row, the BLs are pre-charged by the pre-charge and the write driver in weighted sum and weight update, respectively. After the memory data are read by the sense amplifier (S/A), the adder and register are used to accumulate the partial weighted sum in a row-by-row fashion. In parallel-read-out mode as demonstrated in [8], the input vectors will be fetched in via WL switch matrix, the partial-sums will be collected along columns simultaneously at one time with high-precision flash-ADCs based on multilevel S/A by varying references. In both modes, the adders and shift registers are used to shift and accumulate partial sums for multiple cycles of input vectors (which represent MSB to LSB of the analog neural activations).



Fig. 9. Transformation from (a) conventional 1T1R array to (b) pseudo-crossbar array by 90° rotation of BL to enable weighted sum operation.

## 2) Analog eNVM 1T1R synaptic array

Fig. 8 (a) and (b) shows the structure of 1T1R based eNVM array. The WL controls the gate of the transistor, which can be viewed as a switch for the cell. The source line (SL) connects to the source of the transistor. The eNVM cell's top electrode connects to the BL, while its bottom electrode connects to the drain of the transistor through a contact via. In such case, the cell area of 1T1R array is then determined by the transistor size, which is typically  $>6F^2$  depending on the maximum current required to be delivered into the eNVM cell. Larger current needs larger transistor gate width/length (W/L). However, conventional 1T1R array is not able to perform the parallel weighted sum operation. To solve this problem, we modify the conventional 1T1R array by rotating the BLs by 90°, which is known as the pseudo-crossbar array architecture, as shown in Fig. 9 (b). In weighted sum operation, all the transistors will be transparent when all WLs are turned on. Thus, the input vector voltages are provided to the BLs, and the weighted sum currents are read out through SLs in parallel. Then the weighted sum currents are digitalized by a current-mode sense amplifier (S/A), and a Flash-ADC with multilevel S/A by varying references.

## 3) Analog eNVM crossbar array

The crossbar array structure has the most compact and simplest array structure for analog eNVM devices to form a weight matrix, where each eNVM device is located at the cross point of a word line (WL) and a bit line (BL), as shown in Fig. 8 (c). The crossbar array structure can achieve a high integration density of  $4F^2/\text{cell}$  ( $F$  is the lithography feature size). If the input vector is encoded by read voltage signals, the weighted sum operation (matrix-vector multiplication) can be performed in a parallel fashion with the crossbar array. Here, the crossbar array assumes there is an ideal two-terminal selector device connected to each eNVM, which is desired for suppressing the sneak path currents during the row-by-row weight update. It should be noted that ideal selector device is still under research and development.

## 4) Analog FeFET array

As shown in Fig. 8 (c) and (d), the analog FeFET array is in the pseudo-crossbar fashion as proposed in [9], which is similar to the analog eNVM pseudo-crossbar one. It also has an access transistor for each cell to prevent programming on other unselected rows during row-by-row weight update. As FeFET is a three-terminal device, it needs two separate input signals to be fetched to activate WLs and introduce read voltages to RS (read select), respectively, where RS is used to fetch in input vectors as Fig. 10 shown below.



Fig. 10. Operations of (a) write and (b) read in FeFET cell.

## *6.2 Array Peripheral Circuits*

The periphery circuit modules used in the synaptic arrays in Fig.7 and Fig. 8 are described below:

### 1) Level shifter

Level-shifter is normally required for RRAM (or PCM/FeFET) array to support the need of higher write voltage (e.g. >1.5V which is higher than logic VDD). In the simulator, we take a conventional level shifter as shown in Figure.11. If the validation mode is selected, a wiring area factor  $\alpha = 1.44$  will be imposed on this module for calibration.



Fig. 11. Schematic of level shifter.

## 2) Switch matrix

Switch matrices are used for fully parallel voltage input to the array rows or columns. Fig. 12 (a) shows the BL switch matrix for example. It consists of transmission gates that are connected to all the BLs, with control signals ( $B_1$  to  $B_n$ ) of the transmission gates stored in the registers (not shown here). In the weighted sum operation, the input vector signal is loaded to  $B_1$  to  $B_n$ , which decide the BLs to be connected to either the read voltage or ground. In this way, the read voltage that is applied at the input of transmission gates can pass to the BLs and the weighted sums are read out through SLs in parallel. If the input vector is higher than 1 bit, it should be encoded using multiple clock cycles, as shown in Fig 12 (b). The reason why we do not use analog voltage to represent the input vector precision is the I-V nonlinearity of eNVM cell, which will cause the weighted sum distortion or inaccuracy as discussed above. In the simulator, all the switch matrices (`slSwitchMatrix`, `blSwitchMatrix` and `wlSwitchMatrix`) are instantiated from `SwitchMatrix` class in `SwitchMatrix.cpp`, this module is used in parallel-read-out synaptic arrays.



Fig. 12 (a) Transmission gates of the BL switch matrix in the weighted sum operation. A vector of control signals ( $B_1$  to  $B_n$ ) from the registers (not shown here) connect the BLs either to a voltage source or ground. (b) Control signals in a bit stream to represent the precision of the input vector.

### 3) Crossbar WL decoder

The crossbar WL decoder is modified from the traditional WL decoder. It has an additional feature to activate all the WLs for making all the transistors transparent for weighted sum. The crossbar WL decoder is constructed by attaching the follower circuits to every output row of the traditional decoder, as shown in Fig. 13. If  $ALLOPEN=1$ , the crossbar WL decoder will activate all the WLs no matter what input address is given, otherwise it will function as a traditional WL decoder. In the simulator, the crossbar WL decoder contains a traditional WL decoder (**wlDecoder**) instantiated from **RowDecoder** class in **RowDecoder.cpp** and a collection of follower circuits (**wlDecoderOutput**) instantiated from **WLDecoderOutput** class in **WLDecoderOutput.cpp**, this module is used in sequential-read-out synaptic arrays.



Fig. 13 Circuit diagram of the crossbar WL decoder. Follower circuit is attached to every row of the decoder to enable activation of all WLs when  $ALLOPEN=1$ .



Fig. 14 Circuit diagram of (a) decoder follower and (b) WL switch matrix, which are used to control both WLs and BLs simultaneously, for pseudo-1T1R synaptic arrays.

#### 4) Decoder driver

The decoder driver helps provide the voltage bias scheme for the write operation when its decoder selects the cells to be programmed. As the digital eNVM crossbar array has the write voltage bias scheme for both WLs and BLs, it needs the WL decoder driver (**wlDecoderDriver**) and column decoder driver (**colDecoderDriver**). These decoder drivers can be instantiated from **DecoderDriver** class in **DecoderDriver.cpp**, this module is used in sequential-read-out synaptic arrays.

#### 5) New Decoder Driver and Switch Matrix

One should be noticed that, for eNVM pseudo-crossbar and FeFET synaptic arrays, the WLs and BLs/RSs could be controlled by same input signals, but with different voltage values, thus, it could significantly save the area for unnecessary BL/RS switch matrix. To achieve this function, there are several extra control gates to be added into the WL decoder driver circuits, and into the WL switch matrix. Fig. 14 shows the circuit diagram of new decoder driver and switch matrix for eNVM pseudo-1T1R synaptic array, which could be used to control both WL and BL (or RS) at the same time. In Fig. 14 (a), with the input and decoder output, both of WL and BL will be controlled, where the WLs will be either activated or not, and the BLs to be connected to either the read voltage or ground. Similarly, in Fig. 14 (b), the each single WL switch matrix has two extra transmission gates to be used to send two separate voltages into the corresponding WL and BL. In FeFET synaptic arrays, the signals connected to BLs in this example, will be connected to RSs. In the simulator, the **WLNewDecoderDriver** (decoder driver) is instantiated from **WLNewDecoderDriver** class in **NewDecoderDriver.cpp** and the **WLNewSwitchMatrix** (WL switch matrix) is instantiated from **WLNewSwitchMatrix** class in **NewSwitchMatrix.cpp**, these new decoder follower and switch matrix are used in eNVM pseudo-1T1R and FeFET synaptic arrays.

#### 6) Multiplexer (Mux) and Mux decoder

The Multiplexer (Mux) is used for sharing the read periphery circuits among synaptic array columns, because the array cell size is much smaller than the size of read periphery circuits and it will not be area-efficient to put all the read periphery circuits underneath the array. However, sharing the read periphery circuits among synaptic array columns inevitably increases the latency of weighted sum as time multiplexing is needed, which is controlled by the Mux decoder. In the simulator, the Mux (**mux**) is instantiated from **Mux** class in **Mux.cpp** and the Mux decoder (**muxDecoder**) is instantiated from **RowDecoder** class in **RowDecoder.cpp**.

## 7) Analog-to-digital converter (ADC)

To read out the partial-sums and further process them in the subsequent logic modules (such as activation and pooling), ADCs are used at the end of SLs to generate digital outputs. In the simulator, different types of ADC are supported such as successive-approximation-register (SAR) ADC (Fig. 15), and Flash ADC using multilevel voltage-mode sense amplifiers (VSA) or current-mode sense amplifier (CSA) (Fig. 16). The flash ADC design/modeling for NeuroSim 1.4 has been updated from the previous version, to improve the generalizability to different synapse types (RRAM synapse, SRAM synapse) and the flexibility of supporting PPA estimation for various subarray sizes/ADC precisions. Additionally, new sets of flash ADC parameters (for power/latency/area estimation) are incorporated into the code based on the cadence simulation results on the updated ADC circuits (Fig. 17).



Fig. 15 Schematic of successive-approximation-register (SAR) ADC.

A PTM model is used for the device compact model for Cadence simulation. For the technology nodes beyond 14 nm, calibrated PTM models (introduced in Section. 7) to match the on/off current & gate capacitance requirement at each technology node were used. We point out general working principles of ADC circuit operation before detailing the modeling process for ADC in NeuroSim 1.4. In theory (and from the cadence simulation), the ADC latency can have a wide variety of values ranging from the ps level to the tenths of ns level depending on the read voltage ( $V_{read}$ ) of the main array and the closeness between the column resistance and the reference resistance. Generally, the higher the read voltage, the faster development of the voltage difference between the two storage nodes in the latch (part of sense amplifier), which leads to faster ADC sensing. Also, the higher the reference resistance and column resistance difference, the faster the sensing time. (In the case of SRAM mode, the supply voltages for the latch in the SRAM are assumed to be adjusted to  $V_{read}$  during ADC operation, for consistent modeling with the RRAM mode). As a side effect of increasing read voltage, the dynamic energy consumption of the bitline increases due to the increased read voltage, the higher the energy consumption in ADC, as the dynamic energy consumption is proportional to  $C_{bit}V_{dd2}$ .

(a)



(b)



Due to the complexity of modeling the latency for arbitrary read voltage values, we set a fixed read voltage for each technology node based on a certain criterion. For the default implementation provided in NeuroSim 1.4 public version, the read voltage is chosen to satisfy following requirements: (i) ~ 1 ns of 5-bit flash ADC latency is achieved at 22 nm node and (ii) 5-bit ADC latency decreases by 100~300 ps with each successive technology node scaling. The details of each power/performance/area estimation are elaborated in the following paragraphs.



Fig. 18. Layout of the CSA/VSA.



Fig. 19. (a) Worst-case CSA delay (red line) and static power/CSA (purple line) plotted against reference resistance.  
(b) Worst-case VSA delay (red line) and static power/VSA (purple line) plotted against reference resistance.

**Area.** The sense amplifier part of ADC area is calculated based on CSA/VSA layout in Fig. 18. For the sake of simplicity, we ensure the CSA/VSA design have the same layout, so that the same area equations can be used for the ADC area estimation. Reference resistances that are being compared with the column resistances are implemented using dummy columns. Dummy column has the same design as the main array (RRAM-based, SRAM-based ... etc.), but the supply voltages and the stored memory bits are varied flexibly to provide a desired equivalent resistance for the ADC operation. It is also assumed that dummy column can be shared amongst many different sense amplifiers to reduce the area overhead by the reference resistances. The parameter **dumcolshared** in **Param.cpp** determines how many sense amplifiers share one dummy column. The main advantage of such modeling approach is that the area estimation for dummy columns/arrays can be carried out simply using the same area equations as that of the main array.

**Latency.** The ADC latency is estimated as the worst-case CSA/VSA delay given the reference resistance sets, determined by the device on/off resistance and ADC bit precision. The worst-case CSA/VSA delay refers to the propagation delay between SAE and OUT/OUT\_BAR signals (Fig. 17) when there is a 10% difference between the column resistance and the reference resistance. The final ADC delay for clock period (critical delay) estimation is set as the highest value among the worst-case CSA/VSA delays associated with each reference resistance. The ADC delay is the dominant component of the critical delay of the subarray, which are used to determine the clock frequency of the chip. Typical waveforms of several signals during ADC operations and worst-case ADC delay as a function of reference resistance is plotted in Fig. 17 (a)-(b). When SAE signal is turned on and  $V_{read}$  is applied, net 038/014 start to charge toward  $V_{read}$ . Consequently, M17 and M18 gradually become stronger and start to pull down the node voltages of net 055/065 towards ground. Depending on the charging speed difference between the net 038/014, the pulldown strengths of M17 and M18 and thus M7 and M6 differ. If the net 014 charging is faster than net 038, which is the case for the worst-case delay assumption, M18 becomes stronger than M17 and M7 becomes stronger than M6. As a result, some time after “SAE” enable “out\_bar” starts to be pulled down to net 065 while “out” is being pulled up to  $V_{dd}$  in the latch (consisted of M6, M7, M11, and M14). Thus, the final voltage of CSA/VSA output is essentially determined by the charging speed difference of net055/net065, affected by the resistance difference between reference and main array columns.

**Power.** ADC power can be decomposed into two categories: (i) dynamic energy consumption from the subarray column charging/discharging + transistor switching (ii) static energy consumption from the leakage power dissipation in reference/main array cells and transistors. (i) is estimated using the dynamic energy estimation formula using the node capacitance & voltage, consistent with the general modeling equations in other circuit modules. (ii) is estimated from the look up table containing static power/1 sense amplifier as a function of resistance (can either be reference resistance or main array resistance), obtained from the Cadence simulation results. The generated worst-case CSA/VSA delay and static power consumption as a function of reference resistance values are plotted in Fig. 19 (a)-(b).

**Analysis of the latency/power trend.** It is worth noting some trends observed from Fig. 19 (a)-(b). It is expected that as the reference resistance values become higher, the ADC delay will also increase since the RC delay along the reference/main array column will increase. The increased RC delay slows down the development of the voltage difference between net 038/014 (and net 055/056), delaying the development of the final voltage in the latch of the sense amplifier. The expected tendency is well reflected in the plotted results in Fig. 19 (a)-(b). However, it is also observed that the ADC delay also increases when the reference resistance drops below a certain threshold. Such a counterintuitive trend observed from both CSA/VSA can be explained by the fact that the time required to develop a sufficient voltage difference between net038/014 increases when both reference/main array column charge faster beyond a certain threshold. Node voltages at net 038/014 reach close to  $V_{read}$  shortly after SAE enable when reference resistance/main array resistance values are sufficiently low, due to the low RC delay along column. Although the currents flowing through drain terminals of M17 & M18 are large, it takes longer time to develop final latch voltage as currents on reference/main array sides have small difference. Thus, the bottleneck of the ADC delay switches from the RC delay to charge the net 038/014 to the voltage difference development delay as the reference resistance constantly decreases towards 0.

According to Fig. 19 (a)-(b), the static power consumption from VSA is significantly less than that in CSA. This is because the ref\_voltage in Fig. 16 (b), which sets the equivalent on-resistance M33 and M34 can be set to very low values in simulation without affecting the circuit functionality (For NeuroSim 1.4, ref\_voltage is set to 0.05 V). The equivalent on resistance of M33 and M34 make the



Fig. 20. Classification accuracy of CIFAR-10 for an 8-bit CNN as a function of the ADC precision for partial sums. Non-linear quantization was used for a sub-array size of  $128 \times 128$ .

current flow through the drain terminals of M33 and M34 (which correspond to the reference/main array column current) negligibly small, resulting in order of magnitude less static power consumption in VSA. Due to the negligible power consumption in VSA, we set the static power consumption of VSA to 0 in NeuroSim 1.4.

To determine the ADC latency/power at different read voltages other than the default values, it is advised that users conduct cadence simulation under desired read voltages based on the circuits in Fig. 16 (a)-(b) to modify the latency/power table in **MultilevelSenseAmp.cpp**.

To read out the partial-sums in parallel modes, it requires ADC with high enough precision, for example, with synaptic array size  $128 \times 128$ , and each cell represents 1-bit synapse, the partial-sums along one column would be 7-bit which can be impractical for an ASIC, thus we usually truncate the precision of ADC (for partial sums) to minimize the area and energy overhead.

As Fig. 20 shows, we perform 8-bit inference of VGG-8 network on CIFAR-10 dataset, to investigate the effects of truncating ADC precision on the classification accuracy. We set the sub-array size to be  $128 \times 128$ , and investigate three schemes with 1-bit cell, 2-bit cell and 4-bit cell. To minimize the ADC truncation effects on the partial-sums, we utilize the non-linear quantization with various quantization edges (corresponding to different ADC precision), where the edges are determined according to the distribution of partial-sums, as proposed in [10]. Compared to the baseline accuracy (no ADC truncation), the results suggest that at least 4-bit ADC is required to prevent significant accuracy degradation. Compared to a prior work on binary neural network where 3-bit ADC was reportedly sufficient [10], the results in Fig. 16 suggest that higher weight-precision generally requires higher ADC-precision. With larger synaptic array size or higher cell precision, higher ADC precision is demanded.

## 8) Adder and register

As mentioned earlier, the adders and registers are used to accumulate the partial weighted sum results during the row-by-row weighted sum operation in digital synaptic array architectures. The group of adders is instantiated from **Adder** class in **Adder.cpp** and the group of registers (**dff**) is instantiated from **DFF** class in **DFF.cpp**.

## 9) Adder and shift register

The adder and shift register pair at the bottom of synaptic core performs shift and add of the weighted sum result at each input vector bit cycle ( $B_1$  to  $B_n$  in Fig 12 (b)) to get the final weighted sum. The bit-width of the adder and shift register needs to be further extended depending on the precision of input vector. If the values in the input vector are only 1 bit, then the adder and shift register pair is not required. In the simulator, a collection of the adder and shift register pairs (**ShiftAdd**) is instantiated from **ShiftAdd** class in **ShiftAdd.cpp**, where **ShiftAdd** further contains a group of adders (**adder**) instantiated from **Adder** class in **Adder.cpp** and a group of registers (**dff**) instantiated from **DFF** class in **DFF.cpp**.

---

## 7. Device Level: Technology File

Technology.cpp and Param.cpp contain the transistor dimension/performance metrics and standard cell design parameters for each technology node, required for estimating the PPA of the circuits. In NeuroSim 1.4, technology file (Technology.cpp) is updated to support down to 1 nm CMOS technology node. Stacked nanosheet devices and backside power rail are introduced in the code from 2 nm. The details of the device and standard cell design assumptions/parameters are summarized in Table II-III below:

|                                              |    | Technology Node (nm) |     |     |     |     |     |     |
|----------------------------------------------|----|----------------------|-----|-----|-----|-----|-----|-----|
|                                              |    | 14                   | 10  | 7   | 5   | 3   | 2   | 1   |
| Metal track in M0<br>(T)                     |    | 9                    | 7.5 | 6   | 6   | 6   | 5.7 | 5   |
| Metal pitch<br>(nm)                          | M2 | 64                   | 44  | 40  | 36  | 32  | 24  | 16  |
|                                              | M0 | 64                   | 44  | 40  | 30  | 24  | 20  | 16  |
|                                              | M1 | 78                   | 64  | 57  | 34  | 32  | 23  | 20  |
| # of fins per standard cell<br>(NMOS + PMOS) |    | 4+4                  | 3+3 | 2+2 | 2+2 | 2+2 | 1+1 | 1+1 |
| Cell height (nm)                             |    | 576                  | 330 | 240 | 180 | 144 | 114 | 80  |
| Contacted poly pitch<br>(nm)                 |    | 78                   | 64  | 57  | 51  | 48  | 45  | 40  |
| PN separation length<br>(nm)                 |    | 136                  | 100 | 83  | 64  | 45  | 40  | 15  |
| Fin pitch (nm)                               |    | 48                   | 36  | 30  | 28  | 24  | 26  | 24  |
| Barrier thickness for<br>M0-M2 (nm)          |    | 2.5                  | 2.5 | 2.5 | 2.0 | 1.5 | 0.5 | 0.5 |

Table II. Assumed standard cell design assumptions for NeuroSim 1.4.

|                                                    | Technology Node (nm) |         |         |         |         |                  |                 |
|----------------------------------------------------|----------------------|---------|---------|---------|---------|------------------|-----------------|
|                                                    | 14                   | 10      | 7       | 5       | 3       | 2                | 1               |
| Fin height [FH] (nm)                               | 42                   | 45      | 50      | 50      | 48      | -                | -               |
| Fin width [FW] (nm)                                | 8                    | 8       | 7       | 6       | 5       | -                | -               |
| NS thickness [NT] (nm)                             | -                    | -       | -       | -       | -       | 6                | 6               |
| NS width [NW] (nm)                                 | -                    | -       | -       | -       | -       | 15               | 10              |
| NS vertical pitch [VP] (nm)                        | -                    | -       | -       | -       | -       | 14<br>IRDS: 18   | 12<br>IRDS: 16  |
| # of stacked NSs                                   | -                    | -       | -       | -       | -       | 3.0              | 4.0             |
| Gate length (nm)                                   | 26                   | 22      | 22      | 20      | 18      | 14               | 12              |
| Equivalent oxide thickness [EOT] (nm)              | 0.9                  | 0.8     | 0.7     | 0.65    | 0.6     | 0.55             | 0.5             |
| Gate height over fin [GHOF] (nm)                   | 40.0                 | 35.0    | 30.0    | 25.0    | 20.0    | 10.0<br>IRDS: 15 | 5.0<br>IRDS: 10 |
| Spacer width [SW] (nm)                             | 12.0                 | 8.0     | 8.0     | 7.0     | 6.0     | 6.0              | 5.0             |
| Spacer k value                                     | 4.5                  | 4.5     | 4.5     | 4       | 3.5     | 3.3              | 3               |
| Contact resistivity ( $\Omega \cdot \text{cm}^2$ ) | 2e-9                 | 2e-9    | 2e-9    | 2e-9    | 2e-9    | 2e-9             | 5e-10           |
| On current/fin ( $\mu\text{A}$ )                   | 54.744               | 58.725  | 60.139  | 61.320  | 64.788  | 66.385           | 59.005          |
| On current density ( $\mu\text{A}/\text{um}$ )     | 595.045              | 599.237 | 562.048 | 578.495 | 641.463 | 526.868          | 460.980         |
| Off current/fin (pA)                               | 9.856                | 12.516  | 15.752  | 14.676  | 16.006  | 9.242            | 21.747          |
| $g_m/\text{fin}$ (mS)                              | 0.130                | 0.177   | 0.191   | 0.193   | 0.204   | 0.248            | 0.307           |
| Supply voltage (V)                                 | 0.800                | 0.750   | 0.700   | 0.700   | 0.700   | 0.650            | 0.600           |
| Gate capacitance (nF/m)                            | 1.128                | 0.995   | 0.939   | 0.772   | 0.719   | 0.633            | 0.523           |
| Junction capacitance ( $\text{F}/\text{m}^2$ )     | 0.012                | 0.013   | 0.014   | 0.012   | 0.013   | 0.009            | 0.010           |
| Doping Concentration                               |                      |         |         |         |         |                  |                 |
| PTS doping ( $\text{cm}^{-3}$ )                    | 7e18                 |         |         |         |         |                  |                 |
| Substrate doping ( $\text{cm}^{-3}$ )              | 3e17                 |         |         |         |         |                  |                 |
| S/D doping ( $\text{cm}^{-3}$ )                    | 2e20                 |         |         |         |         |                  |                 |
| S/D extension doping ( $\text{cm}^{-3}$ )          | 1e18                 |         |         |         |         |                  |                 |
| Channel doping ( $\text{cm}^{-3}$ )                | 1e15                 |         |         |         |         |                  |                 |

Table III. Assumed device parameters for NeuroSim 1.4.

Standard cell dimension scaling is the driving force of the Moore's law. We assume that FinFET and stacked nanosheet are adopted at 14 nm to 3 nm and at 2 nm/1 nm, respectively. Standard cell is largely characterized by two parameters: contacted poly pitch (CPP) and the cell height (CH). CH can be decomposed into three components: PN separation length, Active area, Outer area (Fig. 1). We determine PN separation length by assuming two dummy fins are used for the separation of PMOS and NMOS for 14-7 nm and using International Roadmap for Devices and Systems (IRDS) 2021-2022 [11] projection values for 5-1 nm nodes. CPP, CH, and fin pitch (FP) for 14-7 nm are taken from the reported major company data [12-18] and the IRDS 2021-2022 values corresponding to the high-density (HD) application are used for 5-1 nm nodes. Fin number/standard cell of 14-7 nm nodes follow the reported industry trends [19-21] and that for 5-3 nm node is assumed to be 2+2/standard cell (NMOS+PMOS/standard cell). M0/M1/M2 metal pitch for 14-7 nm is assumed to be equal to minimum metal pitch/CPP/minimum metal pitch from the reported major company data [12-18] and that for 5-1 nm nodes follow IRDS 2021-2022's projections. M0/M1/M2 metal width is assumed to be one-half of the M0/M1/M2 metal pitch. The standard cell parameters used for the extended NeuroSim are summarized in Table 1 (in the appendix).

We use Sentaurus TCAD simulation to generate the device parameters. Due to the limited accessibility to the actual foundry data for the advanced technology nodes (14-1 nm), we follow IRDS 2016-2022's projection data [11] to set the on-current density. We set target  $I_{on}$  ( $\mu A/fin$ ) as the value corresponding to the HD application. Using the device design parameters as shown in Table III, we build an NMOS transistor at each technology node using Sentaurus TCAD simulation (NMOS and PMOS are assumed to have symmetrical characteristics). Device parameters such gate capacitance (normalized to effective width) and junction capacitance (normalized to S/D junction area), and transconductance  $g_m$  which are necessary for NeuroSim hardware estimation are extracted from the simulated  $I_D$ - $V_G$ ,  $C_{GG}$ - $V_G$  curve from Sentaurus TCAD.

It should be noted that updated technology parameters beyond 14 nm are based on the limited information provided from papers by major companies and IRDS projection. User can utilize their own judgement to modify the technology parameters in Technology.cpp if necessary.

## 8. Algorithm Level: PyTorch Wrapper

The algorithm we use to get the quantized DNN model for inference is the WAGE from [22]. The Pytorch code is modified based on [23-25]. The same algorithm is realized except that we move the scale term from weight to output to make it more suitable for the hardware architecture. This algorithm could directly train a quantized network with user defined bit width for weight, activation, gradient and error. The partial sum quantization (according to ADC precision) is to be released in future version.

Here we considered inference with offline training. In general, users could either train the network with floating point and find the quantization level with statistics for weight and activation or introduce quantization with desired quantization level during training directly. We choose the quantized training scheme using WAGE since WAGE quantize both weight and activation using fixed quantization level, which is [-1, 1] with scale of  $2^{-b}$ . This mechanism is friendly to hardware implementation, which normally represent data use 2's complimentary. WAGE also apply quantization to gradient and error, which is not necessary for inference stage (but maybe useful for online training to be release later). Users could set the bit-width to -1 to make these two floating-point for inference. Users need to pay attention that some hyper-parameters need to be changed if the bit-width is changed for WAGE algorithm.

The key parameters that will be transferred from the DNN algorithm to *NeuroSim* are weight precision (determining the synaptic weight cell design), partial sum precision (determining the ADC precision), and the activation precision (determining the input clock cycle number). For inference, the weight patterns are pre-defined by offline training, and they will be transferred to *NeuroSim* only once (acting as one-time programming), and then the input dataset (e.g. 1 test image) is loaded for the hardware performance estimation.

## 9. Algorithm Level: Inference Accuracy Estimation

In this framework, the neural network model (or weights) is assumed to be pre-trained off-chip, and then mapped to the compute-in-memory (i.e. CIM) inference chip. Thus, the non-ideal effects of synaptic devices (such as nonlinearity, asymmetry and endurance during weight-update operation) are not considered in this inference version (V1.0-V1.3) but are considered in the training version (V2.0-V2.2). In the contrast, the main factors that we introduced into the accuracy estimation of inference chip are: on/off ratio, ADC quantization effects, conductance variation and retention.

As Fig.21 shows, to represent the weights from algorithm (floating-point) on the CIM architectures, due to the limited precision of synaptic devices, one ideal way is to normalize the weights to decimal integers, and then digitalize the integers to conductance levels. For example, as shown in Fig. 21, if we define the synaptic weight precision to be 4-bit (decimal integer 0 to 15), and represented by 2-bit (conductance level 0 to 3) synaptic devices, from algorithm, the floating-point weight “+0.8906” will be normalized to 15, and thus be mapped to two synaptic devices, one as LSB and one as MSB, and each of them are on conductance level 3 (i.e. $15/4=3$ ,  $15\%4=3$ ).

### 1) Conductance On/Off Ratio

Ideally, the conductance levels of synaptic devices range from 0 to  $2^N$ , where N is the precision of synaptic devices. However, the minimum conductance can be regarded as 0 only if the conductance on/off ratio (=maximum conductance/minimum conductance) of synaptic devices is infinity, which is not feasible in current technology.

Thus, in reality, the minimum conductance level is not an ideal “0”. For example, if we use a normalized synaptic device conductance range as 0~1 (as  $0 \sim 2^N/2^N$ ), where the “1” can be represented as maximum conductance, and “0” is minimum conductance, in algorithm aspect, the conductance level “1” represent ideal “1”, while the conductance level “0” actually represent a non-zero value “1/(on/off ratio)”. In this case, small on/off ratio will introduce such non-ideal zeros into the calculation, and significantly distort the inference accuracy.

One approach to remedy this situation is to eliminate the effect of the OFF-state current in every weight element with the aid of a dummy column. In this framework, as Fig. 22 shows, we map the algorithm weights (range [-1, +1]) to synaptic devices (conductance range [ $G_{min}$ ,  $G_{max}$ ]) in the synaptic arrays, while



Fig. 21. Mapping weight from algorithm to synaptic device conductance in CIM architecture.

we set a group of dummy columns beside each synaptic array, and the devices in dummy columns are set to the middle conductance  $(G_{\min}+G_{\max})/2$ . Such that, by subtracting the real outputs with the dummy outputs, the truncated conductance range will be  $[-(G_{\max}-G_{\min})/2, +(G_{\max}-G_{\min})/2]$ , which is zero-centered as  $[-1, +1]$ , and the off-state current effects are perfectly removed.

The conductance on/off ratio is defined as one argument “args.onoffratio” in “inference.py” file.

## 2) Conductance Variation

It is well known that the synaptic devices involving drift and diffusion of the ions/vacancies show considerable variation from device to device, and even from pulse to pulse within one device. Thus, in inference chip, although the weight-update operation is not required, conductance variation is still a concern during initialization or programming of the synaptic arrays.

In this framework, the conductance variation is introduced as a percentage of variation of desired conductance, for example, if the desired conductance is 0.5, with +0.1 conductance variation, the actual conductance will be 0.55, similarly, with -0.2 conductance variation, the actual conductance will be 0.4.

For a chip, the conductance variation could be different from array to array, and device to device, so we module such variation as a function of random generator, to generate conductance variation of different cells, while the standard deviation of this random generator will be the argument in “inference.py” file, as “args.vari”.

## 3) Retention

Retention refers to the ability of memory device to retain its programmed state over a long period of time. Typical retention specification for NVM in memory application is more than 10 years at 85°C. Many binary eNVM devices have been able to meet this requirement. However, there are no reported data for analog eNVM that shows such retention, which can be attributed to the instability of intermediate conductance states.



Fig. 22. Introduce dummy column to cancel out the off-state current effects.

To be general, we consider four scenarios of conductance drift for the retention analysis, as shown in Fig. 23, where the conductance can either drift toward its maximum, minimum or intermediate states, or just randomly drift. The formula for modeling the conductance drift behavior is assumed to follow the one shown below:

$$G = G_0 \left( \frac{t}{t_0} \right)^v$$

where “ $G_0$ ” is the initial conductance, “ $t$ ” is the retention time, “ $v$ ” is the drift coefficient and “ $t_0$ ” is the time constant which is assumed to be 1 second in this framework.

To estimate the retention effect on inference accuracy, we define a function called “Retention” in file “wage\_quantizer.py”, where the retention time and drift coefficient are defined as “args.t” and “args.v” separately, while “args.detect” is used to define the drift scenario, if “args.detect” is 1, then the drift scenario is drifting to a fixed value (otherwise, it is random drift), and the targeted value is then defined as “args.target”, the range is defined from 0 to 1. Those arguments can be defined in file “inference.py”.

#### 4) ADC Quantization Effects

For CIM architecture, there are mainly two read-out schemes. A sequential processing method of the matrix-vector multiplication is to read out the dot-products in a row-by-row manner, which leads to extra energy and latency for accumulations along the rows. A more efficient method is parallel processing, where multiple rows are activated simultaneously by a switch matrix, and the current summation will be read out by an ADC. Therefore, the row-by-row accumulation periphery of sequential scheme is eliminated. However, since it is impractical to use very high-precision ADC at the edge of eNVM sub-arrays, we have to truncate the precision of ADC (for partial sums) to minimize the area and energy overhead.

To minimize the ADC precision while guarantee the inference accuracy, it is necessary to run the simulation of ADC quantization before hardware design. In this framework, we support two quantization methods: linear and non-linear quantization. As Fig. 24 shows, in linear quantization, the ADC references are distributed linearly across possible partial-sum value range in the synaptic array; while in non-linear quantization, the ADC references are non-linearly distributed, according to the distribution of partial-sums, the references are more spread in high-probability area, while less in low-probability part.



Fig. 23. Different scenarios of conductance drift.

Normally, non-linear quantization can save ~1-bit ADC precision compared with linear quantization; however, the choice of non-linear references and quantized outputs is quite sensitive, the detection of partial-sum distribution is necessary. In this framework, we defined two functions called “NonLinearQuantizeOut” and “LinearQuantizeOut” in file “wage\_quantizer.py”, while the users can define the “args.ADCprecision” in file “inference.py”.

## 10. How to run *DNN + NeuroSim*

### 1) Define Network Structure in NetWork.csv

Firstly, the users have to define network structure in the NetWork.csv file, such that the NeuroSim will process the floorplan and define the hardware design. Taking the default VGG-8 with 8 layers as an example, the definition of each cell in the excel table in shown below, in the NetWork.csv file, only the numbers are supposed to be filled in, i.e. the texts cannot be written in the file, it is important to accurately modify the table to avoid segmentation fault.



Fig. 24. Linear and non-linear ADC quantization.

Table IV NetWork.csv

|         | IFM Length | IFM Width | IFM Channel Depth | Kernel Length | Kernel Width | Kernel Depth | Followed by pooling or not? |
|---------|------------|-----------|-------------------|---------------|--------------|--------------|-----------------------------|
| Layer 1 | 32         | 32        | 3                 | 3             | 3            | 128          | 0                           |
| Layer 2 | 32         | 32        | 128               | 3             | 3            | 128          | 1                           |
| Layer 3 | 16         | 16        | 128               | 3             | 3            | 256          | 0                           |
| Layer 4 | 16         | 16        | 256               | 3             | 3            | 256          | 1                           |
| Layer 5 | 8          | 8         | 256               | 3             | 3            | 512          | 0                           |
| Layer 6 | 8          | 8         | 512               | 3             | 3            | 512          | 1                           |
| Layer 7 | 1          | 1         | 8192              | 1             | 1            | 1024         | 0                           |
| Layer 8 | 1          | 1         | 1024              | 1             | 1            | 10           | 0                           |

In the default VGG-8 network, layer 1 to layer 6 are convolutional layers, and layer 7 to layer 8 are fully-connected layers. In the Table IV, the dimensions of each layer are defined in different rows, from layer 1 to layer 8 (row 1 to row 8), while the first three columns (column 1 to column 3) are used to define the

dimension of input feature maps (IFMs) of each layer. For example, the input image size of layer 1 is  $32 \times 32 \times 3$ , thus, in first row, the first three cells should be filled by 32, 32 and 3 respectively, which indicated the length, width and depth of the IFM. The next three columns (column 4 to column 6) are used to define the dimension of kernels. For example, the kernel size of layer 3 is  $3 \times 3 \times 128 \times 256$  (i.e. each single 3D kernel is  $3 \times 3 \times 128$ , the kernel depth is 256), since it is well known that the third dimension of kernel is defined by the IFM channel depth, it is not necessary to define the third dimension again, thus, from the Table IV, in row 3, the fourth, fifth and sixth cell should be filled by 3, 3 and 256, which represent the length, width and kernel depth (first, second and fourth dimension of kernel) respectively. One should notice that the fully-connected layer can also be represented in the similar way, by considering it as a special convolutional layer, which has unit length and width for IFM and kernels. The last column is used to define whether the current layer is followed by pooling, it will be read by *NeuroSim*, and properly estimate the hardware performance for pooling function, in this framework, the activation function is considered to be integrated in every layer.

## 2) Modify the hardware parameters in Param.cpp

After setting up the network structure, the users need to define the hardware parameters in **Param.cpp**. In this file, the users could define the parameters, such as technology node (**technode**), device type (**memcelltype**: SRAM, eNVM or FeFET), operation mode (**operationmode**: parallel or sequential analog, synaptic sub-array size (**numRowSubArray**, **numColSubArray**), synaptic device precision (**cellBit**), mapping method (conventional or novel), activation type (sigmoid or ReLU), cell height/width in feature size (F), clock frequency and so on. We list some recommended device parameters as below:

Table V Scaling trend of SRAM cell area with technology nodes (assuming F is the same as the technology node)

| SRAM                        | 22nm | 14nm | 10nm | 7nm  | 5nm  | 3nm  | 2nm  | 1nm   |
|-----------------------------|------|------|------|------|------|------|------|-------|
| 8T SRAM cell size ( $F^2$ ) | 360  | 480  | 720  | 1080 | -    | -    | -    | -     |
| 6T SRAM cell size ( $F^2$ ) | 280  | 326  | 420  | 551  | 1120 | 2048 | 4680 | 11500 |

Table VI Recommended eNVM device parameters.

|                     | 22nm RRAM<br>(Intel)<br>[26] | 90nm RRAM<br>(Winbond)<br>[37] | 130nm RRAM<br>(Tsinghua)<br>[38] | 90nm PCM<br>(IBM)<br>[49] | 22nm FeFET<br>(GF)<br>[50] | 22nm STT-MRAM<br>(Intel)<br>[51] |
|---------------------|------------------------------|--------------------------------|----------------------------------|---------------------------|----------------------------|----------------------------------|
| Ron ( $\Omega$ )    | 6k                           | 6k                             | 100k                             | 40k                       | 240k                       | 1.4k                             |
| On/off ratio        | 17                           | 150                            | 10                               | 12.5                      | 100                        | 2.8                              |
| Cell size ( $F^2$ ) | $5 \times 12$                | $6 \times 6$                   | $4 \times 4$                     | $4 \times 4$              | $4 \times 6$               | $10 \times 10$                   |

In this framework, all the hardware parameters that users need to define are summarized in the **Param.cpp**, thus, to successfully run the simulator, the two main files users need to visit are **NetWork.csv** and **Param.cpp**.

### 3) Compilation of *NeuroSim*

After modifying the **NetWork.csv** and **Param.cpp** files, or whenever any change is made in the files, the codes have to be recompiled by using make command as stated in **Installation and Usage (Linux + Anaconda)** section. If the compilation is successful, a screenshot like Fig. 25 can be expected.

```
model_path = './log/default/batch_size=200/decreasing_lr=200,250/grad_scale=8/seed=117/type=cifar10/wl_activate=8/wl_error=8/wl_grad=8/wl_weight=2/latest.pth'
# data loader and model
```

Fig. 26 example of load path.

### 4) Run the program with PyTorch wrapper

After compilation of *NeuroSim*, go back to the PyTorch wrapper, in the wrapper, there are three networks (VGG-8 network for CIFAR-10 dataset, DenseNet-40 network for CIFAR-10 dataset, ResNet-18 network for ImageNet dataset) as default, the users can modify their network structures, and run the simulator correspondingly.

Instructions to run the wrapper:

```
g++ -c -fopenmp -O3 -std=c++0x -w NewFlux.cpp -o NewFlux.o
g++ -c -fopenmp -O3 -std=c++0x -w ProcessingUnit.cpp -o ProcessingUnit.o
g++ -c -fopenmp -O3 -std=c++0x -w Bus.cpp -o Bus.o
g++ -c -fopenmp -O3 -std=c++0x -w XorArbiterPuf.cpp -o XorArbiterPuf.o
g++ -c -fopenmp -O3 -std=c++0x -w ArbiterPuf.cpp -o ArbiterPuf.o
g++ -c -fopenmp -O3 -std=c++0x -w formula.cpp -o formula.o
g++ -c -fopenmp -O3 -std=c++0x -w Dff.cpp -o Dff.o
g++ -c -fopenmp -O3 -std=c++0x -w FunctionUnit.cpp -o FunctionUnit.o
g++ -c -fopenmp -O3 -std=c++0x -w RippleCounter.cpp -o RippleCounter.o
g++ -c -fopenmp -O3 -std=c++0x -w Adder.cpp -o Adder.o
g++ -c -fopenmp -O3 -std=c++0x -w Technology.cpp -o Technology.o
g++ -c -fopenmp -O3 -std=c++0x -w MultilevelSenseAmp.cpp -o MultilevelSenseAmp.o
g++ -c -fopenmp -O3 -std=c++0x -w SwitchMatrix.cpp -o SwitchMatrix.o
g++ -c -fopenmp -O3 -std=c++0x -w CurrentSenseAmp.cpp -o CurrentSenseAmp.o
g++ -c -fopenmp -O3 -std=c++0x -w NewSwitchMatrix.cpp -o NewSwitchMatrix.o
g++ -c -fopenmp -O3 -std=c++0x -w WLNewDecoderDriver.cpp -o WLNewDecoderDriver.o
g++ -c -fopenmp -O3 -std=c++0x -w BitShifter.cpp -o BitShifter.o
g++ -c -fopenmp -O3 -std=c++0x -w VoltageSenseAmp.cpp -o VoltageSenseAmp.o
g++ -c -fopenmp -O3 -std=c++0x -w MapPooling.cpp -o MapPooling.o
g++ -c -fopenmp -O3 -std=c++0x -w Sigmoid.cpp -o Sigmoid.o
g++ -c -fopenmp -O3 -std=c++0x -w Param.cpp -o Param.o
g++ -c -fopenmp -O3 -std=c++0x -w SubArray.cpp -o SubArray.o
g++ -c -fopenmp -O3 -std=c++0x -w AdderTree.cpp -o AdderTree.o
g++ -c -fopenmp -O3 -std=c++0x -w Comparator.cpp -o Comparator.o
g++ -c -fopenmp -O3 -std=c++0x -w DecoderDriver.cpp -o DecoderDriver.o
g++ -c -fopenmp -O3 -std=c++0x -w Subtractor.cpp -o Subtractor.o
g++ -c -fopenmp -O3 -std=c++0x -w MultilevelSAEncoder.cpp -o MultilevelSAEncoder.o
g++ -c -fopenmp -O3 -std=c++0x -w Chip.cpp -o Chip.o
g++ -c -fopenmp -O3 -std=c++0x -w Precharger.cpp -o Precharger.o
g++ -c -fopenmp -O3 -std=c++0x -w RowDecoder.cpp -o RowDecoder.o
g++ -c -fopenmp -O3 -std=c++0x -w Matrix.cpp -o Matrix.o
g++ -c -fopenmp -O3 -std=c++0x -w SenseAmp.cpp -o SenseAmp.o
g++ -c -fopenmp -O3 -std=c++0x -w Buffer.cpp -o Buffer.o
g++ -c -fopenmp -O3 -std=c++0x -w WLDecoderOutput.cpp -o WLDecoderOutput.o
g++ -c -fopenmp -O3 -std=c++0x -w ReadCircuit.cpp -o ReadCircuit.o
g++ -c -fopenmp -O3 -std=c++0x -w DeMux.cpp -o DeMux.o
g++ -c -fopenmp -O3 -std=c++0x -w Tile.cpp -o Tile.o
g++ -c -fopenmp -O3 -std=c++0x -w ShiftAdd.cpp -o ShiftAdd.o
g++ -c -fopenmp -O3 -std=c++0x -w HTree.cpp -o HTree.o
g++ -c -fopenmp -O3 -std=c++0x -w SRAMWriteDriver.cpp -o SRAMWriteDriver.o
g++ -c -fopenmp -O3 -std=c++0x -w SramNewSA.cpp -o SramNewSA.o
g++ -c -fopenmp -O3 -std=c++0x -w main.cpp -o main.o
g++ -fopenmp -O3 -std=c++0x -w NewFlux.o ProcessingUnit.o XorArbiterPuf.o ArbiterPuf.o formula.o Dff.o FunctionUnit.o RippleCounter.o Adder.o Technology.o MultilevelSenseAmp.o SwitchMatrix.o CurrentSenseAmp.o NewSwitchMatrix.o WLNewDecoderDriver.o BitShifter.o VoltageSenseAmp.o MapPooling.o Sigmoid.o Param.o AdderTree.o Comparator.o Subtractor.o MultilevelSAEncoder.o Chip.o Precharger.o RowDecoder.o Matrix.o SenseAmp.o Buffer.o WLDecoderOutput.o ReadCircuit.o DeMux.o Tile.o ShiftAdd.o HTree.o SRAMWriteDriver.o SramNewSA.o main.o -o main
```

Fig. 25 Output of compilation.

- PyTorch (<https://pytorch.org/>)
  - The bitwidth could be set use optional parameter
  - Train
    - Python train.py
    - The model will be saved at a hierarchical folders based one the option value.
  - Inference
    - Python inference.py
    - Set model\_path to the saved model \*.pth file

The program will print out the results for each layer of the network during the simulation. The simulation will approximately take 5 minutes with a computer workstation (Intel 8-core CPU with 3.2 GHz and NVidia Titan V GPU) for the VGG-8 network. Fig. 27 shows an example of final output of an 8-bit VGG-8 inference for one CIFAR-10 image, based on parallel 1T1R synaptic array, with 2-bit per cell RRAM (100k $\Omega$  and 10M $\Omega$  as  $R_{on}$  and  $R_{off}$ ). The outputs from the simulation include hardware inference accuracy, memory utilization, and latency/energy/leakage breakdown for 1-image inference, and the equivalent energy efficiency in terms of TOPS/W, and throughput in terms of TOPS and frames per second (FPS), and compute efficiency in terms of TOPS/mm<sup>2</sup>.

```

Test set: Average loss: 1.5600, Accuracy: 9305/10000 (93%)
FloorPlan
-----
Tile and PE size are optimized to maximize memory utilization (= memory mapped by synapse / total memory on chip)

Desired Conventional Mapped Tile Storage Size: 1024x1024
Desired Conventional PE Storage Size: 512x512
Desired Novel Mapped Tile Storage Size: 9x512x512
User-defined Subarray Size: 128x128

----- # of tile used for each layer -----
layer1: 1
layer2: 1
layer3: 2
layer4: 2
layer5: 4
layer6: 4
layer7: 32
layer8: 1

----- Speed-up of each layer -----
layer1: 16
layer2: 4
layer3: 4
layer4: 2
layer5: 2
layer6: 1
layer7: 1
layer8: 8

----- Utilization of each layer -----
layer1: 0.210938
layer2: 1
layer3: 1
layer4: 1
layer5: 1
layer6: 1
layer7: 1
layer8: 0.3125
Memory Utilization of Whole Chip: 96.8384 %

FloorPlan Done
-----
Hardware Performance
-----
Estimation of Layer 8
-----
layer8's readLatency is: 1267.57ns
layer8's readDynamicEnergy is: 5052.4pJ
layer8's leakagePower is: 9.95156uW
layer8's leakageEnergy is: 568.358pJ
layer8's buffer latency is: 1067.22ns
layer8's buffer readDynamicEnergy is: 132.858pJ
layer8's ic latency is: 128.549ns
layer8's ic readDynamicEnergy is: 3803.4pJ

----- Breakdown of Latency and Dynamic Energy -----
ADC (or S/A and precharger for SRAM) readLatency is: 16.4112ns
Accumulation Circuits (subarray level: adders, shiftAddrs, PE/Tile/Global level: accumulation units) readLatency is: 53.3366ns
Other Peripherals (e.g. decoders, mux, switchmatrix, buffers, IC, pooling and activation units) readLatency is: 1197.82ns
ADC (or S/A and precharger for SRAM) readDynamicEnergy is: 368.345pJ
Accumulation Circuits (subarray level: adders, shiftAddrs, PE/Tile/Global level: accumulation units) readDynamicEnergy is: 396.968pJ
Other Peripherals (e.g. decoders, mux, switchmatrix, buffers, IC, pooling and activation units) readDynamicEnergy is: 4287.09pJ

----- Breakdown of Latency and Dynamic Energy -----
----- Summary -----
ChipArea : 4.5554e+07um^2
Chip total GIM array : 1.5408e+06um^2
Total IC Area on chip (Global and Tile/PE local): 6.17954e+06um^2
Total ADC (or S/A and precharger for SRAM) Area on chip : 1.46062e+07um^2
Total Accumulation Circuits (subarray level: adders, shiftAddrs, PE/Tile/Global level: accumulation units) on chip : 3.13322e+06um^2
Other Peripherals (e.g. decoders, mux, switchmatrix, buffers, pooling and activation units): 2.00943e+07um^2

----- Breakdown of Latency and Dynamic Energy -----
Chip clock period is: 2.05141ns
One layer-by-layer execution (per image) is: 1.45613e+06ns
Chip total readDynamicEnergy is: 3.77518e+07pJ
Chip total leakage Energy is: 1.19011e+06pJ
Chip total leakage Power is: 628.781uW
Chip buffer readLatency is: 1.00082e+06ns
Chip buffer readDynamicEnergy is: 412968pJ
Chip ic readLatency is: 141884ns
Chip ic readDynamicEnergy is: 94262e+06pJ

----- Breakdown of Latency and Dynamic Energy -----
ADC (or S/A and precharger for SRAM) readLatency is: 63462.3ns
Accumulation Circuits (subarray level: adders, shiftAddrs, PE/Tile/Global level: accumulation units) readLatency is: 246001ns
Other Peripherals (e.g. decoders, mux, switchmatrix, buffers, IC, pooling and activation units) readLatency is: 1.14667e+06ns
ADC (or S/A and precharger for SRAM) readDynamicEnergy is: 1.8274e+07pJ
Accumulation Circuits (subarray level: adders, shiftAddrs, PE/Tile/Global level: accumulation units) readDynamicEnergy is: 5.16853e+06pJ
Other Peripherals (e.g. decoders, mux, switchmatrix, buffers, IC, pooling and activation units) readDynamicEnergy is: 1.43092e+07pJ

----- Breakdown of Latency and Dynamic Energy -----
----- Performance -----
Energy Efficiency TOPS/W (Layer-by-Layer Process): 25.7176
Throughput TOPS (Layer-by-Layer Process): 0.845963
Throughput FPS (Layer-by-Layer Process): 688.75
Compute efficiency TOPS/mm^2 (Layer-by-Layer Process): 0.0185705
----- Hardware Performance Done -----
----- Simulation Performance -----
Total Run-time of NeuroSim: 132 seconds
----- Simulation Performance -----

```

Fig. 27 Example of final output.

## 11. Hardware Estimation Examples

We provide hardware estimation results for a few cases (including the added technology nodes) under VGG8-CIFAR10 workload, highlighting the new features of NeuroSim 1.4.

Here, for all cases we assume 1b/cell and XY bus for the global interconnect topology which is newly introduced in NeuroSim 1.4. Assumed subarray size and multiply-and-accumulate (MAC) operation is 128 x 128 and 8b (weight) x 8b (activation), respectively. The rest of the parameters follow the default setting in the released code.

TABLE VII Hardware Estimation Examples.

| Memory Type | Tech Node | On/Off | ADC Precision | Parallel Read | Accuracy | Area (mm <sup>2</sup> ) | TOPS    | TOPS/W  | TOPS/mm <sup>2</sup> |
|-------------|-----------|--------|---------------|---------------|----------|-------------------------|---------|---------|----------------------|
| RRAM        | 22        | 17     | 4             | 32            | 71%      | 81.451                  | 7.08885 | 20.3117 | 0.0870315            |
| RRAM        | 22        | 17     | 4             | 64            | 43%      | 80.766                  | 8.49117 | 31.7514 | 0.105133             |
| SRAM        | 22        | 17     | 4             | 32            | 77%      | 70.896                  | 6.9434  | 14.7364 | 0.0979379            |
| SRAM        | 22        | 17     | 4             | 64            | 59%      | 69.471                  | 8.33172 | 25.1646 | 0.119929             |
| SRAM        | 5         | 17     | 4             | 32            | 77%      | 10.803                  | 10.8157 | 43.205  | 1.00111              |
| SRAM        | 5         | 17     | 4             | 64            | 59%      | 10.536                  | 11.1945 | 68.1646 | 1.06245              |
| SRAM        | 1         | 17     | 4             | 32            | 77%      | 4.724                   | 11.7563 | 84.5428 | 2.48859              |
| SRAM        | 1         | 17     | 4             | 64            | 59%      | 4.651                   | 12.3144 | 129.889 | 2.64754              |

## 12. Reference

- [1]. P.-Y. Chen, X. Peng, S. Yu, "NeuroSim: A circuit-level macro model for benchmarking neuro-inspired architectures in online learning," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018.
- [2]. [github.com/neurosim/MLP\\_NeuroSim V3.0](https://github.com/neurosim/MLP_NeuroSim)
- [3]. W. Li, S. Huang, X. Sun, H. Jiang, S. Yu, "Secure-RRAM: A 40nm 16kb compute-in-memory 354 macro with reconfigurability, sparsity control, and embedded security," *IEEE Custom Integrated 355 Circuits Conference (CICC)*, 2021.
- [4]. Predictive Technology Model (PTM). Available at <http://ptm.asu.edu/>
- [5]. N. E. Weste and D. Harris, "CMOS VLSI Design – A Circuit and Systems Perspective, 4<sup>th</sup> edition," 2007.
- [6]. X. Peng, R. Liu and S. Yu, "Optimizing weight mapping and data flow for convolutional neural networks on RRAM based processing-in-memory architecture," IEEE International Symposium on Circuits and Systems (ISCAS), 2019.
- [7]. P.-Y. Chen, et al., "Technology-design co-optimization of resistive cross-point array for accelerating learning algorithms on chip," ACM/IEEE Design, Automation & Test in Europe Conference & Exhibition (DATE), 2015.
- [8]. W. Khwa et al., "A 65nm 4Kb algorithm-dependent computing-in-memory SRAM unit-macro with 2.3ns and 55.8TOPS/W fully parallel product-sum operation for binary DNN edge processors," IEEE International Solid State Circuits Conference (ISSCC), 2018.
- [9]. M. Jerry, et al., "Ferroelectric FET analog synapse for acceleration of deep neural network training," IEEE International Electron Devices Meeting (IEDM), 2017.
- [10]. X. Sun, S. Yin, X. Peng, R. Liu, J.-S. Seo, S. Yu, "XNOR-RRAM: A scalable and parallel resistive synaptic architecture for binary neural networks," ACM/IEEE Design, Automation & Test in Europe Conference (DATE), 2018.
- [11]. IRDS 2016-2022 reports [Online] Available: <https://irds.ieee.org/editions>
- [12]. "Samsung 14 nm FinFET." <https://www.samsung.com/us/sas/Business/Foundry14nm> (accessed May. 18, 2023).

- [13]. 14 nm lithography process [Online] Available: [https://en.wikichip.org/wiki/14\\_nm\\_lithography\\_process](https://en.wikichip.org/wiki/14_nm_lithography_process)
- [14]. 10 nm lithography process [Online] Available: [https://en.wikichip.org/wiki/10\\_nm\\_lithography\\_process](https://en.wikichip.org/wiki/10_nm_lithography_process)
- [15]. 7 nm lithography process [Online] Available: [https://en.wikichip.org/wiki/7\\_nm\\_lithography\\_process](https://en.wikichip.org/wiki/7_nm_lithography_process)
- [16]. D. Schor, "TSMC 7nm HD and HP Cells, 2nd Gen 7nm, And The Snapdragon 855 DTCO." [Online]. Available: <https://fuse.wikichip.org/news/2408/tsmc-7nm-hd-and-hp-cells-2nd-gen-7nm-and-the-snapdragon-855-dtco/>.
- [17]. D. Schor, "VLSI 2018: Samsung's 8nm 8LPP, a 10nm extension." [Online]. Available: <https://fuse.wikichip.org/news/1443/vlsi-2018-samsungs-8nm-8lpp-a-10nm-extension/>.
- [18]. M. Sharma, "TSMC 5 nm vs. Intel 10nm vs. GloFo 7 nm." [Online]. Available: <https://www.apothetech.com/tsmc-5-nm-vs-intel-10nm-vs-glofo-7-nm/>.
- [19]. K. Gibb, "Samsung's 14 nm LPE FinFET transistors." [Online]. Available: <https://www.eenewseurope.com/en/samsungs-14-nm-lpe-finfil-transistors/>.
- [20]. I. Huang, "Huawei, Apple use TSMC but their 7nm SoCs are different." [Online]. Available: <https://www.eetasia.com/huawei-apple-use-tsmc-but-their-7nm-socts-are-different/>.
- [21]. Comparing SMIC 7nm vs TSMC 7nm [Online] Available: <https://files.static-nzz.ch/2022/08/30/ef873456-9644-4a36-b160-ec0d41853e68.pdf>
- [22]. S. Wu, et al. "Training and inference with integers in deep neural networks," arXiv: 1802.04680, 2018.
- [23]. [github.com/boluowefenda/WAGE](https://github.com/boluowefenda/WAGE)
- [24]. [github.com/stevenygd/WAGE.pytorch](https://github.com/stevenygd/WAGE.pytorch)
- [25]. [github.com/aaron-xichen/pytorch-playground](https://github.com/aaron-xichen/pytorch-playground)
- [26]. P. Jain, U. Arslan, M. Sekhar, B.C. Lin, L. Wei, T. Sahu, J. Alzate-vinasco, A. Vangapaty, M. Meterelliyo, N. Strutt, A.B. Chen, "A 3.6 Mb 10.1 Mb/mm<sup>2</sup> Embedded Non-Volatile ReRAM Macro in 22nm FinFET Technology with Adaptive Forming/Set/Reset Schemes Yielding Down to 0.5 V with Sensing Time of 5ns at 0.7 V," IEEE International Solid-State Circuits Conference (ISSCC), 2019.
- [27]. W. He, S. Yin, Y. Kim, X. Sun, J.J. Kim, S. Yu and J.S. Seo, "2-Bit-per-Cell RRAM based In-Memory Computing for Area-/Energy-Efficient Deep Learning," IEEE Solid-State Circuits Letters, vol. 3, pp. 194-197, 2020.
- [28]. W. Wu, H. Wu, B. Gao, P. Yao, X. Zhang, X. Peng, S. Yu, H. Qian, "A methodology to improve linearity of analog RRAM for neuromorphic computing," IEEE Symposium on VLSI Technology (VLSI), 2018.
- [29]. W. Kim, R.L. Bruce, T. Masuda, G.W. Fraczak, N. Gong, P. Adusumilli, S. Ambrogio, H. Tsai, J. Bruley, J.P. Han, M. Longstreet, "Confined PCM-based analog synaptic devices offering low resistance-drift and 1000 programmable states for deep learning," IEEE Symposium on VLSI Technology, 2019.
- [30]. K. Ni, B. Grisafe, W. Chakraborty, A.K. Saha, S. Dutta, M. Jerry, J.A. Smith, S. Gupta, S. Datta, "In-memory computing primitive for sensor data fusion in 28 nm HKMG FeFET technology," IEEE International Electron Devices Meeting (IEDM), 2018.
- [31]. L. Wei, J.G. Alzate, U. Arslan, et al. "A 7Mb STT-MRAM in 22FFL FinFET technology with 4ns read sensing time at 0.9 V using write-verify-write scheme and offset-cancellation sensing technique" IEEE International Solid-State Circuits Conference (ISSCC), 2019.