

# IterL2Norm: Fast Iterative L2-Normalization

ChangMin Ye<sup>†</sup>, Yonguk Sim<sup>‡</sup>, Youngchae Kim<sup>‡</sup>, SeongMin Jin<sup>†</sup>, and Doo Seok Jeong<sup>††\*</sup>

<sup>†</sup>Division of Materials Science and Engineering, Hanyang University, Seoul, Republic of Korea

<sup>‡</sup>Department of Semiconductor Engineering, Hanyang University, Seoul, Republic of Korea

\*Corresponding author: dooseokj@hanyang.ac.kr

**Abstract**—Transformer-based large language models are a memory-bound model whose operation is based on a large amount of data that are marginally reused. Thus, the data movement between a host and accelerator likely dictates the total wall-clock time. Layer normalization is one of the key workloads in the transformer model, following each of multi-head attention and feed-forward network blocks. To reduce data movement, layer normalization needs to be performed on the same chip as the matrix-matrix multiplication engine. To this end, we introduce an iterative L2-normalization method for 1D input (IterL2Norm), ensuring fast convergence to the steady-state solution within five iteration steps and high precision, outperforming the fast inverse square root algorithm in six out of nine cases for FP32 and five out of nine for BFloat16 across the embedding lengths used in the OPT models. Implemented in 32/28nm CMOS, the IterL2Norm macro normalizes  $d$ -dimensional vectors, where  $64 \leq d \leq 1024$ , with a latency of 116–227 cycles at 100MHz/1.05V.

**Index Terms**—IterL2Norm, layer normalization, fast convergence, large language models

## I. INTRODUCTION

Large language models (LLMs) such as GPT [1], Gemini [2], and Llama [3] represent a recent breakthrough with profound impacts on global society. These state-of-the-art LLMs commonly adopt the transformer architecture [4] that ensures high performance in natural language processing due to self-attention with an explicit working memory. In particular, the decoder-only transformer architecture employed in these LLMs ensures high-performance associative recalls in a generative manner. The decoder-only transformer consists of multiple decoders in series, each of which consists of a masked multi-head attention and feed-forward network sub-block in series. Matrix-matrix multiplication (MatMul) operations in these sub-blocks represent a major workload. Notably, each sub-block is followed by layer normalization that L2-normalizes the output for each batch.

LLMs based on transformer are memory-bound models that depend on a large amount of data but very limited reuse of them for operations (cf. convolutional neural networks as representative compute-bound models) [5]. The large amount of data needed and their limited reuse cause significant data traffic between a main memory (DRAM) and processor, so that the overall operational wall-clock time is dictated by the memory bandwidth rather than the processor performance.

This research was supported by National R&D Program through the National Research Foundation of Korea (NRF) and Institute of Information & Communications Technology Planning & Evaluation (IITP) grants funded by the Korea government (MSIT) (RS-2024-00406897 and IITP-(2024)-RS-2023-00253914). The EDA tool was supported by the IC Design Education Center (IDEC), Korea.

Thus, graphics processing units (GPUs) equipped with high bandwidth memories are widely used to accelerate LLMs. As an alternative to GPUs, various processors have recently been proposed to mainly accelerate MatMul operations and/or activation functions at a lower power than GPUs, including computing-in-memory units such as Function-In-Memory DRAM (FIMDRAM) [6], and GDDR6-based accelerator-in-memory (AiM) [7]. However, most of them rely on their host for layer normalization. That is, the output from each sub-block should be sent to the main memory for layer normalization, which leads to data traffic, and thus significant latency and power consumption [5].

For layer normalization to be performed on the same chip as MatMul processing engines, additional arithmetic operations such as square root and division need to be performed. However, the area/power overheads are probably prohibitive, as for [8]. Alternatively, various approximations without division operations have been implemented using digital logic circuits such as [9]–[11]. Unfortunately, detailed implementations and performance data are not well documented.

The layer normalization algorithm for on-chip implementation needs to avoid vanilla division and square root and to be generic to apply to various floating-point (FP) formats with low power/area overheads and operational latency. To this end, we propose a fast iterative L2-normalization algorithm (IterL2Norm) that is based on a high-dimensional dynamic system with a few fixed point. One of them represents the L2-normalized 1D vector in the hyper-space, which can be attained by appropriately setting the initial point in the hyper-space. IterL2Norm is a division and square root operation-free L2-normalization algorithm, rendering it suitable for power- and area-efficient on-chip implementations. It is based on a solid theoretical ground and applicable to various FP formats unlike previous methods tailored to specific FP formats [12]. It highlights fast convergence toward the fixed point (L2-normalized vector) in the hyperspace within five iteration steps, and thus low operational latency.

The primary contributions of our work include

- We introduce a novel L2-normalization algorithm on a solid theoretical (rather than heuristic) ground with a full derivation.
- We share our results of in-depth analyses on (i) its precision, convergence rate, and latency for various data lengths in FP32/FP16/BFloat16 and (ii) LLM-level performance by replacing the conventional layer normalization algorithm.

- We also share our digital implementation of the IterL2Norm macro with detailed explanation and its power and area overheads in 32/28nm CMOS technology.

The rest of this paper is organized as follows. Sec. II explains the key theories of dynamic systems that prototype IterL2Norm. Sec. III elaborates IterL2Norm and the system initialization that ensures fast convergence. Sec. IV proposes the IterL2Norm macro architecture. Sec. V evaluates IterL2Norm and the macro, and Sec. VI compares our work with previous ones. Sec. VII concludes our work.

## II. PRELIMINARIES

**Theorem II.1.** Let  $\mathbf{y}$  and  $\tilde{\mathbf{y}}$  be vectors of the same length. Let  $k$  be a nonzero scalar value such that  $k = \mathbf{y} \cdot \tilde{\mathbf{y}}$ . Consider the following differential equation for  $\tilde{\mathbf{y}}$  for a given  $\mathbf{y}$ .

$$\tau \frac{d\tilde{\mathbf{y}}}{dt} = k\mathbf{y} - \alpha k^2 \tilde{\mathbf{y}}, \quad (1)$$

where  $\alpha$  is a positive constant. For a given  $\mathbf{y}$ ,  $\tilde{\mathbf{y}}$  is initialized to  $\tilde{\mathbf{y}}_0$  such that  $k = \mathbf{y} \cdot \tilde{\mathbf{y}}_0 > 0$ . The steady state solution to this differential equation ( $\tilde{\mathbf{y}}_\infty$ ) satisfies that  $\|\tilde{\mathbf{y}}_\infty\|_2^2 = \alpha^{-1}$  and  $\tilde{\mathbf{y}}_\infty = \alpha^{-1/2} \mathbf{y} / \|\mathbf{y}\|_2$ .

*Proof.* Given that  $k = \mathbf{y} \cdot \tilde{\mathbf{y}}$ , the inner product of each side of Eq. (1) and  $\mathbf{y}$  yields

$$\tau \left( \frac{dk}{dt} - \tilde{\mathbf{y}} \cdot \frac{d\mathbf{y}}{dt} \right) = \tau \frac{dk}{dt} = k \|\mathbf{y}\|_2^2 - \alpha k^3.$$

For a positive  $\alpha$ , this dynamic system holds one unstable fixed point ( $k = 0$ ) and two stable fixed points ( $k = \pm \alpha^{-1/2} \|\mathbf{y}\|_2$ ). Therefore, the steady state  $k$  ( $\tilde{\mathbf{y}}_\infty$ ) is determined by the initial  $k$  ( $k_0 = \mathbf{y} \cdot \tilde{\mathbf{y}}_0$ ) such that  $k_\infty = \alpha^{-1/2} \|\mathbf{y}\|_2$  if  $k_0 > 0$  and  $k_\infty = -\alpha^{-1/2} \|\mathbf{y}\|_2$  if  $k_0 < 0$ . Because we consider only  $\tilde{\mathbf{y}}_0$  that leads to  $k_0 > 0$ , we have

$$k_\infty = \alpha^{-1/2} \|\mathbf{y}\|_2. \quad (2)$$

The inner product of each side of Eq. (1) and  $\tilde{\mathbf{y}}$  yields

$$\frac{\tau d \|\tilde{\mathbf{y}}\|_2^2}{dt} = k^2 - \alpha k^2 \|\tilde{\mathbf{y}}\|_2^2.$$

In the steady state, the left-hand side is zero, so that we have  $\|\tilde{\mathbf{y}}_\infty\|_2^2 = \alpha^{-1}$ . Additionally, the steady state solution to Eq. (1) is  $\tilde{\mathbf{y}}_\infty = \mathbf{y} / \alpha k_\infty$ . Using Eq. (2), we eventually have  $\tilde{\mathbf{y}}_\infty = \alpha^{-1/2} \mathbf{y} / \|\mathbf{y}\|_2$ .  $\square$

## III. PROPOSED METHOD

### A. Iterative normalization method

Layer normalization for a given vector  $\mathbf{x} \in \mathbb{R}^d$  involves the following sequential steps:

**Step 1:** Shifting the mean of  $\mathbf{x}$  ( $\bar{\mathbf{x}}$ ) to zero,  $\mathbf{y} \leftarrow \mathbf{x} - \bar{\mathbf{x}}$ ,  
**Step 2:** Normalizing  $\mathbf{y}$  by the standard deviation of  $\mathbf{y}$ , i.e.,  $\sigma_y$ ,  $\hat{\mathbf{y}} \leftarrow \mathbf{y} / \sigma_y$ ,

**Step 3:** Scaling and shifting  $\hat{\mathbf{y}}$ ,  $\mathbf{z} \leftarrow \gamma \cdot \hat{\mathbf{y}} + \beta$ .

Because  $\sigma_y = d^{-1/2} \|\mathbf{y}\|_2$ , **Step 2** is expressed as  $\hat{\mathbf{y}} \leftarrow d^{1/2} \mathbf{y} / \|\mathbf{y}\|_2$ , which is equivalent to L2-normalizing  $\mathbf{y}$  and subsequently scaling it by multiplying  $d^{1/2}$ . **Step 2** is the only step involving division operations which are computationally expensive. We can replace this costly step by the iterative

---

**Algorithm 1** IterL2Norm-based layer normalization.

---

**Input:** input vector  $\mathbf{x}$ ; update-rate  $\lambda$ ; max-tolerated error  $\delta_{\max}$ ; scale and shift parameters ( $\gamma$  and  $\beta$ )

**Output:** output vector  $\mathbf{z}$

**Initialization:**  $\Delta a \leftarrow \delta_0 (> \delta_{\max})$ ;  $a \leftarrow a_0$   
 $\bar{\mathbf{x}} \leftarrow d^{-1} \sum_{i=1}^d x_i$   
 $\mathbf{y} \leftarrow \mathbf{x} - \bar{\mathbf{x}}$   
 $m \leftarrow \|\mathbf{y}\|_2^2$

**while**  $\Delta a > \delta_{\max}$  **do**

$\Delta a \leftarrow \lambda m a (1 - m a^2)$

$a \leftarrow a + \Delta a$

**end**

$\hat{\mathbf{y}} \leftarrow d^{1/2} a \mathbf{y}$   
 $\mathbf{z} \leftarrow \gamma \hat{\mathbf{y}} + \beta$

---

method supported by **Theorem II.1**. **Theorem II.1** for  $\alpha = 1$  explains that the following differential equation,

$$\tau \frac{d\tilde{\mathbf{y}}}{dt} = k\mathbf{y} - k^2 \tilde{\mathbf{y}}, \text{ where } k = \mathbf{y} \cdot \tilde{\mathbf{y}}, \quad (3)$$

has the steady state solution  $\tilde{\mathbf{y}}_\infty = \mathbf{y} / \|\mathbf{y}\|_2$ . Thus, we can evaluate  $\tilde{\mathbf{y}}$  for  $\mathbf{y}$  by solving Eq. (3). We solve Eq. (3) by approximating it to a recursive form using the Euler method as follows.

$$\tilde{\mathbf{y}}_{i+1} = (1 - \lambda k_i^2) \tilde{\mathbf{y}}_i + \lambda k_i \mathbf{y}, \quad (4)$$

where  $\lambda = \Delta t / \tau$ , and  $k_i = \mathbf{y} \cdot \tilde{\mathbf{y}}_i$ . Note that the subscript  $i$  for  $\tilde{\mathbf{y}}$  and  $k$  denotes the  $i$ th iteration step. The timestep width is denoted by  $\Delta t$ . The value  $\tilde{\mathbf{y}}_{i+1}$  in Eq. (4) is repeatedly calculated until it reaches its steady state, yielding  $\tilde{\mathbf{y}}_\infty$ . For all  $i$ ,  $\tilde{\mathbf{y}}_i$  is parallel to  $\mathbf{y}$ , so that we replace  $\tilde{\mathbf{y}}_i$  by  $a_i \mathbf{y}$ , leading to  $k_i = a_i \|\mathbf{y}\|_2^2$ . Thus, Eq. (4) is converted into a simple scalar equation.

$$\Delta a = a_{i+1} - a_i = \lambda \|\mathbf{y}\|_2^2 a_i \left( 1 - \|\mathbf{y}\|_2^2 a_i^2 \right), \quad (5)$$

which asymptotically converges towards  $a_\infty = 1 / \|\mathbf{y}\|_2$  with a positive  $a_0$  and sufficiently small  $\lambda$ . We refer to this L2-normalization method with the replacement of **Step 2** by this iterative normalization as IterL2Norm. The pseudocode for IterL2Norm-based layer normalization is shown in **Algorithm 1**.

### B. Initialization and update-rate setting

The more the iteration steps for IterL2Norm, the larger the wall-clock time. To shorten the iteration, we should use the initial value  $a_0$  close to  $a_\infty (= 1 / \|\mathbf{y}\|_2)$  and update-rate  $\lambda$  such that (i)  $\lambda$  is sufficiently large for  $a_\infty$  to be attained with the minimum iteration steps (fast convergence) but (ii) sufficiently small to avoid an intolerable error in approximation.

**Initialization of  $a$ :** Let  $m$  be  $\|\mathbf{y}\|_2^2$ , which is once evaluated for IterL2Norm as shown in **Algorithm 1**. We initialize  $a$  using the exponent of  $m$ ,  $E(m)$ , as follows.

$$a_0 = 2^{-(E(m) - \text{bias} + 1)/2}, \quad (6)$$



Fig. 1. (a) Architecture of the IterL2Norm macro. (b) Data organization in the Input buffer. (c) Block diagram of the Add block equipped with total nine 8-input adder trees.

where bias depends on the data format, e.g., bias = 127 for FP32 and BFloat16 and bias = 15 for FP16. Because  $a_\infty = m^{-1/2}$ , we can express  $a_\infty$  as follows.

$$a_\infty = \text{Significand}(m)^{-1/2} \cdot 2^{-(E(m)-\text{bias})/2},$$

where  $\text{Significand}(m)$  denotes the significand of  $m$ , which satisfies  $1 \leq \text{Significand}(m) < 2$ . Therefore, we have  $0.7 < a_0/a_\infty < 1$ , implying that  $a_0$  is already close to  $a_\infty$  in so much as the distance is smaller than 30% of  $a_\infty$ . Further, the evaluation of  $a_0$  involves one addition, one subtraction, and one bit-shift operation only.

**Update-rate  $\lambda$ :** Eq. (5) can be expressed as the following differential equation.

$$\tau \frac{da}{dt} = -m^2 a (a^2 - 1/m), \quad (7)$$

where  $m = \|\mathbf{y}\|_2^2$ . There exists the analytical solution to Eq. (7).

$$a = a_0 \left[ (1 - m a_0^2) e^{-2mt/\tau} + m a_0^2 \right]^{-1/2}. \quad (8)$$

Because we consider discrete iteration steps, we replace  $t$  in Eq. (8) by  $n\Delta t$  with non-negative integer  $n$  that indicates the iteration step index. Subsequently, by introducing  $\lambda (= \Delta t/\tau)$ , we have

$$a = a_0 \left[ (1 - m a_0^2) e^{-2mn\lambda} + m a_0^2 \right]^{-1/2}. \quad (9)$$

The convergence rate is determined by the exponent on the right-hand side of Eq. (9). For fast convergence, the exponential term should fall below a tolerable error value  $\delta_c \sim 0$  within a few iteration steps  $n_c$ , leading to the following inequality,  $\lambda > -(2mn_c)^{-1} \ln \delta_c$ . We set  $\delta_c$  and  $n_c$  to  $10^{-3}$  and 5, respectively, so that we have  $\lambda > 0.69m^{-1}$ . As such, the calculation of  $m^{-1}$  needs a division operation, which is avoided in IterL2Norm. Because the exponent of  $m$ , i.e.,  $E(m)$  is known, the range of  $m^{-1}$  is readily available,  $0.5 \cdot 2^{-(E(m)-\text{bias})} < m^{-1} \leq 2^{-(E(m)-\text{bias})}$ . Therefore, we approximate the condition of  $\lambda$  for  $a$  to converge within  $n_c$  (= 5) iteration steps.

$$\lambda > 0.345 \cdot 2^{-(E(m)-\text{bias})}. \quad (10)$$



Fig. 2. Architecture of (a) the initialize and (b) the update modules in the iteration controller.

This calculation needs one subtraction and one multiplication operation only.

#### IV. ITERL2NORM MACRO DESIGN

The IterL2Norm macro implements the IterL2Norm-based layer normalization algorithm for a  $d$ -long input vector  $\mathbf{x} = [x_0, x_1, \dots, x_{d-1}]$  with scale parameters  $\gamma = [\gamma_0, \gamma_1, \dots, \gamma_{d-1}]$  and shift parameters  $\beta = [\beta_0, \beta_1, \dots, \beta_{d-1}]$ , which outputs layer-normalized input  $\mathbf{z} = [z_0, z_1, \dots, z_{d-1}]$ . Fig. 1a shows a block diagram of the IterL2Norm macro proposed. The Input buffer of eight parallel banks ( $n_b = 8$ ) buffers a  $d$ -long input vector, and thus the input length  $d$  is limited by the buffer size (Fig. 1b). Because each bank can store  $16 \times 8$  input elements ( $h_b = 16$  and  $w_b = 8$ ), the IterL2Norm macro can handle  $d = 1024$  maximally for a single input, i.e.,  $d_{\max} = 1024$ . A  $d$ -long input vector  $\mathbf{x}$  is buffered over multiple banks such that, in a bank  $b$  out of total eight banks ( $n_b = 8$ ), its row  $i$  stores  $x[w_b(b+n_b i) : w_b(b+n_b i+1) - 1]$  as illustrated in Fig. 1b. Because eight parallel banks share a read pointer,  $x[n_b w_b i : n_b w_b (i+1) - 1]$  is read at a time. Instead, multiple ( $\lfloor d_{\max}/d \rfloor$ ) input vectors can be buffered and sequentially normalized. Note that, to maintain  $d_{\max} = 1024$  for FP32/16 and BFloat16, the IterL2Norm macro for FP32 uses the Input buffer twice as large as that for FP16 and BFloat16. Additionally, the Mul and Add blocks are tailored to each data format by using format-specific multipliers and adders but with the same latency of two clock cycles.

This macro normalizes the input vector using the following sequence.

**Initialization:** The macro is initialized with input length  $d$  and number of input vectors.

**Data loading:** The Input,  $\gamma$ , and  $\beta$  buffers are loaded with input vector(s), and scale and shift parameters, respectively, through the input channels. This is controlled by the input and main controllers.

**Mean-shift:** The  $\bar{x}$  controller retrieves the input vector from the Input buffer to calculate its element-wise sum in the Add block. The Add block is equipped with eight 8-input L<sub>1</sub> adder trees and one 8-input L<sub>2</sub> adder tree, which can add 64 input elements to yield the sum of the partial input at a time (Fig. 1c). This sum is buffered in the Partial sum buffer alongside the sum values for the previous partial inputs. This is repeated  $\lceil d/64 \rceil$  times to collect total  $\lceil d/64 \rceil$  sum values in the Partial sum buffer. They are sent to the Add block to acquire the sum of the whole input vector. The sum is subsequently multiplied by  $d^{-1}$  (pre-stored in the memory), eventually outputting the mean  $\bar{x}$ . The Shift controller then shifts the mean of  $x$  to zero by subtracting  $\bar{x}$  from  $x$  and rewrites the mean-shifted vector,  $y = x - \bar{x}$ , into the Input buffer.

**Inner product of  $y$  with itself:** The  $m$  controller reads the mean-shifted vector  $y$  from the Input buffer and sends it to the Mul block (equipped with 64 multipliers). The resulting vector is sent to the Add block that outputs the inner product of a partial vector of  $y$ . This result is buffered in the Partial sum buffer. This is repeated  $\lceil d/64 \rceil$  times to calculate  $m = \|y\|_2^2$ .

**Iteration:** The Iteration controller initializes  $a_0$  using Eq. (6) and sets the update rate  $\lambda$  using Eq. (10) (Fig. 2a). It then iteratively updates  $a$  using Eq. (5) to attain its steady-state value  $a_\infty$  (Fig. 2b). The number of iteration steps  $n_c$  is a programmable variable.

**Output:** The Output controller reads the mean-shifted vector  $y$  from the Input buffer and sends it to the Mul block with the product of  $a_\infty$  and pre-stored  $d^{1/2}$  to obtain the L2-normalization result  $\hat{y}$ . This vector is re-sent to the Mul block with the scale parameters buffered in the  $\gamma$  block and the scaled vector to the Add block with the shift parameters buffered in the  $\beta$  block to finally obtain the layer-normalization result  $z$  for a given input  $x$ .

## V. EVALUATION

### A. Computational precision and convergence rate

We evaluated the performance of the IterL2Norm macro implemented in a Xilinx Virtex-7 FPGA. We applied the IterL2Norm-based normalization to random vectors of different lengths ( $64 \leq d \leq 1024$ ) in FP32/FP16/BFloat16. For each length and each data format, we used 1,000 random vectors sampled from a uniform distribution in the range  $(-1, 1)$  as input vectors. The number of iteration steps was set to 5. We used the absolute deviation of our results from the ground truth (absolute error) as a measure of computational precision. The ground truth values were calculated by applying the layer-normalization function in PyTorch (1.12.1) [13] to the same random vectors using a CPU. Fig. 3 shows the



Fig. 3. Approximation precision of IterL2Norm for various input lengths  $d$  in (a) FP32, (b) FP16, and (c) BFloat16. The insets show the distribution of errors for  $d = 384$  over 1,000 input vectors.

TABLE I  
PRECISION COMPARISON BETWEEN ITERL2NORM AND FISR

| Input length | FP32                            |             | BFloat16                        |             |
|--------------|---------------------------------|-------------|---------------------------------|-------------|
|              | Avg/Max Err( $\times 10^{-4}$ ) | FISR        | Avg/Max Err( $\times 10^{-3}$ ) | FISR        |
| 768          | <b>0.132/29.35</b>              | 4.124/101.6 | <b>2.195/125.0</b>              | 2.294/125.0 |
| 1024         | <b>1.987/91.76</b>              | 3.104/61.21 | <b>2.243/125.0</b>              | 2.235/125.0 |
| 2048         | 61.76/3699.0                    | 1.544/37.69 | 7.423/312.5                     | 2.142/125.0 |
| 2560         | <b>0.030/0.658</b>              | 1.232/25.67 | <b>2.069/125.0</b>              | 2.137/125.0 |
| 4096         | 1.516/94.21                     | 0.767/16.90 | <b>2.129/125.0</b>              | 2.154/125.0 |
| 5120         | <b>0.032/0.782</b>              | 0.613/14.97 | <b>2.008/125.0</b>              | 2.124/125.0 |
| 7168         | 20.61/467.0                     | 0.435/8.831 | 2.456/187.5                     | 2.109/125.0 |
| 9216         | <b>0.203/14.98</b>              | 0.337/8.736 | <b>2.160/125.0</b>              | 2.129/125.0 |
| 12288        | <b>0.015/1.831</b>              | 0.251/5.846 | <b>2.070/125.0</b>              | 2.185/125.0 |

evaluation results. The average (maximum) absolute errors for FP32, FP16, and BFloat16 are  $2.23 \times 10^{-4}$  ( $5.0 \times 10^{-1}$ ),  $5.26 \times 10^{-4}$  ( $4.9 \times 10^{-1}$ ), and  $3.07 \times 10^{-3}$  ( $6.8 \times 10^{-1}$ ), respectively. The maximum error cases marginally occurred as shown in the histograms in Fig. 3.

LLMs use long embedding vectors, as seen in the OPT models, with the largest (OPT-175B) utilizing 12,288-dimensional embeddings [14]. We further evaluated the precision of IterL2Norm for the embedding lengths used in the OPT models ( $768 \leq d \leq 12,288$ ) and compared with the layer normalization method based on the fast inverse square root (FISR) algorithm [12]. Since FISR is designed for FP formats with an 8b exponent, we limit our comparison to FP32 and BFloat16. The results are shown in Table I. In FP32, IterL2Norm outperforms the FISR-based method in terms of average precision in six out of nine cases while, in BFloat16, it does so in five out of nine cases.

To identify the coverage rate, we measured the average



Fig. 4. Average absolute errors of IterL2Norm in FP32, FP16, and BFloat16 with the number of iteration steps.



Fig. 5. Measured latency of IterL2Norm (five iteration steps) with input length  $d$ .

absolute error by varying the numbers of iteration steps for  $d = 1024$  for FP32, FP16, and BFloat16. Fig. 4 plots the evaluation results, where each data point was acquired from 1,000 trials. IterL2Norm in FP16 and BFloat16 ensures fast convergence within five iteration steps while that in FP32 needs a few additional iteration steps until convergence. Nevertheless, the error after five steps is close to the steady state error and far below the steady state errors for FP16 and BFloat16. Note that, in all formats, these errors after five steps may be sufficiently low to avoid LLM-level performance degradation on some text generation tasks as addressed in Sec. V-D.

### B. Operational latency

We evaluated the IterL2Norm macro latency by varying input length  $d$ . The results are plotted in Fig. 5. Note that the latency does not rely on the data format. The latency scales with the number of chunks  $\lceil d / (n_b w_b) \rceil$  of the input length  $d$ . This is because the major steps addressed in Sec. IV work on a chunk of  $n_b w_b (= 64)$  input elements at a time, so that the latency scales with  $\lceil d / (n_b w_b) \rceil$ . They include the mean calculation, mean-shift operation, inner product of  $y$  with itself, and scale and shift operations.

### C. IterL2Norm macro in 32/28nm CMOS

We finally synthesized the IterL2Norm macro for each of FP32/FP16/BFloat16 using the Synopsys SAED 32/28nm technology PDK with a supply voltage of 1.05V and a clock

TABLE II  
SYNTHESIS RESULTS FOR THE ITERL2NORM MACROS IN FP32, FP16, AND BFLOAT16

| Format   | Memory size | # cells | Area                                   | Power   |
|----------|-------------|---------|----------------------------------------|---------|
| FP32     | 96.5 kib    | 269.3k  | 2.4 (1.7) <sup>†</sup> mm <sup>2</sup> | 22.9 mW |
| FP16     | 48.3 kib    | 100.1k  | 1.1 (0.8) <sup>†</sup> mm <sup>2</sup> | 8.4 mW  |
| BFloat16 | 48.3 kib    | 87.0k   | 1.0 (0.8) <sup>†</sup> mm <sup>2</sup> | 7.3 mW  |

<sup>†</sup>: Area without the Add and Mul blocks.



Fig. 6. Area breakdowns for the IterL2Norm macro for (a) FP32, (b) FP16, and (c) BFloat16. Power breakdowns for (d) FP32, (e) FP16, and (f) BFloat16.

frequency  $f_{\text{clk}}$  of 100MHz. We used the Design Compiler V-2023.12-SP5. Note that the configuration of Input buffer banks for all formats follows the generic architecture (use of eight Input buffer banks, each of which stores  $16 \times 8$  input elements) explained in Sec. IV. The synthesis results are summarized in Table II. As such, the macro for FP32 needs on-chip memory (96.5 kib) twice as large as those (48.3 kib) for FP16 and BFloat16. For FP32, each of the Input,  $\gamma$ , and  $\beta$  buffers uses 32 kib to store 1024 elements maximally, and the partial sum buffer uses 0.5 kib to maximally store 16 partial sums. For FP16 and BFloat16, the memory usage is half that for FP32 such that the Input,  $\gamma$ , and  $\beta$  buffers use 48 kib in total, and the partial sum buffer 0.25 kib.

The number of standard cells used is primarily determined by the FP multipliers and adders. As such, among the three formats, the FP32 multiplier and adder require the most standard cells due to their higher number of exponent and mantissa bits. The BFloat16 multiplier and adder require a fewer standard cells than FP16 because of their lower number of mantissa bits that are subject to multiplication and addition.

As shown in Table II, the macro areas for FP32, FP16, and BFloat16 are 2.4, 1.1, and 1.0 mm<sup>2</sup>, respectively. The area breakdown for each format is shown in Figs.6a-c. For all formats, the memory (Input/ $\gamma/\beta$  and partial sum buffers) occupies the largest area in the macro, which is followed by the logic area including FP multipliers and adders. Although we considered FP multipliers and adders dedicated solely to IterL2Norm, IterL2Norm can use them in the MatMul block co-integrated on the same die. Therefore, the actual area of

TABLE III  
COMPARISON BETWEEN THE ITERL2NORM MACRO AND PREVIOUS IMPLEMENTATIONS OF LAYER NORMALIZATION

|             | <b>Implementation</b> | <b>Method</b>                           | <b>Operations</b>                   | <b>Data format</b>                | <b>Area</b>                                                                                                                   | <b>Power</b>                         | <b>Clock frequency</b> |
|-------------|-----------------------|-----------------------------------------|-------------------------------------|-----------------------------------|-------------------------------------------------------------------------------------------------------------------------------|--------------------------------------|------------------------|
| [8]         | 65nm CMOS             | approximate SQRT                        | addition, division, bit shift       | INT32                             | 68.3 mm <sup>2</sup>                                                                                                          | 2.0 W                                | 143 MHz                |
| [9]         | 7nm CMOS              | approximate 1/SQRT                      | multiplication, addition            | INT32<br>FP32<br>FP16             | 1008.9 μm <sup>2</sup><br>1136.6 μm <sup>2</sup><br>498.4 μm <sup>2</sup>                                                     | 59.1 μW<br>43.7 μW<br>25.0 μW        | -                      |
| [10]        | 28nm CMOS             | FISR                                    | multiplication, addition, bit shift | BFloat16                          | -                                                                                                                             | -                                    | 1 GHz                  |
| [11]        | 28nm CMOS             | layer normalization w/ dynamic compress | multiplication, addition, bit shift | INT8                              | -                                                                                                                             | -                                    | 1 GHz                  |
| <b>Ours</b> | <b>32/28nm CMOS</b>   | <b>IterL2Norm</b>                       | <b>multiplication, addition</b>     | <b>FP32<br/>FP16<br/>BFloat16</b> | <b>2.4 (1.7)<sup>†</sup> mm<sup>2</sup><br/>1.1 (0.8)<sup>†</sup> mm<sup>2</sup><br/>1.0 (0.8)<sup>†</sup> mm<sup>2</sup></b> | <b>22.9 mW<br/>8.4 mW<br/>7.3 mW</b> | <b>100 MHz</b>         |

†: Area without the Add and Mul blocks.

TABLE IV  
LLM-LEVEL EVALUATION OF ITERL2NORM USING OPT-125M AND OPT-350M ON TWO TEXT GENERATION TASKS

| Task       | Format   | OPT-125M |         |               | OPT-350M |         |               |
|------------|----------|----------|---------|---------------|----------|---------|---------------|
|            |          | Baseline | # steps | Perplexity    | Baseline | # steps | Perplexity    |
| Wikitext-2 | FP32     | 18.21    | 3       | 18.37 (+0.16) | 15.28    | 3       | 15.28 (+0.00) |
|            |          |          | 4       | 18.22 (+0.01) |          | 4       | 15.28 (+0.00) |
|            |          |          | 5       | 18.21 (+0.00) |          | 5       | 15.28 (+0.00) |
|            |          |          | 10      | 18.21 (+0.00) |          | 10      | 15.28 (+0.00) |
|            | FP16     | 25.35    | 3       | 25.51 (+0.16) | 27.57    | 3       | 27.57 (+0.00) |
|            |          |          | 4       | 25.35 (+0.00) |          | 4       | 27.57 (+0.00) |
|            |          |          | 5       | 25.35 (+0.00) |          | 5       | 27.57 (+0.00) |
|            |          |          | 10      | 25.35 (+0.00) |          | 10      | 27.57 (+0.00) |
|            | BFloat16 | 19.17    | 3       | 19.43 (+0.26) | 15.43    | 3       | 15.44 (+0.01) |
|            |          |          | 4       | 19.20 (+0.03) |          | 4       | 15.43 (+0.00) |
|            |          |          | 5       | 19.20 (+0.03) |          | 5       | 15.43 (+0.00) |
|            |          |          | 10      | 19.17 (+0.00) |          | 10      | 15.43 (+0.00) |
| BST        | FP32     | 17.30    | 3       | 17.36 (+0.06) | 15.41    | 3       | 15.41 (+0.00) |
|            |          |          | 4       | 17.31 (+0.01) |          | 4       | 15.41 (+0.00) |
|            |          |          | 5       | 17.30 (+0.00) |          | 5       | 15.41 (+0.00) |
|            |          |          | 10      | 17.30 (+0.00) |          | 10      | 15.41 (+0.00) |
|            | FP16     | 19.61    | 3       | 19.67 (+0.16) | 21.94    | 3       | 21.95 (+0.01) |
|            |          |          | 4       | 19.61 (+0.00) |          | 4       | 21.94 (+0.00) |
|            |          |          | 5       | 19.61 (+0.00) |          | 5       | 21.94 (+0.00) |
|            |          |          | 10      | 19.61 (+0.00) |          | 10      | 21.94 (+0.00) |
|            | BFloat16 | 17.83    | 3       | 17.91 (+0.08) | 15.49    | 3       | 15.49 (+0.00) |
|            |          |          | 4       | 17.84 (+0.01) |          | 4       | 15.49 (+0.00) |
|            |          |          | 5       | 17.84 (+0.01) |          | 5       | 15.49 (+0.00) |
|            |          |          | 10      | 17.83 (+0.00) |          | 10      | 15.49 (+0.00) |

the IterL2Norm macro likely excludes the multiplier and adder areas, which is also listed in Table II. The operational power is also primarily determined by the FP multipliers and adders, resulting in the highest (lowest) power consumption for FP32 (BFloat16) as identified in Table II and power breakdown for each format in Figs.6d-f.

#### D. LLM-level evaluation

We evaluated IterL2Norm at the LLM-level by creating a PyTorch module for IterL2Norm. Note that the precision of IterL2Norm in PyTorch negligibly differs from that in FPGA. We considered the Open Pre-trained Transformer (OPT) models with 125M (OPT-125M) and 350M (OPT-350M) parameters [14] for text generation. The OPT models are decoder-only models consisting of stacks of 12 and 24 transformer blocks, each employing 12 and 16 attention heads with embedding sizes of 768 and 1024, respectively. We used two text generation datasets: WikiText-2 [15] and Blended Skill Talk (BST) [16]. All layer normalization blocks in the pre-trained OPT-125M and OPT-350M models for each dataset were replaced with IterL2Norm, and we measured the change in perplexity scores for text generation as the LLM-level error of IterL2Norm. The IterL2Norm module takes the number of iteration steps  $n_{\text{iter}}$  as a parameter. The perplexity scores for different iteration steps in

FP32, FP16, and BFloat16 are listed in Table IV. Compared to the baseline perplexity, the perplexity scores for both WikiText-2 and BST marginally increased after the third iteration step.

#### VI. RELATED WORK AND COMPARISON

Our IterL2Nom macros designed for FP32, FP16, and BFloat16 are compared with previous implementations of layer normalization in Table III. The method in [8] realizes the layer normalization of INT32 vectors using integer-only arithmetic. To this end, they adopted an iterative algorithm [17] that approximates square root values. Thus, this method requires additional division operations to normalize the input integer vector. When implemented in 65nm CMOS, the area and power overheads of the circuit are 68.3 mm<sup>2</sup> and 2.0 W, respectively.

The method in [9] avoids costly division operations for normalization by using a lookup-table (LUT)-based approximation of the inverse square root function. The inverse square root function is approximated using a piecewise linear method, and the function values for multiple inputs are stored in an LUT. For a given input, its square root value is calculated by interpolating between two neighboring function values. Wu et al. implemented FISR [12] in BFloat16 in 28nm CMOS technology for on-chip layer normalization [10]. Unfortunately, the detailed implementation and performance data are unavailable.

The method in [11] uses the low-precision computation of the mean and standard deviation using dynamic compression and power-of-two factor quantization methods. The mean and standard deviation values are computed using 4-bit integer arithmetic. Additional LUTs are used to store inverse square root values. However, similar to [10], the implementation and performance data are missing.

#### VII. CONCLUSION

We introduced IterL2Norm, an efficient method for iteratively L2-normalizing input vectors without costly division or square root operations. Grounded in solid theory, IterL2Norm is applicable to general FP data and ensures high precision, outperforming FISR in six out of nine cases for FP32 and five out of nine for BFloat16 across the embedding lengths used in the OPT models. It also converges quickly, reaching its fixed point within five iterations. Implemented in 32/28nm CMOS technology, the IterL2Norm macro processes  $d$ -dimensional input vectors, where  $64 \leq d \leq 1024$ , with a latency of 116-227 clock cycles.

## REFERENCES

- [1] A. Radford, “Improving language understanding by generative pre-training,” 2018.
- [2] G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth *et al.*, “Gemini: a family of highly capable multimodal models,” *arXiv preprint arXiv:2312.11805*, 2023.
- [3] H. Touvron, T. Lavigil, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar *et al.*, “Llama: Open and efficient foundation language models,” *arXiv preprint arXiv:2302.13971*, 2023.
- [4] A. Vaswani, “Attention is all you need,” *Advances in Neural Information Processing Systems*, 2017.
- [5] S. Kim, C. Hooper, T. Wattanawong, M. Kang, R. Yan, H. Genc, G. Dinh, Q. Huang, K. Keutzer, M. W. Mahoney *et al.*, “Full stack optimization of transformer inference: a survey,” *arXiv preprint arXiv:2302.14017*, 2023.
- [6] Y.-C. Kwon, S. H. Lee, J. Lee, S.-H. Kwon, J. M. Ryu, J.-P. Son, O. Seongil, H.-S. Yu, H. Lee, S. Y. Kim *et al.*, “25.4 a 20nm 6gb function-in-memory dram, based on hbm2 with a 1.2 tflops programmable computing unit using bank-level parallelism, for machine learning applications,” in *2021 IEEE International Solid-State Circuits Conference (ISSCC)*, vol. 64. IEEE, 2021, pp. 350–352.
- [7] S. Lee, K. Kim, S. Oh, J. Park, G. Hong, D. Ka, K. Hwang, J. Park, K. Kang, J. Kim *et al.*, “A 1ynm 1.25 v 8gb, 16gb/s/pin gddr6-based accelerator-in-memory supporting 1tflops mac operation and various activation functions for deep-learning applications,” in *2022 IEEE International Solid-State Circuits Conference (ISSCC)*, vol. 65. IEEE, 2022, pp. 1–3.
- [8] A. Marchisio, D. Dura, M. Capra, M. Martina, G. Masera, and M. Shafique, “Swifttron: An efficient hardware accelerator for quantized transformers,” in *2023 International Joint Conference on Neural Networks (IJCNN)*. IEEE, 2023, pp. 1–9.
- [9] J. Yu, J. Park, S. Park, M. Kim, S. Lee, D. H. Lee, and J. Choi, “Nn-lut: neural approximation of non-linear operations for efficient transformer inference,” in *Proceedings of the 59th ACM/IEEE Design Automation Conference*, 2022, pp. 577–582.
- [10] Y. Wu, Z. Wang, and W. D. Lu, “Pim gpt a hybrid process in memory accelerator for autoregressive transformers,” *npj Unconventional Computing*, vol. 1, no. 1, p. 4, 2024.
- [11] W. Wang, S. Zhou, W. Sun, P. Sun, and Y. Liu, “Sole: Hardware-software co-design of softmax and layernorm for efficient transformer inference,” in *2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD)*. IEEE, 2023, pp. 1–9.
- [12] A. Q. III, “Fast inverse square root,” 1999. [Online]. Available: [https://en.wikipedia.org/wiki/Fast\\_inverse\\_square\\_root](https://en.wikipedia.org/wiki/Fast_inverse_square_root)
- [13] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in *Advances in Neural Information Processing Systems*, vol. 32, 2019.
- [14] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin *et al.*, “Opt: Open pre-trained transformer language models,” *arXiv preprint arXiv:2205.01068*, 2022.
- [15] S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture models,” *arXiv preprint arXiv:1609.07843*, 2016.
- [16] E. M. Smith, “Can you put it all together: Evaluating conversational agents’ ability to blend skills,” *arXiv preprint arXiv:2004.08449*, 2020.
- [17] R. E. Crandall and C. Pomerance, *Prime numbers: a computational perspective*. Springer, 2005, vol. 2.