

# Low Power Approximate Multiplier Architecture for Deep Neural Networks

Pragun Jaswal<sup>1</sup> , L. Hemanth Krishna<sup>2</sup> , and B. Srinivasu<sup>3</sup>

School of Computing and Electrical Engineering,  
 Indian Institute of Technology Mandi, Mandi - 175005, India  
<sup>1</sup>[thepragun@gmail.com](mailto:thepragun@gmail.com), <sup>2</sup>[hemanthkrishna412@gmail.com](mailto:hemanthkrishna412@gmail.com),  
<sup>3</sup>[srinivasu@iitmandi.ac.in](mailto:srinivasu@iitmandi.ac.in)

**Abstract.** This paper proposes an low power approximate multiplier architecture for deep neural network (DNN) applications. A 4:2 compressor, introducing only a single combination error, is designed and integrated into an  $8 \times 8$  unsigned multiplier. This integration significantly reduces the usage of exact compressors while preserving low error rates. The proposed multiplier is employed within a custom convolution layer and evaluated on neural network tasks, including image recognition and denoising. Hardware evaluation demonstrates that the proposed design achieves up to 30.24% energy savings compared to the best among existing multipliers. In image denoising, the custom approximate convolution layer achieves improved Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) compared to other approximate designs. Additionally, when applied to handwritten digit recognition, the model maintains high classification accuracy. These results demonstrate that the proposed architecture offers a favorable balance between energy efficiency and computational precision, making it suitable for low-power AI hardware implementations.

**Keywords:** Low power approximate multiplier · approximate compressor · custom approximate convolution layer · deep neural networks

## 1 Introduction

CONVOLUTIONAL Neural Networks (CNNs) have become fundamental in advancing fields such as computer vision, speech recognition, and multimedia processing, driven by their ability to extract complex features from large datasets. With the rapid expansion of real-time and edge-based applications, particularly in resource-constrained environments, the demand for highly efficient hardware implementations has intensified. Traditional exact arithmetic circuits impose significant overheads in terms of area, delay, and power consumption [1], making them less favorable for large-scale data processing tasks and edge-based applications [2]. To deal with these issues and for applications which involve human perception, approximate computing has evolved as a viable alternative [3],

allowing slight accuracy trade-offs to accumulate large advantages in energy, area and speed [4, 5].

Approximation techniques, particularly in arithmetic units such as adders, multipliers, have shown considerable success [6–8]. Among these, the use of approximate compressors in the partial product reduction (PPR) stage of Dadda’s multipliers has gained significant attention. Approximate compressors simplify the logic by approximating low-probability signal combinations, reducing both the size and error probability of multipliers.

The fundamental motivation of this work is to introduce approximations selectively in combinations with the lowest occurrence probability  $P(1/256)$ , thereby significantly reducing both the hardware complexity and the error probability of the multiplier. In this study, we specifically focus on the design and analysis of approximate 4:2 compressors and 8 bit unsigned multiplier. A comprehensive survey of existing 4:2 compressor architectures was conducted. Based on this analysis, we propose a high-accuracy 4:2 approximate compressor design that introduces only a single error combination while achieving a 30.24% improvement in energy consumption compared to the best existing designs.

#### **The main contributions of this paper are as follows:**

1. A 4:2 approximate compressor is proposed, integrated into a high-accuracy  $8 \times 8$  multiplier, achieving up to 8.93% energy savings and improving power efficiency.
2. The proposed multiplier achieves up to 27.64% power and 27.48% energy reduction over the best existing multiplier-1 design.
3. The proposed multiplier achieves up to 33.14% power and 30.24% energy reduction over the best existing multiplier-2 design.
4. The multiplier is integrated into a custom convolution layer for DNN tasks, such as image denoising and digit recognition, demonstrating high accuracy with reduced computational overhead.

## **2 Related Work**

The exact 4:2 compressor takes four primary inputs  $x_1$ ,  $x_2$ ,  $x_3$ , and  $x_4$  along with an additional carry input  $C_{in}$ . These five inputs are summed to produce an output that can range from 0 to 5. To represent this maximum sum of 5, the compressor requires three output bits. These outputs are  $Cout$ ,  $Carry$  which has a weight of  $2^{n+1}$  and  $Sum$  with a weight of same as inputs  $2^n$ . Each output contributes to reconstructing the total sum based on its positional weight in binary representation. This structure enables efficient reduction of partial products in arithmetic circuits, especially in multiplier architectures.

Fig. 1 illustrates the implementation of an exact 4:2 compressor using full adders. In contrast, the approximate 4:2 compressor typically eliminates the  $Cin$  and  $Cout$  pins, computing only ( $Carry$ ,  $Sum$ ), as the sum is  $x_1+x_2+x_3+x_4$ . This elimination breaks the carry propagation chain between compressors, thereby accelerating the accumulation of the sum. However, the maximum value that



Fig. 1: Conventional Exact 4:2 Compressor [9]

can be encoded using only the Sum and Carry outputs is three. Given four input bits ( $x_1$  to  $x_4$ ), it is evident that at least one error is unavoidable, specifically when all inputs are ‘1’.

## 2.1 Low Accuracy Approximate 4:2 Compressors

Most researchers have proposed various approximate 4:2 compressor designs with different combinations of error cases. For instance, if four input combinations produce incorrect outputs, the error rate is 25% [9–11]. When five combinations result in errors, the error rate increases to 31.25% [12], and with six erroneous combinations, it rises to 37.5% [13, 14].

These designs primarily aim to reduce the hardware complexity of compressor circuits. However, this reduction often comes at the cost of accuracy. When integrated into an 8-bit multiplier architecture, such approximate compressors tend to exhibit higher error metrics, particularly in terms of Normalized Mean Error Distance (NMED) and Mean Relative Error Distance (MRED), which are further discussed in Section IV.

The 4:2 compressor design proposed in [15] uses two XOR gates for the *Sum* output, introducing up to four combination errors, which results in an error probability of  $P(16/256)$ , as shown in Table 3. The compressors proposed in [12] use an input reordering circuit with additional gates, eliminating XOR gates in the critical path, and introduce two combination errors, resulting in an error probability of  $P(19/256)$ . In [16], two-compressor are proposed, where Design-2, uses only OR and AND gates, incurs a maximum of seven error combinations with an error probability of  $P(55/256)$ . The compressor design in [13] incorporates one XOR and one NOR gate in the critical path, introducing a maximum of six combination errors, leading to an error probability of  $P(70/256)$ . These low-accuracy compressors provide substantial energy savings, though at the expense of reduced computational accuracy.

## 2.2 High-Accuracy Approximate 4:2 Compressors

Compressors with a single combination error typically occur when all inputs are logic high. It corresponds to an error probability of  $1/256$  and provides

significantly improved accuracy in multiplier architectures. In this work, such designs are classified as *high-accuracy approximate compressors*. Examples of such designs are found in [16–19], where the proposed compressors maintain low error rates. However, these designs often require higher hardware resources and introduce longer critical paths, which can lead to increased delay and power consumption.

### 3 Proposed Approximate Multiplier Architecture

This section is divided into two parts. The first describes the construction of the  $8 \times 8$  approximate unsigned multiplier architecture incorporating the proposed compressor. The second presents the design of the proposed 4:2 approximate compressor, which serves as the core component of the multiplier.

#### 3.1 Efficient Architecture of Approximate Multiplier

Previous multiplier designs, shown in Fig. 2 (a, b), use a mix of exact and approximate compressors. Typically, exact compressors are used in the most significant columns to maintain accuracy, while approximate compressors are applied in the least significant columns to reduce hardware cost. Additionally, design (b) incorporates truncation in the least significant columns  $n - 4$  and an error correction module to further mitigate the error rate.

In contrast, the proposed multiplier design as shown in Fig. 2 (c) uses only approximate compressors throughout the design to further reduce hardware complexity. Although it relies more heavily on approximation, the overall error remains low. This is because the proposed compressor introduces only one combination error with a very low probability of occurrence  $P(1/255)$ . As a result, error metrics such as NMED and MRED show only minimal increase, ensuring that the accuracy of the multiplier is not significantly affected.



Fig. 2: Structure of  $8 \times 8$  Unsigned Multipliers (a) Multiplier Design-1 : from [12], [17], [19], (b) Multiplier Design-2 : from [13], [15], (c) Proposed multiplier for High Accuracy Compressor

### 3.2 Proposed High-Accuracy 4:2 Compressor

Table 1: Truth Table of the Proposed 4:2 Approximate Compressor

| Inputs<br>$x_4 \ x_3 \ x_2 \ x_1$ | Exact<br>Value | Prob.<br>$81/256$ | Outputs |     | Appr.<br>Value | Difference |
|-----------------------------------|----------------|-------------------|---------|-----|----------------|------------|
|                                   |                |                   | Carry   | Sum |                |            |
| 0 0 0 0                           | 0              | 81/256            | 0       | 0   | 0              | 0          |
| 0 0 0 1                           | 1              | 27/256            | 0       | 1   | 1              | 0          |
| 0 0 1 0                           | 1              | 27/256            | 0       | 1   | 1              | 0          |
| 0 0 1 1                           | 2              | 9/256             | 1       | 0   | 2              | 0          |
| 0 1 0 0                           | 1              | 27/256            | 0       | 1   | 1              | 0          |
| 0 1 0 1                           | 2              | 9/256             | 1       | 0   | 2              | 0          |
| 0 1 1 0                           | 2              | 9/256             | 1       | 0   | 2              | 0          |
| 0 1 1 1                           | 3              | 3/256             | 1       | 1   | 3              | 0          |
| 1 0 0 0                           | 1              | 27/256            | 0       | 1   | 1              | 0          |
| 1 0 0 1                           | 2              | 9/256             | 1       | 0   | 2              | 0          |
| 1 0 1 0                           | 2              | 9/256             | 1       | 0   | 2              | 0          |
| 1 0 1 1                           | 3              | 3/256             | 1       | 1   | 3              | 0          |
| 1 1 0 0                           | 2              | 9/256             | 1       | 0   | 2              | 0          |
| 1 1 0 1                           | 3              | 3/256             | 1       | 1   | 3              | 0          |
| 1 1 1 0                           | 3              | 3/256             | 1       | 1   | 3              | 0          |
| 1 1 1 1                           | 4              | 1/256             | 1       | 1   | 3              | -1         |

The proposed approximate compressor comprises four inputs and two outputs, defined by Equation (1) and (2) for the Carry and Sum, respectively. Equation (3) represents the intermediate values where  $A$  and  $C$  represent the NOR operations for specific input pairs and outputs  $B$  and  $D$  are likewise NAND operations. Given that NOR and NAND gates exhibit superior speed and energy efficiency compared to the AND and OR gates utilized in earlier proposed designs, this approach is well-suited for low-power, high-speed applications.

$$\text{Carry} = \overline{B \cdot D} + \overline{A + C} \quad (1)$$

$$\begin{aligned} \text{Sum} = & \overline{A} \cdot B \cdot C + \overline{A} \cdot B \cdot \overline{D} + \overline{A} \cdot \overline{C} \cdot D \\ & + \overline{B} \cdot \overline{C} \cdot D + \overline{B} \cdot \overline{D} \end{aligned} \quad (2)$$

$$\begin{aligned} \text{where } A &= \overline{x_1 + x_2}, \quad B = \overline{x_1 \cdot x_2} \\ C &= \overline{x_3 + x_4}, \quad D = \overline{x_3 \cdot x_4}. \end{aligned} \quad (3)$$

Table 1 represents the Sum and Carry outputs of the proposed approximate compressor for all possible combinations. The proposed compressor not only have high accuracy but also has shorter critical path. As shown in Fig.3, the red colored dotted lines marks the critical path of the approximate compressor.



Fig. 3: Proposed Approximate 4:2 Compressor with One Error Probability

There are one NOR-2, one NAND-2, two inverters, and one AO222 on the critical path. The proposed compressor architecture demonstrates a notable reduction in propagation delay compared to the designs presented in [17] and [19]. These performance improvements are discussed in detail in the subsequent results section.

## 4 Results and Discussion

### 4.1 Error Metrics

To evaluate the accuracy of the proposed approximate design, error metrics such as *Error Distance*, *Relative Error Distance*, and *Mean Relative Error Distance* are used. These metrics quantify the deviation of approximate outputs from their exact counterparts.

#### Error Distance (ED)

The Error Distance measures the absolute difference between the exact and approximate outputs for each test case as shown in Equation (4).

$$ED_i = |A_i - A'_i| \quad (4)$$

where  $A_i$  is the exact output and  $A'_i$  is the approximate output for the  $i^{\text{th}}$  test case.

#### Error Rate (ER)

The Error Rate indicates the percentage of test cases where the approximate output differs from the exact output. It is computed as shown in Equation (5).

$$ER = \left( \frac{1}{N} \sum_{i=1}^N \delta_i \right) \times 100 \quad (5)$$

$$\delta_i = \begin{cases} 1, & \text{if } A_i \neq A'_i \\ 0, & \text{if } A_i = A'_i \end{cases}$$

Here,  $A_i$  and  $A'_i$  denote the exact and approximate outputs respectively for the  $i^{\text{th}}$  test case, and  $N$  is the total number of test cases.

### Relative Error Distance (RED)

The Relative Error Distance normalizes the error with respect to the exact output as shown in Equation (6).

$$\text{RED}_i = \frac{|A_i - A'_i|}{|A_i|} \quad (6)$$

### Mean Relative Error Distance (MRED)

The Mean Relative Error Distance provides the average RED across all test cases, giving an overall measure of the accuracy degradation due to approximation as shown in Equation (7).

$$\text{MRED} = \frac{1}{N} \sum_{i=1}^N \frac{|A_i - A'_i|}{|A_i|} \quad (7)$$

where  $N$  is the total number of test cases.

Table 2: Error Metrics of Proposed Multiplier Design

| Design      | ER (%) | NMED (%) | MRED (%) |
|-------------|--------|----------|----------|
| Design [12] | 68.498 | 0.596    | 3.496    |
| Design [15] | 65.425 | 0.673    | 3.531    |
| Design [16] | 6.994  | 0.046    | 0.109    |
| Design [16] | 86.326 | 1.879    | 9.551    |
| Design [17] | 21.296 | 0.162    | 0.578    |
| Design [17] | 6.994  | 0.046    | 0.109    |
| Design [19] | 6.994  | 0.046    | 0.109    |
| Design [19] | 6.994  | 0.046    | 0.109    |
| Design [13] | 95.681 | 1.565    | 20.276   |
| Design [18] | 6.994  | 0.046    | 0.109    |
| Proposed    | 6.994  | 0.046    | 0.109    |

All designs of  $8 \times 8$  unsigned approximate multipliers were evaluated by simulation across the complete input space. The corresponding error metrics

for each design are summarized in Table 2. Although the proposed multiplier employs fully approximate compressors, it achieves a high level of computational accuracy, with a MRED of 0.109%.

#### 4.2 Hardware Synthesis Analysis

The proposed and existing 8-bit unsigned multipliers were implemented using Verilog HDL and synthesized using Cadence Genus with UMC 90nm technology, under typical-typical (TT) process conditions, to evaluate their design efficiency and characteristics. Table 3 presents a comparative analysis of both low and high accuracy compressor designs. The proposed compressor demonstrates an energy reduction of 9.81% compared to the best performing high-accuracy compressor [16].

Table 3: Hardware Synthesis Metrics of Existing and Proposed 4:2 Compressors

| <i>S.No</i> | <i>Design</i> | <b>Area</b><br>( $\mu\text{m}^2$ ) | <b>Power</b><br>( $\mu\text{W}$ ) | <b>Delay</b><br>(ps) | <b>PDP</b><br>(fJ) | <b>Error Probability</b> |
|-------------|---------------|------------------------------------|-----------------------------------|----------------------|--------------------|--------------------------|
| 1.          | Exact         | 43.90                              | 1.99                              | 436                  | 0.867              | 0                        |
| 2.          | Design-1 [18] | 50.17                              | 2.39                              | 469                  | <b>0.852</b>       | 1/256                    |
| 3.          | Design-1 [19] | 44.68                              | 1.86                              | <b>383</b>           | 0.713              | 1/256                    |
| 4.          | Design-5 [19] | <b>28.22</b>                       | 1.17                              | 297                  | 0.347              | 1/256                    |
| 5.          | Design-1 [16] | 34.49                              | 1.20                              | <b>226</b>           | 0.291              | 1/256                    |
| 6.          | Design-3 [17] | <b>76.82</b>                       | <b>3.02</b>                       | 307                  | 0.827              | 1/256                    |
| 7.          | Design-1 [12] | 49.74                              | 1.83                              | 374                  | 0.684              | 19/256                   |
| 8.          | Design [15]   | 25.87                              | 1.02                              | 175                  | 0.179              | 16/256                   |
| 9.          | Design-2 [16] | 19.60                              | 0.71                              | 104                  | 0.074              | 55/256                   |
| 10.         | Design-2 [17] | 31.36                              | 1.37                              | 308                  | 0.422              | 4/256                    |
| 11.         | Design [13]   | 14.11                              | 0.52                              | 139                  | 0.072              | 70/256                   |
| 12.         | Proposed      | 30.57                              | <b>1.12</b>                       | 237                  | <b>0.265</b>       | 1/256                    |

\*Best and worst results for the high-accuracy compressor are highlighted in green and red, respectively.

Furthermore, the proposed multiplier achieves the lowest Power-Delay Product (PDP) among all designs, with a value of 91.20 fJ as presented in Table 4. Proposed design demonstrates significant energy improvement as depicted in Fig. 4, with reductions of 27.48% and 30.24% in energy consumption compared with best of multiplier design-1 and design-2, respectively. When high-accuracy compressors, described in [17] and [19], are employed within the proposed multiplier, the proposed compressor achieves notable energy improvements of 46.35% and 49.59%, respectively.

Table 4: Hardware Synthesis and Error Metrics (MRED, Power, Delay, PDP) of 8-bit Existing and Proposed Approximate Multipliers

| Design      | Multiplier Design-1 [12, 17, 19] |                   |            |          | Multiplier Design-2 [13, 15] |                   |            |          | Proposed Multiplier Design |                   |            |               |
|-------------|----------------------------------|-------------------|------------|----------|------------------------------|-------------------|------------|----------|----------------------------|-------------------|------------|---------------|
|             | MRED (%)                         | Power ( $\mu W$ ) | Delay (ns) | PDP (fJ) | MRED (%)                     | Power ( $\mu W$ ) | Delay (ns) | PDP (fJ) | MRED (%)                   | Power ( $\mu W$ ) | Delay (ns) | PDP (fJ)      |
| Design [12] | 0.993                            | 76.25             | 2.084      | 158.91   | 1.286                        | 74.68             | 2.009      | 150.05   | 3.496                      | 63.17             | 2.042      | <b>129.09</b> |
| Design [15] | 0.773                            | 68.67             | 1.998      | 137.24   | 0.974                        | 67.58             | 1.996      | 134.87   | 3.531                      | 57.41             | 2.042      | <b>117.23</b> |
| Design [16] | 0.023                            | 68.67             | 2.071      | 142.23   | 0.715                        | 66.81             | 2.071      | 138.35   | 0.109                      | 57.50             | 2.121      | <b>121.96</b> |
| Design [16] | 2.693                            | 59.00             | 1.993      | 117.75   | 2.704                        | 59.38             | 1.993      | 118.32   | 9.551                      | 41.12             | 2.042      | <b>83.97</b>  |
| Design [17] | 0.090                            | 74.94             | 2.084      | 156.15   | 0.702                        | 77.44             | 2.085      | 161.40   | 0.578                      | 69.21             | 2.126      | <b>147.14</b> |
| Design [17] | 0.023                            | 97.30             | 2.239      | 217.86   | 0.715                        | 78.83             | 2.140      | 168.70   | 0.109                      | 82.65             | 2.189      | <b>180.92</b> |
| Design [19] | 0.023                            | 76.94             | 2.243      | 172.54   | 0.715                        | 75.30             | 2.243      | 168.82   | 0.109                      | 74.13             | 2.293      | <b>169.98</b> |
| Design [19] | 0.023                            | 61.73             | 2.090      | 135.18   | 0.715                        | 62.42             | 2.090      | 130.46   | 0.109                      | 66.10             | 2.139      | <b>141.39</b> |
| Design [13] | 4.399                            | 61.73             | 1.993      | 123.06   | 4.320                        | 65.51             | 1.995      | 130.73   | 20.276                     | 42.46             | 2.042      | <b>86.70</b>  |
| Design [18] | 0.023                            | 70.19             | 2.350      | 164.94   | 0.715                        | 71.25             | 2.350      | 167.44   | 0.109                      | 62.69             | 2.371      | <b>148.64</b> |
| Proposed    | 0.023                            | 65.56             | 1.993      | 130.75   | 0.715                        | 64.25             | 1.993      | 128.06   | 0.109                      | 44.66             | 2.042      | <b>91.20</b>  |

\*Best and worst results for the are highlighted in green and red, respectively.



Fig. 4: Comparison of PDP and MRED for Different Designs

## 5 Applications

The proposed and existing multipliers are evaluated using two neural network applications: handwritten digit recognition on the MNIST dataset and image denoising using the FFdNet architecture. These benchmarks help assess the impact of multiplier design on inference accuracy and efficiency.

### 5.1 MNIST Handwritten Digit Recognition

The proposed multipliers was evaluated using a Keras-based [20] convolutional neural network model (CNN), as depicted in Fig. 5 and LeNet-5 model [21] developed for handwritten digit classification (0–9) using the MNIST dataset [22]. In this evaluation, the exact multiplier in the convolutional layers were



Fig. 5: Architecture of Convolution Neural Network using Keras Model

substituted with the proposed approximate multiplier. The dataset consists of 5,000 grayscale training images and 500 testing images of handwritten digits, each with a resolution of  $28 \times 28$  pixels. The CNN architecture was trained over 50 epochs, and the classification accuracies corresponding to various multiplier designs are reported in Table 5. Although a marginal decline in classification accuracy is observed when compared to exact multiplier, the proposed design demonstrates substantial advantages in terms of reduced power consumption and lower area overhead.

Table 5: Performing Number-Recognition using different convolution Models

| Models       | Dataset | Design      | Accuracy(%) |
|--------------|---------|-------------|-------------|
| Keras [20]   | MNIST   | Exact       | 95.24       |
|              |         | Design [13] | 90.58       |
|              |         | Design [15] | 92.14       |
|              |         | Design [16] | 92.46       |
|              |         | Design [12] | 93.19       |
|              |         | Proposed    | 93.54       |
| LeNet-5 [21] | MNIST   | Exact       | 98.24       |
|              |         | Design [13] | 91.66       |
|              |         | Design [15] | 93.72       |
|              |         | Design [16] | 93.88       |
|              |         | Design [12] | 95.12       |
|              |         | Proposed    | 96.45       |

## 5.2 Image Denoising using FFDNET Architecture

To further evaluate the effectiveness of the proposed approximate multiplier in real-world applications, it was integrated into the convolutional layers of the FFDNet architecture, a well-established CNN model for image denoising [23]



Fig. 6: FFDNet Architecture with Custom Convolution Layers for Image Denoising [23]



Fig. 7: Denoised Output Images Using the Proposed Approximate Multiplier Integrated into the FFDNet Model, Along with Corresponding PSNR and SSIM Values for Noise Levels  $\sigma = 25$  and  $\sigma = 50$ .

depicted in Fig. 6. In the original design, FFDNet employs a reversible down-sampling operator followed by a sequence of convolutional layers with combinations of convolution, batch normalization, and ReLU activations to restore clean images from noisy inputs.

In this study, the exact multiplier in the convolutional layers was substituted with the proposed approximate multiplier, while preserving the rest of the network architecture. As illustrated in Fig. 8 and Fig. 7, the proposed multiplier achieves effective noise reduction with minimal perceptual degradation, retaining competitive denoising performance and high PSNR values compared to existing designs. These results highlight both the visual quality and hardware efficiency of the proposed architecture, demonstrating significant reductions in power consumption and area overhead.

## 6 Conclusion

This paper proposed a high-efficiency approximate  $8 \times 8$  unsigned multiplier incorporating a high accuracy 4:2 compressor, optimized for deep neural network



Fig. 8: Comparison of Noisy and Denoised Images with Region of Interest (ROI)

applications. The architecture achieves notable accuracy and energy efficiency improvements upto 30.24% compared to existing design, with an MRED of 0.109% only. Experimental results demonstrate the superior performance of the proposed design in image denoising using FFDNet, achieving higher PSNR values, as well as high classification accuracy in digit recognition tasks, thereby validating its suitability for low-power AI hardware implementations.

## References

1. A. Fahim and M. Elmasry, “Low-power high-performance arithmetic circuits and architectures,” *IEEE Journal of Solid-State Circuits*, vol. 37, no. 1, pp. 90–94, 2002.
2. H. Jiang, C. Liu, N. Maheshwari, F. Lombardi, and J. Han, “A comparative evaluation of approximate multipliers,” in *2016 IEEE/ACM International Symposium on Nanoscale Architectures (NANOARCH)*, 2016, pp. 191–196.
3. V. K. Chippa, S. T. Chakradhar, K. Roy, and A. Raghunathan, “Analysis and characterization of inherent application resilience for approximate computing,” in *2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC)*, 2013, pp. 1–9.
4. A. Bosio, D. Ménard, and O. Sentieys, Eds., *Approximate Computing Techniques: From Component-to Application-Level*. Cham, Switzerland: Springer, 2022.
5. W. Liu, F. Lombardi, and M. Shulte, “A retrospective and prospective view of approximate computing [point of view],” *Proceedings of the IEEE*, vol. 108, no. 3, pp. 394–399, 2020.
6. L. B. Soares, M. M. A. da Rosa, C. M. Diniz, E. A. C. da Costa, and S. Bampi, “Design methodology to explore hybrid approximate adders for energy-efficient image and video processing accelerators,” *IEEE Transactions on Circuits and Systems I: Regular Papers*, vol. 66, no. 6, pp. 2137–2150, 2019.
7. V. Gupta, D. Mohapatra, A. Raghunathan, and K. Roy, “Low-power digital signal processing using approximate adders,” *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, vol. 32, no. 1, pp. 124–137, 2013.
8. M. Shafique, W. Ahmad, R. Hafiz, and J. Henkel, “A low latency generic accuracy configurable adder,” in *2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC)*, 2015, pp. 1–6.

9. A. Momeni, J. Han, P. Montuschi, and F. Lombardi, “Design and analysis of approximate compressors for multiplication,” *IEEE Transactions on Computers*, vol. 64, no. 4, pp. 984–994, 2015.
10. L. H. Krishna, J. B. Rao, S. Ayesha, S. Veeramachaneni, and S. Noor Mahammad, “Energy efficient approximate multiplier design for image/video processing applications,” in *2021 IEEE International Symposium on Smart Electronic Systems (iSES)*, 2021, pp. 210–215.
11. S. Hwang, K.-W. Kwon, and Y. Kim, “Design of a hardware-efficient approximate 4-2 compressor for multiplications in image processing,” *IEEE Embedded Systems Letters*, pp. 1–1, 2025.
12. L. H. Krishna, A. Sk, J. B. Rao, S. Veeramachaneni, and N. M. Sk, “Energy-efficient approximate multiplier design with lesser error rate using the probability-based approximate 4:2 compressor,” *IEEE Embedded Systems Letters*, vol. 16, no. 2, pp. 134–137, 2024.
13. M. Zhang, S. Nishizawa, and S. Kimura, “Area efficient approximate 4–2 compressor and probability-based error adjustment for approximate multiplier,” *IEEE Transactions on Circuits and Systems II: Express Briefs*, vol. 70, no. 5, pp. 1714–1718, 2023.
14. Y. Zhang, X. Chen, P. Guo, and G. Xie, “Design and analysis of approximate multiplier of majority-based imprecise 4–2 compressor for image processing,” in *2023 IEEE 23rd International Conference on Nanotechnology (NANO)*, 2023, pp. 1–5.
15. U. Anil Kumar, S. V. Bharadwaj, A. B. Pattaje, S. Nambi, and S. E. Ahmed, “Caam: Compressor-based adaptive approximate multiplier for neural network applications,” *IEEE Embedded Systems Letters*, vol. 15, no. 3, pp. 117–120, 2023.
16. A. Kumari and R. P. Palathinkal, “Design and analysis of energy efficient approximate multipliers for image processing and deep neural network,” *IEEE Transactions on Circuits and Systems I: Regular Papers*, vol. 72, no. 2, pp. 854–867, 2025.
17. A. G. M. Strollo, E. Napoli, D. De Caro, N. Petra, and G. D. Meo, “Comparison and extension of approximate 4-2 compressors for low-power approximate multipliers,” *IEEE Transactions on Circuits and Systems I: Regular Papers*, vol. 67, no. 9, pp. 3021–3034, 2020.
18. Z. Yang, J. Han, and F. Lombardi, “Approximate compressors for error-resilient multiplier design,” in *Proceedings of the 2015 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS)*, Amherst, MA, USA, 2015, pp. 183–186.
19. T. Kong and S. Li, “Design and analysis of approximate 4–2 compressors for high-accuracy multipliers,” *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 29, no. 10, pp. 1771–1781, 2021.
20. F. Chollet *et al.*, “Keras,” <https://keras.io>, 2015.
21. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” *Proceedings of the IEEE*, vol. 86, no. 11, pp. 2278–2324, 1998.
22. Y. LeCun, C. Cortes, and C. J. Burges, “The mnist database of handwritten digits,” <http://yann.lecun.com/exdb/mnist/>, 1998.
23. K. Zhang, W. Zuo, and L. Zhang, “Ffdnet: Toward a fast and flexible solution for cnn-based image denoising,” *IEEE Transactions on Image Processing*, vol. 27, no. 9, pp. 4608–4622, 2018.