

# Carry Disregard Approximate Multipliers

Nima Amirafshar<sup>✉</sup>, Ahmad Sadigh Baroughi, Hadi Shahriar Shahhoseini<sup>✉</sup>,  
and Nima TaheriNejad<sup>✉</sup>, *Member, IEEE*

**Abstract**—Several challenges in improving the performance of computing systems have given rise to emerging computing paradigms. One of these paradigms is approximate computing. Many applications require different levels of accuracy and are error-tolerant to a certain degree. Approximate computations can reduce the calculation complexities significantly and thus improve the performance. Here, we propose a methodology for designing approximate N-bit array multipliers based on carry disregarding. We evaluate and analyze the proposed multipliers both experimentally and theoretically. The proposed 8-bit multipliers, compared to the exact multiplier, reduce the critical path delay, power consumption, and area by 29%, 29%, and 30%, on average. Compared to the existing approximate array architectures in the literature, they have improved 14.3%, 22.8%, and 26.4%, respectively. Compared to the exact 16-bit multiplier, the proposed 16-bit multipliers have reduced the delay, power consumption, and area by 35%, 24%, and 23% on average. In an image processing application, we have also demonstrated the applicability of a wide range of proposed multipliers, which have Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) over 30 dB and 94%, respectively.

**Index Terms**—Approximate computing, carry disregard multiplier, power-efficient, image processing.

## I. INTRODUCTION

ONE main goal of computer architecture design is to achieve high performance. We notice a substantial growth in data volume with different characteristics and, as a result, many processing operations in various applications, so that the number and variety of applications have also increased significantly. Therefore, the architecture of today's computer systems is limited for processing such a significant amount of data, leading to inefficiency. Hence, the architecture of processing systems should be centered on data characteristics. Many specific applications do not mandate precise calculations. Numerous applications are inherently error-tolerant, such as Machine Learning, Scientific Computing, Data Analytics, and Signal Processing [1], [2], [3], [4], [5]. Also, human perceptual limitations make it possible to use approximations in many further applications, such as Image

Manuscript received 3 April 2023; revised 18 July 2023; accepted 7 August 2023. Date of publication 31 August 2023; date of current version 18 December 2023. This article was recommended by Associate Editor W. Jiang. (*Corresponding author: Hadi Shahriar Shahhoseini*.)

Nima Amirafshar, Ahmad Sadigh Baroughi, and Hadi Shahriar Shahhoseini are with the School of Electrical Engineering, Iran University of Science and Technology, Tehran 13114-16846, Iran (e-mail: nima\_amirafshar@elec.iust.ac.ir; sadighbaroughi\_a@elec.iust.ac.ir; shahhoseini@iust.ac.ir).

Nima TaheriNejad is with the Institute of Computer Engineering, Heidelberg University, 69117 Heidelberg, Germany, and also with the Institute for Computer Technology (ICT), TU Wien, 1040 Vienna, Austria (e-mail: nima.taherinejad@ziti.uni-heidelberg.de).

Color versions of one or more figures in this article are available at <https://doi.org/10.1109/TCSI.2023.3306071>.

Digital Object Identifier 10.1109/TCSI.2023.3306071

Processing and Multimedia [6], [7], [8]. Therefore, approximate computing is efficient and endeavors to achieve high efficiency in speed, area, and power or energy consumption by compromising accuracy.

Approximate computing methods can be divided into software, architecture, and circuit levels [9]. We can mention Loop Perforation, Code Perforation, and Inexact Program Versions at the software level [10]. Methods such as Instruction Set Architecture (ISA) Extension [11], Approximate Accelerator, and Approximate Storage are at the architectural level [12]. The circuit level also includes methods such as Voltage and Frequency Scaling. However, one of the most common is Inexact Hardware, which designs inaccurate numerical and logical units [13], [14]. Multiplication is one of the most common numerical operations. Conventional exact multipliers have significant critical path delays and power consumption, which, due to their repeated use in various applications, lead to limited efficiency and increased energy consumption. Hence, approximate multipliers have become popular and improved a wide range of different applications.

This paper presents approximate array multipliers based on carry disregarding. They significantly reduce power consumption, critical path delay, and area compared to exact and existing approximate multipliers. Applications are error tolerant up to a certain level. Hence, the proposed approximate multipliers have different error levels, and some have the highest accuracy compared to previous approximate multipliers. Also, they have created a better balance between accuracy and hardware performance criteria. The key contributions of this paper can be summarized as follows:

- 1) A methodology for designing efficient approximate N-bit array multipliers based on carry disregarding,
- 2) Theoretical analysis of the proposed methodology, providing equations for accuracy criteria for optimal design,
- 3) Design approximate 8-bit and 16-bit array multipliers with balanced hardware efficiency and accuracy criteria,
- 4) Achieving the most efficient designs (mostly are Pareto frontier) in terms of power, area, power-delay product, and accuracy compared to recent state-of-the-art designs,
- 5) Evaluating proposed designs in image processing.

The rest of this paper is as follows: Section II reviews some related works. In Section III, we investigate the background of exact multipliers. Section IV and V propose our 8-bit approximate multipliers and their extension, respectively, and Section VI theoretically analyzes the accuracy of the proposed multipliers. We describe our experiments and results in Section VII and make a comprehensive multidimensional comparison with recent state-of-the-art approximate

multipliers in Section VIII. Also, Section IX implies the pertinence of proposed designs in image processing application, and the paper is concluded in Section X.

## II. RELATED WORK

In general, multipliers consist of three main computation stages: Partial Product (PP) generation, PP accumulation (reduction), and a final addition ascertaining the multiplication result. There are three main architectures to accumulate PPs and reduce them: the Carry-Save Adder (CSA) array, the Wallace tree, and the Dadda tree [9]. Tree-based PP accumulation has less delay than array structures. In contrast, they have more power consumption and area than array structures due to the more significant number of computing units and higher complexity [9], [15]. The array-based multiplier has a straightforward, uniform, and modular architecture. Correspondingly, such architecture mostly has less power consumption and area; also, the development and management of this architecture are more comprehensible and more optimal than Wallace- and Dadda-based architectures [9], [15].

Designers employ the approximation in three vital computing stages of multipliers. One of the ways is operand truncation, which takes advantage of the fact that not all operand bits are equally substantial. Hence, a substantially smaller core multiplier results from merely choosing a portion of the operand bits [16], [17]. On the other hand, we can apply approximation during the PP accumulation stage. The fundamental component for accumulating PPs in approximate Wallace and Dadda multipliers is the approximate compressor, which has undergone much research; for example, [18], [19], and [20].

Regarding array multipliers, using approximate adders such as approximate Full-Adder (FA), Half-Adder (HA), and compressor is one of the conventional methods for applying approximation in reducing PPs. Hence, [21] proposed an imprecise 4:2 compressor, which by using it in an array structure, designed an approximate 8-bit multiplier. Another method is to design a small-scale approximate multiplier and use it recursively to design larger ones. Reference [22] first proposed an approximate  $2 \times 2$  multiplier as a building block, then by using it, designed larger multipliers recursively. The main idea of this paper is to change the logical function of the 2-bit multiplier such that the approximate multiplier converts the only available 4-bit output (i.e.,  $3 \times 3 = 9$ ) to show the approximate form with 3-bit (i.e., 7). As a result, it has a simpler circuit and lacks adders and XOR gates. Reference [23] proposed approximate 2-bit multipliers based on the concept of equating similar output bits and also designed larger multipliers recursively. In exact 2-bit multiplication, Most-Significant Bit (MSB) and Least-Significant Bit (LSB) are opposite in only three cases. Hence, [23] removed the logic circuit related to LSB, connected MSB to LSB, and finally managed to reduce the maximum magnitude of the error compared to [22].

Another conventional method is PP truncation. Reference [24] proposed Broken-Array Multiplier (BAM) truncates several PPs using Vertical Break Level (VBL) and Horizontal Break Level (HBL) parameters. Additionally, [14] proposed



Fig. 1. Conventional exact 8-bit multiplier architecture and logic circuit of partial product unit ( $\Pi_0$ ).

the probabilistic analysis-based PP array and used the proposed Propagation and Generation (PG) function to replace numerous PPs with different values. Afterward, some of them were truncated, which shortened the delay. Also, it uses approximate FA and HA for accumulating PPs and designs large multipliers recursively. Carry propagation is the primary limiting factor for array multipliers. Hence, [13] presented an approximate array multiplier in which specific columns of Partial Product Unit ( $\Pi_0$ ) disregard all generated carry, and our proposed designs share some architectural similarity with [13].

We note that approximate compute units beyond Complementary Metal-Oxide Semiconductor (CMOS) are gaining currency as well. For instance, in [25], the authors have proposed highly-scalable Majority Gates (MGs) based on spin-CMOS technology and used them to design approximate compressors. In [8] and [26], the authors designed various approximate adders for In-Memory Computation (IMC) using memristive stateful logics. However, they are outside the scope of this work, since they require technologies beyond CMOS.

## III. BACKGROUND

Figure 1 shows a conventional 8-bit array multiplier whose main component is the  $\Pi_0$ ; each  $\Pi_0$  consists of an AND Gate ( $\Lambda$ ) for single-bit multiplication and a full adder. This multiplier has a significant critical path delay due to the high dependence between  $\Pi_0$ s. An  $n$ -bit array multiplier has  $n^2$  elements. Its critical path has  $3n - 3$  elements in which the number of  $\Pi_0$ s and  $\Lambda$ s are  $3n - 4$  and 1, respectively. For simplicity in calculating the critical path delay, we assume that the delay of the  $\Lambda$ , OR, and XOR gates are 1, 1, and 2 cycles, respectively. Hence the critical path delay will be  $12n - 15$ . As the multiplier scale increases (i.e., larger  $n$ ), the total number of cells increases significantly. Thus, the dependence between  $\Pi_0$ s continues over a larger area, leading to a significant increase in critical path delay. For example, in 8-, 16-, 32-, and 64-bit multipliers, the total number of cells is 64, 256, 1024, and 4096, respectively, and the critical path delays are 81, 177, 369, and 753. Figure 1 shows the critical path of an 8-bit multiplier in red.

In general, the  $\Pi_0$ s depend on the adjacent  $\Pi_0$ s due to carry and summation inputs. However, the carry has a much more significant effect on increasing the critical path delay. For example, if we disregard all the carries, we will see that



Fig. 2. The architecture of an exact 8-bit multiplier using two exact  $8 \times 4$  multipliers.

these disregarding put all the columns of  $\Pi$ s together independently, and they can operate in parallel. Accordingly, the main idea of the proposed methods in this paper is to disregard carries. Also, we can convert large-scale multiplications into smaller ones by using the distributive property. According to Equation (1), we can convert an 8-bit multiplication to two  $8 \times 4$  multiplications.

$$A \times B = A(BH \times 2^4 + BL) = (A \times BH)2^4 + (A \times BL), \quad (1)$$

where  $A$  and  $B$  are 8 bits, and  $BH$  and  $BL$  are the most-significant bits and least-significant bits of  $B$ , respectively. As shown in Figure 2, we can design an 8-bit multiplier using two smaller  $8 \times 4$  multipliers and a Carry Look-ahead Adder (CLA). In this case, the two  $8 \times 4$  multipliers operate independently and in parallel. Eventually, the CLA determines the final result of 8-bit multiplier. There are two advantages to using two smaller multipliers together. First, they have a much less critical path delay due to their smaller scale than an 8-bit multiplier, which equals 49 cycles. On the other hand, the parallel operation of the two  $8 \times 4$  multipliers, which causes their delay overlap, is the second advantage. Finally, a CLA significantly reduces the delay due to carries propagation in calculating the sum of the results of two  $8 \times 4$  multipliers. Therefore, using this structure can improve the critical path delay, which is the basis of the proposed designs this paper. Nevertheless, there is still dependence between  $\Pi$ s in every  $8 \times 4$  multiplier. So, this is the leading cause of the delay in the overall performance of the 8-bit multiplier. Therefore, the paper's primary purpose is to use approximate computing and apply appropriate error levels in the calculations to reduce the dependence between  $\Pi$ s to achieve a better balance between accuracy and critical path delay, power consumption, and area.

#### IV. PROPOSED APPROXIMATE 8-BIT MULTIPLIERS

This paper proposes approximate 8-bit multipliers that are inspired by the multiplier of Figure 2 and based on two  $8 \times 4$  multipliers. In this architecture,  $8 \times 4$  multipliers operate in parallel; therefore, reducing the critical path delay in  $8 \times 4$  multipliers significantly reduces the 8-bit multiplier delay. The carry propagation in  $\Pi$ s of  $8 \times 4$  multipliers is the leading reason for the strong dependence between  $\Pi$ s and blocks



Fig. 3. Circuits and symbols of (a)  $\Pi_1$ , (b)  $\Pi_2$ , and (c)  $\Pi_3$ .

the possibility of parallel operation of them; hence, causing a significant delay and an increase in power consumption. Therefore, if we disregard carries in  $\Pi$ s columns, each column can operate in parallel independently of the other column, reducing delay. Also, by disregarding carry, we can use simple partial product units with less hardware complexity. Figure 3 shows our proposed partial product units. In general, we can divide the design process into two steps. First, we must design approximate  $8 \times 4$  multipliers based on disregarding the carry. For each  $8 \times 4$  multiplier, there are many combinations of  $\Pi$ s columns to disregard the carry. Each of these creates a new approximate  $8 \times 4$  multiplier. In the second step, we have to choose two multipliers from the approximate  $8 \times 4$  multipliers of the previous step, for which many combinations are possible too. In the end, we will have a large number of approximate  $8 \times 4$  multipliers, each having a different critical path delay, power consumption, area, and levels of accuracy. This paper aims to use approximate computing and reduce the level of accuracy more optimally.

1) *Proposed Approximate  $8 \times 4$  Multipliers:* Figure 4 shows some of the proposed carry-disregard-based approximate  $8 \times 4$  multipliers. The total number of proposed  $8 \times 4$  multipliers is 9, which are  $cd_2$  to  $cd_9$ . In all of them, our starting point is to disregard the carries of the second column because the first column has no dependence on the carry and contains only one  $\Lambda$ . Therefore in all proposed approximate  $8 \times 4$  multipliers, we have disregarded the carries from the second column to a specific column. Hence, in their name, there is a number in hexadecimal form, which shows from the second column

to which column we disregard the carries. For example, the  $cd_6$  approximate multiplier (i.e., Figure 4(c)) disregards the carries from the Column 2 to the Column 6. As a result, in this multiplier, all columns 1 to 7 operate independently and in parallel. As stated earlier, we can use simpler and more efficient units instead of  $\Pi$ 0s since we disregard carries in partial product units. Therefore, we propose three partial product units and use them in our approximate multipliers. Figure 3 shows the logic circuits of Carry Disregard Partial Product Unit ( $\Pi$ 1), Half-adder-Based Partial Product Unit ( $\Pi$ 2), and Full-adder-Based Partial Product Unit ( $\Pi$ 3) as follows: The  $\Pi$ 1 uses one  $\Lambda$  for single-bit multiplication and one XOR gate for determining the sum of  $S_{in}$  with  $\Lambda$  output.  $\Pi$ 1 has no inputs and outputs for the carry (i.e.,  $C_{in}$  and  $C_{out}$ , respectively) and disregards them. The  $\Pi$ 2 has a  $\Lambda$  for single-bit multiplication and a half-adder for calculating the sum of  $S_{in}$  with  $\Lambda$  output. The  $\Pi$ 2 has a  $C_{out}$  output, but the only difference with a conventional  $\Pi$ 0 is that it disregards the  $C_{in}$  input. The  $\Pi$ 3 has two  $\Lambda$ s for single-bit multiplications and a full-adder for determining the sum of  $S_{in}$  with the output of two  $\Lambda$ s. This unit has the  $C_{out}$  output but disregards the  $C_{in}$  input. The  $\Pi$ 3 is a combination of two  $\Pi$ 0s, so its use reduces the two carry outputs (i.e., the  $C_{out}$  outputs in the two  $\Pi$ 0s) to one. Therefore, for example, the  $\Pi$ 2 in Column 5 of  $cd_3$  (i.e., Figure 4(b)), can operate independently and in parallel with its previous columns.

The approximate multiplier  $cd_2$  (i.e., Figure 4(a)) disregards only a carry of Column 2. Therefore, the Column 2 has a  $\Pi$ 1, and columns 1 to 3 operate independently and in parallel. As a result, its critical path delay is 41 cycles, less than the delay of the exact  $8 \times 4$  multiplier (i.e., 49 cycles). In all approximate  $8 \times 4$  multipliers in Figure 4, we show the critical path in red. Approximate multipliers  $cd_3$  (i.e., Figure 4(b)),  $cd_4$ ,  $cd_5$ , and  $cd_6$  (i.e., Figure 4(c)), follow a similar procedure in their architecture. The  $cd_3$  disregards all the carries up to Column 3, so all the elements up to this column are  $\Pi$ 1 type. In Column 4, there is an  $\Pi$ 3 and an  $\Pi$ 2. Column 5 also contains an  $\Pi$ 2, so other elements of this multiplier are  $\Pi$ 0s. The approximate multipliers  $cd_4$  and  $cd_6$  to disregard the carries up to columns 4 and 6, respectively. The critical path delay of the approximate multipliers  $cd_3$  to  $cd_6$  is 37, 33, 29, and 25 cycles, respectively. We can see that if we disregard the carries in more columns, the critical path delay decreases.

The approximate multiplier  $cd_7$  (i.e., Figure 4(d)) up to Column 7 disregards all the carries, and as a result, all columns 1 to 8 operate independently and in parallel. Up to Column 7, its cells are  $\Pi$ 1s, and in Column 8, it has one  $\Pi$ 3 and one  $\Pi$ 2. However, since in  $cd_7$ , the first partial product unit in Column 9 does not generate any carry, therefore, we use a  $\Pi$ 1. The critical path delay of the  $cd_7$  is 21 cycles. The  $cd_8$  (i.e., Figure 4(e)) up to Column 8 disregards all the carries; hence, all columns 1 to 9 operate independently and in parallel. Therefore, columns 2 to 8 have  $\Pi$ 1s, and in Column 9, because the first partial product unit does not generate any carry, we use a  $\Pi$ 1. The  $cd_8$  also has two  $\Pi$ 3s and one  $\Pi$ 2 that we connect their  $C_{out}$  output to the  $S_{in}$  input of the adjacent cell. Since we disregard the carries in more columns,  $cd_8$  has a smaller critical path delay, which is 13 cycles. The

$cd_9$  (i.e., Figure 4(f)) disregards all carries up to Column 9, and columns 1 to 10 operate in parallel and independently. The  $cd_9$  has two  $\Pi$ 2s, and we connect the  $C_{out}$  output of the first one to the  $S_{in}$  input of the second one. Also, its other partial product units are of the  $\Pi$ 1 type. The  $cd_a$  up to Column 10 disregards all carries, and the partial product unit in Column 11 does not generate carry. Therefore, all its elements are of  $\Pi$ 1 type, hence, it has the simplest circuit among other proposed approximate  $8 \times 4$  multipliers, which leads to lower power consumption and smaller area. The critical path delay of the  $cd_9$  and  $cd_a$  equals 7 cycles, the most negligible delay compared to all the proposed approximate  $8 \times 4$  multipliers.

2) *Approximate 8-bit Carry Disregard Multiplier (CDM8):* After designing the approximate  $8 \times 4$  multipliers, we now have to choose two multipliers for groups A and B. Therefore, we must choose so that the final 8-bit multiplier has the least possible delay and, at the same time, is optimal in terms of power consumption and area. In this architecture, we can divide the factors causing the delay in the 8-bit multiplier into three parts: (i) The delay of the  $8 \times 4$  multiplier of Group A, (ii) the delay of the  $8 \times 4$  multiplier of Group B, and (iii) the delay of CLA. Accordingly, we can minimize the final delay when the three delays are as small as possible, and the units operate in parallel. The two  $8 \times 4$  multipliers of groups A and B are completely independent and operate in parallel. Nevertheless, the CLA adder depends on the results of both. As a result, we can say that CLA is a limiting factor. So if we can provide the CLA input pair of bits as quickly as possible, the CLA can start operating correctly. For example, according to Figure 2, by producing the pairs of bits  $SA4$  and  $SB0$  with a minor delay compared to each other and continuing this process to the more significant bits, respectively, the CLA can operate parallel to groups A and B. Therefore, to achieve this parallelism, we must choose  $8 \times 4$  multipliers of groups A and B in a specific way.

Table I shows the approximate 8-bit CDM8<sub>xy</sub> multipliers with different ranges of accuracy, critical path delay, power consumption, and area. The hexadecimal numbers  $x$  and  $y$  determine the type of  $8 \times 4$  multipliers of groups A and B (i.e.,  $cd_x$  and  $cd_y$ ), respectively. For example, the CDM8<sub>84</sub> multiplier in groups A and B uses the approximate  $8 \times 4$  multipliers  $cd_8$  and  $cd_4$ , disregarding the carry up to Column 8 and Column 4, respectively. This approximate 8-bit multiplier determines the output of Column 8 of Group A (i.e.,  $SA7$ ) after 13 cycles and the output of Column 4 of Group B (i.e.,  $SB3$ ) after 13 cycles. While the exact multiplier of Figure 2 obtains the outputs of Column 8 of Group A and Column 4 of Group B after 37 and 21 cycles, respectively, which differ by 16 cycles. Hence, the proposed multiplier CDM8<sub>84</sub> not only achieves the outputs of Column 8 of Group A and Column 4 of Group B faster but also the difference between them is 0 cycle. Therefore, the CLA can calculate the sum of the two faster. On the other hand, in CDM8s, we used simpler  $\Pi$ 1,  $\Pi$ 2, and  $\Pi$ 3, which also reduced power consumption and area.

## V. EXTENSION OF PROPOSED APPROXIMATE MULTIPLIER

Figure 5 shows the architecture of the proposed approximate 16-bit multipliers. Each multiplier has four 8-bit approximate

Fig. 4. Proposed approximate  $8 \times 4$  carry disregard multipliers.

TABLE I  
PROPOSED APPROXIMATE CDM8s

| Design  | CDM8_44 | CDM8_50      | CDM8_62 | CDM8_73 | CDM8_74 | CDM8_84 | CDM8_95 | CDM8_a6 | CDM8_a7 | CDM8_a8 | CDM8_a9 | CDM8_aa | CDM8_a |
|---------|---------|--------------|---------|---------|---------|---------|---------|---------|---------|---------|---------|---------|--------|
| Group A | $cd_4$  | $cd_5$       | $cd_6$  | $cd_7$  | $cd_7$  | $cd_8$  | $cd_9$  | $cd_a$  | $cd_a$  | $cd_a$  | $cd_a$  | $cd_a$  | $cd_a$ |
| Group B | $cd_4$  | <i>Exact</i> | $cd_5$  | $cd_6$  | $cd_7$  | $cd_8$  | $cd_9$  | $cd_6$  | $cd_7$  | $cd_8$  | $cd_9$  | $cd_a$  | $cd_a$ |

multipliers and four approximate 8-bit adders. The four 8-bit multipliers are independent and operate in parallel, which reduces the critical path delay of the approximate 16-bit multiplier. Also, each approximate adder is a smaller-scale adder consisting of one Approximate 8-bit Carry Look-ahead Adders (CLAx). Using such adders reduces many hardware complexities. CLAx only have no carry output (i.e.,  $C_{out}$ ); as a result, the approximate adders related to the least-significant bits (i.e., CLAx1 and CLAx2) are independent of the approximate adders in the most-significant bits (i.e., CLAx3 and CLAx4). Consequently, their parallel operation improves the delay of the approximate 16-bit multiplier. In general, each proposed approximate  $N$ -bit multiplier (i.e.,  $N \geq 16$ ) has four  $N/2$ -bit smaller scale multipliers (i.e., Mul\_1 to Mul\_4) and four  $N/2$ -bit smaller scale approximate adders. Each  $N/2$ -bit approximate adder is CLA-based, consisting of  $\frac{N}{16} - 1$  exact 8-bit CLAs and one 8-bit CLAX in its most-significant bits. The  $N/2$ -bit approximate adder disregards the final carry; therefore, the approximate adders related to the least-significant bits and the approximate adders in the most-significant bits are independent and operate in parallel.

The design process of approximate  $N$ -bit multipliers ( $N \geq 16$ ) is similar, and we can divide it into two steps. In the first step, we must design the desired approximate  $N/2$ -bit multipliers. Hence, the starting point is designing of approximate 16-bit multiplier and using it recursively for the larger multiplier. For the proposed approximate 16-bit multiplier,

we use the CDM8s that we introduced earlier. The CDM8s have much lower critical path delay than conventional 8-bit multipliers and operate in parallel. Hence, these factors cause a significant reduction in the delay of the 16-bit multiplier. In the second step, we have to choose four multipliers among the CDM8s. There are many choices for this, each of which creates a new approximate multiplier. Generally, the final delay of the 16-bit multiplier depends on the delay of four 8-bit multipliers and the delay of approximate adders (i.e., CLAx). Hence, we see a significant reduction in the delay of the approximate 16-bit multiplier when these units can operate in parallel. Nevertheless, the CLAx can not operate in parallel with 8-bit multipliers because CLAx depend on the results of 8-bit multipliers. However, each CLAx has eight pairs of bits as inputs. Hence if we can provide the pairs of input bits as quickly as possible, the CLAx can start the summation process correctly. Also, another essential factor is the time difference of providing two bits in each pair. In other words, as regards each pair, if we can provide two bits with a negligible time difference, then CLAx can start its calculations sooner. The proposed 8-bit multipliers similarly had such a situation regarding their CLAs, which we discussed comprehensively in Section IV and explained by mentioning an example.

Based on this, we propose fourteen approximate 16-bit unsigned multipliers called 16-bit Carry Disregard Multiplier (CDM16), which have different accuracy, delay, power consumption, and area levels. Table II shows the type of 8-bit multipliers in each proposed CDM16. We have named each proposed 16-bit multiplier CDM16\_wxyz, in which the hexadecimal digits  $w, x, y$ , and  $z$  show that we have disregarded the carry in multipliers mul\_1 to mul\_4, up to which bit of their output. Therefore, these digits determine the type of CDM8s that we have considered for each of them. For example, in the CDM16\_b330, we used CDM8\_a8, CDM8\_40, CDM8\_40, and an exact multiplier for the multipliers mul\_1 to mul\_4, respectively. Choosing 8-bit multipliers in this way reduces the delay of 16-bit multipliers. Also, using CDM8s



Fig. 5. Proposed approximate 16-bit multiplier architecture.

TABLE II  
PROPOSED APPROXIMATE CDM16S

| Design     | Mul_1:<br>$A_{low} B_{low}$ | Mul_2:<br>$A_{high} B_{low}$ | Mul_3:<br>$A_{low} B_{high}$ | Mul_4:<br>$A_{high} B_{high}$ |
|------------|-----------------------------|------------------------------|------------------------------|-------------------------------|
| CDM16_0000 | Exact                       | Exact                        | Exact                        | Exact                         |
| CDM16_3000 | CDM8_40                     | Exact                        | Exact                        | Exact                         |
| CDM16_7000 | CDM8_84                     | Exact                        | Exact                        | Exact                         |
| CDM16_8000 | CDM8_95                     | Exact                        | Exact                        | Exact                         |
| CDM16_8330 | CDM8_95                     | CDM8_40                      | CDM8_40                      | Exact                         |
| CDM16_b330 | CDM8_aa                     | CDM8_40                      | CDM8_40                      | Exact                         |
| CDM16_f770 | CDM8_aa                     | CDM8_84                      | CDM8_84                      | Exact                         |
| CDM16_f880 | CDM8_aa                     | CDM8_95                      | CDM8_95                      | Exact                         |
| CDM16_f883 | CDM8_aa                     | CDM8_95                      | CDM8_95                      | CDM8_40                       |
| CDM16_fbb3 | CDM8_aa                     | CDM8_aa                      | CDM8_aa                      | CDM8_40                       |
| CDM16_fff7 | CDM8_aa                     | CDM8_aa                      | CDM8_aa                      | CDM8_84                       |
| CDM16_fff8 | CDM8_aa                     | CDM8_aa                      | CDM8_aa                      | CDM8_95                       |
| CDM16_fffb | CDM8_aa                     | CDM8_aa                      | CDM8_aa                      | CDM8_a8                       |
| CDM16_ffff | CDM8_aa                     | CDM8_aa                      | CDM8_aa                      | CDM8_aa                       |

and CLAx reduce hardware complexity and consequently improves power consumption and area.

## VI. ACCURACY ANALYSIS

### A. Accuracy Criteria

Accuracy criteria indicate the error level and, consequently, the quality of the outputs. Several criteria have been proposed to assess the accuracy of approximate arithmetic circuits [18], [19], [27], [28]. Let us assume that  $\Omega_i^P$  is the output of the exact multiplier that calculates the multiplication of  $N$ -bit operands  $A_i$  and  $B_i$  accurately (i.e.,  $\Omega_i^P = A_i B_i$ ). Also, assume that  $\Omega_i^X$  is the output of the approximate multiplier, which an error can accompany. Table III shows all the accuracy criteria we used in this paper. Error ( $E_i$ ) and Error Distance ( $ED_i$ ) are primary criteria that other accuracy criteria calculated by them. All the proposed approximate multipliers are carry-disregard-based. Therefore their output is always smaller than the output of the exact multiplier for all possible input combinations. Consequently, in our approximate multipliers,  $E_i$  and  $ED_i$  are equal.

### B. CDM8 Accuracy Analysis

Our purpose in this section is to provide an equation for  $ED_i$ . This equation can determine the  $ED_i$  for all possible input combinations for any design of CDM8. Also, by obtaining  $ED_i$ , we can easily calculate other accuracy criteria. The CDM8\_xy multipliers have two  $8 \times 4$  multipliers in groups A and B (i.e.,  $cd_x$  and  $cd_y$ ), then an exact CLA calculates the summation of the  $8 \times 4$  multipliers output. Therefore, according to the Equation (2),  $ED_i$  for each CDM8\_xy (i.e.,  $ED_{xy}^8$ ) is

equal to the sum of the product of the  $ED_i$  of  $8 \times 4$  multipliers (i.e.,  $ED_{cd_x}$  and  $ED_{cd_y}$ ) in own weight. First, we determine the  $ED_{cd_x}$  of the Group A  $8 \times 4$  multiplier. The  $cd_x$  contains 11 columns, which ignores all the carries from Column 2 to Column  $x$ . Therefore, in the first step, for each combination of inputs, we determine the number of carries that Column 2 to Column  $x$  generate independently (i.e., without propagating the carries from one column to the adjacent column). In the second step, for each column, we multiply the number of generated carries (i.e.,  $C_k(A_i, B_i)$ ) by its weight and calculate the sum of their results. Hence, Equation (3) calculates  $ED_{cd_x}$ .

$$ED_{xy}^8 = ED_{cd_x} + 2^4 \times ED_{cd_y} \quad (2)$$

$$ED_{cd_x}(A_i, B_i) = \sum_{k=0}^x 2^k \times C_k(A_i, B_i) \quad (3)$$

In Equation (3),  $k = 0$  indicates that the  $8 \times 4$  multiplier does not disregard any carries and thus will work accurately; hence for  $k = 0$ , the  $C_0(A_i, B_i)$  is 0. We want to determine the number of carries generated by each column for each input combination (i.e.,  $C_k(A_i, B_i)$ ). Pair of logical ones in each column generate a carry independently. Therefore, the number of generated carries in each column equals the integer part of the division of the number of logical ones by two. That is,

$$C_k(A_i, B_i) = \left\lfloor \frac{L_{1-k}(A_i, B_i)}{2} \right\rfloor, \quad (4)$$

where  $L_{1-k}(A_i, B_i)$  determines the number of logical ones in column  $k$  per  $A_i$  and  $B_i$  inputs. If we determine the summation of the logical ones and logical zeros in each column, then we get the number of logical ones in that column. Based on this, Equation (5) determines the number of carries each column generates for each input combination.

$$C_k(A_i, B_i) = \begin{cases} 0 & k = 0, 1 \\ \left\lfloor \frac{1}{2} \sum_{n=0}^{k-1} a_{k-n-1} b_n \right\rfloor & 2 \leq k \leq 4 \\ \left\lfloor \frac{1}{2} \sum_{n=0}^3 a_{k-n-1} b_n \right\rfloor & 5 \leq k \leq 8 \\ \left\lfloor \frac{1}{2} \sum_{n=k-8}^3 a_{k-n-1} b_n \right\rfloor & 9 \leq k \leq 11 \end{cases} \quad (5)$$

where  $k$  is the column number of the  $cd_x$ ,  $a_0 - a_7$  and  $b_0 - b_3$  are  $A_i$  and  $B_i$  input bits, respectively. Also, calculating the  $ED_{cd_y}$  is entirely similar to the  $ED_{cd_x}$  so that we can determine the number of generating carries in each column of the  $cd_y$  (i.e.,  $C_k^*(A_i, B_i)$ ), according to Equation (6).

$$C_k^*(A_i, B_i) = \begin{cases} 0 & k = 0, 1 \\ \left\lfloor \frac{1}{2} \sum_{n=0}^{k-1} a_{k-n-1} b_{n+4} \right\rfloor & 2 \leq k \leq 4 \\ \left\lfloor \frac{1}{2} \sum_{n=0}^3 a_{k-n-1} b_{n+4} \right\rfloor & 5 \leq k \leq 8 \\ \left\lfloor \frac{1}{2} \sum_{n=k-8}^3 a_{k-n-1} b_{n+4} \right\rfloor & 9 \leq k \leq 11 \end{cases} \quad (6)$$

TABLE III  
ACCURACY CRITERIA

| Accuracy Criteria                     | Equation                                          | Description                                                                                                                                                                                                         |
|---------------------------------------|---------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Error ( $E_i$ )                       | $E_i = \Omega_i^P - \Omega_i^X$                   | Refers to the difference between the exact multiplier's output and the approximate multiplier's output [28].                                                                                                        |
| Error Distance ( $ED_i$ )             | $ED_i =  E_i  =  \Omega_i^P - \Omega_i^X $        | Represents the absolute value of $E_i$ [19], [20], [28], [29].                                                                                                                                                      |
| Mean Error Distance (MED)             | $\frac{1}{2^{2N}} \sum_{i=1}^{2^{2N}} ED_i$       | Refers to the average of $ED_i$ for all possible input combinations [19], [29].                                                                                                                                     |
| Relative Error Distance ( $RED_i$ )   | $ED_i/\Omega_i^P \quad \forall \Omega_i^P \neq 0$ | Refers to the ratio of $ED_i$ to the output of the corresponding exact multiplier that is not equal to zero [20], [28], [29].                                                                                       |
| Mean Relative Error Distance (MRED)   | $\frac{1}{2^{2N}} \sum_{i=1}^{2^{2N}} RED_i$      | Refers to the average of $RED_i$ for all possible input combinations [19], [20], [28], [29].                                                                                                                        |
| Normalized Mean Error Distance (NMED) | $MED/(2^N - 1)^2$                                 | Refers to the average of $ED_i$ divided by the most considerable exact multiplier output, i.e., $(2^N - 1)^2$ [19], [20], [28], [29].                                                                               |
| Probability of Correctness (PC)       | $\#\Omega_C/2^{2N}$                               | Refers to the ratio of the number of correct outputs of the approximate multiplier to the total number of possible outputs. $\#\Omega_C$ is the number of correct outputs of the approximate multiplier [20], [28]. |
| Number of Effective Bits (NoEB)       | $2N - \log_2(1 + \sqrt{MSE})$                     | $MSE$ is the mean squared error obtained as the average of $ED_i^2$ for all possible input combinations, i.e., $MSE = \frac{1}{2^{2N}} \sum_{i=1}^{2^{2N}} ED_i^2$ [20], [28].                                      |

where  $k$  is the column number of the  $cd_y$  multiplier,  $a_0 - a_7$  and  $b_4 - b_7$  are  $A_i$  and  $B_i$  input bits, respectively. As a result, we can calculate the  $ED_{cdy}$  according to Equation (7) for the inputs  $A_i$  and  $B_i$ . Consequently, we can calculate the  $ED_{xy}^8$  according to Equation (8).

$$ED_{cdy}(A_i, B_i) = \sum_{k=0}^y 2^k \times C_k^*(A_i, B_i) \quad (7)$$

$$ED_{xy}^8(A_i, B_i) = \sum_{k=0}^x 2^k \times C_k(A_i, B_i) + 2^4 \times \sum_{k=0}^y 2^k \times C_k^*(A_i, B_i) \quad (8)$$

### C. CDM( $N - bit$ ) Accuracy Analysis

This section aims to provide an equation for  $ED_i$  for all possible input combinations for any design of CDM( $N - bit$ ) (for  $N \geq 16$ ). We first determine the  $ED_i$  equation for any design of CDM16 and then generalize it for CDM( $N - bit$ ) multipliers. In general, the  $ED_i$  of each CDM16 is due to the  $ED_i$  of the four CDM8s and the  $ED_i$  of the four CLAx<sub>s</sub>. Therefore, the  $ED_i$  of each CDM16 is equal to the sum of the product of each CDM8s error distance and the CLAx<sub>s</sub> error distance in their corresponding weight. That is,

$$\begin{aligned} ED^{16} &= ED_{X1Y1}^8 + 2^8(ED_{X2Y2}^8) + 2^8(ED_{X3Y3}^8) \\ &\quad + 2^{16}(ED_{X4Y4}^8) + 2^8(ED_{CLAx\_L}) \\ &\quad + 2^{16}(ED_{CLAx\_M}), \end{aligned} \quad (9)$$

where  $ED_{XiYi}^8$  is the error distance of each CDM8,  $ED_{CLAx\_L}$  is the error distance of the least-significant part approximate adders of the CDM16 (i.e., CLAx1 and CLAx2), and  $ED_{CLAx\_M}$  is the error distance of the most-significant part approximate adders of the CDM16 (i.e., CLAx3 and CLAx4). Therefore, we have to first calculate the  $ED_{XiYi}^8$

for each approximate 8-bit multiplier, which is possible using Equation (8). Next, we have to obtain the  $ED_i$  of the approximate adders (i.e.,  $ED_{CLAx\_L}$  and  $ED_{CLAx\_M}$ ). Hence, we must determine the least-significant and most-significant parts of each MUL\_1 to MUL\_4 output. The output of these four multipliers can be calculated by Equation (10) to (13). According to Equation (14), we can separate their most-significant part and least-significant part (i.e.,  $R_{m_i\_H}$  and  $R_{m_i\_L}$ ), where  $r_{i\_0}$  to  $r_{i\_15}$  are the bits of the MUL<sub>i</sub> multiplier output. As a result, according to Equation (15), by dividing the output of each multiplier by  $2^8$ , the integer part of the division result will be the most-significant part of the desired multiplier output. By obtaining the most-significant part, we can easily calculate least-significant part using Equation (16).

$$MUL\_1 : R_{m_1} = A_L B_L - ED_{X1Y1}^8(A_L, B_L) \quad (10)$$

$$MUL\_2 : R_{m_2} = A_H B_L - ED_{X2Y2}^8(A_H, B_L) \quad (11)$$

$$MUL\_3 : R_{m_3} = A_L B_H - ED_{X3Y3}^8(A_L, B_H) \quad (12)$$

$$MUL\_4 : R_{m_4} = A_H B_H - ED_{X4Y4}^8(A_H, B_H) \quad (13)$$

$$R_{m_i} = \sum_{n=0}^{15} 2^n \times r_{i\_n} = 2^8 \times R_{m_i\_H} + R_{m_i\_L}$$

$$= 2^8 \sum_{n=8}^{15} 2^{n-8} \times r_{i\_n} + \sum_{n=0}^7 2^n \times r_{i\_n} \quad (14)$$

$$R_{m_i\_H} = \left\lfloor \frac{R_{m_i}}{2^8} \right\rfloor \quad (15)$$

$$R_{m_i\_L} = R_{m_i} - 2^8 \times R_{m_i\_H} \quad (16)$$

The CLAx<sub>s</sub> work like a exact 8-bit CLA, except that they disregard the last generated carry, which weights  $2^8$ . Hence, to calculate the  $ED_{CLAx\_L}$ , we need to obtain the number of carries that the CLAx1 and CLAx2 disregard. According to Equation (17), by dividing the sum of the  $R_{m_1\_H}$ ,  $R_{m_2\_L}$ , and  $R_{m_3\_L}$  by the weight of the last generated carry (i.e.,  $2^8$ ), the integer part of the division result is the number of carries

which the CLAx1 and CLAx2 disregard. In the same way, according to Equation (18), we can get the number of carries that the CLAx3 and CLAx4 disregard.

$$N_{CLAx\_L}(A_i, B_i) = \left\lfloor \frac{R_{m_1\_H} + R_{m_2\_L} + R_{m_3\_L}}{2^8} \right\rfloor \quad (17)$$

$$N_{CLAx\_M}(A_i, B_i) = \left\lfloor \frac{R_{m_2\_H} + R_{m_3\_H} + R_{m_4\_L}}{2^8} \right\rfloor \quad (18)$$

Hence, by Equation (19), we can calculate the  $ED_{CLAx\_L}$ . It is equal to the product of the number of carries that the CLAx1 and CLAx2 disregard (i.e.,  $N_{CLAx\_L}$ ) by the weight of those carries (i.e.,  $2^8$ ). Similarly, Equation (20) calculates the  $ED_{CLAx\_M}$ . Finally, using Equation (21), we can calculate the error distance of each CDM16. The  $ED_i$  of each CDM( $N - bit$ ) (for  $N \geq 16$ ) is equal to the sum of the product of each MUL\_1 to MUL\_4 error distance and the approximate adders error distance in their corresponding weight. The weight corresponding to MUL\_1 to MUL\_4 are 1,  $2^{N/2}$ ,  $2^{N/2}$ , and  $2^N$ , respectively, and the weight corresponding to approximate adders in the least-significant and most-significant parts are  $2^{N/2}$  and  $2^N$ , respectively. According to Equations (22) and (23), we can determine the most-significant and least-significant parts of each MUL\_1 to MUL\_4 output. Similar to the 16-bit multipliers, we can get the error distance of each CDM( $N - bit$ ). Hence, Equation (24) can easily calculate  $ED^N$  for  $N \geq 16$ .

$$ED_{CLAx\_L}(A_i, B_i) = 2^8 \times N_{CLAx\_L}(A_i, B_i) \quad (19)$$

$$ED_{CLAx\_M}(A_i, B_i) = 2^8 \times N_{CLAx\_M}(A_i, B_i) \quad (20)$$

$ED^{16}$

$$\begin{aligned} &= ED_{X1Y1}^8 + 2^8(ED_{X2Y2}^8 + ED_{X3Y3}^8) + 2^{16}(ED_{X4Y4}^8) \\ &+ 2^{16}\left(\left\lfloor \frac{\left\lfloor \frac{R_{m_1}}{2^8} \right\rfloor + R_{m_2} - 2^8\left\lfloor \frac{R_{m_2}}{2^8} \right\rfloor + R_{m_3} - 2^8\left\lfloor \frac{R_{m_3}}{2^8} \right\rfloor}{2^8} \right\rfloor\right) \\ &+ 2^{24}\left(\left\lfloor \frac{\left\lfloor \frac{R_{m_2}}{2^8} \right\rfloor + \left\lfloor \frac{R_{m_3}}{2^8} \right\rfloor + R_{m_4} - 2^8\left\lfloor \frac{R_{m_4}}{2^8} \right\rfloor}{2^8} \right\rfloor\right) \end{aligned} \quad (21)$$

$$R_{m_i\_H} = \left\lfloor \frac{R_{m_i}}{2^{N/2}} \right\rfloor \quad (22)$$

$$R_{m_i\_L} = R_{m_i} - 2^{N/2} \times R_{m_i\_H} \quad (23)$$

$ED^N$

$$\begin{aligned} &= ED_{MUL1} + 2^{\frac{N}{2}}(ED_{MUL2} + ED_{MUL3}) + 2^N(ED_{MUL4}) \\ &+ 2^N\left(\left\lfloor \frac{\left\lfloor \frac{R_{m_1}}{2^{\frac{N}{2}}} \right\rfloor + R_{m_2} - 2^{\frac{N}{2}}\left\lfloor \frac{R_{m_2}}{2^{\frac{N}{2}}} \right\rfloor + R_{m_3} - 2^{\frac{N}{2}}\left\lfloor \frac{R_{m_3}}{2^{\frac{N}{2}}} \right\rfloor}{2^{\frac{N}{2}}} \right\rfloor\right) \\ &+ 2^{\frac{3N}{2}}\left(\left\lfloor \frac{\left\lfloor \frac{R_{m_2}}{2^{\frac{N}{2}}} \right\rfloor + \left\lfloor \frac{R_{m_3}}{2^{\frac{N}{2}}} \right\rfloor + R_{m_4} - 2^{\frac{N}{2}}\left\lfloor \frac{R_{m_4}}{2^{\frac{N}{2}}} \right\rfloor}{2^{\frac{N}{2}}} \right\rfloor\right) \end{aligned} \quad (24)$$

## VII. EXPERIMENTS AND RESULTS

### A. Hardware Efficiency Criteria

Critical path delay, power consumption, and area are the main criteria for hardware evaluation and analysis. On the other hand, combining these main criteria and considering them together is very important. Therefore, we consider Power

Delay Product (PDP), Power Area Delay Product (PADP), Power Delay Error Product (PDEP), and Power Area Delay Error Product (PADEP) criteria for a more comprehensive evaluation and analysis of different designs. In PDEP and PADEP criteria, we use MRED for error.

### B. Experimental Setups

We used Verilog HDL to describe the proposed approximate multipliers and the ISE Design Suite-Xilinx to verify them. The Genus Synthesis Solution was then used to synthesize the proposed designs with 45-nm NanGate technology. Afterward, we analyzed the three primary hardware efficiency criteria: critical path delay, power consumption, and area. For evaluating the accuracy, we determined all the mentioned accuracy criteria in Section VI-A using Python for the proposed 8-bit and 16-bit multipliers for all possible input combinations (i.e.,  $2^{16}$  and  $2^{32}$ , respectively).

### C. Results

Table IV, shows the results of our evaluations regarding the hardware efficiency and accuracy criteria for the proposed unsigned 8-bit multipliers. Compared to the exact unsigned 8-bit multiplier (denoted as “Exact8” in Table IV), CDM8s improve the essential hardware efficiency criteria, i.e., critical path delay, power consumption, and area, by 29%, 29%, and 30%, respectively. Among all proposed CDM8s, CDM8\_51 has the highest delay, power consumption, and area (improved by 15.7%, 8.8%, and 10.4%, respectively, compared to the Exact8). However, it has the lowest MRED, which is 0.0039. The lowest power consumption and area belong to CDM8\_aa (improved by 51.5% and 48.1%, respectively, compared to the Exact8). Nevertheless, has the highest MRED is 0.2145 among all the CDM8s. Meanwhile, the CDM8\_95 with MRED of 0.0518 has the lowest critical path delay, which has improved it by 36% compared to the exact multiplier and significantly reduced power and area by 31.9% and 33.3%, respectively.

The PDP indicates the energy efficiency, and the PADP evaluates the energy efficiency and area together. CDM8s have improved them by 48.1% and 64.8%, respectively. CDM8\_aa has the best PDP and PADP, mainly because of its lower power consumption and area than all CDM8s. Also, the PDEP evaluates the energy efficiency and MRED together. The PADEP, the product of PDP, Area, and MRED, evaluates all these criteria together. CDM8\_51 has the lowest PDEP and PADEP, mainly due to its very low MRED compared to all CDM8s. Also, all the CDM8s have a completely accurate result in most different input combinations, and their errors are significantly less in cases with inaccurate results.

Table V, shows the results of our evaluations for the proposed unsigned 16-bit multipliers. Compared to the exact unsigned 16-bit multiplier (denoted as “Exact16” in Table V), CDM16s improve the critical path delay, power consumption, and area by 35%, 24%, and 23%, respectively. Among all CDM16s, CDM16\_0000, CDM16\_3000, CDM16\_7000, and CDM16\_8000 have the highest delay, power consumption,

TABLE IV  
HARDWARE EFFICIENCY AND ACCURACY CRITERIA OF THE PROPOSED 8-BIT MULTIPLIERS AND COMPARABLE LITERATURE

| Proposed Method | Hardware efficiency criteria |                        |                      |        |             |       |         | Accuracy criteria |         |        |       |        |
|-----------------|------------------------------|------------------------|----------------------|--------|-------------|-------|---------|-------------------|---------|--------|-------|--------|
|                 | Area( $\mu\text{m}^2$ )      | Power( $\mu\text{W}$ ) | Delay( $\text{nS}$ ) | PDP    | PADP        | PDEP  | PADEP   | MED               | NMED    | MRED   | NoEB  | PC (%) |
| Exact8 *        | 300.6                        | 85.61                  | 0.76                 | 65.064 | 19558       | 0     | 0       | 0                 | 0       | 0      | 16    | 100    |
| CDM8_44         | 257.7                        | 76.777                 | 0.65                 | 49.905 | 12860       | 0.653 | 168.344 | 97.75             | 0.00150 | 0.0133 | 8.49  | 52.9   |
| CDM8_51         | 269.2                        | 78.071                 | 0.641                | 50.043 | 13472       | 0.195 | 52.405  | 14.25             | 0.00022 | 0.0039 | 11.29 | 59.57  |
| CDM8_62         | 253.2                        | 74.851                 | 0.584                | 43.713 | 11068       | 0.349 | 88.434  | 35.25             | 0.00054 | 0.0081 | 10.10 | 51.1   |
| CDM8_73         | 237.5                        | 70.826                 | 0.563                | 39.875 | 9470        | 0.644 | 152.946 | 89.25             | 0.00137 | 0.0164 | 8.87  | 42.52  |
| CDM8_74         | 227.7                        | 66.27                  | 0.529                | 35.057 | 7982        | 0.834 | 189.982 | 157.25            | 0.00242 | 0.0241 | 8.08  | 37.09  |
| CDM8_84         | 216.3                        | 62.704                 | 0.522                | 32.731 | 7080        | 1.030 | 222.873 | 225.25            | 0.00346 | 0.0319 | 7.63  | 33.58  |
| CDM8_95         | 200.6                        | 58.337                 | <b>0.486</b>         | 28.352 | 5687        | 1.448 | 290.511 | 441.25            | 0.00678 | 0.0518 | 6.66  | 29.05  |
| CDM8_a6         | 193.1                        | 55.601                 | 0.502                | 27.912 | 5390        | 2.108 | 407.034 | 777.25            | 0.01195 | 0.0766 | 5.81  | 26.40  |
| CDM8_a7         | 183.8                        | 52.056                 | 0.502                | 26.132 | 4803        | 2.793 | 513.305 | 1321.25           | 0.02032 | 0.1084 | 5.00  | 24.63  |
| CDM8_a8         | 171                          | 48.724                 | 0.502                | 24.459 | 4182        | 3.731 | 638.008 | 2409.25           | 0.03705 | 0.1548 | 4.09  | 23.06  |
| CDM8_a9         | 166.2                        | 47.57                  | 0.505                | 24.023 | 3992        | 4.581 | 761.388 | 3689.25           | 0.05673 | 0.1936 | 3.36  | 22.43  |
| CDM8_aa         | <b>156.1</b>                 | 41.548                 | 0.548                | 22.768 | <b>3554</b> | 4.809 | 750.668 | 4713.25           | 0.07248 | 0.2145 | 2.85  | 22.26  |

Other approximate 8-bit array multipliers in the literature

|      |       |           |      |             |       |              |              |             |                |               |                 |                 |
|------|-------|-----------|------|-------------|-------|--------------|--------------|-------------|----------------|---------------|-----------------|-----------------|
| [14] | 278.7 | 82.5      | 0.65 | 53.6        | 14955 | <b>0.096</b> | <b>26.77</b> | <b>5.75</b> | <b>0.00008</b> | <b>0.0018</b> | <b>12.42</b>    | <b>67.58</b>    |
| [15] | 301.6 | 143.5     | 0.77 | 110.5       | 33340 | 3.13         | 943.52       | 397.95      | 0.00612        | 0.0283        | NR <sup>†</sup> | 30.27           |
| [30] | 217.3 | <b>40</b> | 0.53 | <b>21.2</b> | 4607  | 1.71         | 371.76       | 578.72      | 0.0089         | 0.0807        | NR <sup>†</sup> | NR <sup>†</sup> |
| [25] | 349.3 | 50.6      | 0.58 | 29.3        | 10251 | 7.42         | 2590.49      | 1664.64     | 0.0256         | 0.2527        | NR <sup>†</sup> | NR <sup>†</sup> |

\* Exact unsigned 8-bit multiplier (using two exact 8×4 multipliers).

<sup>†</sup> Not Reported.

TABLE V  
HARDWARE EFFICIENCY AND ACCURACY CRITERIA OF THE PROPOSED 16-BIT MULTIPLIERS

| Method     | Hardware efficiency criteria |                        |                      |            |               |             |             | Accuracy criteria |                 |                 |             |              |
|------------|------------------------------|------------------------|----------------------|------------|---------------|-------------|-------------|-------------------|-----------------|-----------------|-------------|--------------|
|            | Area( $\mu\text{m}^2$ )      | Power( $\mu\text{W}$ ) | Delay( $\text{nS}$ ) | PDP        | PADP          | PDEP        | PADEP       | MED               | NMED            | MRED            | NoEB        | PC (%)       |
| Exact16 *  | 1348.4                       | 501.4                  | 1.35                 | 677        | 912718        | 0           | 0           | 0                 | 0               | 0               | 32          | 100          |
| CDM16_0000 | 1318.6                       | 479.9                  | 1.03                 | 494        | 651714        | 4.84        | 6390        | <b>8222699</b>    | <b>0.001914</b> | <b>0.009806</b> | <b>7.98</b> | <b>19.61</b> |
| CDM16_3000 | 1298.9                       | 473.3                  | 1.02                 | 486        | 631316        | 4.76        | 6190        | 8222699           | 0.001914        | 0.009806        | 7.98        | 13.85        |
| CDM16_7000 | 1237.7                       | 453.1                  | 1.02                 | 464        | 575408        | 4.55        | 5642        | 8222699           | 0.001914        | 0.009806        | 7.98        | 8.19         |
| CDM16_8000 | 1225.2                       | 448.2                  | 1.02                 | 459        | 563364        | 4.50        | 5524        | 8222699           | 0.001914        | 0.009806        | 7.98        | 7.55         |
| CDM16_8330 | 1183.9                       | 436.9                  | 0.94                 | 411        | 486725        | 4.03        | 4777        | 8222766           | 0.001914        | 0.009815        | 7.98        | 5.12         |
| CDM16_b330 | 1155.5                       | 427.3                  | 0.94                 | 402        | 464574        | 3.94        | 4560        | 8222766           | 0.001914        | 0.009816        | 7.98        | 4.49         |
| CDM16_f770 | 1022.5                       | 375.9                  | 0.79                 | 300        | 306769        | 2.99        | 3059        | 8223980           | 0.001914        | 0.009974        | 7.98        | 2.04         |
| CDM16_f880 | 990.6                        | 365.1                  | <b>0.77</b>          | 284        | 281707        | <b>2.86</b> | 2842        | 8225459           | 0.001915        | 0.010089        | 7.98        | 1.81         |
| CDM16_f883 | 970.6                        | 359.3                  | 0.78                 | 280        | 272696        | 3.08        | 2993        | 8609497           | 0.002           | 0.010976        | 7.95        | 1.63         |
| CDM16_fbb3 | 905.7                        | 335.1                  | <b>0.77</b>          | 259        | 234914        | 2.97        | <b>2692</b> | 8610775           | 0.002           | 0.011462        | 7.95        | 1.33         |
| CDM16_ff77 | 825.1                        | 306.0                  | 0.81                 | 249        | 205814        | 6.69        | 5525        | 23076674          | 0.005373        | 0.026846        | 6.66        | 1.07         |
| CDM16_fff8 | 809.2                        | 301.8                  | 0.81                 | 245        | 199047        | 8.70        | 7047        | 36831042          | 0.008575        | 0.035407        | 5.92        | 1.04         |
| CDM16_fffb | 773.3                        | <b>289.0</b>           | 0.81                 | <b>235</b> | 182165        | 18.44       | 14262       | 156630850         | 0.036469        | 0.078296        | 3.59        | 1.00         |
| CDM16_ffff | <b>767.9</b>                 | <b>289.0</b>           | 0.81                 | <b>235</b> | <b>180910</b> | 25.39       | 19501       | 307625794         | 0.071626        | 0.107794        | 2.35        | 0.99         |

\* Exact 16-bit multiplier (using four exact 8-bit multipliers).

area, PDP, and PADP, respectively, but have the least MRED, which is 0.009806. However, compared to the Exact16, they improved mentioned hardware criteria by 24%, 8%, 6%, 30%, and 34%, on average. The CDM16\_ffff and CDM16\_fffb have the lowest power consumption, area, PDP, and PADP, respectively, but have the highest MRED among all CDM16s. Compared to the Exact16, they improved the hardware criteria along with the delay by 42%, 43%, 65%, 60%, and 40%, on average. CDM16\_f880 and CDM16\_fbb3 have the lowest delay among all CDM16s. These two multipliers have reduced the delay by 43% compared to Exact16 and have improved the power, area, PDP, and PADP by 30%, 30%, 60%, and 72%, on average. Also, among all CDM16s, CDM16\_f880 and CDM16\_fbb3 have the best PDEP and PADEP, respectively.

## VIII. COMPARISON AND DISCUSSION

Table IV shows the hardware efficiency and accuracy criteria of some existing approximate unsigned 8-bit array multipliers in the literature that we intend to compare with the CDM8 multipliers. CDM8s have reduced the critical path delay, power consumption, and area by 14.3%, 22.8%, and 26.4% on average compared to the other approximate array multipliers in Table IV; CDM8s have also improved PDP, PADP, PDEP, and PADEP by 37.1%, 52.7%, 37.5%, and 64.1% respectively. Regarding the accuracy criteria, CDM8s have improved MRED by 17.6%. Therefore, as the results show, the CDM8s have improved both the accuracy and the hardware efficiency.



Fig. 6. Comparison of power consumption, area, delay, and PDP versus MRED for the approximate 8-bit multipliers.

Regarding the power consumption and PDP, [29] obtained the best result, followed by the three proposed multipliers CDM8\_aa, CDM8\_a9, and CDM8\_a8, respectively. Reference [29] Compared to them have improved power consumption, PDP, and MRED by 12.8%, 10.7%, and 55.9%, respectively. However, those three proposed multipliers are the best in terms of area and PADP. They are also among the best in terms of delay. Hence, compared to [29], they have improved by 24.3%, 15.1%, and 2.3%, respectively. Regarding critical path delay, CDM8\_95 is the best, which is 8.3% better compared to [29]. Also, CDM8\_95 has a higher accuracy than [29], which improved it by 35.8%. On the other hand, most multipliers [13], [14], and [24] have the highest delay, power consumption, area, PDP, and PADP; Hence, compared to [13], [14], and [24], the three mentioned proposed multipliers are 22.7%, 50.2%, 46.9%, 63.2%, and 80% better, respectively.

Regarding the MRED, PC, PDEP, and PADEP, the [13] has obtained the best results, followed by the four proposed multipliers CDM8\_51, CDM8\_62, CDM8\_73, and CDM8\_44. Reference [13] compared to them, improved MRED, PC, PDEP, and PADEP by 82%, 23.8%, 79.1%, and 76.8%, respectively, which shows an insignificant error of [13]. Nevertheless, those four proposed multipliers have more acceptable results in power consumption, delay, area, PDP, and PADP, which have improved by 9%, 8%, 9%, 15%, and 22%, respectively, compared to [13].

In the following, we intend to compare the CDM8s with other approximate multipliers with different architectures. We selected about 80 approximate multipliers of recent years [14], [16], [17], [18], [20], [21], [22], [23], [24], [28], [29], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39], [40], [41], [42], [43], [44]. All these existing approximate multipliers are 8-bit, unsigned, and synthesized under 45 nm NanGate technology by reference papers. The architecture of the selected multipliers is different, so that we can divide them into four types: compressor-based multipliers [18], [20], [21], [30], [31], [32], [35], [36], [39], [40], [41], [42], [43], array multipliers [13], [14], [21], [22], [23], [24], [29], logarithmic multipliers [28], [37], [38] and operand truncation-based multipliers [16], [17], [28], [33], [34], [38].

Figure 6 shows the power consumption, area, delay, and PDP of approximate multipliers in terms of MRED. Also, for each plot, we delineated the Pareto front to show the designs with the highest efficiency. Hence, the four proposed multipliers, CDM8\_51, CDM8\_62, CDM8\_73, and CDM8\_74, are placed on the Pareto front of the power consumption. Regarding the area, the CDM8\_51, CDM8\_62, CDM8\_73, CDM8\_74, CDM8\_84, CDM8\_95, CDM8\_a6, and CDM8\_a7 are placed on the Pareto front. Regarding the delay, we see that compressor-based multipliers have the lowest delay and are placed on the Pareto front of critical path delay. Nevertheless, it should be noted that the CDM8s have the lowest delay after the compressor-based multipliers. Compared to them, CDM8s have far less power, area, and PDP, and they are among the best designs in terms of this criteria. Also, in terms of PDP, the CDM8\_51, CDM8\_62, CDM8\_73, and CDM8\_74, are the most optimal designs and are placed on the Pareto front of PDP. Therefore, we can conclude that the CDM8s, whether compared to array multipliers or other conventional architectures, are either the best or part of the best in many evaluation criteria; hence, CDM8s have a better balance.

## IX. CASE STUDY: IMAGE PROCESSING

One of the most frequently considered error-resistant applications is image processing, and several papers test their suggested circuits in this setting. This paper assesses the use of image blurring in image processing. This application assists in clarifying the scope of applicability for the proposed designs. Low pass filtering in image processing generates image smoothing, which eliminates the abrupt spatial changes in the image. The low-pass filter alters a sliding kernel, which investigates each pixel individually concerning neighboring pixels. Each pixel must be processed via several multiplications, the number of which depends on the kernel size. The

TABLE VI  
GAUSSIAN SMOOTHING  $3 \times 3$  KERNEL [27]

| Original |       |       | Modified |     |     |
|----------|-------|-------|----------|-----|-----|
| 0.095    | 0.118 | 0.095 | 97       | 121 | 97  |
| 0.118    | 0.148 | 0.118 | 121      | 151 | 121 |
| 0.095    | 0.118 | 0.095 | 97       | 121 | 97  |

weighted average of the adjacent pixels serves as the affected pixel's actual value. Additionally, since the human eye cannot see insignificant fragments, image blurring is an application that can tolerate errors.

The aim of this study is to investigate the impact of approximating more partial product units ( $\Pi_0$ s) on the performance of Gaussian smoothing. The performance is measured by the Peak Signal-to-Noise Ratio (PSNR) and the Structural Similarity Index Measure (SSIM). This study focuses on comparing the performance of the proposed 8-bit approximate multipliers. We take the output image of the exact 8-bit multiplier, in which Group A and Group B have exact partial product units ( $\Pi_0$ ), as the baseline in our comparison. Therefore, PSNR and SSIM values of the output images of proposed 8-bit approximate multipliers are calculated regarding the output image of the exact 8-bit multiplier.

#### A. Experimental Setup

In this paper, a two-dimensional rotationally symmetric  $3 \times 3$  Gaussian low-pass filter with a standard deviation of 1.5 operates as the kernel considered for image smoothing, similar to [20]. The kernel's floating-point values are rounded after being multiplied by  $2^{10}$ . In this fashion, the kernel values are appropriate for the 8-bit input multipliers.

As in [27], the initial and revised kernels are presented in Table VI. A Gaussian smoothing filter involving the proposed multipliers has been performed to blur a test image. The resultant images are presented in Figure 7. The same processing has also been conducted utilizing an exact multiplier to compare proposed designs adequately.

The precise equivalent of proposed multipliers as well as exact multiplier were written in Python. These codes have the same accuracy criteria as Table IV. These codes are embedded into the multiplication process of Gaussian smoothing where the 8-bit  $128 \times 128$  input image incorporates convolution with the modified kernel presented in Table VI.

#### B. Metrics

Quantitative measurement of each proposed multiplier's competence in image smoothing is represented by the SSIM and PSNR. When evaluating the quality of images in image processing, it is important to consider both the PSNR and the SSIM.

While PSNR is a commonly used metric for image quality assessment, it has some limitations. Specifically, PSNR is sensitive to small changes in pixel values and may not accurately reflect the perceived quality of an image. This is because PSNR only considers the mean squared error between the original and reconstructed images, without taking into account the structural similarity between the images.

TABLE VII  
PERFORMANCE OF  $8 \times 8$  APPROXIMATE MULTIPLIERS  
IN GAUSSIAN SMOOTHING

|          | Design    | SSIM (%) | PSNR (dB) |
|----------|-----------|----------|-----------|
| [14]     | CDM8_40   | 99.99    | 63.04     |
| Proposed | CDM8_44   | 99.89    | 52.24     |
|          | CDM8_51   | 99.98    | 60.11     |
|          | CDM8_62   | 99.95    | 56.66     |
|          | CDM8_73   | 99.89    | 52.95     |
|          | CDM8_74   | 99.86    | 51.38     |
|          | CDM8_84   | 99.73    | 48.02     |
|          | CDM8_95   | 99.07    | 41.52     |
|          | CDM8_a6   | 98.40    | 37.27     |
|          | CDM8_a7   | 94.46    | 30.78     |
|          | CDM8_a8   | 81.60    | 22.38     |
| [28]     | CDM8_a9   | 67.15    | 15.77     |
|          | CDM8_aa   | 67.15    | 15.77     |
|          | N8-L1     | 97.85    | 41.70     |
|          | N8-L2     | 97.66    | 39.50     |
| [15]*    | N8-5      | 97.98    | 43.00     |
|          | N8-6      | 97.98    | 43.00     |
|          | Ax8_1     | 97.96    | 43.00     |
| [46]*    | Ax8_2     | 97.85    | 39.20     |
|          | Ax8_3     | 97.25    | 35.60     |
|          | AxRM1     | 97.97    | 43.00     |
| [47]     | AxRM2     | 97.90    | 41.50     |
|          | AxRM3     | 97.85    | 41.20     |
| [47]     | SSM_m4    | 94.39    | 26.80     |
|          | SSM_m4_u3 | 96.41    | 38.90     |
| [48]*    | DT2       | 97.67    | 42.31     |
|          | DT4       | 97.67    | 42.31     |
|          | DT8       | 97.37    | 35.61     |

\* Reported by [28].

On the other hand, SSIM takes into account the structural information of the image, by comparing local patterns of pixel intensities rather than just the overall pixel values. As a result, SSIM is a more perceptually relevant measure of image quality. It has been shown in several studies that SSIM correlates more closely with human perception of image quality than PSNR [48].

Therefore, by considering both PSNR and SSIM in image processing, we can obtain a more comprehensive evaluation of image quality. This can help ensure that the resulting images are not only technically accurate but also visually appealing to the human eye.

#### C. Results

Figure 7 shows the output images of Gaussian smoothing utilizing proposed 8-bit multipliers. Except for CDM8\_a8, CDM8\_a9, and CDM8\_aa, the other multipliers have acceptable PSNR and SSIM over 30 dB and 94%, respectively. Consequently, CDM8\_40 [13] to CDM8\_a7 show their capability in image processing applications.

Figure 7 indicate that increasing the number of approximated partial product units leads to a degradation in the PSNR and SSIM. CDM8\_44, which approximates four columns of



Fig. 7. Gaussian smoothing of images obtained with proposed 8-bit multipliers.

the least significant bits of both Group A and Group B, shows more approximation compared to CDM8\_51, resulting in a slightly worse performance in the case study. CDM8\_73, which has more carry-disregarded columns in Group A but approximates three columns in Group B, shows the closest performance to CDM8\_44. The study infers that Group B approximation primarily contributes to the accuracy and performance degradation.

The study shows that as the number of approximated columns in both Group A and Group B increases, the performance degradation is more significant. However, the degradation is subtle from CDM8\_44 to CDM8\_95. The study concludes that with more hardware efficiency (see Table IV), it is possible to maintain acceptable performance in Gaussian smoothing, achieving over 41 dB and 99% of PSNR and SSIM, respectively. CDM8\_a6 and CDM8\_a7 maintain an acceptable performance level in the case study, with slightly noticeable performance drops. Since all columns of Group A for both proposed approximate multipliers are approximated, their performance in the study is slightly dropped, yet with PSNR and SSIM of over 30 dB and 94%, respectively.

In summary, the study demonstrates that increasing the number of approximated partial product units affects the performance of proposed 8-bit approximate multipliers in Gaussian smoothing, particularly in Group B approximation. However, with more hardware efficiency, it is possible to maintain acceptable performance in Gaussian smoothing even with a higher number of approximated partial product units e.g. using CDM8\_73 instead of CDM8\_44 which experiences more hardware efficiency with similar performance in our case study. In comparison with other 8-bit approximate multipliers in the literature with same case study, the proposed 8-bit approximate multipliers CDM8\_44 to CDM8\_84 maintain more PSNR and SSIM as shown in Table VII.

## X. CONCLUSION

This paper proposes a methodology for designing approximate array multipliers based on carry disregard. Carries can be ignored in various ways, and each method leads to different results regarding the criteria of accuracy and hardware. As shown by our application case-study, a smaller

number of carry disregard, does not necessarily lead to a more accurate and better performing multiplier. In other words, by judiciously selection the location of carry disregard, it is possible to disregard a larger number of carries (and thus gain better speed and smaller area) while gaining a better performance in terms of overall accuracy and suitability for the application. Thus, the essential point is to choose a suitable way to disregard the carries, which depends on the architecture of multipliers. Our study also shows that the absolute value of approximation metrics for the multiplier, does not necessarily predicts its performance in the end application entirely accurate, even though it is a good general indicator and a ball-park estimator. This method simplifies computing units and reduces hardware complexity. Therefore, it causes a significant improvement in hardware efficiency criteria. Compared to the exact multiplier, the proposed 8-bit approximate multipliers have improved critical path delay, power consumption, and area by 29%, 29%, and 30% on average. Also, compared to the existing approximate array architectures in the literature, they have reduced the delay, power consumption, and area by 14.3%, 22.8%, and 26.4% on average. The proposed 16-bit approximate multipliers have improved critical path delay, power consumption, and area by 35%, 24%, and 23% compared to the exact multiplier. The proposed multipliers generally have different accuracy levels, creating an acceptable and better balance between hardware efficiency and accuracy criteria. The proposed designs are based on conventional CMOS technology, while nowadays, we witness the emergence of new technologies such as spin-CMOS and memristive in-memory computing. It is expected that the difference in the basic technology will lead to different results. Within the same technology, CMOS in our case, the use of smaller-scale technologies brings about better results. We conducted image processing utilizing proposed 8-bit multipliers and demonstrated their PSNR and SSIM. Most of them are applicable in image blurring and have PSNR and SSIM over 30 dB and 94%, respectively.

For future work, we plan to use multi-operand approximate adders like compressors in the process of PP reduction. Hence, the compatibility of our proposed methodology with these types of adders gives us the idea of combining them, and

we predict that we will achieve better results. In addition, as for the final summation step in the process of multiplication, there are various types of adders, such as CSA and Parallel Prefix Adder (PPA), which we intend to investigate their effect. On the other hand, different applications have different error tolerances, so dynamically adjusting the accuracy level of the proposed multipliers in various applications is one of our future works.

## REFERENCES

- [1] W. Liu, F. Lombardi, and M. Shulte, "A retrospective and prospective view of approximate computing point of view," *Proc. IEEE*, vol. 108, no. 3, pp. 394–399, Mar. 2020.
- [2] P. Schober, S. N. Estiri, S. Aygun, N. TaheriNejad, and M. H. Najafi, "Sound source localization using stochastic computing," in *Proc. IEEE/ACM Int. Conf. Comput. Aided Design (ICCAD)*, Oct. 2022, pp. 1–9.
- [3] N. TaheriNejad and S. Shakibhamedan, "Energy-aware adaptive approximate computing for deep learning applications," in *Proc. IEEE Comput. Soc. Annu. Symp. VLSI (ISVLSI)*, Jul. 2022, p. 328.
- [4] P. Schober, S. N. Estiri, S. Aygun, A. H. Jalilvand, M. H. Najafi, and N. TaheriNejad, "Stochastic computing design and implementation of a sound source localization system," *IEEE J. Emerg. Sel. Topics Circuits Syst.*, vol. 13, no. 1, pp. 295–311, Mar. 2023.
- [5] H. Anzt, M. Casas, A. C. I. Malossi, E. S. Quintana-Ortí, F. Scheidegger, and S. Zhuang, *Approximate Computing for Scientific Applications*. Cham, Switzerland: Springer, 2022, pp. 415–465.
- [6] S. E. Fatemieh, M. R. Reshadinezhad, and N. TaheriNejad, "Approximate in-memory computing using memristive IMPLY logic and its application to image processing," in *Proc. IEEE Int. Symp. Circuits Syst. (ISCAS)*, May 2022, pp. 3115–3119.
- [7] C. Ossimitz and N. TaheriNejad, "A fast line segment detector using approximate computing," in *Proc. IEEE Int. Symp. Circuits Syst. (ISCAS)*, May 2021, pp. 1–5.
- [8] S. E. Fatemieh, M. R. Reshadinezhad, and N. TaheriNejad, "Fast and compact serial IMPLY-based approximate full adders applied in image processing," *IEEE J. Emerg. Sel. Topics Circuits Syst.*, vol. 13, no. 1, pp. 175–188, Mar. 2023.
- [9] H. Jiang, F. J. H. Santiago, H. Mo, L. Liu, and J. Han, "Approximate arithmetic circuits: A survey, characterization, and recent applications," *Proc. IEEE*, vol. 108, no. 12, pp. 2108–2135, Dec. 2020.
- [10] Q. Xu, T. Mytkowicz, and N. S. Kim, "Approximate computing: A survey," *IEEE Des. Test*, vol. 33, no. 1, pp. 8–22, Feb. 2016.
- [11] A. S. Baroughi, S. Huemer, H. S. Shahhoseini, and N. TaheriNejad, "AxE: An approximate-exact multi-processor system-on-chip platform," in *Proc. 25th Euromicro Conf. Digit. Syst. Design (DSD)*, Aug. 2022, pp. 60–66.
- [12] S. Mittal, "A survey of techniques for approximate computing," *ACM Comput. Surv.*, vol. 48, no. 4, pp. 1–33, Mar. 2016.
- [13] N. Amirafshar, A. S. Baroughi, H. S. Shahhoseini, and N. TaheriNejad, "An approximate carry disregard multiplier with improved mean relative error distance and probability of correctness," in *Proc. 25th Euromicro Conf. Digit. Syst. Design (DSD)*, Aug. 2022, pp. 46–52.
- [14] H. Waris, C. Wang, W. Liu, J. Han, and F. Lombardi, "Hybrid partial product-based high-performance approximate recursive multipliers," *IEEE Trans. Emerg. Topics Comput.*, vol. 10, no. 1, pp. 507–513, Jan. 2022.
- [15] H. Jiang, C. Liu, L. Liu, F. Lombardi, and J. Han, "A review, classification, and comparative evaluation of approximate arithmetic circuits," *ACM J. Emerg. Technol. Comput. Syst.*, vol. 13, no. 4, pp. 1–34, Oct. 2017.
- [16] S. Hashemi, R. I. Bahar, and S. Reda, "DRUM: A dynamic range unbiased multiplier for approximate applications," in *Proc. IEEE/ACM Int. Conf. Comput.-Aided Design (ICCAD)*, Nov. 2015, pp. 418–425.
- [17] S. Vahdat, M. Kamal, A. Afzali-Kusha, and M. Pedram, "TOSAM: An energy-efficient truncation- and rounding-based scalable approximate multiplier," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 27, no. 5, pp. 1161–1173, May 2019.
- [18] P. J. Edavoor, S. Raveendran, and A. D. Rahulkar, "Approximate multiplier design using novel dual-stage 4:2 compressors," *IEEE Access*, vol. 8, pp. 48337–48351, 2020.
- [19] A. G. M. Strollo, E. Napoli, D. D. Caro, N. Petra, and G. D. Meo, "Comparison and extension of approximate 4:2 compressors for low-power approximate multipliers," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 67, no. 9, pp. 3021–3034, Sep. 2020.
- [20] D. Esposito, A. G. M. Strollo, E. Napoli, D. De Caro, and N. Petra, "Approximate multipliers based on new approximate compressors," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 65, no. 12, pp. 4169–4182, Dec. 2018.
- [21] Y. Chang, Y. Cheng, Y. Lin, S. Liao, C. Lai, and T. Wu, "Imprecise 4:2 compressor design used in image processing applications," *IET Circuits, Devices Syst.*, vol. 13, no. 6, pp. 848–856, Sep. 2019.
- [22] P. Kulkarni, P. Gupta, and M. Ercegovac, "Trading accuracy for power with an underdesigned multiplier architecture," in *Proc. 24th International Conf. VLSI Design*, Jan. 2011, pp. 346–351.
- [23] S. Rehman, W. El-Harouni, M. Shafique, A. Kumar, J. Henkel, and J. Henkel, "Architectural-space exploration of approximate multipliers," in *Proc. IEEE/ACM Int. Conf. Comput.-Aided Design (ICCAD)*, Nov. 2016, pp. 1–8.
- [24] H. R. Mahdiani, A. Ahmadi, S. M. Fakhraie, and C. Lucas, "Bio-inspired imprecise computational blocks for efficient VLSI implementation of soft-computing applications," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 57, no. 4, pp. 850–862, Apr. 2010.
- [25] H. Jiang, S. Angizi, D. Fan, J. Han, and L. Liu, "Non-volatile approximate arithmetic circuits using scalable hybrid spin-CMOS majority gates," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 68, no. 3, pp. 1217–1230, Mar. 2021.
- [26] S. Muthulakshmi, C. S. Dash, and S. R. S. Prabaharan, "Memristor augmented approximate adders and subtractors for image processing applications: An approach," *AEU Int. J. Electron. Commun.*, vol. 91, pp. 91–102, Jul. 2018.
- [27] E. Zacharelos, I. Nunziata, G. Saggese, A. G. M. Strollo, and E. Napoli, "Approximate recursive multipliers using low power building blocks," *IEEE Trans. Emerg. Topics Comput.*, vol. 10, no. 3, pp. 1315–1330, Jul. 2022.
- [28] P. Yin, C. Wang, H. Waris, W. Liu, Y. Han, and F. Lombardi, "Design and analysis of energy-efficient dynamic range approximate logarithmic multipliers for machine learning," *IEEE Trans. Sustain. Comput.*, vol. 6, no. 4, pp. 612–625, Oct. 2021.
- [29] V. Mrazek, R. Hrbacek, Z. Vasicek, and L. Sekanina, "EvoApprox8b: Library of approximate adders and multipliers for circuit design and benchmarking of approximation methods," in *Proc. Design, Autom. Test Eur. Conf. Exhib. (DATE)*, Mar. 2017, pp. 258–261.
- [30] M. S. Ansari, H. Jiang, B. F. Cockburn, and J. Han, "Low-power approximate multipliers using encoded partial products and approximate compressors," *IEEE J. Emerg. Sel. Topics Circuits Syst.*, vol. 8, no. 3, pp. 404–416, Sep. 2018.
- [31] S. K. N. Mahamad, "Low power, high speed approximate multiplier for error resilient applications," *Integration*, vol. 84, pp. 37–46, May 2022.
- [32] Y. Guo, H. Sun, and S. Kimura, "Design of power and area efficient lower-part-OR approximate multiplier," in *Proc. TENCON IEEE Region Conf.*, Oct. 2018, pp. 2110–2115.
- [33] S. Vahdat, M. Kamal, A. Afzali-Kusha, and M. Pedram, "LETAM: A low energy truncation-based approximate multiplier," *Comput. Electr. Eng.*, vol. 63, pp. 1–17, Oct. 2017.
- [34] S. Narayananamoorthy, H. A. Moghaddam, Z. Liu, T. Park, and N. S. Kim, "Energy-efficient approximate multiplication for digital signal processing and classification applications," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 23, no. 6, pp. 1180–1184, Jun. 2015.
- [35] S. Venkatachalam and S.-B. Ko, "Design of power and area efficient approximate multipliers," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 25, no. 5, pp. 1782–1786, May 2017.
- [36] M. Ha and S. Lee, "Multipliers with approximate 4:2 compressors and error recovery modules," *IEEE Embedded Syst. Lett.*, vol. 10, no. 1, pp. 6–9, Mar. 2018.
- [37] W. Liu, J. Xu, D. Wang, C. Wang, P. Montuschi, and F. Lombardi, "Design and evaluation of approximate logarithmic multipliers for low power error-tolerant applications," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 65, no. 9, pp. 2856–2868, Sep. 2018.
- [38] M. S. Kim, A. A. D. Barrio, L. T. Oliveira, R. Hermida, and N. Bagherzadeh, "Efficient Mitchell's approximate log multipliers for convolutional neural networks," *IEEE Trans. Comput.*, vol. 68, no. 5, pp. 660–675, May 2019.
- [39] K. M. Reddy, M. H. Vasantha, Y. B. N. Kumar, and D. Dwivedi, "Design and analysis of multiplier using approximate 4:2 compressor," *AEU Int. J. Electron. Commun.*, vol. 107, pp. 89–97, Jul. 2019.

- [40] O. Akbari, M. Kamal, A. Afzali-Kusha, and M. Pedram, "Dual-quality 4:2 compressors for utilizing in dynamic accuracy configurable multipliers," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 25, no. 4, pp. 1352–1361, Apr. 2017.
- [41] C.-H. Lin and I.-C. Lin, "High accuracy approximate multiplier with error correction," in *Proc. IEEE 31st Int. Conf. Comput. Design (ICCD)*, Oct. 2013, pp. 33–38.
- [42] A. Momeni, J. Han, P. Montuschi, and F. Lombardi, "Design and analysis of approximate compressors for multiplication," *IEEE Trans. Comput.*, vol. 64, no. 4, pp. 984–994, Apr. 2015.
- [43] A. Gorantla, "Design of approximate compressors for multiplication," *ACM J. Emerg. Technol. Comput. Syst.*, vol. 13, no. 3, pp. 1–17, Apr. 2017.
- [44] R. Zendegani, M. Kamal, M. Bahadori, A. Afzali-Kusha, and M. Pedram, "RoBA multiplier: A rounding-based approximate multiplier for high-speed yet energy-efficient digital signal processing," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 25, no. 2, pp. 393–401, Feb. 2017.
- [45] H. Waris, C. Wang, C. Xu, and W. Liu, "AxRMs: Approximate recursive multipliers using high-performance building blocks," *IEEE Trans. Emerg. Topics Comput.*, vol. 10, no. 2, pp. 1229–1235, Apr. 2022.
- [46] A. G. M. Strollo, E. Napoli, D. D. Caro, N. Petra, G. Saggese, and G. Di Meo, "Approximate multipliers using static segmentation: Error analysis and improvements," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 69, no. 6, pp. 2449–2462, Jun. 2022.
- [47] F. Frustaci, S. Perri, P. Corsonello, and M. Alioto, "Approximate multipliers with dynamic truncation for energy reduction via graceful quality degradation," *IEEE Trans. Circuits Syst. II, Exp. Briefs*, vol. 67, no. 12, pp. 3427–3431, Dec. 2020.
- [48] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, "Image quality assessment: From error visibility to structural similarity," *IEEE Trans. Image Process.*, vol. 13, no. 4, pp. 600–612, Apr. 2004.



**Nima Amirafshar** received the B.Sc. degree in electrical engineering from the Ferdowsi University of Mashhad (FUM), Mashhad, Iran, in 2019. He is currently pursuing the M.Sc. degree with the School of Electrical Engineering, Iran University of Science and Technology (IUST), Tehran, Iran. His current research interests include computer architecture, approximate computing, and digital circuit design.



**Ahmad Sadigh Baroughi** received the M.Sc. degree in electrical engineering from Tabriz University, Tabriz, Iran, in 2018. He has published three conference papers on high-performance computing and approximate hardware design. His current research interests include systems on chips, approximate computing, and digital system design.



**Hadi Shahriar Shahhoseini** received the B.Sc. degree in electrical engineering in 1990, the M.Sc. degree in electrical engineering in 1994, and the Ph.D. degree in electrical engineering in 1999. He is currently an Associate Professor with the School of Electrical Engineering, IUST. He has published more than 200 papers from his research works in scientific journals and conference proceedings. His current research interests include high-performance computing, computer networking, and approximate computing.



**Nima TaheriNejad** (Member, IEEE) received the Ph.D. degree in electrical and computer engineering from The University of British Columbia (UBC), Vancouver, Canada, in 2015. He is currently a Full Professor with Heidelberg University, Heidelberg, Germany, and also affiliated with TU Wien (formerly known as the Vienna University of Technology), Vienna, Austria. He has published three books, three patents, and more than 90 articles. His current research interests include in-memory computing, cyber-physical and embedded systems, systems on chips, memristor-based circuits and systems, self-\* systems, and health care. He has received several awards and scholarships from universities, conferences, and competitions. He received the Best University Booth Award at DATE in 2021, the First Prize in the 15th Digilent Design Contest in 2019, the Open-Source Hardware Competition at Eurolab4HPC in 2019, and the Best Teacher and Best Course Award at TU Wien in 2020. He has also been an organizer and the chair of various conferences and workshops. He has served as a reviewer and an editor for many journals and conferences.