

# IMPLY-based Approximate Full Adders for Efficient Arithmetic Operations in Image Processing and Machine Learning

Melanie Qiu<sup>+</sup>, Caoyueshan Fan<sup>+</sup>, Gulafshan<sup>+</sup>, Salar Shakibhamedan\*, Fabian Seiler\* and Nima TaheriNejad<sup>++</sup>

<sup>+</sup>Heidelberg University, Germany, <sup>\*</sup>Technische Universität Wien (TU Wien), Austria

melanie.qiu@stud.uni-heidelberg.de, caoyueshan.fan@stud.uni-heidelberg.de, gulafshan@ziti.uni-heidelberg.de, salar.shakibhamedan@tuwien.ac.at, fabian.seiler@tuwien.ac.at, nima.taherinejad@ziti.uni-heidelberg.de

**Abstract**—To overcome the performance limitations in modern computing, such as the power wall, emerging computing paradigms are gaining increasing importance. Approximate computing offers a promising solution by substantially enhancing energy efficiency and reducing latency, albeit with a trade-off in accuracy. Another emerging method is memristor-based In-Memory Computing (IMC) which has the potential to overcome the Von Neumann bottleneck. In this work, we combine these two approaches and propose two Serial APProximate IMPLY-based full adders (SAPPI). When embedded in a Ripple Carry Adder (RCA), our designs reduce the number of steps by 39%-41% and the energy consumption by 39%-42% compared to the exact algorithm. We evaluated our approach at the circuit level and compared it with State-of-the-Art (SoA) approximations where our adders improved the speed by up to 10% and the energy efficiency by up to 13%. We applied our designs in three common image processing applications where we achieved acceptable image quality with up to half of the RCA approximated.

We performed a case study to demonstrate the applicability of our approximations in Machine Learning (ML) underscoring the potential gains in more complex scenarios. The proposed approach demonstrates energy savings of up to 296 mJ (21%) and a reduction of 1.3 billion (20%) computational steps when applied to Convolutional Neural Networks (CNNs) trained on the MNIST dataset while maintaining accuracy.

## I. INTRODUCTION

Reducing energy consumption is one of the most relevant challenges in Information and Communication Technology (ICT). A large portion of the energy budget of modern computing devices (e.g. 63% at Google [1]) is consumed by data movement alone, which totals to 1.07PWh of global energy [2]. Data movement, constrained by the Von Neumann bottleneck, is a key limiting factor in throughput, making it an optimal target for performance enhancements. With In-Memory Computing (IMC) the computation takes place directly in the memory, offering a potential solution to this issue. Memristors stand out as ideal components for IMC, since they can store data non-volatile with their resistive states and perform logical operations. Other advantages such as low power consumption, a small form factor, and compatibility with CMOS technology emphasize the memristor as a promising candidate [3]–[8]. Plenty of stateful logic forms based on memristors were proposed such as [9]–[13]. Memristor-based Material Implication (IMPLY) proves to be a favorable

choice, as it is well established, compatible with the crossbar array, and more reliable when compared to other stateful logic forms [6], [7], [14]. Approximate Computing (AxC) is another emerging computer paradigm that presents a possible solution to the power-wall problem [15]–[17]. With AxC important metrics like energy efficiency and latency can be improved drastically. The trade-off for this is reduced accuracy. Since applications such as image and video processing, as well as Machine Learning (ML) are inherently error-resilient, they provide the optimal environment for AxC. As in modern computing, most of the operations are based on addition and multiplication, improving them with approximations is the most beneficial approach.

In this work, we propose two IMPLY-based approximated adder algorithms for the serial topology (SAPPI-1&2) that minimize the energy consumption and number of computational cycles by utilizing the unique properties of IMPLY. We present the first adder algorithm that maintains all input states after the operation and analyze the applicability in three image processing tasks and two machine learning tasks. This paper is structured as follows: In Section II we review the related literature. Section III introduces the proposed approximate adders. In Section IV we simulate and analyze the adders and compare them to the SoA in Section V. The results of three image processing applications and their quality is discussed in Section VI. In Section VII, we show the applicability of our approach in machine learning, and in Section VIII, we conclude the paper.

## II. IMPLY-BASED APPROXIMATE ADDERS

Memristors, originally discovered by L. Chua [18] and physically realized by R. Williams et al. [8], can store logical values via electrical resistance in a non-volatile fashion. They have low power consumption and small dimensions, deeming them the ideal memory cell [6], [7]. It is the convention to assume the minimum ( $R_{on}$ ) and maximum ( $R_{off}$ ) resistance as logical ‘1’ and ‘0’ which can be set by the applied voltage and the direction of the current [8], [18]–[21]. Stateful logic forms based on memristors include various models such as [9]–[12]. In this work, we focus on IMPLY [13], [22] due to its reliability and the presence of other approximations [14],



Fig. 1: IMPLY operation: (top) Basic Gate and Truth Table, (bottom) Serial Topology [6], [23]

[23]–[25]. An IMPLY operation is denoted as  $P \rightarrow Q$  and follows the truth table in Figure 1. The result is stored in the  $Q$ -memristor, overwriting the initial state. The IMPLY structure (Figure 1) consists of two memristors connected to a resistor where the resistances must satisfy the condition  $R_{on} \ll R_G \ll R_{off}$ . Voltages must meet  $V_{COND} < V_C < V_{SET}$  for correct operation, where  $V_C$  denotes the threshold voltage of the memristors [6], [13], [20], [22]. IMPLY-based full adders can be categorized into serial [6], [7], [19], parallel [20], [22] and hybrid forms such as semi-serial [21], [26] and semi-parallel [27]. In this work we utilize the serial topology, illustrated in Figure 1, as it stands out for its low hardware complexity and small area, requiring only  $2n + 3$  memristors and no additional switches to function [19]. Approximate computing allows for significant reductions in energy consumption and speed albeit at the cost of reduced accuracy. This technique is suitable for error-resilient applications like image processing and machine learning [17], [23], [28], [29]. The resulting accuracy is evaluated using error metrics such as the Error Distance (ED), Mean Error Distance (MED), Normalized Mean Error Distance (NMED), and Mean Relative Error Distance (MRED) [17], [23], [30]. In image processing, quality metrics such as the Peak Signal-to-Noise Ratio (PSNR) and Mean Structural Similarity Index (MSSIM) are used to assess the resulting quality. In the literature, a PSNR over 30dB is considered acceptable quality [31]. IMPLY-based approximate adders have been proposed in both serial [23], [32] and semi-serial topologies [24], focusing on minimizing error metrics for image processing. Here, we propose two serial adders, SAPPI-1 and SAPPI-2, designed to optimize energy consumption and computational steps while maintaining acceptable quality for image processing and accuracy in machine learning.

### III. PROPOSED APPROXIMATE FULL ADDERS

#### A. Design Methodology

The design of SAPPI-1 and SAPPI-2 is driven by the need to balance computational complexity and accuracy while ensuring energy efficiency and fast processing. The primary goal is to present fast and energy-efficient adders that deliver acceptable quality in applications like image processing and machine learning. In the serial IMPLY topology, only one

IMPLY or FALSE operation can be executed per cycle. Since these two functions form the complete logic set  $\{\rightarrow, \perp\}$ , all boolean operations can be emulated using them. Rather than directly implementing boolean expressions in IMPLY, which is inefficient [19], [24], we utilize the unique properties of IMPLY logic. In IMPLY logic, the result of the operation overwrites the initial state of an input, losing the stored information. With our methodology, we propose an approach to calculate the outputs without losing the initial states of the input memristors. When the inputs are required for multiple calculations our approach does not require an inefficient reloading process of the input data. We propose two approaches that utilize the work memristor  $M$  to accumulate information from both inputs while preserving the input information. For this, we sequentially implicate the content of the  $A$ -memristor and the  $B$ -memristor to the initially reset (set to ‘0’) work memristor, which represents the  $Sum$  output. The  $C_{out}$  is calculated by overwriting the  $C$  input by implicating the  $Sum$  output to  $C$ -memristor. With this, the carry-out is only erroneous in a single place, which reduces the incorrect error propagation. We will denote both of our approaches as **SAPPI** **APP**roximate **I**MPLY-based full adders (SAPPI-1&2). In our first approach (SAPPI-1), both inputs are preserved, and the  $Sum$  is stored directly in the  $M$ -memristor. With SAPPI-2 we further improve the accuracy of the approximation and store the  $Sum$  output in the  $A$ -memristor. Therefore only the  $B$  input is preserved in this version.

#### B. SAPPI-1

In this algorithm, we utilize the property of IMPLY that the second operand is overwritten and the result of the operation is stored instead. For this, we use the work memristor to store the temporary results and apply the  $A$  and  $B$  inputs sequentially. With this the  $Sum$  is stored in the  $M$ -memristor, keeping the original input states. We calculate the  $C_{out}$  by applying  $M \rightarrow C$  and store it in the  $C$ -memristor. So, we reduce the ER of  $C_{out}$  to only  $\frac{1}{8}$  where the only erroneous case is “ $A_{in}B_{in}C_{in}$ ” = “001”. The  $Sum$  output has an error rate (ER) of  $\frac{4}{8}$  which can be seen in the truth table in Table I. The corresponding logical functions in boolean and IMPLY form for  $Sum$  and  $C_{out}$  are:

$$Sum = B \rightarrow \bar{A} = \overline{AB} \quad (1)$$

$$C_{out} = (B \rightarrow \bar{A}) \rightarrow C = AB + C \quad (2)$$

The exact procedure of SAPPI-1 is outlined in Table II, beginning with a reset of the work memristor to logical ‘0’ using a FALSE operation. In the second and third steps, the inputs are implied on the work memristor which stores the  $Sum$  in the third step. In the fourth step, the  $C_{out}$  is computed and saved in the  $C$ -memristor. This algorithm requires only  $4n$  steps for an  $n$ -bit addition. When only SAPPI-1 adders are used, an  $n$ -bit addition would require  $4n$  steps and  $3n + 1$  memristors.

TABLE I: Truth table of proposed algorithms

| Inputs          |                 |                 | Exact |                  | SAPPI-1 |                  | SAPPI-2 |                  |
|-----------------|-----------------|-----------------|-------|------------------|---------|------------------|---------|------------------|
| A <sub>in</sub> | B <sub>in</sub> | C <sub>in</sub> | Sum   | C <sub>out</sub> | Sum     | C <sub>out</sub> | Sum     | C <sub>out</sub> |
| 0               | 0               | 0               | 0     | 0                | 1 ✗     | 0 ✓              | 1 ✗     | 0 ✓              |
| 0               | 0               | 1               | 1     | 0                | 1 ✓     | 1 ✗              | 0 ✗     | 1 ✗              |
| 0               | 1               | 0               | 1     | 0                | 1 ✓     | 0 ✓              | 1 ✓     | 0 ✓              |
| 0               | 1               | 1               | 0     | 1                | 1 ✗     | 1 ✓              | 0 ✓     | 1 ✓              |
| 1               | 0               | 0               | 1     | 0                | 1 ✓     | 0 ✓              | 1 ✓     | 0 ✓              |
| 1               | 0               | 1               | 0     | 1                | 1 ✗     | 1 ✓              | 1 ✗     | 1 ✓              |
| 1               | 1               | 0               | 0     | 1                | 0 ✓     | 1 ✓              | 1 ✗     | 1 ✓              |
| 1               | 1               | 1               | 1     | 1                | 0 ✗     | 1 ✓              | 1 ✓     | 1 ✓              |

TABLE II: Computational steps of the SAPPI-1

| Step | Operation                    | Equivalent logic                                  |
|------|------------------------------|---------------------------------------------------|
| 1    | $M_1 = 0$                    | FALSE( $M_1$ )                                    |
| 2    | $A \rightarrow M_1 = M_1'$   | NOT(A)                                            |
| 3    | $B \rightarrow M_1' = M_1''$ | $Sum = B \rightarrow \bar{A}$                     |
| 4    | $M_1'' \rightarrow C = C'$   | $C_{out} = (B \rightarrow \bar{A}) \rightarrow C$ |

### C. SAPPI-2

In SAPPI-2, the approach also involves sequentially implating the inputs into the work memristor to store temporary results. However, instead of storing the *Sum* result in the work memristor as done in SAPPI-1, the *Sum* is saved in the *A*-memristor using an additional step. This modification aligns with the exact serial algorithm from [19] and ensures compatibility. As with SAPPI-1, the ER for  $C_{out}$  is  $\frac{1}{8}$ . Although the *Sum* output has an ER of  $\frac{4}{8}$ , two error placements have been altered as shown in Table I. The logical equations for SAPPI-2 are:

$$Sum = ((B \rightarrow \bar{A}) \rightarrow C) \rightarrow A = \bar{AB} + \bar{C} + A \quad (3)$$

$$C_{out} = (B \rightarrow \bar{A}) \rightarrow C = AB + C \quad (4)$$

A notable aspect is that a single erroneous bit in the *Sum* now mirrors the error placement in  $C_{out}$ , meaning both *Sum* and  $C_{out}$  are swapped in this scenario. This results in a reduced overall error, as indicated by improved error metrics. The exact procedure for SAPPI-2 is described in Table III, which is similar to SAPPI-1 except for an additional fifth step where the *Sum* is stored in the *A*-memristor. This algorithm requires  $5n$  steps and uses  $2n + 2$  memristors when it is applied  $n$  times.

## IV. CIRCUIT-LEVEL SIMULATION AND ERROR METRICS

### A. Simulation Setup

We simulated the proposed approximated full adders via LT-SPICE to verify their functionality. We used the Voltage Threshold Adaptive Memristor (VTEAM) model implemented in SPICE [33] with the parameters fitted to a discrete Known

TABLE III: Computational steps of the SAPPI-2

| Step | Operation                    | Equivalent logic                                              |
|------|------------------------------|---------------------------------------------------------------|
| 1    | $M_1 = 0$                    | FALSE( $M_1$ )                                                |
| 2    | $A \rightarrow M_1 = M_1'$   | NOT(A)                                                        |
| 3    | $B \rightarrow M_1' = M_1''$ | $B \rightarrow \text{NOT}(A)$                                 |
| 4    | $M_1'' \rightarrow C = C'$   | $C_{out} = (B \rightarrow \bar{A}) \rightarrow C$             |
| 5    | $C' \rightarrow A = A'$      | $Sum = ((B \rightarrow \bar{A}) \rightarrow C) \rightarrow A$ |

TABLE IV: IMPLY parameter values [23]

| Parameter | $V_{SET}$ | $V_{RESET}$ | $V_{COND}$ | $R_G$ | $t_{pulse}$ |
|-----------|-----------|-------------|------------|-------|-------------|
| Value     | 1 V       | -1 V        | 900 mV     | 40 kΩ | 30 μs       |

TABLE V: VTEAM model and setup values [23]

| Parameter | $v_{off}$ | $v_{on}$  | $\alpha_{off}$ | $\alpha_{on}$ | $R_{off}$ | $R_{on}$  |
|-----------|-----------|-----------|----------------|---------------|-----------|-----------|
| Value     | 0.7 V     | -10 mV    | 3              | 3             | 1 MΩ      | 10 kΩ     |
|           | $k_{on}$  | $k_{off}$ | $w_{off}$      | $w_{on}$      | $w_C$     | $a_{off}$ |
| -0.5 nm/s | 1 cm/s    | 0 nm      | 3 nm           | 107 pm        | 3 nm      | 0 nm      |

memristor [33], [34], shown in Table V. We have to note that similar to the differences in discrete and integrated CMOS devices, discrete memristors lead to slower operations and increased power consumption. It is therefore reasonable to assume that integrated memristors have a significantly better performance. We chose the IMPLY-specific parameters similar to Table IV to allow for a fair comparison to SoA papers such as [23], [24], [32]. Since real memristors are non-ideal devices, we included the deviation of  $R_{on}$  and  $R_{off}$  in our experiments as the resistive deviation is one of the most important behaviors that has to be expected. We simulated all eight input combinations with different deviations of up to ±30% which can be seen as the shaded area in Figure 2 and Figure 3.

### B. Simulation Results

For both presented algorithms, we evaluated all eight input combinations with deviations of up to ±30%. All combinations agree with the intended behavior as *Sum* and  $C_{out}$  match the corresponding truth table. Each step in the algorithms corresponds to 30 μs in the simulations. We arbitrarily illustrated one exact and one incorrect by-design case for SAPPI-1 in Figure 2. The *Sum* output is saved in the  $M_1$ -memristor at 60 μs-90 μs which corresponds to the third step. In the fourth step from 90 μs-120 μs, the  $C_{out}$  is stored in the *C*-memristor. The waveforms for SAPPI-2 are shown in Figure 3 where the  $C_{out}$  is stored in the *C*-memristor at 90 μs-120 μs which corresponds to step four. For this algorithm, the *Sum* is stored in the *A*-memristor at the fifth step (120 μs-150 μs).

### C. Error Metrics

To quantify the erroneous behavior of the presented approximated adders, we used common error metrics such as MED, NMED and MRED to efficiently evaluate the accuracy. More details on these metrics can be found in [17], [28]. We evaluated all input cases for an 8-bit RCA that was partially approximated by the presented adders. The results are shown in Table VI where we can see that both presented adders have very similar metrics until an approximation degree of 4/8. With more approximated adders, the accuracy of SAPPI-1 is getting increasingly worse compared to SAPPI-2. This indicates that the intentional error at the case “ $A_{in}B_{in}C_{in}$ ” = “001” leads to a non-mitigated error propagation for SAPPI-1. In SAPPI-2 this case is erroneous for both *Sum* and  $C_{out}$  leading to a mitigation of error.

TABLE VI: Error Metrics of the presented algorithms in an 8-bit approximated full adder (Ax FA) with different approximation degrees

| Ax FA      | SAPPI-1  |        |        | SAPPI-2  |        |        |
|------------|----------|--------|--------|----------|--------|--------|
|            | MED      | NMED   | MRED   | MED      | NMED   | MRED   |
| Ax FA: 1/8 | 0.2500   | 0.0004 | 0.0013 | 0.5000   | 0.0009 | 0.0027 |
| Ax FA: 2/8 | 1.2500   | 0.0024 | 0.0069 | 1.5000   | 0.0029 | 0.0082 |
| Ax FA: 3/8 | 3.5312   | 0.0069 | 0.0197 | 3.5000   | 0.0068 | 0.0194 |
| Ax FA: 4/8 | 8.6250   | 0.0169 | 0.0492 | 7.5000   | 0.0147 | 0.0423 |
| Ax FA: 5/8 | 19.6347  | 0.0385 | 0.1156 | 15.5000  | 0.0303 | 0.0896 |
| Ax FA: 8/8 | 191.0572 | 0.3746 | 1.4026 | 127.5000 | 0.2500 | 0.8841 |



Fig. 2: Two example simulations of SAPPI-1, illustrating the resistive deviation of  $\pm 30\%$  as shaded areas.

## V. CIRCUIT-LEVEL COMPARISON

In this section, we compare the relevant circuit-level criteria with exact [19], [35] and approximated [23], [32] adders that are based on the serial topology. We do not compare to other algorithms in other topologies [24] in this work as they are built for different optimization goals. An overview of the different criteria compared to SoA algorithms is shown in Table VII. Since IMPLY-based approaches are algorithm-based, the approximation level can be reconfigured for each process or even during the run. In the following sections, we just compare with an approximation degree of 4/8 as it is the highest approximation degree that leads to acceptable results on application-level.

### A. Energy Consumption

We recreated the energy consumption of the exact serial algorithm from [19] with our specified simulation parameters in LT-SPICE as our partially approximated RCA uses this algorithm at the higher bits. The energy consumption for the presented and recreated adders was evaluated as the mean of all different input cases. The total energy consumption of such an RCA with  $k$  approximated out of  $n$  total bits is:

$$E_{SAPPI-1}(n, k) = 0.7980k + 4.8250(n - k) \quad (5)$$

$$E_{SAPPI-2}(n, k) = 1.0919k + 4.8250(n - k) \quad (6)$$

SAPPI-1 is more energy efficient than SAPPI-2 and requires 5% less energy. Compared to exact adders the presented algorithms reduce the required energy by 39%-42%. Even to



Fig. 3: Two example simulations of SAPPI-2, illustrating the resistive deviation of  $\pm 30\%$  as shaded areas.

the already optimized approximated adders from [23], [32] our proposed methods improve the energy consumption by 9%-13%, rendering our algorithms more efficient than the SoA.

### B. Number of Steps

Another important criterion on the circuit level is the number of steps that are required per bit. SAPPI-1 and SAPPI-2 require only 4 and 5 steps per bit. When we now include them as the lower bits in an RCA where the higher bits are again calculated by the exact adder from [19], the total number of steps with  $k$  approximated and  $n - k$  exact adders are:

$$S_{SAPPI-1}(n, k) = 4k + 22(n - k) \quad (7)$$

$$S_{SAPPI-2}(n, k) = 5k + 22(n - k) \quad (8)$$

As our approach requires 17-18 steps per bit less than the exact adder from [19], up to 39%-41% steps are saved. Compared to the fastest approximated adder from [23], [32] which is SAFAN, our adders require 7%-10% fewer steps.

### C. Area

For SAPPI-2 and SoA approximations from [23], [32] the work memristors are reused for each bit. So the number of memristors is bounded by the exact serial algorithm from [19] which means they require  $2n + 3$  memristors for  $n$ -bit. In SAPPI-1 the *Sum* result is stored in the work memristor and the inputs are preserved. As the work memristors cannot be reused, additional  $k$  memristors are required. So depending on the approximation degree, SAPPI-1 needs  $2n + k + 3$  memristors for a  $n$ -bit addition. This does not necessarily mean it is more area-inefficient as it preserves the inputs of the adder, compared to other approaches.

## VI. APPLICATION IN IMAGE PROCESSING

Image processing is an error-resilient application widely used in computer vision, robotics, industrial systems and media [36]. We simulated three common applications with RCA that partially consist of our approximated full adders. We evaluated different approximation degrees and determined

TABLE VII: Circuit-Level Comparison to the SoA full adder

| Full adder     | Energy consumption (nJ)      |                |              | No. of steps        |              |              | No. of memristors |              |
|----------------|------------------------------|----------------|--------------|---------------------|--------------|--------------|-------------------|--------------|
|                | n, k                         | n=8-bit, k=4   | Imp. to [19] | n, k                | n=8-bit, k=4 | Imp. to [19] | n, k              | n=8-bit, k=4 |
| Exact 1 [19]   | 4.8250n                      | 38.6000        | -            | 22n                 | 176          | -            | 2n+3              | 19           |
| Exact 2 [35]   | 4.0772n                      | 32.6176        | 15%          | 23n                 | 184          | -4%          | 2n+3              | 19           |
| SIAFA1,3 [23]  | 1.7090k + 4.8250(n-k)        | 26.1360        | 32%          | 8k + 22(n-k)        | 120          | 32%          | 2n+3              | 19           |
| SIAFA2 [23]    | 2.5131k + 4.8250(n-k)        | 29.3524        | 24%          | 10k + 22(n-k)       | 128          | 27%          | 2n+3              | 19           |
| SIAFA4 [23]    | 1.7066k + 4.8250(n-k)        | 26.1264        | 32%          | 8k + 22(n-k)        | 120          | 32%          | 2n+3              | 19           |
| SAFAN [32]     | 1.6628k + 4.8250(n-k)        | 25.9512        | 33%          | 7k + 22(n-k)        | 116          | 34%          | 2n+3              | 19           |
| <b>SAPPI-1</b> | <b>0.7980k + 4.8250(n-k)</b> | <b>22.4920</b> | <b>42%</b>   | <b>4k + 22(n-k)</b> | <b>104</b>   | <b>41%</b>   | 2n+k+3            | 23           |
| <b>SAPPI-2</b> | 1.0919k + 4.8250(n-k)        | 23.6676        | 39%          | 5k + 22(n-k)        | 108          | 39%          | 2n+3              | 19           |

We have simulated the circuits from [19], [23], [32], [35] similar to ours and obtained different numbers to the reported values from [23], [32] and are not sure why. So to allow for a fair comparison, we are reporting our own simulated results.



Fig. 4: Results of different image processing applications with 4/8 Ax FA: image addition (top), grayscale conversion (middle), and with 8/20 Ax FA: Gaussian blurring (bottom).

the quality metrics PSNR and MSSIM which are represented in Table IX. In the literature, a PSNR of over 30dB is deemed acceptable quality. We presented the resulting images with half of the adders approximated in Figure 4. In Section VI-B we analyze possible gains of our approach on application-level and compare to SoA approximations.

#### A. Example Applications

1) *Image Addition*: Image addition is used for masking and enhancing images [16]. We simulated this by adding the corresponding pixels of two images via an RCA and halving the result. We evaluated two 8-bit standard images with varying approximation degrees. Both of the presented algorithms achieved a PSNR of over 30dB with up to 4/8 adders approximated which demonstrates acceptable image quality.

2) *Gray Scale Conversion*: The grayscale conversion is a standard pre-processing procedure for different computer vision and ML tasks. It converts an RGB image to grayscale by averaging the red, green and blue values of each pixel. To perform grayscaling the individual color values are added together in two iterations and the result is then divided by

three. We simulated with varying approximation degrees and achieved acceptable quality (PSNR>30dB) for both presented algorithms with up to 4/8 adders approximated.

3) *Gaussian Smoothing*: To showcase the applicability of our approximated adders on more complex tasks, we evaluated Gaussian smoothing as it is often used for denoising and as a preprocessing step in various computer vision tasks. We implemented this by using a partially approximated 20-bit RCA which we embedded in a shift-and-add multiplier to apply the  $3 \times 3$  kernel from [37] over the whole image. We simulated the Gaussian smoothing with varying approximation degrees and achieved acceptable PSNR over 30dB for both presented algorithms with up to 8/20 adders approximated.

#### B. Application-Level Gains and Comparison

The presented approaches exhibit substantial gains in energy efficiency and required cycles over the exact algorithm [19]. Table VIII presents an overview of the possible gains for the previously covered image processing tasks. Our algorithms reduce the number of steps by 39% – 41% and the energy consumption by 39% – 42%. In image addition we can save up to 1.1 mJ of energy and up to 19 million steps. In the grayscale conversion example, up to 20 mJ of energy and 359 million steps can be saved when our approach is applied. For Gaussian blurring of a  $576 \times 700$  8-bit image, these gains sum up to 581 mJ of energy and 2596 million steps. Compared to SoA approximations our approach is not able to reach the 30dB threshold of PSNR with 5/8 approximated adders. But with four adders approximated, our adders reduce the number of steps by 7% – 10% and improve the energy efficiency by 9% – 13% compared to the most efficient SoA approximation (SAFAN).

## VII. APPLICATION IN MACHINE LEARNING

The demand for efficient computation in machine learning (ML) applications, especially in resource-constrained environments, necessitates innovative approaches to optimize power and performance. In this case, we want to highlight the potential of our IMPLY-based approximations for ML applications, which was disregarded in similar approximation approaches [23], [24], [32]. We verify the applicability to Fully Connected Neural Networks (FC-NN) and Convolutional Neural Networks (CNN) that are pretrained on the MNIST training dataset split (60,000 images) [38]. We utilized a

TABLE VIII: Application-level gains to exact serial algorithm for different image processing applications.

| Ax algorithm        | Image addition (256×256 8-bit Images) |                       | Grayscale conversion (684×912×3 8-bit Image) |                       | Gaussian blurring (576×700 8-bit Image) |                       |
|---------------------|---------------------------------------|-----------------------|----------------------------------------------|-----------------------|-----------------------------------------|-----------------------|
|                     | Energy saved (mJ)                     | Steps saved (million) | Energy saved (mJ)                            | Steps saved (million) | Energy saved (mJ)                       | Steps saved (million) |
| SAPPI-1 (4/8 Ax FA) | 1.0557                                | 18.8744               | 20.0966                                      | 359.3134              | 580.8332                                | 2596.2250             |
| SAPPI-2 (4/8 Ax FA) | 0.9786                                | 17.8258               | 18.6299                                      | 339.3516              | 538.4426                                | 2451.9902             |

TABLE IX: Quality metrics of image processing applications

| Ax algorithm                    | Image addition |        | Grayscale filter |        | Gaussian blurring |             |
|---------------------------------|----------------|--------|------------------|--------|-------------------|-------------|
|                                 | PSNR (dB)      | MSSIM  | PSNR (dB)        | MSSIM  | PSNR (dB)         | MSSIM       |
| Approximation degree: 1/8 Ax FA |                |        |                  |        |                   | 2/20 Ax FA  |
| SAPPI-1                         | 54.10          | 0.9992 | 52.34            | 0.9982 | 88.98             | 1.0000      |
| SAPPI-2                         | 51.12          | 0.9989 | 49.43            | 0.9982 | 79.12             | 1.0000      |
| Approximation degree: 2/8 Ax FA |                |        |                  |        |                   | 4/20 Ax FA  |
| SAPPI-1                         | 48.10          | 0.9974 | 46.08            | 0.9949 | 72.82             | 1.0000      |
| SAPPI-2                         | 46.34          | 0.9978 | 43.71            | 0.9949 | 65.53             | 1.0000      |
| Approximation degree: 3/8 Ax FA |                |        |                  |        |                   | 6/20 Ax FA  |
| SAPPI-1                         | 40.51          | 0.9866 | 38.95            | 0.9758 | 54.08             | 0.9998      |
| SAPPI-2                         | 40.70          | 0.9937 | 37.84            | 0.9827 | 48.75             | 0.9998      |
| Approximation degree: 4/8 Ax FA |                |        |                  |        |                   | 8/20 Ax FA  |
| SAPPI-1                         | 33.42          | 0.9420 | 31.91            | 0.8936 | 35.46             | 0.9893      |
| SAPPI-2                         | 35.01          | 0.9800 | 31.76            | 0.9378 | 33.57             | 0.9942      |
| Approximation degree: 5/8 Ax FA |                |        |                  |        |                   | 10/20 Ax FA |
| SAPPI-1                         | 26.03          | 0.8193 | 24.65            | 0.6764 | 20.33             | 0.9092      |
| SAPPI-2                         | 28.52          | 0.9408 | 25.28            | 0.8004 | 19.69             | 0.9331      |

partially approximated 20-bit RCA which we embedded in a shift-and-add multiplier to evaluate the networks in a post-training fashion. We used the 10,000 test images of MNIST to calculate the accuracy. The resulting accuracy and possible energy savings for the different networks with varying approximation degrees are illustrated in Figure 5.

#### A. Fully Connected Neural Network (FC-NN)

We implemented a simple fully connected network with  $28 \times 28$  inputs that are connected to one hidden layer with 128 nodes and an output stage that represents the 10 possible classes. After each layer, we used a ReLU as an activation function and a normalization. Only after the last layer we used an argmax to calculate the outputs. Our results indicate that with up to six approximated adders the accuracy for both SAPPI-1 and SAPPI-2 are maintained. SAPPI-1 has better accuracy results for all approximation degrees. With seven or more adders approximated, the accuracy rapidly degrades, rendering the network unusable. Using our approach in a fully connected neural network (FC-NN) can result in energy savings of up to 5.3 mJ (29%) and a reduction of 23 million steps (29%) all while preserving accuracy.

#### B. Convolution Neural Network (CNN)

To also show the applicability of our approaches to more complex networks, we used a CNN to classify the same problem defined before. We used a LeNet-5 inspired architecture [39] for this problem since it is a classical approach to solving MNIST image classification task. It is a compact approach, compared to more modern architectures, making it the ideal example for our approximations as the impact of errors has more impact, validating our approach for more complex and error-resilient networks as well. It consists of



Fig. 5: Results of MNIST using the proposed approximations

five convolutional layers with ReLU activation functions with increasing width and height and decreasing channel dimensions. The output of the last convolutional layer is flattened and passed through three fully connected layers with decreasing dimensions. For the exact architecture, we refer the reader to [40], [41]. Our experiments indicate that with up to four adders approximated, there is no degradation in accuracy for both SAPPI-1 and SAPPI-2. With five adders approximated the accuracy for SAPPI-1 only decreases by about 1.43% while the accuracy with SAPPI-2 decreases to 93.59%. With higher approximation degrees our methods do not lead to acceptable results. With our SAPPI-1 approach, the energy can be reduced by up to 296 mJ (21%) while also requiring 1.33 billion fewer steps (20%) and maintaining accuracy.

## VIII. CONCLUSION

In this work, we presented two IMPLY-based approximate full adders in the serial topology which can drastically improve the efficiency of image processing and machine learning applications. The primary emphasis was to reduce the number of steps and energy consumption to a minimum while still being applicable in the aforementioned tasks. When embedded in partially approximated RCA, our approaches are 39%–42% more energy efficient and require 39% – 41% less steps compared to exact adders. With 4/8 adders approximated, our algorithms use 10% – 13% less energy and need 7% – 10% fewer steps than the already efficient SoA approximations. When applied to image processing tasks our approaches can save up to 581 mJ of energy and 2.6 billion steps (cycles) in Gaussian smoothing compared to the exact adders. We demonstrated the applicability of our approximated adders in machine learning where 296 mJ and 1.3 billion steps can

be saved for one inference in the MNIST dataset image classification while maintaining accuracy.

## REFERENCES

- [1] A. Boroumand *et al.* Google workloads for consumer devices: Mitigating data movement bottlenecks. *SIGPLAN Notices*, 53(2):316–331, 2018.
- [2] N. TaheriNejad. In-memory computing: Global energy consumption, carbon footprint, technology, and products status quo. In *IEEE Nano Conference*, pp. 1–6, 2024.
- [3] N. Taherinejad *et al.* Memristors’ potential for multi-bit storage and pattern learning. In *2015 IEEE European Modelling Symposium (EMS)*, pp. 450–455, Oct 2015.
- [4] N. Taherinejad *et al.* Fully digital write-in scheme for multi-bit memristive storage. In *2016 13th International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE)*, pp. 1–6, Sept 2016.
- [5] D. Radakovits and N. TaheriNejad. Implementation and characterization of a memristive memory system. In *2019 IEEE 32nd Canadian Conference on Electrical and Computer Engineering (CCECE)*, pp. 1–5, May 2019.
- [6] J. Borghetti *et al.* Memristive switches enable stateful logic operations via material implication. *Nature*, 464:873–6, 04 2010.
- [7] E. Lehtonen and M. Laiho. Stateful implication logic with memristors. *2009 IEEE/ACM International Symposium on Nanoscale Architectures*, pp. 33–36, 2009.
- [8] D. B. Strukov *et al.* The missing memristor found. *Nature*, 453:80–83, 2008.
- [9] S. Gupta *et al.* Felix: Fast and energy-efficient logic in memory. In *2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)*, pp. 1–7, 2018.
- [10] N. TaheriNejad. Sixor: Single-cycle in-memristor xor. *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, 29(5):925–935, 2021.
- [11] P. Huang *et al.* Reconfigurable nonvolatile logic operations in resistance switching crossbar array for large-scale circuits. *Advanced Materials*, 28(44):9758–9764, 2016.
- [12] S. Kvatinsky *et al.* Magic—memristor-aided logic. *IEEE Transactions on Circuits and Systems II: Express Briefs*, 61(11):895–899, 2014.
- [13] S. Kvatinsky *et al.* Memristor-based imply logic design procedure. In *2011 IEEE 29th International Conference on Computer Design (ICCD)*, pp. 142–147, 2011.
- [14] D. Radakovits and N. Taherinejad. Behavioral leakage and inter-cycle variability emulator model for rerams (BELIEVER). *CoRR*, abs/2103.04179, 2021.
- [15] W. Liu *et al.* A retrospective and prospective view of approximate computing. *Proceedings of the IEEE*, 108:394–399, 03 2020.
- [16] V. Gupta *et al.* Low-power digital signal processing using approximate adders. *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, 32(1):124–137, 2013.
- [17] H. Jiang *et al.* A review, classification, and comparative evaluation of approximate arithmetic circuits. *ACM Journal on Emerging Technologies in Computing Systems (JETC)*, 13(4), 2017.
- [18] L. Chua. Memristor—the missing circuit element. *IEEE Transactions on Circuit Theory*, 18(5):507–519, 1971.
- [19] S. G. Rohani and N. TaheriNejad. An improved algorithm for imply logic based memristive full-adder. In *2017 IEEE 30th Canadian Conference on Electrical and Computer Engineering (CCECE)*, pp. 1–4, 2017.
- [20] A. Karimi and A. Rezai. Novel design for a memristor-based full adder using a new imply logic approach. *Journal of Computational Electronics*, 17(3):1303–1314, 2018.
- [21] D. Radakovits *et al.* A memristive multiplier using semi-serial imply-based adder. *IEEE Transactions on Circuits and Systems I: Regular Papers*, 67(5):1495–1506, 2020.
- [22] S. Kvatinsky *et al.* Memristor-based material implication (imply) logic: Design principles and methodologies. *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, 22(10):2054–2066, 2014.
- [23] S. E. Fatemeh *et al.* Fast and compact serial imply-based approximate full adders applied in image processing. *IEEE Journal on Emerging and Selected Topics in Circuits and Systems*, 13(1):175–188, 2023.
- [24] F. Seiler and N. TaheriNejad. An imply-based semi-serial approximate in-memristor adder. In *2023 IEEE Nordic Circuits and Systems Conference (NorCAS)*, pp. 1–7, 2023.
- [25] S. E. Fatemeh *et al.* Approximate in-memory computing using memristive imply logic and its application to image processing. In *2022 IEEE International Symposium on Circuits and Systems (ISCAS)*, pp. 3115–3119, 2022.
- [26] N. TaheriNejad *et al.* A semi-serial topology for compact and fast imply-based memristive full adders. In *2019 17th IEEE International New Circuits and Systems Conference (NEWCAS)*, pp. 1–4, 2019.
- [27] S. Ganjeheizadeh Rohani *et al.* A semiparallel full-adder in imply logic. *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, 28(1):297–301, 2020.
- [28] H. Jiang *et al.* Approximate arithmetic circuits: A survey, characterization, and recent applications. *Proceedings of the IEEE*, 108(12):2108–2135, 2020.
- [29] H. A. Almurib *et al.* Inexact designs for approximate low power addition by cell replacement. In *2016 Design, Automation & Test in Europe Conference & Exhibition (DATE)*, pp. 660–665, 01 2016.
- [30] H. Jiang *et al.* A review, classification, and comparative evaluation of approximate arithmetic circuits. *ACM Journal on Emerging Technologies in Computing Systems (JETC)*, 13:1 – 34, 2017.
- [31] S. Mittal. A survey of techniques for approximate computing. *ACM Computing Surveys*, 48(4):1–33, 2016.
- [32] S. Asgari *et al.* Energy-efficient and fast imply-based approximate full adder applying nand gates for image processing. *Computers and Electrical Engineering*, 113:109053, 2024.
- [33] D. R. M. Jungwirth and N. TaheriNejad. Spice implementation of vteam model, 2018.
- [34] Knownm sdc memristors, [knowm.org/downloads/Knowm\\_Memristors.pdf](http://knowm.org/downloads/Knowm_Memristors.pdf). Last accessed Feb 2024.
- [35] A. Karimi and A. Rezai. Novel design for a memristor-based full adder using a new imply logic approach. *Journal of Computational Electronics*, 17, 09 2018.
- [36] M. Khaleqi Qaleh Joop *et al.* Ultraefficient imprecise multipliers based on innovative 4:2 approximate compressors. *International Journal of Circuit Theory and Applications*, 49:169–184, 2021.
- [37] N. Amirafshar *et al.* Carry disregard approximate multipliers. *IEEE Transactions on Circuits and Systems I: Regular Papers*, 70(12):4840–4853, 2023.
- [38] L. Deng. The mnist database of handwritten digit images for machine learning research. *IEEE Signal Processing Magazine*, 29(6):141–142, 2012.
- [39] Y. Lecun *et al.* Gradient-based learning applied to document recognition. *Proceedings of the IEEE*, 86(11):2278–2324, 1998.
- [40] S. Shakibamedan *et al.* An analytical approach to enhancing DNN efficiency and accuracy using approximate multiplication. In *2nd Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@ICML 2024)*, 2024.
- [41] S. Shakibamedan *et al.* Ace-cnn: Approximate carry disregard multipliers for energy-efficient cnn-based image classification. *IEEE Transactions on Circuits and Systems I: Regular Papers*, 71(5):2280–2293, 2024.