

# A Leak Resistant SoC to Counteract Side Channel Attacks

Daniel Mesquita, Benoît Badrignans, Lionel Torres,  
Gilles Sassatelli, Michel Robert  
*LIRMM – Laboratoire d’Informatique, de Robotique et de  
Microélectronique de Montpellier – UMII*  
161 rue Ada – CEDEX 5 – 34392 – Montpellier – France  
Phone : +33 4 67 41 85 69  
{lastname}@lirmm.fr

This work addresses the problem of hardware attacks against cryptographic circuits. The most dangerous side-channel attack: the differential power analysis (DPA) is discussed, as well the state of art countermeasures. Then new reconfigurable System on Chip resistant against DPA attacks is proposed. Results shows that our architecture is efficient against DPA attacks, but also outcomes the performance of classical implementation of modular exponentiation, for key size exceeding 2048 bits, with a reasonable extra area overhead.

## I. INTRODUCTION

Even a carefully implementation of a cryptographic algorithm, at the software abstraction level, may be vulnerable at the hardware point of view. Hardware implementation may leak some information – like time to compute, power consumption, among others – that can be used to realise hardware attacks. These attacks are called Side Channel Attacks (SCA) and they have done a new perspective in cryptanalysis. These attacks do not rely on purely cryptanalytic considerations, but rather focus the VLSI implementation of the cryptographic algorithm.

For instance, the power consumed by a cryptographic engine has a strict relation with the data computed. This characteristic is due to the fact that power consumed by every single logic gate differs depending on the output states before and after switching (i.e. switching from logic 0 to logic 1 or from logic 1 to logic 0) [1]. An encryption process generates a power signature depending on the encrypted text and the cryptographic key used for encryption. If the cryptographic hardware does not have any protection measures, a Simple Power Analysis (SPA) attack can be performed. But even with some countermeasures present, the Differential Power Analysis (DPA) may be still efficient.

This paper is organized as follows. Firstly it gives a brief introduction concerning DPA attacks and countermeasures. Then, as the base of the proposed architecture, the Leak Resistant Arithmetic is explained. Afterwards, a reconfigurable architecture resistant against some SCA attacks is introduced. Finally some results, conclusions and further works are discussed.

Fernando Gehm Moraes

*Pontifícia Universidade Católica do Rio Grande do Sul*  
Av. Ipiranga, 6681 – Prédio 30 – Bloco 4  
90.619-900 – Porto Alegre – RS – Brasil  
Phone: +55 51 33 20 36 11 – R.29  
moraes@inf.pucrs.br

## II. DPA ATTACK AND COUNTERMEASURES

### A. DPA

Differential power analysis consists of statistical analysis and error-correction methods to recover keys of hardware implementation of cryptographic algorithms [1].

For a typical attack, an adversary repeatedly samples the target device’s power consumption through each of several thousand cryptographic computations. These power traces can be obtained using high-speed analog-to-digital converters. Figure 1 illustrates this method used to attack a secure device, but experiments were performed also against FPGA implementations.



Figure 1 - A DPA attack platform

For instance, a DPA attack conducted against a device implementing the DES algorithm is given. The Data Encryption Standard (DES) [2] is used because of its widespread use and simplicity to demonstrate the DPA principle. The DES is composed by 16 rounds of substitutions, with initial and final permutations. In each of the 16 rounds, the DES encryption algorithm performs eight S-box operations. The 8 S-boxes take as input the XOR between the six key bits and the six bits of an internal register and produce four output bits.

The attack consists in to guess the sub-keys at the input of the S-box, and predict the output. For any of the four predicted bits, the power traces are divided into two subsets:

This work is partially founded by the Brazilian CAPES Agency

one where the predicted bit value is  $0$ , and one set where the predicted value is  $1$ . Next, an average trace is computed for each subset, where the  $n^{\text{th}}$  sample in each average trace is the average of the  $n^{\text{th}}$  samples in all traces in the subset. Finally, the adversary computes the difference of the average traces. If the original assumption is incorrect, the computed averages tend to zero. Otherwise, if the key hypothesis is correct, the average trace will reveal peaks, a statistically-significant deviation from zero in the difference trace.

### B. Algorithmic Countermeasures

There are several algorithmic countermeasures to thwart DPA attacks. Some of the first ones were proposed in [3], and the three proposed countermeasures are efficient against SPA and classical DPA attacks. For RSA cryptosystems [4] the first method described by Coron is applicable, and the second one is just an adaptation of the Chaum's blind signature [5]. The third method is only suitable for ECC (Elliptic Curve Cryptosystems). But the recently proposed Refined Power Analysis (RPA) [6] overrules these countermeasures.

The BRIP method counteracts the RPA but is also targeted to ECC, not tailored to work with the widely used RSA algorithm [7]. The message blinding proposed by P. Kocher [8] seems to be an efficient countermeasure against the MRED [9], an attack targeting CRT (Chinese Remainder Theorem) implementation of RSA.

In general, the countermeasures protecting the RSA algorithm of DPA attacks relies on message or exponent blinding. These methods contribute or not to the security of the system, depending on the way they are implemented and the kind of attack. It is not rare that the defence against one attack may benefit another kind of attack.

So, the best way to counteract DPA attacks is to target the DPA principle: the correlation between the data computed and the power consumption. Differently of the works that generally proposes CRT to accelerate RSA, like [10], another proposes a full RNS (Residue Number System) representation to compute RSA [13], [14]. Besides the acceleration, a full RNS implementation of RSA can intelligently be used to counteract DPA and DFA attacks, as shown below

## III. LEAK RESISTANT ARITHMETIC (LRA)

The Leak Resistant Arithmetic (henceforth called LRA) is based on the RNS representation and the RNS Montgomery's modular multiplication proposed in [12].

A Residue Number System relies on the Chinese Remainder Theorem. The CRT indicates that is possible to represent a large integer using a set of smaller integers, so that computation may be performed more efficiently.

So, arithmetic operations can be made using this representation, as addition and multiplication – for instance, absolutely in fully parallel mode, because once the number are represented in RNS the operations do not generates any carry. For cryptographic applications, modular reduction ( $x \bmod M$ ), modular multiplication ( $x \cdot y \bmod M$ ) and modular

exponentiation ( $x^y \bmod M$ ) are some of the most important operations. They can be calculated using the Montgomery's algorithm, modified to RNS representation, as described in the next sub-section.

### A. Montgomery's modular multiplication in RNS

The version of the Montgomery's modular multiplication presented below was proposed in [13]. In the RNS representation the value  $M$  is taken from (2):

$$M = \prod_{i=1}^k m_i \quad (1)$$

So,  $M$  is chosen as the Montgomery constant instead  $\beta^k$  in the classical representation. Then, with  $A$ ,  $B$ ,  $R$  and  $N$  represented in RNS within the base  $\beta_1 = \{m_1, m_2, m_3, \dots, m_k\}$ . The result of the algorithm must be:

$$R = A \cdot B \cdot M_1^{-1} \bmod N$$

However, the value  $M_1^{-1}$  cannot be computed in  $\beta_1$ . So another base  $\beta_2$  is defined as an extension of  $\beta_1$ , with  $k$  extra moduli all co-primes among them and with  $\beta_1$ . So, before calculate (Algorithm 1, in Figure 2, points 3 and 5)  $M_1^{-1}$  a base extension from  $\beta_1$  to  $\beta_2$  is performed.

The Algorithm 3 describes the Montgomery's modular multiplication in RNS. As inputs we have two RNS bases  $\beta_1$  and  $\beta_2$ , such that  $M$  and  $M'$  can be computed as the product of the moduli that composes respectively  $\beta_1$  and  $\beta_2$ . The inputs  $A$ ,  $B$  and  $N$  are also represented in both  $\beta_1$  and  $\beta_2$  bases. Besides a redundant modulus  $m_r$  such that  $\gcd(m_r, m_i) = 1$  is needed. Also,  $N$  and  $M$  must be co-primes. The result is given by  $R$  in  $\beta_1$ .

$$\begin{aligned} 1: & T \leftarrow A \otimes_{\text{RNS}} B \text{ in } \beta_1 \text{ and } \beta_2 \\ 2: & Q \leftarrow T \otimes_{\text{RNS}} (-N^{-1}) \text{ in } \beta_1 \\ 3: & \text{Extend } Q \text{ from } \beta_1 \text{ to } \beta_2 \\ 4: & R \leftarrow (T \oplus_{\text{RNS}} Q \otimes_{\text{RNS}} N) \otimes_{\text{RNS}} M^{-1} \text{ in } \beta_2 \\ 5: & \text{Extend } R \text{ from } \beta_2 \text{ to } \beta_1 \end{aligned} \quad (2)$$

The points 1, 2 and 4 of the algorithm consist of full RNS operations that can be realized in parallel. Therefore, the most complex operations rely on the base extensions (points 3 and 5). There are some methods to compute a base extension, but here the Mixed Radix System, described in [11] that has the advan that it requires only a of  $k$  values for each base modification [12]. Due to space constraints, the algorithm for modular exponentiation was omitted in this paper, but details can be obtained in the same reference [12].

### B. The security provided by the LRA

Besides performance due to the intrinsic parallelism, RNS algorithms provide also the possibility of randomize the basis: the algorithm's robustness relies on this concept. The LRA proposes two approaches of data randomization: one at the circuit level (spatial randomization) and the data level (arithmetic masking). They represent a good trade-off between security and implementation cost. The considered approaches are:

- **Random choice of initial bases:** Randomization of the input data is provided by randomly choosing the

elements of  $\beta_1$  and  $\beta_2$  before each modular exponentiation.

- **Random change of bases before and during the exponentiation:** A generic algorithm is proposed in [12], offering many degrees of freedom in the implementation and at the security level.

The main goal of these approaches is to lead to a randomization of all intermediate data computed at the cryptographic circuit for the same input data and output. Based on the same principle of the DPA, if the data change during an operation, consequently the power consumption became random, thwarting DPA attacks. The security of this method was demonstrated in [12].

#### IV. THE LR<sup>2</sup>A RECONFIGURABLE SoC ARCHITECTURE

As the RNS implementation of RSA leads to parallel operations, we conceived a reconfigurable SoC to run the Leak Resistant Arithmetic called “*Leak Resistant Reconfigurable Architecture*” (LR<sup>2</sup>A). The LR<sup>2</sup>A is a SoC because includes processor, reconfigurable fabric, memory and a flexible interconnection. Besides, the LR<sup>2</sup>A intends to be a flexible solution for cryptographic algorithms, so it can be viewed as a coarse grain reconfigurable architecture, because it is possible to reconfigure this architecture to perform other cryptosystems based on modular arithmetic, like ECC or RC6. Furthermore, the LR<sup>2</sup>A can be easily modified to run supplementary applications, like data compression or image processing, but the discussion about the flexibility exceeds the scope of the paper.

The LR<sup>2</sup>A is built around three main structures: a configuration and data injection controller, some homogeneous processing elements (PE), and memory resources. A controller is responsible of data injection and configuration control to/of the PEs. There are  $k$  PE, where  $k$  is given by the number of bases used for the LRA. The LR<sup>2</sup>A memory schema is non-uniform memory access architecture, i.e. the memory organization can be viewed as a local memory for each PE, but accessible for all PEs and the controller.

##### A. The controller

The main difference from the LR<sup>2</sup>A and some common reconfigurable architectures is that the LR<sup>2</sup>A is not only a loop-core for a specific algorithm class. The proposed reconfigurable architecture includes a controller to bring the configuration specific to each node, to inject the data into the distributed memory, and after that, to recover the computed data. The Figure 2 shows an overview of the LR<sup>2</sup>A.

To perform these operations the Plasma processor was chosen, due to its availability and the fact it has a C compiler, which makes easier the configuration model programming. The Plasma CPU is an open source processor, based on the MIPS R3000 instruction set, ant it has as an advantage the software compatibility with others processors used for embedded systems.



Figure 2 - The LR<sup>2</sup>A overview

To perform these operations the Plasma processor was chosen, due to its availability and the fact it has a C compiler, which makes easier the configuration model programming. When the system starts, the controller reads the configuration for each PE from the configuration memory, sending it to the target PE. Also, the controller stores internally the status of each node into a data structure. After, the initial data supposed to be computed is loaded into the local memories. Since configuration and data are in place, the controller starts the computation. Finally, the controller is the element charged to recuperate the computed data and to store it into the main *memory*. For the next changes (i.e., data arriving, or other configuration needed), the controller verifies the current status of the PEs and their respective memories before send the new information.

##### B. The processing element

The LR<sup>2</sup>A’s processing element is a load-store architecture similar to Plasma: the logic and arithmetic instructions are executed using internal registers only, while the memory access instructions execute either the reading from (load) or the writing to (store) one memory position. In fact, the processor’s organization is similar to the Plasma, but the PEs are optimized for cryptographic operations, and consequently, simplest and smaller.

Due to the load/store architecture option, the processor must have a relatively large set of data manipulation general-purpose registers, to reduce the number of memory accesses (this always represents a time penalty with regard to the processor internal operation). Regarding the instruction format, all instructions have exactly the same size, occupying 1 memory word each. The instruction contains the operation code and the operands specification, in case they exist.

Some important features of the PE are:

- Address and data size are of 32 bits.
- Memory addressing is performed on a word basis.
- The register bank contains 16 general-purpose registers, each of 32 bits.
- There are 4 status flags named: negative, zero, carry, and overflow.

The processor datapath includes specific cryptographic operators. Because the LRA’s basic operations are the modular ones, the following operations are hardwired and incorporated to the ALU:

- Modular reduction ( $X \bmod M$ )
- Modular reduction by  $2^k$  factor ( $X \bmod 2^k$ )
- Modular addition ( $X+Y \bmod M$ )
- Modular subtraction ( $X-Y \bmod M$ )
- Modular multiplication ( $X \times Y \bmod M$ )
- Modular multiplication by  $2^k$  factor ( $X \times Y \bmod 2^k$ )

- Shifts and rotations performed through a Barrel shifter.

## V. RESULTS

Taking the 32bits datapath configuration, and considering the size of the controller being around 40k gates, a LR<sup>2</sup>A composed by 32 PEs (capable of performing 1024bits exponentiation), occupies 352k gates on silicon. Comparing with state-of-art hardware accelerators for the same purpose, the LR<sup>2</sup>A takes five to eight times the commonly used area.

But in terms of performance, as shown in the Figure 3, for cryptographic keys larger than 1024 bits, the LR<sup>2</sup>A outcomes classical implementations. The comparison was made taking into account the square and multiply method to perform modular exponentiation. The reference implementation concerns a modular exponentiation calculated with a dedicated circuit implementing the Montgomery algorithm for modular multiplication, with a word size of 32 bits, and running at 40MHz. On the other hand, the LR<sup>2</sup>A implementation features are a 32 bit datapath version, with 32 processors running at 40MHz. For both versions the area corresponds the exponentiation control was not considered.

As the Figure 3 illustrates, in terms of performance, the LR<sup>2</sup>A is equivalent to classic implementations of modular exponentiation for key sizes until 2048 bits. Beyond 4096 bits, the LR<sup>2</sup>A is clearly more interesting in performance, but with an area penalty.



**Figure 3 - - LR<sup>2</sup>A versus a classic implementation of the modular exponentiation**

Concerning the area, the report is 5 times for 2048 bits and 10 times for 4096. This is the cost of the security. The classic implementation does not contain any countermeasure against DPA, while the LR<sup>2</sup>A incorporate the robustness needed to counteract this attack.

Regarding the robustness aspect, by changing intermediate results via base extension within the LRA method, our architecture consumes differently for the same data set, according with the Figure 4. In the first part we can see the consumption for the first base  $\beta_1$ , while in the second part, the consumption is for the base  $\beta_2$ . The arrows highlight a visible different consumption.

## VI. CONCLUSIONS

The Leak Resistant Arithmetic improves the robustness of cryptographic applications counteracting the principle of some hardware attacks, like DPA. Due to its nature, the LRA leads to a parallel architecture, requiring important hardware resources. Even with area penalties, the LR<sup>2</sup>A is competitive in terms of performance, equivalent to state-of-art cryptographic accelerators.



**Figure 4 - Different consumption for different bases, with the same calculated data**

## REFERENCES

- [1] P. Kocher, J. Jaffe, et al. "Differential Power Analysis : Leaking Secrets". *Advances in Cryptology: Proceedings of CRYPTO'99*, LNCS, Vol 1666, Springer, pp. 388-397. 1999.
- [2] – . "Data Encryption Standard (DES)". Federal Information Processing Standards Publications (FIPS PUBS) N° 46-3. EUA.October 25, 1999.
- [3] J-S Coron. "Resistance against Differential Power Analysis for Elliptic Curve Cryptosystems". *Cryptographic Hardware and Embedded Systems, Proceedings of CHES 1999*. LNCS, Vol. 1717, Springer, Pp 292-302, 1999.
- [4] R. Rivest, A. Shamir, et al. "A Method for Obtaining Digital Signatures and Public-Key Cryptosystems". *ACM Communications*, vol 21, pp 120-126. 1978.
- [5] D. Chaum. "Security without identification: transaction systems to make Big Brother obsolete". *Communication of the ACM*. Vol. 8., n° 10, pp 1030-144. 1985.
- [6] L. Goubin. "A refined power-analysis attack on elliptic curve cryptosystems". *Public Key Cryptography: Proceedings of PKC '03*. LNCS, Vol. 2567, Springer, pp 199-210. 2003.
- [7] M. Hideyo, M. Atsuko. "Efficient Countermeasures against RPA, DPA, and SPA". *Cryptographic Hardware and Embedded Systems, Proceedings of CHES 2004*. LNCS, Vol. 3156, Springer, Pp 343-356, 2004.
- [8] P. Kocher. "Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS, and Other Systems". *16<sup>th</sup> Workshop in Cryptology: Proceedings of Crypto '96*. LNCS, Vol. 1109, Springer, pp 104-113. Santa Barbara, USA. 1996.
- [9] B. Boer. "A DPA Attack against the Modular Reduction within a CRT Implementation of RSA". *Cryptographic Hardware and Embedded Systems, Proceedings of CHES 2002*. LNCS, Vol. 2523, Springer, pp 228-243, 2002.
- [10] C. Kim, J. Ha, et al. "A CRT-Based RSA Countermeasure against Physical Cryptanalysis". *International Conference on High Performance Computing and Communications: Proceedings of HPCC '05*. Pp 549-554, Naples, Italy, 2005.
- [11] H. Garner. "The Residue Number System". *IRE Transactions in electronic Computers*. Vol 8, pp. 140-147, 1959.
- [12] J-C. Bajard, L. Imbert, et al. "Leak Resistant Arithmetic". *Cryptographic Hardware and Embedded Systems, Proceedings of CHES 2004*. LNCS, Vol. 3156, Springer, Pp 62-75, 2004.
- [13] J-C. Bajard, L. Imbert. "A Full RNS Implementation of RSA". *IEEE Transactions on Computers*. Vol. 53, n° 6, pp. 769-774. 2004.
- [14] M. Ciet, M. Neve, et al. "Parallel FPGA implementation of RSA with residue number systems – can side-channel threats be avoided?". *40<sup>th</sup> IEEE International Midwest Symposium on Circuits and Systems: Proceedings of MWSCAS '03*. Cairo, Egypt, December 2003.
- [15] E-L. Carvalho, N. Calazans, et al. "Reconfiguration Control for Dynamically Reconfigurable Systems". *19<sup>th</sup> Conference on Design of Circuits and Integrated Systems: Proceedings of DCIS '04*. Bordeaux, France, 2004.