

See discussions, stats, and author profiles for this publication at: <https://www.researchgate.net/publication/4123255>

# Coarsely integrated operand scanning (CIOS) architecture for high-speed Montgomery modular multiplication

Conference Paper · January 2005

DOI: 10.1109/FPT.2004.1393267 · Source: IEEE Xplore

---

CITATIONS

14

READS

1,651

3 authors:



Maire Mcloone

Queen's University Belfast

91 PUBLICATIONS 2,577 CITATIONS

[SEE PROFILE](#)



Corinne McIvor

University of the West of Scotland

11 PUBLICATIONS 428 CITATIONS

[SEE PROFILE](#)



J.V. McCanny

Queen's University Belfast

210 PUBLICATIONS 3,341 CITATIONS

[SEE PROFILE](#)

# Coarsely Integrated Operand Scanning (CIOS) Architecture For High-Speed Montgomery Modular Multiplication

Máire McLoone, Ciaran McIvor, John V McCanny

The Institute of Electronics, Communications & Information Technology (ECIT)  
Queen's University Belfast, Northern Ireland Science Park,  
Queen's Road, Queen's Island, Belfast BT3 9DT

## Abstract

A generic Coarsely Integrated Operand Scanning (CIOS) architecture that provides high speed Montgomery modular multiplication is presented in this paper. The architecture is capable of supporting varying operand sizes. It achieves a throughput of 210 Mbps, 289 Mbps and 334 Mbps for 128-bit, 256-bit and 512-bit operand sizes respectively, when implemented on a Virtex XC2VP50 FPGA. Throughputs of up to 400 Mbps are achieved if the final subtraction in the Montgomery algorithm is excluded. To the authors' knowledge this is the fastest Montgomery multiplication architecture reported in the literature.

## 1. Introduction

The CIOS algorithm is one of five Montgomery multiplication algorithms proposed by Koç *et al* [1]. In the algorithm, multiplication and reduction are integrated and are carried out using operand scanning, where the operations are performed on words of one of the operands.

Montgomery multiplication was developed in 1985 [2] and has emerged as the most efficient way to perform modular multiplication in both hardware and software. It is highly efficient as it replaces division by the modulus,  $n$ , with division modulo a power of 2, an operation which is easily implemented on hardware and software platforms.

Elliptic Curve Cryptosystems [7, 8] comprise modular multiplications, and modular exponentiation is an integral part of the Diffie-Hellman key exchange protocol [3], the Digital Signature Algorithm (DSA) [4] and of public key algorithms, such as RSA [5] and ElGamal [6]. Modular exponentiation comprises multiple modular multiplications and this proves to be the bottleneck in implementations of these algorithms. Therefore, a high-speed efficient modular multiplication implementation is essential in order to provide the security requirements of present day real-time communication applications.

This paper describes a high-speed modular multiplication architecture, which can support numerous multiplication data lengths. The architecture utilizes a state machine methodology which allows the complex CIOS algorithm to be easily implemented. Although a significant amount of research work has been published on architectures which comprise Montgomery multiplication [9-13], they fail to provide specific performance metrics. A number of publications do provide comparative results and these are discussed in section 4 of this paper. The only known previous hardware architecture which uses the CIOS algorithm is the work of the authors [14]. This paper improves on the previous work by supporting multiple operand sizes and achieving a higher throughput.

In section 2, the CIOS algorithm is explained. Section 3 describes the architecture that has been developed, and performance results are provided in section 4. Finally, conclusions are given in section 5.

## 2. The Coarsely Integrated Operand Scanning (CIOS) Algorithm

The CIOS algorithm is a representation of the Montgomery modular multiplication algorithm.

### 2.1 Montgomery Multiplication

Montgomery multiplication is performed on integers in  $n$ -residue format, where the modulus  $n$  is a  $k$ -bit integer,  $r = 2^k$ , and  $n$  and  $r$  are relatively prime. The  $n$ -residue of an integer  $y$  is given by,

$$\bar{y} = y \cdot r \pmod{n} \quad (1)$$

and the Montgomery multiplication of two  $n$ -residue integers,  $x$  and  $y$ , is,

$$\bar{z} = \bar{x} \cdot \bar{y} \cdot r^{-1} \pmod{n} \quad (2)$$

where,  $r^{-1}$  is the inverse of  $r$  modulo  $n$ . To obtain the original integer given the  $n$ -residue form, it is necessary to multiply the  $n$ -residue product by the

integer  $r$ . In Montgomery multiplication, a value  $n'$  is also required, such that,

$$r \cdot r^{-1} - n \cdot n' = 1 \quad (3)$$

The overall Montgomery modular multiplication algorithm can be described as follows [15]:

```
MontMult ( $\bar{x}, \bar{y}$ )
1.  $t := \bar{x}, \bar{y}$ 
2.  $m := t \cdot n' \bmod r$ 
3.  $u := (t + m \cdot n) / r$ 
4. if  $u \geq n$  then return  $u - n$ 
   else return  $u$  (4)
```

## 2.2 The CIOS Algorithm

The CIOS algorithm alternates between the multiplication and reduction steps involved in Montgomery modular multiplication, rather than carrying out full multiplication followed by reduction. These multiplications and reductions are conducted using two loops, where an outer loop cycles through words of one of the operands. The CIOS algorithm is described in Figure 1, where  $s$  is the number of words in the operand,  $C$  and  $S$  are the carry and sum respectively,  $t$  is the output product,  $a$  and  $b$  are the input operands,  $n'$  is as defined by equation 3,  $W = 2^w$ , where  $w$  is the wordsize of the computer, and  $n$  is the modulus.

Step 4 of the Montgomery multiplication algorithm is not included in the CIOS algorithm description. Under certain conditions, this subtraction step is not necessary. It has been proven that with a correct hardware implementation and within particular bounds, Montgomery exponentiation can be carried out without utilising any modular subtraction [16-18]. It has also been shown that the subtraction step can be avoided in software implementations of Montgomery exponentiation with even stricter bounds [19]. However, this does not hold for hardware implementations when the modulus length is smaller than the multiplier length.

In this paper, we present and compare Montgomery modular multiplication architectures, which include final subtraction, and architectures, which do not include this step.

## 3. CIOS Algorithm Architecture

The CIOS architecture described in this paper assumes that the number of words,  $s$ , in each operand is 4. Therefore, from Figure 1, it is evident that the  $i$  loop and both  $j$  loops need to be unrolled 4 times, leading to quite a complex algorithm to implement. This paper proposes the use of a state

machine methodology for ease of implementation. This involves outlining the core components required at any one stage of the algorithm's execution, and therefore, during any one state of the state machine.

```
CIOS ( $a, b, n, n'$ )
for  $i = 0$  to  $s - 1$  loop
   $C := 0$ 
  for  $j = 0$  to  $s - 1$  loop
     $(C, S) := t[j] + a[j] * b[i] + C$ 
     $t[j] := S$ 
  end loop
   $(C, S) := t[s] + C$ 
   $t[s] := S$ 
   $t[s + 1] := C$ 
   $C := 0$ 
   $m := t[0] * n'[0] \bmod W$ 
   $(C, S) := t[0] + m * n[0]$ 
  for  $j = 1$  to  $s - 1$  loop
     $(C, S) := t[j] + m * n[j] + C$ 
     $t[j - 1] := S$ 
  end loop
   $(C, S) := t[s] + C$ 
   $t[s - 1] := S$ 
   $t[s] := t[s + 1] + C$ 
end loop
```

Figure 1. CIOS Algorithm

The components required by the CIOS algorithm and the inputs and outputs of each component, are outlined in Table 1. Thus, the components needed to implement the CIOS algorithm per state include a multiplication, an addition and two  $j$  loop components. A  $j$  loop component is the realization of the  $j$  loops in the CIOS algorithm description of Figure 1, and comprises a multiplication and two additions. These components are performed in parallel and the iterative architecture cycles through them during each state. Overall, an 18-state state machine is utilized to execute the complete CIOS algorithm and to obtain the 4-word output product,  $t = t[0], t[1], t[2], t[3]$ . Finally, if step 4 of the Montgomery multiplication algorithm is to be implemented, a subtract block is required. This subtract block is the critical path of the overall architecture since it involves a comparison of the full-length modulus with the full-length product operand. In the architecture described here, the full-length product and modulus are split into  $s = 4$  words and the individual words are compared. As such, step 4 of the Montgomery multiplication algorithm is implemented according to the pseudo-

code outlined in Figure 2. This allows the individual word comparisons to be registered, significantly improving the speed of the subtraction block at the expense of a 4-cycle latency. An outline of the CIOS architecture incorporating the subtract block is depicted in Figure 3.

**Table 1. Components Required by CIOS Architecture ( $s = 4$ )**

| State | Comp.                    | Inputs                                                                                                                                                              | Outputs                                                                                                                           |
|-------|--------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------|
| 0     | j loop                   | a(0), b(0), '0', '0'                                                                                                                                                | C <sub>out0</sub> , t <sub>out0_0</sub>                                                                                           |
| 1     | j loop<br>Mult           | a(1), b(0), C <sub>out0</sub> , '0'<br>t <sub>out0_0</sub> , n'(0)                                                                                                  | C <sub>out1</sub> , t <sub>out1_1</sub><br>m <sub>0</sub>                                                                         |
| 2     | j loop<br>j loop         | a(2), b(0), C <sub>out1</sub> , '0'<br>m <sub>0</sub> , n(0), '0', t <sub>out0_0</sub>                                                                              | C <sub>out2</sub> , t <sub>out2_2</sub><br>C <sub>out3</sub> , null                                                               |
| 3     | j loop<br>j loop         | a(3), b(0), C <sub>out2</sub> , '0'<br>m <sub>0</sub> , n(1), C <sub>out3</sub> , t <sub>out0_1</sub>                                                               | C <sub>out3</sub> , t <sub>out3_3</sub><br>C <sub>out5</sub> , t <sub>out1_0</sub>                                                |
| 4     | Add<br>j loop<br>j loop  | C <sub>out4</sub> , '0'<br>m <sub>0</sub> , n(2), C <sub>out5</sub> , t <sub>out0_2</sub><br>a(0), b(1), '0', t <sub>out1_0</sub>                                   | t <sub>out1_4</sub> , t <sub>out1_5</sub><br>C <sub>out6</sub> , t <sub>out1_1</sub><br>C <sub>out7</sub> , t <sub>out2_0</sub>   |
| 5     | j loop<br>j loop<br>Mult | m <sub>0</sub> , n(3), C <sub>out6</sub> , t <sub>out0_3</sub><br>a(1), b(1), C <sub>out7</sub> , t <sub>out1_1</sub><br>t <sub>out2_0</sub> , n'(0)                | C <sub>out8</sub> , t <sub>out1_2</sub><br>C <sub>out9</sub> , t <sub>out2_1</sub><br>m <sub>1</sub>                              |
| 6     | Add<br>j loop<br>j loop  | C <sub>out8</sub> , t <sub>out0_4</sub><br>a(2), b(1), C <sub>out9</sub> , t <sub>out1_2</sub><br>m <sub>1</sub> , n(0), '0', t <sub>out2_0</sub>                   | C <sub>sig</sub> , t <sub>out1_3</sub><br>C <sub>out10</sub> , t <sub>out2_2</sub><br>C <sub>out11</sub> , null                   |
| 7     | Add<br>j loop<br>j loop  | C <sub>sig</sub> , t <sub>out0_5</sub><br>a(3), b(1), C <sub>out10</sub> , t <sub>out1_3</sub><br>m <sub>1</sub> , n(1), C <sub>out11</sub> , t <sub>out2_1</sub>   | t <sub>out1_4</sub><br>C <sub>out12</sub> , t <sub>out2_3</sub><br>C <sub>out13</sub> , t <sub>out3_0</sub>                       |
| 8     | Add<br>j loop<br>j loop  | C <sub>out12</sub> , t <sub>out1_4</sub><br>m <sub>1</sub> , n(2), C <sub>out13</sub> , t <sub>out2_2</sub><br>a(0), b(2), '0', t <sub>out3_0</sub>                 | t <sub>out2_4</sub> , t <sub>out2_5</sub><br>C <sub>out14</sub> , t <sub>out3_1</sub><br>C <sub>out15</sub> , t <sub>out4_0</sub> |
| 9     | j loop<br>j loop<br>Mult | m <sub>1</sub> , n(3), C <sub>out14</sub> , t <sub>out2_3</sub><br>a(1), b(2), C <sub>out15</sub> , t <sub>out3_1</sub><br>t <sub>out4_0</sub> , n'(0)              | C <sub>out16</sub> , t <sub>out3_2</sub><br>C <sub>out17</sub> , t <sub>out4_1</sub><br>m <sub>2</sub>                            |
| 10    | Add<br>j loop<br>j loop  | C <sub>out16</sub> , t <sub>out2_4</sub><br>a(2), b(2), C <sub>out17</sub> , t <sub>out3_2</sub><br>m <sub>2</sub> , n(0), '0', t <sub>out4_0</sub>                 | C <sub>sig1</sub> , t <sub>out3_3</sub><br>C <sub>out18</sub> , t <sub>out4_2</sub><br>C <sub>out19</sub> , null                  |
| 11    | Add<br>j loop<br>j loop  | C <sub>sig1</sub> , t <sub>out2_5</sub><br>a(3), b(2), C <sub>out18</sub> , t <sub>out3_3</sub><br>m <sub>2</sub> , n(10), C <sub>out19</sub> , t <sub>out4_1</sub> | t <sub>out3_4</sub><br>C <sub>out20</sub> , t <sub>out4_32</sub><br>C <sub>out21</sub> , t <sub>out5_1</sub>                      |
| 12    | Add<br>j loop<br>j loop  | C <sub>out20</sub> , t <sub>out3_4</sub><br>m <sub>2</sub> , n(2), C <sub>out21</sub> , t <sub>out4_2</sub><br>a(0), b(3), '0', t <sub>out5_0</sub>                 | t <sub>out4_4</sub> , t <sub>out4_5</sub><br>C <sub>out22</sub> , t <sub>out5_1</sub><br>C <sub>out23</sub> , t <sub>out6_0</sub> |
| 13    | j loop<br>j loop<br>Mult | m <sub>2</sub> , n(3), C <sub>out22</sub> , t <sub>out4_3</sub><br>a(1), b(3), C <sub>out23</sub> , t <sub>out5_1</sub><br>t <sub>out6_0</sub> , n'(0)              | C <sub>out24</sub> , t <sub>out5_2</sub><br>C <sub>out25</sub> , t <sub>out6_1</sub><br>m <sub>3</sub>                            |
| 14    | Add<br>j loop<br>j loop  | C <sub>out24</sub> , t <sub>out4_4</sub><br>a(2), b(3), C <sub>out25</sub> , t <sub>out5_2</sub><br>m <sub>3</sub> , n(0), '0', t <sub>out6_0</sub>                 | C <sub>sig2</sub> , t <sub>out5_3</sub><br>C <sub>out26</sub> , t <sub>out6_2</sub><br>C <sub>out27</sub> , null                  |
| 15    | Add<br>j loop<br>j loop  | C <sub>sig2</sub> , t <sub>out4_5</sub><br>a(3), b(3), C <sub>out26</sub> , t <sub>out5_3</sub><br>m <sub>3</sub> , n(1), C <sub>out27</sub> , t <sub>out6_1</sub>  | t <sub>out5_4</sub><br>C <sub>out28</sub> , t <sub>out6_3</sub><br>C <sub>out29</sub> , t[0]                                      |
| 16    | Add<br>j loop            | C <sub>out28</sub> , t <sub>out5_4</sub><br>m <sub>3</sub> , n(2), C <sub>out29</sub> , t <sub>out6_2</sub>                                                         | t <sub>out6_4</sub> , t <sub>out6_5</sub><br>C <sub>out30</sub> , t[1]                                                            |
| 17    | j loop                   | m <sub>3</sub> , n(3), C <sub>out30</sub> , t <sub>out6_3</sub>                                                                                                     | C <sub>out31</sub> , t[2]                                                                                                         |
| 18    | Add                      | C <sub>out31</sub> , t <sub>out6_4</sub>                                                                                                                            | t[3]                                                                                                                              |

#### 4. Performance of CIOS Architecture

The generic CIOS architecture described in this paper supports varying operand sizes. If a different multiplication length is required, the architecture

does not need to be re-designed - the generic operand length is simply changed and the design re-synthesised.

```

for i = 0 to 3 loop
    if t(i) ≥ n(i)
        flag(i) = 1
    else
        flag(i) = 0
    end if
end loop
for i = 0 to 3 loop
    if flag(i) = 1
        Product = t - n
    else
        Product = t
    end if
end loop

```

**Figure 2. Pseudo-Code for Subtract Block**

To analyse the architecture's performance, 128-bit, 256-bit and 512-bit multiplier designs, with and without the subtract block, were implemented on XC2VP50 Virtex-II Pro devices. The designs were simulated using Modelsim and synthesised using Synplify Pro v7.3.1 and Xilinx Foundation software v5.2i. In order to verify correct operation of the CIOS multiplier architectures, test vectors were generated using the Big Number Calculator developed by Reinhold [20]. The following calculations are performed:

$$r = 2^k \quad (5)$$

where  $k$  is the operand length, and,

$$r^{-1} \pmod{n} \quad (6)$$

Next,  $n'$  is found, where,

$$n' = [(r * r^{-1}) - 1] / n \quad (6)$$

Finally, the Montgomery product,  $P$ , of two  $k$ -bit operands,  $A$  and  $B$  is calculated such that,

$$P = A * B * r^{-1} \pmod{n} \quad (6)$$

The 256-bit and 512-bit Montgomery multiplication test vectors are given in the Appendix. The 128-bit test-vectors are provided in [15].

The 128-bit, 256-bit and 512-bit operands,  $A$ ,  $B$  and  $n$  are input in 32-bit blocks, thus reducing the overall IOB count.



**Figure 3. CIOS Algorithm Architecture**

The full-length operand  $n'$  is not input since only the first word,  $n'[0]$  is utilised in the algorithm. When  $s = 4$ ,  $n'[0]$  is 32-bits, 64-bits or 128-bits in length and this is also input in 32-bit blocks.

The multipliers in the architecture are implemented using the 18x18-bit multiplier blocks available on the Virtex-II devices. Also, since there are a number of high fanout signals in the design, Synplify performs replication to improve timing. The performance results for the 128-bit, 256-bit and 512-bit multiplier architectures, with and without the subtract block, are outlined in Table 2.

Removing the subtract block from the Montgomery multiplication architecture does not significantly improve the overall clock speed. It does reduce the overall latency since the 4-cycle latency of the individual word comparisons is not incurred and hence, the throughput is improved. Evidently, the area is also reduced.

The five Montgomery multiplication methods described by Koç et al [1], are the CIOS algorithm, the Separated Operand Scanning (SOS) algorithm, the Finely Integrated Operand Scanning (FIOS) algorithm, the Finely Integrated Product Scanning (FIPS) algorithm and the Coarsely Integrated Hybrid Scanning (CIHS) algorithm. In their paper they provide performance figures for software implementations of these algorithms and conclude that the CIOS method is the most efficient. A 512-bit CIOS multiplication implemented in Assembly code on a Pentium-60 Linux system is executed in 0.122 ms [1]. In work previously carried out by the authors [14], hardware architectures of the SOS, CIOS and FIOS algorithms were implemented on Xilinx Virtex-II Pro devices and compared. The CIOS multiplier proved to be the fastest with the 128-bit

and 256-bit multiplier architectures both achieving speeds of 100 Mbps. Satoh and Takano [21] describe a scalable dual-field elliptic curve processor which comprises a Montgomery multiplication architecture. When implemented using a 0.13  $\mu$ m CMOS standard cell library, 256-bit multiplication is executed in 162 clock cycles. No throughput or area metrics for the multiplication architecture are given. Tenca and Koç [22] present a modular multiplication design which is scalable in terms of both area and operand size. Their 256-bit multiplication design is implemented using 0.5  $\mu$ m CMOS technology and for a wordsize of 8 and 40 pipeline stages, it executes in 7.4  $\mu$ s. All of these hardware Montgomery multiplication implementations exclude the conditional subtract.

**Table 2. CIOS Architecture Performance Results**

| Architecture Type    | Clock Speed (MHz) | Area (slices & multipliers) | No. of clock cycles | Throughput (Mbps)    |
|----------------------|-------------------|-----------------------------|---------------------|----------------------|
| 128-bit: Subtract    | 70.7              | 1522 slices<br>11 mults     | 43                  | 210 (0.6 $\mu$ s)    |
| 128-bit: No Subtract | 72                | 1341 slices<br>11 mults     | 39                  | 236.3 (0.54 $\mu$ s) |
| 256-bit: Subtract    | 53                | 3512 slices<br>42 mults     | 47                  | 289 (0.89 $\mu$ s)   |
| 256-bit: No Subtract | 56                | 3208 slices<br>42 mults     | 43                  | 334 (0.77 $\mu$ s)   |
| 512-bit: Subtract    | 35.8              | 8299 slices<br>164 mults    | 55                  | 333.6 (1.5 $\mu$ s)  |
| 512-bit: No Subtract | 40                | 7704 slices<br>164 mults    | 51                  | 401.5 (1.3 $\mu$ s)  |

The CIOS architectures presented here, and even those incorporating the subtract block, outperform these previously reported Montgomery multiplication architectures and in particular, they are capable of executing any length of multiplication in a significantly lower number of clock cycles. The difference in latency for the 128-bit, 256-bit and 512-bit operand lengths lies in the extra number of clock cycles necessary to input the values in 32-bit blocks. Therefore, the execution times for the 256-bit and 512-bit multiplications will be 4 and 12 clock cycles, respectively, more than the execution time for the 128-bit multiplication.

## 5. Conclusion

This paper describes a high speed Montgomery modular multiplication architecture. The CIOS algorithm, which is the most efficient [1, 14] of the five algorithms proposed by Koç *et al*, is utilised to

perform the Montgomery multiplication. The CIOS algorithm hardware architecture supports varying operand lengths. When the final subtraction of Montgomery's algorithm is included in the design, it runs at speeds of up to 334 Mbps. When the subtraction is not incorporated, the architecture achieves speeds of up to 400 Mbps. The final subtraction step can be removed if the multiplication architecture is used in implementations of Montgomery exponentiation, which is required by commonly used public key algorithms, such as RSA and El Gamal.

The high speed Montgomery multiplication architecture is accomplished through implementation using a state machine methodology. This enables the complex CIOS algorithm to be presented in a simple, straightforward manner and core components identified and designed in parallel. It also allows the multiplication to be executed in a fixed number of clock cycles regardless of the operand length. Variations in execution time for the different multiplier inputs only arise due to the latency incurred in loading the operand lengths.

The CIOS multiplication architecture presented is 3 times faster than previously reported CIOS hardware implementations and 80 times faster than equivalent software implementations.

## 6. Acknowledgements

This research has been funded under a Royal Academy of Engineering/EPSRC post-doctoral research fellowship.

## 7. References

- [1] C.K. Koç, T. Acar, B.S. Kaliski Jr., "Analyzing and Comparing Montgomery Multiplication Algorithms", *IEEE Micro.*, Vol. 16, No. 3, pp 26-33, June 1996.
- [2] P.L. Montgomery, "Modular Multiplication without Trial Division", *Mathematics of Computation*, Vol. 44, pp 512-521, April 1985.
- [3] W. Diffie and M. E. Hellman, "New directions in cryptography", *IEEE Transactions on Information Theory*, IT-22:644-654, 1976.
- [4] Digital Signature Standard (DSS), US Federal Information Processing Standards Publication 186-2, <http://csrc.nist.gov/publications/fips/fips186-2/fips186-2-change1.pdf>
- [5] R.L. Rivest, A. Shamir, L. Adleman, "A Method for Obtaining Digital Signatures and Public-Key Cryptosystems", *Communications of the ACM*, 21(2): 120–126, February 1978.
- [6] T. ElGamal, "A Public-key cryptosystem and a Signature Scheme based on Discrete Logarithms", *IEEE Trans. on Information Theory*, vol.31, no.4, pp 469-472, '85
- [7] V.S. Miller, "Use of Elliptic Curves in Cryptography", *Advances in Cryptology (Crypto' 85)*, pp. 417-426, 1986.
- [8] N. Koblitz, "Elliptic Curve Cryptosystems", *Math. Computing*, Vol. 48, pp. 203-209, 1987.
- [9] S.E. Eldridge, C.D. Walter, "Hardware Implementation of Montgomery's Modular Multiplication Algorithm", *IEEE Trans. on Computers*, Vol 42, Issue 6, pp 693-699, 1993.
- [10] T. Blum and C. Paar, "Montgomery Modular Exponentiation on Reconfigurable Hardware", 14<sup>th</sup> IEEE Symposium on Computer Arithmetic (ARITH-14), April 14-16, Adelaide, Australia, 1999.
- [11] A. Tiountchik, E. Trichina, "RSA Acceleration with Field Programmable Gate Arrays", Proc. of 4th Australasian Conference on Information Security and Privacy, pp: 164 – 176, 1999
- [12] G. Orlando and C. Paar, "A scalable GF(p) elliptic curve processor architecture for programmable hardware", *Cryptographic Hardware and Embedded Systems*, CHES 2001, May 14-16, France, 2001
- [13] S. B. Ors, L. Batina, B. Preneel, J. Vandewalle, "Hardware Implementation of an Elliptic Curve Processor over GF(p)", Proc. of IEEE 14th International Conference on Application-specific Systems, Architectures and Processors, The Hague, The Netherlands, June 24-26, 2003, pp. 433-443.
- [14] C. McIvor, M. McLoone, J.V. McCanny, "FPGA Montgomery Multiplier Architectures - A Comparison", *IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'04)*, April 2004.
- [15] M. McLoone, C. McIvor, J.V. McCanny, "Montgomery Modular Multiplication Architecture for Public Key Cryptosystems", *IEEE Workshop on Signal Processing Systems (SiPS'04)*, October 2004.
- [16] C.D. Walter, "Montgomery's Multiplication Technique: How to make it Smaller and Faster", *Workshop on Cryptographic Hardware and Embedded Systems (CHES'99)*, LNCS 1717, pp 80-93, August 1999.
- [17] C.D. Walter, "Montgomery Exponentiation Needs no Final Subtractions", *Electronics Letters*, 35(21):1831-1832, October 1999.
- [18] C.D. Walter, "Precise Bounds for Montgomery Modular Multiplication and some Potentially Insecure RSA Moduli", *Topics in Cryptology – CT-RSA 2002*, LNCS 2271, pp 30-39, 2002.
- [19] G. Hachez, J. Quisquater, "Montgomery Exponentiation with no Final Subtractions: Improved Results", *Workshop on Cryptographic Hardware and Embedded Systems (CHES'00)*, LNCS 1965, pp 293-301, July 2000.
- [20] A. Reinhold, "Big Number Calculator Applet", URL: <http://world.std.com/~reinhold/BigNumCalc.html>, Mar 2004.

- [21] A. Satoh, K. Takano, "A Scalable Dual-Field Elliptic Curve Cryptographic Processor", *IEEE Transactions on Computers*, vol. 52, no. 4, April 2003.

- [22] A. Tenca, C.K. Koç, "A Scalable Architecture for Modular Multiplication Based on Montgomery's Algorithm", *IEEE Transactions on Computers*, vol. 52, no. 9, Sept 2003.

## **8. Appendices**

## **Appendix 1: 256-bit Montgomery Modular Multiplication Test Vectors**

TEST1:

$n = b_0$

MonMult =

412093eaf12d0d473be5966aa7ad29c  
9172b3877641ecda4a75afb9bb90ba87

## TEST2 :

— — — — —

**MonMult** =

33

### MonMuLT

Format -  
487448233b556c30a1820ef6bb44893d  
d67cd631adf7af48c86c667306f04cb2

#### TEST4 :

```
MonMult =  
    349a9b139d4796e8d9077d162d175fcb  
    b5f5721a17a35410625487eee5146778
```

TEST5

```
MonMult =  
    5e002a4b652c9b3cf0cbbb062fecbb96  
    79e3c25ada22092d5461b760eb7ed05b
```

## **Appendix 2: 512-bit Montgomery Modular Multiplication Test Vectors**

### TEST1 :



