

# Digital Circuit Lab 2 Report

林軒霆, 蔡丞彥, 侯奕安

October 14, 2025

## 1 Introduction

In Lab 2, we implemented an RSA-256 decryption module. The overall structure includes a `core.sv` for the calculation and a `wrapper.sv` for controlling the core and transferring data between RS232 and the FPGA. The core uses exponentiation by squaring to reduce the number of multiplications and applies the Montgomery algorithm to calculate  $ab \bmod N$  faster.

## 2 File Structure

For Lab2, our code is written in the provided `Rsa256Core.sv` and `Rsa256Wrapper.sv` files, and no other code is modified or added. The files we submitted are shown in the figure above. For demonstration, replacing the original files with these would be sufficient.



Figure 1: Hierarchy of Files

### 3 System Architecture

The top module is `Rsa256Wrapper`, which includes a submodule `Rsa256Core`. The core contains two submodules, `RsaPrep` and `RsaMont`. The first submodule prepares  $a^{2^{256}} \bmod N$  and  $b^{2^{256}} \bmod N$  for the Montgomery algorithm, while `RsaMont` performs the algorithm itself. The relationship between signals is shown in the figure below (clock and reset signals are synchronized with the wrapper input and are thus omitted).



Figure 2: Architecture of Our Design

## 4 Hardware Scheduling

### 4.1 FSM in Wrapper

The FSM used in the wrapper is shown in the figure below. We use four states, and the transitions between them are straightforward. The initial state is `S_GET_KEY`:

1. `S_GET_KEY` to `S_GET_DATA`: When the input bus is ready, the input address is `RX_BASE`, and the byte counter has decreased to zero. This means that the key has been successfully received.
2. `S_GET_DATA` to `S_WAIT_CALCULATE`: When the input bus is ready, the input address is `RX_BASE`, and the byte counter has decreased to zero. This means that the encrypted data has been successfully received.
3. `S_WAIT_CALCULATE` to `S_SEND_DATA`: When the RSA decryption is done.
4. `S_SEND_DATA` to `S_GET_KEY`: When the output bus is ready, the output address is `TX_BASE`, and the byte counter has decreased to zero. This means that the decrypted data has been successfully sent.



Figure 3: FSM in Wrapper

## 4.2 FSM in Core

The FSM used in the core is shown in the figure below. The transitions between states are as follows:

1. S\_IDLE to S\_PREP: When the start signal is triggered.
2. S\_PREP to S\_CALC: When the preparation is done, which is signaled by RsaPrep.
3. S\_CALC to S\_MONT: When  $\text{bit\_counter} < 256$  (still processing bits).
4. S\_MONT to S\_CALC: When the current Montgomery calculation is done and no extra multiplication is needed.
5. S\_MONT to S\_MONT: When the current Montgomery calculation is done and one extra multiplication ( $t \times t \bmod N$ ) is needed.
6. S\_CALC to S\_DONE: When  $\text{bit\_counter} \geq 256$  (all bits processed).



Figure 4: FSM in Core

### 4.2.1 FSM in RsaPrep and RsaMont

The FSM used in RsaPrep is shown in the figure below. The transitions between states are

1. S\_IDLE to S\_REDUCE: When start signal is triggered.

## 4 HARDWARE SCHEDULING

---

2. **S\_REDUCE** to **S\_REDUCE**: When  $x \geq i\_n$ . Make sure  $x < n$  before proceeding to **S\_REDUCE**.
3. **S\_REDUCE** to **S\_DOUBLE**: When  $x < n \ \&\& \text{counter} < 256$ . In this state, perform the operation of  $2 \times x \bmod i\_n$ .
4. **S\_REDUCE** to **S\_FINAL**: When  $x < n \ \&\& \text{counter} \geq 256$ , now  $x = y \times 2^{256} \bmod i\_n$ .
5. **S\_FINAL** will only take one cycle and then return to **S\_DONE**. Here, the last modulus operation is performed to ensure the output  $t < i\_n$ .
6. In **S\_DONE**, the finished signal is set to 1, and the state returns to **S\_IDLE** to wait for the next use.



Figure 5: FSM in RsaPrep

The FSM used in RsaMont is shown in the figure below. The transitions between states are

1. **S\_IDLE** to **S\_CALC**: When start signal is triggered.
2. **S\_CALC** to **S\_REDUCE**: When  $\text{counter} \geq 256$ . **S\_CALC** iterates through all 256 bits of operand **a**, performing Montgomery multiplication's core algorithm. For each bit position **i**, it examines **a[i]**:

## 4 HARDWARE SCHEDULING

---

- $a[i] = 1$ : Add  $b$  to  $m$ , then divide by 2. Since division must be exact, the algorithm ensures  $b + m$  is even before the right shift. It checks the XOR of the LSBs: if  $m[0] \oplus b[0] = 1$  is odd, it adds the odd modulus  $n$  to make the sum even. Otherwise,  $m = (m + b) >> 1$ .
- $a[i] = 1$ : Only divide  $m$  by 2. If  $m[0] = 1$ , add  $n$  first:  $m = m + n >> 1$ . If  $m[0] = 0$ :  $m = m >> 1$ .

When all 256 bits are processed, the state transitions to  $S\_REDUCE$ .

3.  $S\_REDUCE$  will only take one cycle and then return to  $S\_DONE$ . Here, the last modulus operation is performed to ensure the output  $m < n$ .
4. In  $S\_DONE$ , the finished signal is set to 1, and the state returns to  $S\_IDLE$  to wait for the next use.



Figure 6: FSM in RsaMont

## 5 SCREEN SHOT OF FITTER SUMMARY

---

### 5 Screen shot of Fitter Summary

| Fitter Summary                     |                                             |
|------------------------------------|---------------------------------------------|
| Fitter Status                      | Successful - Tue Oct 14 18:01:33 2025       |
| Quartus II 64-Bit Version          | 15.0.0 Build 145 04/22/2015 SJ Full Version |
| Revision Name                      | DE2_115                                     |
| Top-level Entity Name              | DE2_115                                     |
| Family                             | Cyclone IV E                                |
| Device                             | EP4CE115F29C7                               |
| Timing Models                      | Final                                       |
| Total logic elements               | 8,070 / 114,480 ( 7 % )                     |
| Total combinational functions      | 6,911 / 114,480 ( 6 % )                     |
| Dedicated logic registers          | 4,015 / 114,480 ( 4 % )                     |
| Total registers                    | 4015                                        |
| Total pins                         | 518 / 529 ( 98 % )                          |
| Total virtual pins                 | 0                                           |
| Total memory bits                  | 0 / 3,981,312 ( 0 % )                       |
| Embedded Multiplier 9-bit elements | 0 / 532 ( 0 % )                             |
| Total PLLs                         | 1 / 4 ( 25 % )                              |

Figure 7: Fitter Summary

## 7 PROBLEMS ENCOUNTERED

---

## 6 Screen shot of Timing Analyzer



Figure 8: Timing Analyzer

## 7 Problems Encountered

The most significant obstacle we encountered in this lab was related to Quartus. Since we needed to include a .qsys file in this experiment, the overall procedure in Quartus was much more complicated than in Lab 1. During our first attempt, the .sv files could not run on the FPGA board. After several hours of debugging, we finally discovered that the issue was caused by not importing the .qsf file. This experience taught us that carefully following every step in the tutorial can save a huge amount of time.