

## EL6463 Project Report

# 32-bit Processor Design

Group No.08

|              |        |
|--------------|--------|
| Junchen Su   | js9133 |
| Liu Yang     | ly1032 |
| Zhiyu Wen    | zw1305 |
| Yuchen Zhang | yz3814 |

## 1. Introduction

In this project a 32-bit processor called NYU-6463 Processor was designed and implemented on the Nexys 4 Artix-7. The processor has three instruction types: (a) R-Type for arithmetic instructions, (b) I-Type for immediate value operations, load and store instructions, and (c) J-Type for jump instructions. To test the performance of the processor, we wrote the assembly program for RC5 algorithm, including key expansion, encryption, and decryption.

### Performance Summary

|                                                      |           |
|------------------------------------------------------|-----------|
| Encryption Cycles for Assigned Test Vector           | 1030      |
| Decryption Cycles for Assigned Test Vector           | 1032      |
| Round Key Generation Cycles for Assigned Test Vector | 4663      |
| Maximum Frequency                                    | 137.8 MHz |
| LUT Utilization                                      | 4.5%      |
| Number of Test Vectors for Functional Simulation     | 1001      |
| Number of Test Vectors for Timing Simulation         | 5         |

## 2. Processor Structure

Key components includes ALU, Instruction Fetch, Control Unit, Data Memory, Register Memory, and etc. As shown in Figure 2.1, our design is based on a hierarchy of three levels, and the block diagram can be found in Appendix I.

```

    ▾  cpu(Behavioral) (cpu.vhd) (5)
        ▾  DMem : DataMemory(Behavioral) (DataMemory.vhd)
        ▾  InstrFetch : Instruction_Fetch(Behavioral) (Instruction_Fetch.vhd) (8)
            ▾  PC : Program_Counter(Behavioral) (Program_Counter.vhd)
            ▾  InstrMem : InstructionMemory(Behavioral) (InstructionMemory.vhd)
            ▾  PC_Adder : ADDER(Behavioral) (ADDER.vhd)
            ▾  Branch_Adder : ADDER(Behavioral) (ADDER.vhd)
            ▾  BranchMUX : MUX(Behavioral) (MUX.vhd)
            ▾  JumpMUX : MUX(Behavioral) (MUX.vhd)
            ▾  HaltMUX : MUX(Behavioral) (MUX.vhd)
            ▾  nextpcMUX : MUX(Behavioral) (MUX.vhd)
            ▾  CtrlUn : control_unit(Behavioral) (control_unit.vhd)
            ▾  ALUMAIN : ALU(Behavioral) (ALU.vhd)
            ▾  RegFile : RegisterFile(Behavioral) (RegisterFile.vhd)

```

Figure 2.1. Architecture of the Processor.

### 3. Functional Simulation

The key expansion part starts from 1 to 246 in instruction address, encryption from 247 to 658, and decryption from 247 to 260 & 659 to 1056.

For this testing case,

ukey:=x"a1b21111a1b22222a1b28888abcd1111" for key expansion

din="dd0000bb00991111" for encryption

din="6854b2a051e26180" for decryption

To verify the results in a more general range, 1000 cases are tested in Part 6 using TextIO



Figure 3.1. Key Expansion.



Figure 3.2. Key Expansion Detailed Waveform.



Figure 3.3. Encryption, with result 6854b2a051e26180.



Figure 3.4. Decryption, with result dd0000bb00991111.

## 4. Timing Simulation

### 4.1 Screenshots of Waveform

The skey will be stored in data memory [25 downto 0], we used five test cases for both encode and decode, and before them get executed, they key expansion will be run too. So 10 test cases for key expansion. All of our test case starts from 291ns, which is the time that the instruction begin to be executed, so we can calculate the total cycles of each function, and finally we will calculate the average latency of these functions.



Figure 4.1.0

## Encoding Test case1:

ukey:=x"215447ea3c30c306e990034a87d6664e" din="26f249bec27b6a35"

Results generated by C code:

skey: 9bbbd8c8 1a37f7fb 46f8e8c5 460c6085 70f83b8a 284b8303 513e1454 f621ed22  
 3125065d 11a83a5d d427686b 713ad82d 4b792f99 2799a4dd a7901c49 dede871a 36c03196  
 a7efc249 61a78bb8 3b0a1d2b 4dbfc76 ae162167 30d76b0a 43192304 f6cc1431 65046380

dout: 6c2ccfc34482ae08

## Key Expansion:



Figure 4.1.1

## Encoding:



Figure 4.1.2

### Encoding Test case2:

ukey:=x"915f4619be41b2516355a50110a9ce91" din="eedba5216d8f4b15"

Results generated by C code:

skey: da22168a d00cc77e c36b960f ddccdd485 b7afcd42 2a013c76 b49bfa76 d55d677e 6ca89043 2d1f2e62 f2a3096c 5eb4f29c 7a1ff0b3 999f4e5d 22443887 9d6e90af cb96af38 97bcb916 f195c8c5 5feb96df fcac3a65 21e070f7 764c1b47 65fa14e2 13f9851f 31e62f57

dout: 2ab8bc63a296b969

### Key Expansion:



Figure 4.1.3

**Encode:**

Figure 4.1.4

**Encoding Test case3:**

ukey:=x"0317a4b760adbaa3888ddaa727831103" din="2ab8bc63a296b969"

Results generated by C code:

skey: 76145621 200edca6 f5689e9f e30540e4 8961fee2 326f7fdf 8b539794 6691637c  
aa8de1c8 8d952b58 0717ae17 5605d2cb 4c7d9520 69eebf50 ee79a5dd 9b13a3a5 ab9444f1  
91bd3630 7d61b3a2 bdb1d70a 746f682f bf047814 605b9a27 702eccef a028fe3d 26bbab42

dout: ca890f566e6d7511

**Key Expansion:**

Figure 4.1.5

**Encode:**

Figure 4.1.6

**Encoding Test case4:**

ukey:=x"b9f8f092def0f53602c2147ab69808e6" din="ca890f566e6d7511"

Results generated by C code:

skey: 62aff981 7293a930 dbaac5f3 dc6e0b4a f0f29881 773ec5db 8f25847b f4119559  
 ab2628a9 39323f32 303c58a6 7b9a708a 60e54974 7a107051 38aaacc4 9ff1ade1 dec16101  
 03690bda 8c970453 6557a021 78673a75 a8959705 9b3bfcbc 0e1f57e0 e5e755c2 e86c3228

dout: b2945e5eb833de33

**Key Expansion:**

Figure 4.1.7

**Encode:**

Figure 4.1.8

**Encoding Test case5:**

ukey:=x"ue040cbd769fb85516abe8aee6a302255e" din="b2945e5eb833de33"

Results generated by C code:

skey: 55886c33 b7253089 385a037a cf39203d c0462d1e bf22be0e b3a26f0e b02087de  
 220b6129 a5d00e1c 42653bc3 17ea11f3 220c14ca 0e348b9d a140cff0 3b5a5c00 72549888  
 fd0a5fb5 fa2eafc1 1139b8a4 926a425a 06900e4a 753935a2 aa94c7c7 4f6fb6c5 ee22c8f0

dout: 26f249bec27b6a35

**Key Expansion:**



Figure 4.1.9

### Encode:



Figure 4.1.10

### Decode Test case1:

ukey:=x"915f4619be41b2516355a50110a9ce91" din="2ab8bc63a296b969"

Results generated by C code:

skey: da22168a d00cc77e c36b960f ddccdd485 b7afcd42 2a013c76 b49bfa76 d55d677e 6ca89043 2d1f2e62 f2a3096c 5eb4f29c 7a1ff0b3 999f4e5d 22443887 9d6e90af cb96af38 97bcb916 f195c8c5 5feb96df fcac3a65 21e070f7 764c1b47 65fa14e2 13f9851f 31e62f57

dout: eedb5216d8f4b15

### Key Expansion



Figure 4.1.11

### Decode:



Figure 4.1.12

### Decode Test case2:

ukey:=x"0317a4b760adbaa3888ddaa727831103" din="ca890f566e6d7511"

Results generated by C code:

skey: 76145621 200edca6 f5689e9f e30540e4 8961fee2 326f7fdf 8b539794 6691637c  
aa8de1c8 8d952b58 0717ae17 5605d2cb 4c7d9520 69eebf50 ee79a5dd 9b13a3a5 ab9444f1  
91bd3630 7d61b3a2 bdb1d70a 746f682f bf047814 605b9a27 702eccef a028fe3d 26bbab42

dout: 2ab8bc63a296b969

### Key Expansion



Figure 4.1.13

### Decode:



Figure 4.1.14

### Decode Test case3:

ukey:=x"b9f8f092def0f53602c2147ab69808e6" din="b2945e5eb833de33"

Results generated by C code:

skey: 62aff981 7293a930 dbaac5f3 dc6e0b4a f0f29881 773ec5db 8f25847b f4119559  
 ab2628a9 39323f32 303c58a6 7b9a708a 60e54974 7a107051 38aaacc4 9ff1ade1 dec16101  
 03690bda 8c970453 6557a021 78673a75 a8959705 9b3bfcbc 0e1f57e0 e5e755c2 e86c3228

dout: ca890f566e6d7511

**Key:**



Figure 4.1.15

**Decode:**



Figure 4.1.16

#### Decode Test case4:

ukey:=x"040cbd769fb85516abe8aee6a302255e" din="26f249bec27b6a35"

Results generated by C code:

skey: 55886c33 b7253089 385a037a cf39203d c0462d1e bf22be0e b3a26f0e b02087de  
220b6129 a5d00e1c 42653bc3 17ea11f3 220c14ca 0e348b9d a140cff0 3b5a5c00 72549888  
fd0a5fb5 fa2eafc1 1139b8a4 926a425a 06900e4a 753935a2 aa94c7c7 4f6fb6c5 ee22c8f0

dout: b2945e5eb833de33

#### Key Expansion:



Figure 4.1.17

### Decode:



Figure 4.1.18

### Decode Test case5:

ukey:=x"215447ea3c30c306e990034a87d6664e" din="6c2ccfc34482ae08"

Results generated by C code:

skey: c4fe7385 8f3ecb7e 75967249 d38d83ea d0d25abc a1a3901e 1f5c872a d5f913ac  
 cce9d784 10eb1c2e 8f90dcf7 d2ae0f9e 156df151 43ab34a6 c3e66456 31b6ea41 e7e88ad1  
 22129b1f 4d051e51 6d6791b5 7a0f5281 7190ade8 d37ad057 f683b767 3a94ddfa e3015728

dout: 26f249bec27b6a35

### Key Expansion:



Figure 4.1.19

### Decode:



Figure 4.1.20

## 4.2 Resources Utilization

The resource utility of the synthesis is illustrated as below:

| Resource | Utilization | Available | Utilization % |
|----------|-------------|-----------|---------------|
| LUT      | 2876        | 63400     | 4.54          |
| FF       | 2320        | 126800    | 1.83          |
| IO       | 49          | 210       | 23.33         |

Figure 4.2.1

The time summary of the synthesis is illustrated as below:

Figure 4.2.2

The resource utility of the place and route is illustrated as below:

| Resource | Utilization | Available | Utilization % |
|----------|-------------|-----------|---------------|
| LUT      | 2874        | 63400     | 4.53          |
| FF       | 2320        | 126800    | 1.83          |
| IO       | 49          | 210       | 23.33         |

Figure 4.2.3

The time summary of the place and route is illustrated as below:

Figure 4.2.4

### 4.3 Timing Performance

*Table 1. Performance Summary*

|                         |                                                                                               |
|-------------------------|-----------------------------------------------------------------------------------------------|
| Critical path delay     | 7.258 ns                                                                                      |
| Maximum clock frequency | 137.779 MHz                                                                                   |
| Latency                 | 4616 clock cycles for key expansion<br>1136 clock cycles for enc<br>1146 clock cycles for dec |
| Propagation delay       | 33.50 us for key expansion<br>8.25 us for enc<br>8.32 us for dec                              |

## 5. RC5 in Assembly Code and Machine Code

## 5.1 Key Expansion

- This part is used to compare the value of  $A+B$  with 1 to 31

```

"00011100000001100000000000000000", --247 Lw 0(R00) S[0](R03) 0(ImmE0 Load s[0] to R3
"00011000000100000000000000000001", --248 Lw 0(R00) S[1](R04) 1(Imm) Load s[1] to R4
"00000100000010100000000000000000", --249 Addi 0(R00) Temp1(R05) Din(63 downto 48)
"00010100101001010000000000000000", --250 Shl Temp1(R05) Temp1(R05) 16 Load Din(63 downto 48)
"00000100000010000000000000000000", --251 Addi 0(R00) Temp2(R06) Din(47 downto 32)
"00010100110001100000000000000000", --252 Shl Temp2(R06) Temp2(R06) 16
"00011000110001100000000000000000", --253 Shr Temp2(R06) Temp2(R06) 16 Load Din(47 downto 32)
"00010001010011000000000000000000", --254 Add Temp1(R05) Temp2(R06) A(R01) Load A
"00000100000010100000000000000000", --255 Addi 0(R00) Temp1(R05) Din(31 downto 16)
"00010100101001000000000000000000", --256 Shl Temp1(R05) Temp1(R05) 16 Load Din(31 downto 16)
"00000100000011000000000000000000", --257 Addi 0(R00) Temp2(R06) Din(15 downto 0)
"00010100110001100000000000000000", --258 Shr Temp2(R06) Temp2(R06) 16
"00011000110001100000000000000000", --259 Shr Temp2(R06) Temp2(R06) 16 Load Din(15 downto 0)
"00000000101001000100000000000000", --260 Add Temp1(R05) Temp2(R06) B(R02) Load A

```

-----SELECTION-----

```

"00000100000010110000000000000000", --261 Addi 0(R00) Mode(R11) 0—encode 1— decode
"00100100000010110000001100001100", --262 Blt 0(R00) Mode(R11)
"00000000001000100010000000000000", --263 Add A(R01) S[0](R03) A(R01) A+s[0]
"00000000001000100000000000000000", --264 Add B(R02) S[1](R04) B(R02) B+s[1]
"00000000000000000000000000000000", --265 Add 0(R00) 0(R00) I(R05) Initialize I
"000001000000101000000000000000", --266 Addi 0(R00) 24(R10) 24
"00101000010101000000000000000000", --267 Beq I(R05) 24(R10) 388 ±0

"00000100100101001000000000000000", --268 Addi I(R05) I(R05) 2 I=I+2
"00011000101000110000000000000000", --269 Lw I(R05) S[2*i](R03) 0 Load s[2*i]
"00011000101000100000000000000000", --270 Lw I(R05) S[2*i+1](R04) 1 Load s[2*i+1]
"00000000001000000000000000000000", --271 And A(R01) B(R02) C(R06)
"00000000001000000000000000000000", --272 Nor C(R06) 0(R00) C(R06)
"00000000001000000000000000000000", --273 Or A(R01) B(R02) D(R07)
"00000000001000000000000000000000", --274 And C(R06) D(R07) A(R01) A Xor B
"00000000001000000000000000000000", --275 Andi B(R02) Rotator(R07) 0000:11111 B(4 downto 0)
"00000100000000000000000000000000", --276 Addi 0(R00) counter(R11) 1
"00101100011001010000000000000000", --277 Bne Rotator(R07) counter(R11) 5
"00010100000000000000000000000000", --278 Shl A(R01) Temp_left(R08) 1
"00011000000000000000000000000000", --279 Shr A(R01) Temp_right(R09) 31
"00000000000000000000000000000000", --280 Add Temp_left(R08) Temp_right(R09) A(R01) Round rotate A
"00110000000000000000000000000000", --281 Jmp ½x
"00000000000000000000000000000000", --282 Addi counter(R11) counter(R11) 1
"00101100011001010000000000000000", --283 Bne Rotator(R07) counter(R11) 4
"00010100000000000000000000000000", --284 Shl A(R01) Temp_left(R08) 2
"00011000000000000000000000000000", --285 Shr A(R01) Temp_right(R09) 30
"00000000000000000000000000000000", --286 Add Temp_left(R08) Temp_right(R09) A(R01) Round rotate A
"00110000000000000000000000000000", --287 Jmp ½x
"00000000000000000000000000000000", --288 Addi counter(R11) counter(R11) 1
"00101100011001010000000000000000", --289 Bne Rotator(R07) counter(R11) 4
"00010100000000000000000000000000", --290 Shl A(R01) Temp_left(R08) 3
"00011000000000000000000000000000", --291 Shr A(R01) Temp_right(R09) 29
"00000000000000000000000000000000", --292 Add Temp_left(R08) Temp_right(R09) A(R01) Round rotate A
"00110000000000000000000000000000", --293 Jmp ½x
"00000000000000000000000000000000", --294 Addi counter(R11) counter(R11) 1
"00101100011001010000000000000000", --295 Bne Rotator(R07) counter(R11) 4
"00010100000000000000000000000000", --296 Shl A(R01) Temp_left(R08) 4
"00011000000000000000000000000000", --297 Shr A(R01) Temp_right(R09) 28
"00000000000000000000000000000000", --298 Add Temp_left(R08) Temp_right(R09) A(R01) Round rotate A
"00110000000000000000000000000000", --299 Jmp ½x
"00000000000000000000000000000000", --300 Addi counter(R11) counter(R11) 1
"00101100011001010000000000000000", --301 Bne Rotator(R07) counter(R11) 4
"00010100000000000000000000000000", --302 Shl A(R01) Temp_left(R08) 5
"00011000000000000000000000000000", --303 Shr A(R01) Temp_right(R09) 27
"00000000000000000000000000000000", --304 Add Temp_left(R08) Temp_right(R09) A(R01) Round rotate A
"00110000000000000000000000000000", --305 Jmp ½x
"00000000000000000000000000000000", --306 Addi counter(R11) counter(R11) 1
"00101100011001010000000000000000", --307 Bne Rotator(R07) counter(R11) 4

```

We firstly import the ukey into the data memory by the addi operation, and store it in data memory 26 to 29 as L array. Since the addi operation can only get 16 bits of immediate number, so in order to import 128 bits to 4 32-bits slot, we need to do two addi operations for each slot.

For example, in order to import the ukey[31 downto 0], we first addi ukey[31 downto 16], and store it into a register, then import ukey[15 downto 0]. Finally, we shift the ukey[31 downto 16] to left by 16 bits, and then add it to ukey[15 downto 0]. But there is a trick here: when we use addi operation, we actually get the sign extension of the immediate number, so if the ukey[15 downto 0] has MSB '1', then the final addition result will be wrong. The following key expansion operations are regular loops that keep updating S array.

## 5.2 Enc&Dec:

A select mode is used to determine running encryption or decryption, from line 247 to 262 in instruction. Din is loaded by 16 bit per time before the actual instruction for selection is being executed, since the Initialization for encryption and decryption is same. The output of encryption is storage in R30&R31, and the output of decryption is storage in R28&R29.

## 5.3 Optimization:

In our previous design, when we want to do

$$B = L[j] = (L[j] + A + B) \ll (A + B);$$

we set the A+B value into a register, say R11, and we set an index register, start in loop with value 0, say R12. Then we use BNE to compare R11 and R12, if R11 not equal to R12, we shift (L[j] + A + B) to left one bit, then finally jump back to the BNE instruction. But we found that this method could be really time consuming. It cost us 9000 cycles to finish the key expansion.

Then we optimized our algorithm, when we get the A+B value, we compared it with 1 to 31. For example, we put immediate number 1 into R11, and we compared it with A+B with BNE, if equal, shift 1 bit left, if not, jump to the next step BNE and compare it with 2 and so on so forth. Although it takes more instruction memory, it does save us a lot of time, we can finish the key expansion about 4800 cycles.

# 6. Verification of the Results

## 6.1 TextIO Verification

To verify the results in a more general range, 1001 cases generated by C code are tested using TextIO. Results are shown in the lower windows as Round Key Success, Encryption Success, and Decryption Success.

**Encryption:**



Figure 6.1

## Decryption:



Figure 6.2

## Round Key:



Figure 6.3

## 6.2 Test Vector for Final Analysis:

### Encode Test:

ukey: x" a1b21111 a1b22222 a1b28888 abcd11161", din: x"dd0000bb 00991111"

skey: d79c5c23 595062a9 243d544f f8403345 442273c1 ce49b445 48f9b1bd e2962dc9  
 ae01748c 8b021996 2a9a3f93 5a159f09 5b33435f 14c0deae b2f89959 e6ce3921 d2b1be3b  
 6ceac0c7 87c826ec 70cd4587 edca72f2 eb1327f4 ae7326d0 7f02eb80 4196cb32 7c0738a3

dout: x"6854b2a0 51e26180"

### Key Expansion:



Figure 6.4

### Encode:



Figure 6.5

### Decode Test:

ukey: x" a1b21111 a1b22222 a1b28888 abcd1111", din: x"6854b2a0 51e26180"

skey: d79c5c23 595062a9 243d544f f8403345 442273c1 ce49b445 48f9b1bd e2962dc9  
 ae01748c 8b021996 2a9a3f93 5a159f09 5b33435f 14c0deae b2f89959 e6ce3921 d2b1be3b  
 6ceac0c7 87c826ec 70cd4587 edca72f2 eb1327f4 ae7326d0 7f02eb80 4196cb32 7c0738a3

dout: x"dd0000bb 00991111"

### Key Expansion:



Figure 6.6

### Decode:



Figure 6.7

The number of cycles for round key generation: 4663 cycles

The number of cycles for encryption: 1030 cycles

The number of cycles for decryption: 1032 cycles

## 7. Implementation on FPGA

The input of the FPGA board uses 3 buttons, 13 switches.

**Buttons:**

There are 3 buttons used for input: Button center, Button down and Button right.

Button right: press to +1 to the temporary bit for input user key and Din. The temporary bit will cycle from 0 to F and back to 0. The temporary bit can be select to replace any hex bit in the Input.

Button center: press to reset the whole FPGA.

Button down: in single instruction mode, press to execute the next instruction

**Switches:**

Switch<0>: used to select the between single instruction mode or full execution mode. sw<0>=0 means full execution mode. sw<0>=1 means single instruction mode.

Switch<1>: used to start the encode or decode process. sw<1>=1 means start encode/decode. sw<1>=0 is when user will input Din and user key.

Switch<2>: used to select to input for Din or Ukey. sw<2>=1 means input for Ukey. sw<2>=0 means input for Din.

Switch<3>: used to select encode or decode operation. sw<3>=1 means decode. sw<3>=0 means encode.

Switch<8 downto 4> : used to select which hex bit of Din or Ukey does user want to input. “00000” means input for the first hex bit. “00001” means input for the second hex bit.

*Notice: the Din only has 16 hex bits. So any combination larger than “01111” will not change any bit on Din.*

Switch<10>: used to push the temporary bit into the selected hex bit. Push the switch up once to push the temporary hex bit into the selected hex bit.

Switch <15 downto 13>: used to select which 8 hex bit to display on the 7 segment Digital LED display. Combining with the mode selector sw<0>. It can display several value. the combinations are: sw<15 downto 13> & sw<0>:

*Full execution mode:*

“0000” Din(31 downto 0)

“1000” Din(63 downto 32)

“0100” Dout(31 downto 0)

“1100” Dout(63 downto 0)

“0010” ukey(31 downto 0)

“0110” ukey(63 downto 32)

“1010” ukey(95 downto 64)

“1110” ukey(127 downto 96)

*Single instruction mode:*

“0001” Current Instruction (32 bit)  
“0011” Relevant Rs value (32 bit)  
“0101” Relevant Rt value (32 bit)  
“0111” Relevant Rd value(32 bit)  
“1001” Relevant Immediate number value (32 bit)  
“1011” Current instruction number  
“1101” Next instruction number

**Output:**

The 7 segment digital LED display shows hex value of input and output.

**LED lights:**

led<15>: when led<15> is on, it indicates the Dout is ready.

In single instruction mode, led<6 downto 0> are useful.

led<3 downto 0>: hard-wired to opcode<3 downto 0>. So that it can distinguish R type instruction and addi, subi, jump, halt etc.

led<6 downto 4>: hard\_wired to Function code<2 downto 0>. So when led<3 downto 0> are all off (all 0s). It can distinguish different R type instruction, add, and, or, etc.

**Demo Video Address**

<https://youtu.be/JrCvFFhndmw>

**Appendix I. Block Diagram (In Next Page)**

