

# EL-GY 6463 Final Project Report

## NYU-6463 Processor Design and RC5 Implementation

Mengyuan Zhu ( mz1893 )  
Shuai Zhang ( sz1950 )  
Yezhong Xu ( yx1209 )  
Can Xu ( cx461 )  
Johnny Chu ( jc6875 )  
QiaoXin Xing ( qx389 )

## Part I Introduction

For the advance computer hardware final project we are tasked with implementing a 32 bit processor in VHDL, creating all the individual components that make up the processor, and the following test benches for testing the processor and each of its individual components. The processor that we chose to simulate for this project is the following shown below:



The following architecture was taken off the slides in class. The core components that will be discussed in this report are the CPU main code, the program counter, the control unit, the data memory, the instruction memory, the ALU, and the register.

The CPU code

This is our top level code

Program counter

The program counter also known as the PC indicates where a computer is in its program sequence. The PC is incremented after fetching an instruction, and holds the memory address of the next instruction that would be executed. In the following code snippet shown below:

```

23
24 entity pc is
25   port(
26     clk: in std_logic;
27     address_to_load: in std_logic_vector(31 downto 0);
28     current_address: out std_logic_vector(31 downto 0)
29   );
30 end pc;
31
32 architecture Behavioral of pc is
33
34   signal address: std_logic_vector(31 downto 0):= "00000000000000000000000000000000";
35
36 begin
37
38   process(clk)
39   begin
40     current_address <= address;
41     if clk='1' and clk'event then
42       address <= address_to_load;
43     end if;
44   end process;
45
46 end Behavioral;
47

```

## Instruction Memory

The instruction memory is the part of a CPU's control unit that holds the instruction currently being executed or decoded. The instruction to be executed is loaded into the instruction register which holds it while it is being decoded, prepared, and executed.

## The control unit

The control unit is the component that fetches instructions from the memory and begins the execution of those instructions by directing the coordinated operations of the ALU, registers and other components.

The control unit sets up the behavior of all the MUXes used in the processor. It takes in the opcode from the top level CPU code main and configures MUX1 with reg\_dest which tells the register to read or write. In MUX2 with alu\_src it modifies the data headed into the ALU. In mux3 it decides if the alu\_result and read\_data from data memory is going into the Register's Write Data. In MUX4 it is one of the factors in deciding whether or not to increment by 2 or 4. In MUX5 with jump it chooses between the result of mux4 and the jump address.

```

200
201 CONTROL1: control port map (
202   opcode => opcode,
203   reg_dest => reg_dest,
204   jump => jump,
205   branch => branch,
206   mem_read => mem_read,
207   mem_to_reg => mem_to_reg,
208   mem_write => mem_write,
209   alu_src => alu_src,
210   reg_write => reg_write,
211   alu_op => alu_op
212 );

```

Other components that it influences is the data memory, the register, and the ALU control. It tells the data memory whether or not it is in read or write mode with mem\_read and

mem\_write, tells the register when it is allowed to write with reg\_write, and informs the ALU control of what operations to do.

## Data Memory

The data memory is a register that stores the data that is being transferred to and from the immediate access storage. In our processor data memory takes in mem\_write, mem\_read, di\_vld, din, and ukey from the Top level CPU code, alu\_result from the ALU, and read\_data\_2 from the register. While sending out read\_data, enc\_out, and dec\_out back to the information back to the top level CPU code main. It takes in

```
299
300  MEM: memory port map (
301    clk => clk,
302    clr => clr,
303    address => alu_result,
304    write_data => read_data_2,
305    MemWrite => mem_write,
306    MemRead => mem_read,
307    di_vld => di_vld,
308    din => din,
309    ukey => ukey,
310    read_data => mem_read_data,
311    enc_out => enc_out,
312    dec_out => dec_out
313 );
```

## The ALU

The ALU or arithmetic logic unit is a combinational digital electronic circuit that performs arithmetic and bitwise operations on integer binary numbers. In this simulated processor the ALU is split into 2 parts, the ALU and the ALU control.

First is the ALU control, which takes in funct from the top level CPU main code and the alu\_op from the control unit and tells the ALU what operation it should perform next.

```
233
234  ALU_CTRL: alu_control port map (funct, alu_op, alu_control_funct);
235
```

It is in the ALU where the actual operations are performed. It takes in alu\_in\_2 from the SGN\_EXT mux that has modified data from the register's read\_data\_2 and read\_data\_1 from the register and performs the instructed operation sent over in alu\_control\_funct from the ALU control.

```
245
246  ALU1: alu port map (read_data_1, alu_in_2, alu_control_funct, alu_zero, alu_result);
247
```

```

-- Stimulus process
stim_proc: process
begin
    -- hold reset state for 100 ns.
    --wait for 100 ns;
    in_1 <= x"00000001";
    in_2 <= x"00000002";

    -- insert stimulus here
    wait for 10 ns;
    alu_control_fuct <= "0000";
    wait for 10 ns;
    alu_control_fuct <= "0001";
    wait for 10 ns;
    alu_control_fuct <= "1000";
    wait for 10 ns;
    alu_control_fuct <= "0010";
    wait for 10 ns;
    alu_control_fuct <= "0011";
    wait for 10 ns;
    alu_control_fuct <= "1111";
    wait for 10 ns;
    alu_control_fuct <= "0110";
    wait for 10 ns;
    alu_control_fuct <= "0111";
    wait for 10 ns;
    alu_control_fuct <= "1001";
    wait for 10 ns;
    alu_control_fuct <= "1010";
    wait for 10 ns;

    wait;
end process;

```



## Register

A register is a small amount of storage that the CPU can quickly access to read or write whatever information it wants to store. It takes in data from the top level CPU main code in rs and rt and stores it in read\_reg\_1 and read\_reg\_2. Where it waits to be informed by the control unit on when to write with reg\_write, MUX1 which chooses between rt and rs with write\_reg, and MUX3 which chooses between rt and rs.

```

222   REG: registers port map (
223     clk => clk,
224     clr => clr,
225     reg_write => reg_write,
226     read_reg_1 => rs,
227     read_reg_2 => rt,
228     write_reg => write_reg,
229     write_data => write_data,
230     read_data_1 => read_data_1,
231     read_data_2 => read_data_2
232   );
233

```

```

97      begin
98          -- hold reset state for 100 ns.
99          --wait for 100 ns;
100
101     wait for clk_period*1;
102
103     -- insert stimulus here
104     reg_write <= '1';
105     write_reg <= "00001";
106     write_data <= x"FFFFFF";
107     wait for 10 ns;
108     write_reg <= "00010";
109     write_data <= x"AAAAAAA";
110     wait for 10 ns;
111     reg_write <= '0';
112     read_req_1 <= "00001";
113     read_req_2 <= "00010";
114
115
116     wait;
117 end process;
118
119 END;

```



## Part II Simulation screenshots

### Function simulation(1000 test vectors)

- Ukey = X"00000000 00000000 00000000 00000000"  
 Din = X"00000000 00000000"



2.

- Ukey = X"00000000 00000000 00000000 00000001"  
 Din = X"00000000 00000000"



3.





4.

```
1.key=00000000000000000000000000000000  
plaintext 00 01 ----- encryption 65EE5710 293B016C ----- decryption BFED3837 D65EFF3E
```



## 1000 Test Vectors:

```

95900000 E18592/5A6AF2AB10AE128A5635BA169 8C5CAF49 E9C1D18U
9600000 C59FC471F5C5687DC4237F01C235AAC5 FF8F2115 B828C0CB
9610000 104AEED765F2E6AA67D886FD67F1C632E 3C45BFCE BF9E9C34
9620000 026527E344E37497035957AB1D531C2F 1992F65F 2DEEBEBC
9630000 257B43C7647567ABC3D9D98FB5C7898B 5E7852FB 25CA7086
9640000 3F1B5A515CDF542DE92D788500911D5D 5F6BF67D C7028969
9650000 692AD0DE9A90C3360ED84B6A964C1696 E193CD26 240C94C1
9660000 6DDFB05B9460D3739F6D9A13BED7813 F82D1433 B1E097E0
9670000 B245EF55283561A92B0D94355EADDCC1 E68E88B1 CBC40676
9680000 55A0C45E1C8498EAFD6506FAE22BE E5F21E5E 8A6C8502
9690000 710C908C954E5F80750E1C84444EA8B0 68785040 6B750B32
9700000 B34C2AAA3F5E9276ACEC09BA08C41D0E 8A56C40E C8ED82EF
9710000 48D66354182E01E4ACC0B62CAE00AB84 399F1B54 D4C99029
9720000 01A3615B28D10578CD35B3FA3B9A897 56C3857 FEF84807
9730000 B2DC1B70F7E47C18252EC8C0D3EC1A58 CE201B8 49C6E6B1
9740000 3959B6B7D6A7CF17495D5C67B7ABA57 DD58FB07 4885746
9750000 088BA47D05F31D39198353562571571 1DE434D1 6DAF2997
9760000 EE1211A057F86477838AC0D806A304B58 BCF21728 A5261482
9770000 413D24DD62A5AD8540EFC8ED9E19957D 4E960F4D 86B4CBC
9780000 20449F5C5A8C0E66C9EE0A2546E7E1E04 58803314 BF0558A
9790000 A7AE6544D7CAD7D4AF6E2F28A72E325C E07E5BEC 21D2588D
9800000 165167CTE5393147F0B511736D7D3007 11FB0D47 22633755
9810000 C807158F1F69260B0153509F2CC57423 42536FC3 7E435860
9820000 B53D064F649749D7BD737963C453AA97 5C6CB537 9740D560
9830000 DEF646B436B636A40E66BB2499463684 2584F014 BC8A2DE9
9840000 837AE3D47606CE48E144100C17284EB0 D0436F00 59BAE488
9850000 CF018CF9E81136412F152095DE811139 E2C69C89 9931985E
9860000 AB7A34BA0C2E458643D2BA86D5763606 C8AD3F6 47246EF4
9870000 B3EE80B6E772B012F1D4317638E4B6C2 35E21982 4C525FF1
9880000 C2DA3F507622BF0D250C7D82334B568 38CB2698 3470F67C
9890000 6A989E0A707EE87E1282CE0A378844A6 9174CD96 C6AD7E3D
9900000 DC0B26437F05F79F37075FA3439F48AF 556D88AF 7483B7A7
9910000 6C36AEA2F29EB466FC04E2A09AE214E EDAE319E 80FE7F89
9920000 B93AB35652DC3BA264C21EB605014AA 3056290A 915788FB
9930000 F8D7E317731726D7837D3A370E5FE98F A2B6405F 9989810A
9940000 3D7BF67FF04782C3444324576D0FE003 C4EE0683 A95EEC77
9950000 F8B827E6B24E894E8A4A99A6A7E2BC9E 95C54F4E D96B4538

```



### Timing simulation (5 test vectors):

- Ukey = X"00000000 00000000 00000000 00000000"  
 Din = X"00000000 00000000"



```
1.key=000000000000000000000000000000000000000000000000000000000000000
plaintext 00 00 ----- encryption EEDBA521 6D8F4B15 -----decryption 48E56C13 9616F90F
```

2.



3.



1.key=0000000000000000000000000000000000000000000000000000000000000000

plaintext 00 02 ----- encryption C147A878 7172DD98 ----- decryption 1F6C0E36 1DAADBC9

4.

1.key=0000000000000000000000000000000000000000000000000000000000000000

plaintext 01 01 ----- encryption 8731223 2C9311D6 ----- decryption AFFA8B9C 14FBBC4



5.

1.key=0000000000000000000000000000000000000000000000000000000000000000

plaintext 23 02 ----- encryption 459FE157 1F36C18E ----- decryption 879476D6 724B194E



### Part III Performance and area analysis:

Number of clock cycles to perform Encryption/Decryption:

Encryption: 914

Decryption: 932

Number of clock cycles to perform Round Key Generation:

3897

Clock frequency of the processor:

| Constraint                                           | Check         | Worst Case     | Best Case  | Timing   | I |
|------------------------------------------------------|---------------|----------------|------------|----------|---|
|                                                      |               | Slack          | Achievable | Errors   |   |
| Autotimespec constraint for clock net clk _slow_BUFG | SETUP<br>HOLD | N/A<br>0.016ns | 12.347ns   | N/A<br>0 |   |

Best case achievable is 12.347 ns. (propagation delay)

The fastest speed it can run is  $1 / 12.347\text{ns} = 8.099 \times 10^7\text{Hz}$ .

FPGA utilization (LUTs):

|                              | Synthesis stage | Place and Route stage |
|------------------------------|-----------------|-----------------------|
| LUT and FF pairs usage       | 6332(2%)        | 5129(8%)              |
| IOB usage                    | 44              | 44                    |
| RAM/DSP blocks used (if any) | 0               | 0                     |

## **PART IV RC5 implementation**

### **1. Key Expansion:**

Before the key expansion starts, the Instruction Memory and Data Memory have been initialized by 155 and 64 components. The Ins Mem is filled with the machine code of translated assembly program to implement key expansion algorithm. The ukey is fed through top-level port map, and the first 4 addresses in the Data Mem are used to store the 128 bit ukey.

### **2. Encryption:**

After the key expansion process, the Data Mem has been loaded with 26 keys starting from address 0 to 25. The encryption process will be proceeded. Similarly, the machine code of translated assembly program to implement RC5 encryption algorithm has been loaded into Ins Mem at the very beginning. Then encrypted text will be stored in the address 32 and 33 of the Data Mem.

### **3. Decryption:**

After the encryption process, the decryption will be proceeded. The instructions have been loaded at the very beginning and the deciphered text will be stored in the address 34 and 35 of the Data Mem.

### Description of RC5 implementation in assembly:

How to realize shift faster: if we need to shift Rt for Rs bits

- 1   **if (Rs == 0) jump to 12**
- 2   **if (Rs == 3) jump to 9**
- 3   **if (Rs == 2) jump to 10**
- 4   **if (Rs == 1) jump to 11**
- 5   **Shift Rt for 4 bits**
- 6   **Rs = Rs - 4**
- 7   **if (Rs != 0) jump to 1**
- 8   **if (Rs == 0) jump to 12**
- 9   **Shift Rt for 1 bits**
- 10   **Shift Rt for 1 bits**
- 11   **Shift Rt for 1 bits**
- 12   **continue the code**

Key generation:

```
25  LW(1, 0, 0)
26  ADD(2, 0, 0)
27  ADD(3, 0, 0)
28  SHL(0, 2, 3)
29  ADD(3, 2, 0)
30  LW(5, 4, 0)
31  ADD(0, 4, 3)
32  ANDI(0, 0, 25)
33  SHL(3, 3, 1)
34  SUBI(0, 0, 1)
35  BNE(0, 0, -3)
36  BNE(1, 25, 1)
37  SUBI(1, 1, 25)
38  ADDI(1, 1, 1)
39  BNE(5, 3, 1)
40  SUBI(5, 5, 3)
41  ADDI(5, 5, 1)
42  BNE(6, 77, 1)
43  HAL()
44  BEQ(1, 1, -19)
```

## Encryption:

```

LW(0, 1, 0)          #LW R1, 30(R0)      // R1=A(0)
LW(0, 2, 1)          #LW R2, 31(R0)      // R2=B(0)
LW(0, 9, 6)          #LW R9, 0(R0)       //R9=S[0]
ADDI(0, 11, 1)        #ADD R1, R1, R9    //A=A+S[0]
ADDI(0, 12, 2)        #LW R9, 1(R0)       //R9=S[1]
ADDI(0, 12, 3)        #ADD R2, R2, R9    //B=B+S[1]
ADDI(1, 9, 7)         #ADDI R3, R0, 2     // I=2
ADDI(0, 3, 8)         #ADDI R8, R0, 25   // R8=25
NOR(1, 1, 4)          #NOR R4, R1, R1    // NOR(A)
NOR(2, 2, 5)          #NOR R5, R2, R2    // NOR(B)
AND(1, 5, 6)          #AND R6, R1, R5    // A NOR(B)
AND(2, 4, 7)          #AND R7, R2, R4    // NOR(A) B
OR(6, 7, 1)           #OR R1, R6, R7    // A XOR B
ANDI(2, 10, 31)       #AND R10, R2, 31   // LEAST 5-BITS OF B  -----andi

BEQ(10, 0, 9)         //if r10=0  jump
BEQ(10, 11, 7)
BEQ(10, 12, 5)
BEQ(10, 13, 3)
SHL(1, 1, 4)          #SHL R1, R1, 4      // SHL 4-BIT
SUBI(10, 10, 4)       #SUBI R10, R10, 1   // R10=R10-1
BNE(10, 0, -6)        #BNE R10, R0, -3   // SHL R10-BITS
SHL(1, 1, 1)
SHL(1, 1, 1)
SHL(1, 1, 1)

LW(3, 9, 0)          #LW R9, 0(R3)      //R9=S[I]
ADD(1, 9, 1)          #ADD R1, R1, R9    //A=A+S[I]
ADDI(3, 3, 1)         #ADDI R3, R3, 1     //I=I+1
NOR(1, 1, 4)          #NOR R4, R1, R1    // NOR(A)
NOR(2, 2, 5)          #NOR R5, R2, R2    // NOR(B)
AND(1, 5, 6)          #AND R6, R1, R5    // A NOR(B)
AND(2, 4, 7)          #AND R7, R2, R4    // NOR(A) B
OR(6, 7, 2)           #OR R2, R6, R7    // A XOR B  111
ANDI(1, 10, 31)       #AND R10, R1, 31   // LEAST 5-BITS OF A  -----andi

BEQ(10, 0, 9)         #if r10=0  jump
BEQ(10, 11, 7)
BEQ(10, 12, 5)
BEQ(10, 13, 3)
SHL(2, 2, 4)          #SHL R2, R2, 1      // SHL 4-BIT
SUBI(10, 10, 4)       #SUBI R10, R10, 1   // R10=R10-1
BNE(10, 0, -6)        #BNE R10, R0, -3   // SHL R10-BITS
SHL(1, 1, 1)
SHL(1, 1, 1)
SHL(1, 1, 1)

LW(3, 9, 0)          #LW R9, 0(R3)      //R9=S[I]
ADD(2, 9, 2)          #ADD R2, R2, R9    //B=B+S[I]
ADDI(3, 3, 1)         #ADDI R3, R3, 1     // I=I+1
BNE(3, 8, -27)        #BNE R3, R8, -25   //I!=25 LOOP
SW(0, 1, 32)          #SW R1, 32(R0)    // A(OUT)=R1
SW(0, 2, 33)          #SW R2, 33(R0)    // B(OUT)=R2

```

## Decryption:

```
LW 0 1 30          R1 = M[30]
LW 0 2 31          R2 = M[31]
ADDI 0 3 25         R3 = i = 25
ADDI 0 10 1        R10 = 1 (break loop)
ADDI 0 11 4        R11 = 4
ADDI 0 12 3        R12 = 3
ADDI 0 13 2        R13 = 2
ADDI 0 14 1        R14 = 1
LW 3 8 0           R8 = M[R3] = M[i]
SUB 2 8 2           R2 = R2 - R8
ANDI 1 9 31         R9 = R1 last 5 bits
BEQ 9 0 9

-----
BEQ 12 9 5        R2 shift R9 bits
BEQ 13 9 5
BEQ 14 9 5
SHR 2 2 4
SUBI 9 9 4
BEQ 0 9 -7
SHR 2 2 1
SHR 2 2 1
SHR 2 2 1

-----
NOR 1 1 4          R2 do XOR with R1
NOR 2 2 5
AND 1 5 6
AND 2 4 7
OR 6 7 2
SUBI 3 3 1
LW 3 8 0
SUB 1 8 1
ANDI 2 9 31
BEQ 9 0 9
```

```
-----
BEQ 12 9 6        R1 shift R9 bits
BEQ 13 9 6
BEQ 14 9 6
SHR 1 1 1
SUBI 9 9 4
BEQ 0 9 -6
BEQ 0 9 3
SHR 1 1 1
SHR 1 1 1
SHR 1 1 1

-----
NOR 1 1 4          R1 Xor R2
NOR 2 2 5
AND 1 5 6
AND 2 4 7
OR 6 7 2
SUBI 3 3 1
BNE 3 10 -41       loop

LW 3 8 0
SUB 2 8 2
SUBI 3 3 1
LW 3 8 0
SUB 1 8 1
SW 0 1 34          M[34] = R1
SW 0 2 35          M[35] = R2
```

## Part V Processor interfaces (how do you provide inputs and display results)

```
U0: final_inout      PORT MAP(clr=>clr, clk_slow=>clk_slow, in_switch=>in_switch, led_an=>led_segment,
                                key_vld_button=>key_vld_button, data_vld_button=>data_vld_button, work_button=>work_button,
                                work=>work,
                                dout_button=>dout_button, enc=>enc, enc_button=>enc_button, enc_indicate=>enc_indicate,
                                led_indicate=>led_indicate, di_vld=>di_vld, ukey=>ukey, din=>din, dout=>dout);
U1: Processor        PORT MAP(clr=>clr, clk_slow=>clk_slow, di_vld=>di_vld,
                                ukey=>ukey, din=>din, dout_enc=>dout_enc, dout_dec=>dout_dec);
```

We use the 16 switches to input for din and ukeys, when we press P17 and switch we will put the ukeys. The N17 and switches will be responsible for din, the M17 will start RC5 and the P18 will change the result for encryption and decryption.





You can check the youtube video for details and the whole process.

## Part VI Details about how you verified your overall design

The following below is the test bench designed for testing the instruction memory. In the test bench 4 values was inserted into the input for the instruction memory.

```
50 BEGIN
51
52     -- Instantiate the Unit Under Test (UUT)
53     uut: instruction_memory PORT MAP (
54         read_address => read_address,
55         instruction => instruction
56     );
57
58
59
60     -- Stimulus process
61     stim_proc: process
62     begin
63         -- hold reset state for 100 ns.
64         wait for 100 ns;
65
66         --wait for <clock>_period*10;
67
68         -- insert stimulus here
69         read_address <= x"00000004";
70         wait for 10 ns;
71         read_address <= x"00000008";
72         wait for 10 ns;
73         read_address <= x"0000000C";
74         wait for 10 ns;
75         read_address <= x"00000010";
76
77         wait;
78     end process;
79
80 END;
```

The following behavioral analysis showed that the inputted addressed did lead to the corresponding output that we expected.



## **Part VII Conclusion**

Github repo:

[https://github.com/yemazon/EL6463Project\\_Processor\\_RC5](https://github.com/yemazon/EL6463Project_Processor_RC5)

Demo Video:

<https://youtu.be/skBHDx-i6Pg>