

VIETNAM NATIONAL UNIVERSITY - HO CHI MINH CITY  
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY  
Faculty of Computer Science and Engineering



**LAB 4**  
**INTRODUCTION TO SYSTEM ON CHIP**  
**LABORATORY REPORT**  
Instructor: Pham Kieu Nhat Anh

Class: CC01  
Name: Hoang Thuy Tram  
Student ID: 2353202

Ho Chi Minh city, December 2025

# Contents

|          |                                                |           |
|----------|------------------------------------------------|-----------|
| <b>1</b> | <b>Github</b>                                  | <b>1</b>  |
| <b>2</b> | <b>RTL implementation</b>                      | <b>1</b>  |
| 2.1      | DividerUnsignedPipelined file . . . . .        | 1         |
| 2.2      | DatapathMultiCycle file . . . . .              | 4         |
| <b>3</b> | <b>Testbench</b>                               | <b>17</b> |
| 3.1      | Divider testbench . . . . .                    | 17        |
| 3.2      | Processor testbench . . . . .                  | 20        |
| <b>4</b> | <b>Testbench results</b>                       | <b>20</b> |
| 4.1      | Divider testbench results . . . . .            | 20        |
| 4.2      | Processor testbench results . . . . .          | 21        |
| <b>5</b> | <b>Simulation waveform</b>                     | <b>24</b> |
| 5.1      | Divider simulation waveform . . . . .          | 24        |
| 5.2      | Processor simulation waveform . . . . .        | 24        |
| <b>6</b> | <b>Timing closure and Resource utilization</b> | <b>26</b> |
| 6.1      | Timing closure . . . . .                       | 26        |
| 6.2      | Resource utilization . . . . .                 | 26        |

# 1 Github

The Lab 4 project files (RTL source code, testbenches, simulation scripts) are located in the Lab04 folder. They are available at the following repository: <https://github.com/trxmhoang/SoC.git>

## 2 RTL implementation

### 2.1 DividerUnsignedPipelined file

The DividerUnsignedPipelined module implements a high-throughput divider using an 8-stage pipeline. Instead of calculating the full 32-bit division in a single cycle, which would require a long critical path, the workload is distributed across 8 clock cycles. Each stage processes 4 bits of the quotient using a chain of `divu_1iter` sub-modules. Registers (`r_dividend`, `r_divisor`, etc) are placed between each stage to store intermediate results, allowing the divider to accept new inputs every clock cycle while producing a result with a latency of 8 cycles.

```
1  `timescale 1ns / 1ns
2
3 // quotient = dividend / divisor
4
5 module DividerUnsignedPipelined (
6     input          clk, rst, stall,
7     input [31:0] i_dividend,
8     input [31:0] i_divisor,
9     output reg [31:0] o_remainder,
10    output reg [31:0] o_quotient
11);
12
13 // TODO: your code here
14 reg [31:0] r_dividend [0:6];
15 reg [31:0] r_divisor [0:6];
16 reg [31:0] r_quotient [0:6];
17 reg [31:0] r_remainder [0:6];
18
19 // STAGE 0
20 wire [31:0] s0_dividend [0:4];
21 wire [31:0] s0_quotient [0:4];
22 wire [31:0] s0_remainder [0:4];
23
24 assign s0_dividend[0] = i_dividend;
25 assign s0_quotient[0] = 32'b0;
26 assign s0_remainder[0] = 32'b0;
27
28 genvar i, j;
29 generate
30     for (j = 0; j < 4; j = j + 1) begin : stage0_loop
31         divu_1iter u_div0 (
32             .i_dividend (s0_dividend[j]),
33             .i_divisor (i_divisor),
34             .i_remainder (s0_remainder[j]),
35             .i_quotient (s0_quotient[j]),
36             .o_dividend (s0_dividend[j + 1]),
```

```

37         .o_remainder (s0_remainder[j + 1]),
38         .o_quotient (s0_quotient[j + 1])
39     );
40   end
41 endgenerate
42
43 always @(posedge clk) begin
44   if (rst) begin
45     r_dividend[0] <= 32'b0;
46     r_divisor[0] <= 32'b0;
47     r_quotient[0] <= 32'b0;
48     r_remainder[0] <= 32'b0;
49   end else if (!stall) begin
50     r_dividend[0] <= s0_dividend[4];
51     r_divisor[0] <= i_divisor;
52     r_quotient[0] <= s0_quotient[4];
53     r_remainder[0] <= s0_remainder[4];
54   end
55 end
56
57 // STAGE 1 - STAGE 6
58 generate
59   for (i = 1; i < 7; i = i + 1) begin : pipe_loop
60     wire [31:0] chain_dividend [0:4];
61     wire [31:0] chain_quotient [0:4];
62     wire [31:0] chain_remainder [0:4];
63     wire [31:0] stage_divisor;
64
65     assign chain_dividend[0] = r_dividend[i - 1];
66     assign chain_quotient[0] = r_quotient[i - 1];
67     assign chain_remainder[0] = r_remainder[i - 1];
68     assign stage_divisor = r_divisor[i - 1];
69
70     for (j = 0; j < 4; j = j + 1) begin : iter_loop
71       divu_1iter u_div (
72         .i_dividend (chain_dividend[j]),
73         .i_divisor (stage_divisor),
74         .i_remainder (chain_remainder[j]),
75         .i_quotient (chain_quotient[j]),
76         .o_dividend (chain_dividend[j + 1]),
77         .o_remainder (chain_remainder[j + 1]),
78         .o_quotient (chain_quotient[j + 1])
79       );
80     end
81
82     always @(posedge clk) begin
83       if (rst) begin
84         r_dividend[i] <= 32'b0;
85         r_divisor[i] <= 32'b0;
86         r_quotient[i] <= 32'b0;
87         r_remainder[i] <= 32'b0;
88       end else if (!stall) begin
89         r_dividend[i] <= chain_dividend[4];

```

```

90         r_divisor[i] <= stage_divisor;
91         r_quotient[i] <= chain_quotient[4];
92         r_remainder[i] <= chain_remainder[4];
93     end
94   end
95 endgenerate
96
97 // STAGE 7
98 wire [31:0] s7_dividend [0:4];
99 wire [31:0] s7_quotient [0:4];
100 wire [31:0] s7_remainder [0:4];
101
102 assign s7_dividend[0] = r_dividend[6];
103 assign s7_quotient[0] = r_quotient[6];
104 assign s7_remainder[0] = r_remainder[6];
105
106 generate
107   for (j = 0; j < 4; j = j + 1) begin : stage7_loop
108     divu_1iter u_div7 (
109       .i_dividend (s7_dividend[j]),
110       .i_divisor (r_divisor[6]),
111       .i_remainder (s7_remainder[j]),
112       .i_quotient (s7_quotient[j]),
113       .o_dividend (s7_dividend[j + 1]),
114       .o_remainder (s7_remainder[j + 1]),
115       .o_quotient (s7_quotient[j + 1])
116     );
117   end
118 endgenerate
119
120 always @(posedge clk) begin
121   if (rst) begin
122     o_quotient <= 32'b0;
123     o_remainder <= 32'b0;
124   end else if (!stall) begin
125     o_quotient <= s7_quotient[4];
126     o_remainder <= s7_remainder[4];
127   end
128 end
129 endmodule
130
131 module divu_1iter (
132   input      [31:0] i_dividend,
133   input      [31:0] i_divisor,
134   input      [31:0] i_remainder,
135   input      [31:0] i_quotient,
136   output reg [31:0] o_dividend,
137   output reg [31:0] o_remainder,
138   output reg [31:0] o_quotient
139 );
140
141   wire [31:0] new_remainder, subtract;

```

```

143 assign new_remainder = {i_remainder[30:0], i_dividend[31]};
144 assign subtract = new_remainder - i_divisor;
145
146 always @(*) begin
147   o_dividend = i_dividend << 1;
148   if (new_remainder >= i_divisor) begin
149     o_remainder = subtract;
150     o_quotient = {i_quotient[30:0], 1'b1};
151   end else begin
152     o_remainder = new_remainder;
153     o_quotient = {i_quotient[30:0], 1'b0};
154   end
155 end
156 endmodule

```

Program 1: RTL implementation of the DividerUnsignedPipelined file.

## 2.2 DatapathMultiCycle file

The DatapathMulticycle file is an enhanced version of the previous DatapathSingleCycle file. While the core instruction decoding and execution logic remain similar, this module integrates the 8-stage pipelined divider to handle division operations (DIV, DIVU, REM, REMU). This change allows the processor to manage long-latency arithmetic operations without halting the entire system permanently, although in this initial integration, the processor stalls to wait for the result.

Since the new divider takes 8 clock cycles to compute a result, the processor cannot proceed to the next instruction immediately when a division opcode is detected. A stall counter mechanism was added to freeze the Program Counter (PC) and the instruction fetch process.

- Logic: When a division-type instruction (`is_div_op`) is detected, a counter increments from 0 to 8.
- Stall signal (`stall_en`): This signal remains high while `stall_counter < 8`, effectively pausing the PC update and instruction fetch.
- Resume: Once the counter reaches 8, the divider's output is valid. The stall signal goes low, allowing the result to be written to the register file (`rf_we` enables) and the PC to advance to the next instruction.

```

1 /* INSERT NAME AND PENKEY HERE */
2
3 'timescale 1ns / 1ns
4
5 // registers are 32 bits in RV32
6 'define REG_SIZE 31
7
8 // RV opcodes are 7 bits
9 'define OPCODE_SIZE 6
10
11 // Don't forget your CLA and Divider
12 /* 'include "cla.v"
13 'include "DividerUnsignedPipelined.v" */

```

```

15 module RegFile (
16     input      [        4:0] rd ,
17     input      ['REG_SIZE:0] rd_data ,
18     input      [        4:0] rs1 ,
19     output reg ['REG_SIZE:0] rs1_data ,
20     input      [        4:0] rs2 ,
21     output reg ['REG_SIZE:0] rs2_data ,
22     input                  clk ,
23     input                  we ,
24     input                  rst
25 );
26
27 // TODO: copy your homework #3 code here
28 localparam NumRegs = 32;
29 reg ['REG_SIZE:0] regs[0:NumRegs-1];
30 integer i;
31
32 always @ (posedge clk or posedge rst) begin
33     if (rst) begin
34         for (i = 0; i < NumRegs; i = i + 1) begin
35             regs[i] <= 0;
36         end
37     end else if (we && (rd != 0)) begin
38         regs[rd] <= rd_data;
39     end
40 end
41
42 always @(*) begin
43     rs1_data = (rs1 == 0) ? 32'd0 : regs[rs1];
44     rs2_data = (rs2 == 0) ? 32'd0 : regs[rs2];
45 end
46 endmodule
47
48 module DatapathMultiCycle (
49     input                  clk ,
50     input                  rst ,
51     output reg              halt ,
52     output      ['REG_SIZE:0] pc_to_imem ,
53     input      ['REG_SIZE:0] inst_from_imem ,
54     // addr_to_dmem is a read-write port
55     output reg ['REG_SIZE:0] addr_to_dmem ,
56     input      ['REG_SIZE:0] load_data_from_dmem ,
57     output reg ['REG_SIZE:0] store_data_to_dmem ,
58     output reg [        3:0] store_we_to_dmem
59 );
60
61 // TODO: your code here (largely based on homework #3)
62 // components of the instruction
63 wire [        6:0] inst_funct7;
64 wire [        4:0] inst_rs2;
65 wire [        4:0] inst_rs1;
66 wire [        2:0] inst_funct3;
67 wire [        4:0] inst_rd;

```

```

68   wire ['OPCODE_SIZE:0] inst_opcode;
69
70 // split R-type instruction - see section 2.2 of RiscV spec
71 assign {inst_func7, inst_rs2, inst_rs1, inst_func3, inst_rd,
72         inst_opcode} = inst_from_imem;
73
74 // setup for I, S, B & J type instructions
75 // I - short immediates and loads
76 wire [11:0] imm_i;
77 assign imm_i = inst_from_imem[31:20];
78 wire [4:0] imm_shamt = inst_from_imem[24:20];
79
80 // S - stores
81 wire [11:0] imm_s;
82 assign imm_s = {inst_func7, inst_rd};
83
84 // B - conditionals
85 wire [12:0] imm_b;
86 assign {imm_b[12], imm_b[10:1], imm_b[11], imm_b[0]} = {inst_func7,
87             inst_rd, 1'b0};
88
89 // J - unconditional jumps
90 wire [20:0] imm_j;
91 assign {imm_j[20], imm_j[10:1], imm_j[11], imm_j[19:12], imm_j[0]} =
92     {inst_from_imem[31:12], 1'b0};
93
94 wire ['REG_SIZE:0] imm_i_sext = {{20{imm_i[11]}}, imm_i[11:0]};
95 wire ['REG_SIZE:0] imm_s_sext = {{20{imm_s[11]}}, imm_s[11:0]};
96 wire ['REG_SIZE:0] imm_b_sext = {{19{imm_b[12]}}, imm_b[12:0]};
97 wire ['REG_SIZE:0] imm_j_sext = {{11{imm_j[20]}}, imm_j[20:0]};
98
99 // opcodes - see section 19 of RiscV spec
100 localparam ['OPCODE_SIZE:0] OpLoad      = 7'b00_000_11;
101 localparam ['OPCODE_SIZE:0] OpStore      = 7'b01_000_11;
102 localparam ['OPCODE_SIZE:0] OpBranch     = 7'b11_000_11;
103 localparam ['OPCODE_SIZE:0] OpJalr      = 7'b11_001_11;
104 localparam ['OPCODE_SIZE:0] OpMiscMem   = 7'b00_011_11;
105 localparam ['OPCODE_SIZE:0] OpJal       = 7'b11_011_11;
106
107 localparam ['OPCODE_SIZE:0] OpRegImm    = 7'b00_100_11;
108 localparam ['OPCODE_SIZE:0] OpRegReg    = 7'b01_100_11;
109 localparam ['OPCODE_SIZE:0] OpEnviron   = 7'b11_100_11;
110
111 localparam ['OPCODE_SIZE:0] OpAuipc     = 7'b00_101_11;
112 localparam ['OPCODE_SIZE:0] OpLui       = 7'b01_101_11;
113
114 wire inst_lui      = (inst_opcode == OpLui      );
115 wire inst_auiipc   = (inst_opcode == OpAuipc   );
116 wire inst_jal      = (inst_opcode == OpJal      );
117 wire inst_jalr     = (inst_opcode == OpJalr     );
118
119 wire inst_beq      = (inst_opcode == OpBranch ) & (inst_from_imem
120             [14:12] == 3'b000);

```

```

117    wire inst_bne      = (inst_opcode == OpBranch) & (inst_from_imem
118        [14:12] == 3'b001);
119    wire inst_blt      = (inst_opcode == OpBranch) & (inst_from_imem
120        [14:12] == 3'b100);
121    wire inst_bge      = (inst_opcode == OpBranch) & (inst_from_imem
122        [14:12] == 3'b101);
123    wire inst_bltu     = (inst_opcode == OpBranch) & (inst_from_imem
124        [14:12] == 3'b110);
125    wire inst_bgeu     = (inst_opcode == OpBranch) & (inst_from_imem
126        [14:12] == 3'b111);

127    wire inst_lb       = (inst_opcode == OpLoad) & (inst_from_imem
128        [14:12] == 3'b000);
129    wire inst_lh       = (inst_opcode == OpLoad) & (inst_from_imem
130        [14:12] == 3'b001);
131    wire inst_lw       = (inst_opcode == OpLoad) & (inst_from_imem
132        [14:12] == 3'b010);
133    wire inst_lbu      = (inst_opcode == OpLoad) & (inst_from_imem
134        [14:12] == 3'b100);
135    wire inst_lhu      = (inst_opcode == OpLoad) & (inst_from_imem
136        [14:12] == 3'b101);

137    wire inst_sb       = (inst_opcode == OpStore) & (inst_from_imem
138        [14:12] == 3'b000);
139    wire inst_sh       = (inst_opcode == OpStore) & (inst_from_imem
140        [14:12] == 3'b001);
141    wire inst_sw       = (inst_opcode == OpStore) & (inst_from_imem
142        [14:12] == 3'b010);

143    wire inst_addi     = (inst_opcode == OpRegImm) & (inst_from_imem
144        [14:12] == 3'b000);
145    wire inst_slti     = (inst_opcode == OpRegImm) & (inst_from_imem
146        [14:12] == 3'b010);
147    wire inst_sltiu    = (inst_opcode == OpRegImm) & (inst_from_imem
148        [14:12] == 3'b011);
149    wire inst_xori     = (inst_opcode == OpRegImm) & (inst_from_imem
150        [14:12] == 3'b100);
151    wire inst_ori      = (inst_opcode == OpRegImm) & (inst_from_imem
152        [14:12] == 3'b110);
153    wire inst_andi     = (inst_opcode == OpRegImm) & (inst_from_imem
154        [14:12] == 3'b111);

155    wire inst_slli     = (inst_opcode == OpRegImm) & (inst_from_imem
156        [14:12] == 3'b001) & (inst_from_imem[31:25] == 7'd0);
157    wire inst_srli     = (inst_opcode == OpRegImm) & (inst_from_imem
158        [14:12] == 3'b101) & (inst_from_imem[31:25] == 7'd0);
159    wire inst_srai     = (inst_opcode == OpRegImm) & (inst_from_imem
160        [14:12] == 3'b101) & (inst_from_imem[31:25] == 7'b0100000);

161    wire inst_add      = (inst_opcode == OpRegReg) & (inst_from_imem
162        [14:12] == 3'b000) & (inst_from_imem[31:25] == 7'd0);
163    wire inst_sub      = (inst_opcode == OpRegReg) & (inst_from_imem
164        [14:12] == 3'b000) & (inst_from_imem[31:25] == 7'b0100000);

```

```

146     wire inst_sll      = (inst_opcode == OpRegReg) & (inst_from_imem
147         [14:12] == 3'b001) & (inst_from_imem[31:25] == 7'd0      );
148     wire inst_slt      = (inst_opcode == OpRegReg) & (inst_from_imem
149         [14:12] == 3'b010) & (inst_from_imem[31:25] == 7'd0      );
150     wire inst_sltu     = (inst_opcode == OpRegReg) & (inst_from_imem
151         [14:12] == 3'b011) & (inst_from_imem[31:25] == 7'd0      );
152     wire inst_xor      = (inst_opcode == OpRegReg) & (inst_from_imem
153         [14:12] == 3'b100) & (inst_from_imem[31:25] == 7'd0      );
154     wire inst_srl      = (inst_opcode == OpRegReg) & (inst_from_imem
155         [14:12] == 3'b101) & (inst_from_imem[31:25] == 7'd0      );
156     wire inst_sra      = (inst_opcode == OpRegReg) & (inst_from_imem
157         [14:12] == 3'b101) & (inst_from_imem[31:25] == 7'b0100000);
158     wire inst_or       = (inst_opcode == OpRegReg) & (inst_from_imem
159         [14:12] == 3'b110) & (inst_from_imem[31:25] == 7'd0      );
160     wire inst_and      = (inst_opcode == OpRegReg) & (inst_from_imem
161         [14:12] == 3'b111) & (inst_from_imem[31:25] == 7'd0      );
162
163     wire inst_mul      = (inst_opcode == OpRegReg) & (inst_from_imem
164         [31:25] == 7'd1 ) & (inst_from_imem[14:12] == 3'b000      );
165     wire inst_mulh     = (inst_opcode == OpRegReg) & (inst_from_imem
166         [31:25] == 7'd1 ) & (inst_from_imem[14:12] == 3'b001      );
167     wire inst_mulhsu   = (inst_opcode == OpRegReg) & (inst_from_imem
168         [31:25] == 7'd1 ) & (inst_from_imem[14:12] == 3'b010      );
169     wire inst_mulhu   = (inst_opcode == OpRegReg) & (inst_from_imem
170         [31:25] == 7'd1 ) & (inst_from_imem[14:12] == 3'b011      );
171     wire inst_div      = (inst_opcode == OpRegReg) & (inst_from_imem
172         [31:25] == 7'd1 ) & (inst_from_imem[14:12] == 3'b100      );
173     wire inst_divu    = (inst_opcode == OpRegReg) & (inst_from_imem
174         [31:25] == 7'd1 ) & (inst_from_imem[14:12] == 3'b101      );
175     wire inst_rem      = (inst_opcode == OpRegReg) & (inst_from_imem
176         [31:25] == 7'd1 ) & (inst_from_imem[14:12] == 3'b110      );
177     wire inst_remu    = (inst_opcode == OpRegReg) & (inst_from_imem
178         [31:25] == 7'd1 ) & (inst_from_imem[14:12] == 3'b111      );
179
180     wire inst_ecall   = (inst_opcode == OpEnviron) & (inst_from_imem
181         [31:7] == 25'd0  );

```

```

182     end else begin
183         stall_counter <= 0;
184     end
185 end
186
187 assign stall_en = is_div_op & (stall_counter < 8);
188
189 // program counter
190 reg ['REG_SIZE:0] pcNext, pcCurrent;
191 always @(posedge clk) begin
192     if (rst) begin
193         pcCurrent <= 32'd0;
194     end else if (!stall_en) begin
195         pcCurrent <= pcNext;
196     end
197 end
198 assign pc_to_imem = pcCurrent;
199
200 // cycle/_inst._from_imem counters
201 reg ['REG_SIZE:0] cycles_current, num_inst_current;
202 always @(posedge clk) begin
203     if (rst) begin
204         cycles_current <= 0;
205         num_inst_current <= 0;
206     end else begin
207         cycles_current <= cycles_current + 1;
208         if (!rst && !stall_en) begin
209             num_inst_current <= num_inst_current + 1;
210         end
211     end
212 end
213
214
215 // NOTE: don't rename your RegFile instance as the tests expect it
216 // to be 'rf'
217 // TODO: you will need to edit the port connections, however.
218 wire ['REG_SIZE:0] rs1_data;
219 wire ['REG_SIZE:0] rs2_data;
220 reg ['REG_SIZE:0] rd_data;
221 reg rf_we;
222
223 RegFile rf (
224     .clk      (clk),
225     .rst      (rst),
226     .we       (rf_we),
227     .rd       (inst_rd),
228     .rd_data  (rd_data),
229     .rs1      (inst_rs1),
230     .rs2      (inst_rs2),
231     .rs1_data (rs1_data),
232     .rs2_data (rs2_data)
233 );

```

```

234     wire [31:0] cla_res, cla_op_b, cla_op_b_raw;
235     wire cla_cin;
236
237     assign cla_op_b_raw = (inst_opcode == OpRegImm) ? imm_i_sext :
238         rs2_data;
239     assign cla_op_b = inst_sub ? ~cla_op_b_raw : cla_op_b_raw;
240     assign cla_cin = inst_sub ? 1'b1 : 1'b0;
241
242     cla u_cla (
243         .a      (rs1_data),
244         .b      (cla_op_b),
245         .cin    (cla_cin),
246         .sum    (cla_res)
247     );
248
249     wire [31:0] div_op_a, div_op_b;
250     wire is_signed_div;
251
252     assign is_signed_div = inst_div | inst_rem;
253     assign div_op_a = (is_signed_div & rs1_data[31]) ? (~rs1_data + 1) :
254         rs1_data;
255     assign div_op_b = (is_signed_div & rs2_data[31]) ? (~rs2_data + 1) :
256         rs2_data;
257
258     wire [31:0] div_quot_raw, div_rem_raw;
259     DividerUnsignedPipelined u_div (
260         .clk (clk),
261         .rst (rst),
262         .stall (1'b0),
263         .i_dividend (div_op_a),
264         .i_divisor (div_op_b),
265         .o_quotient (div_quot_raw),
266         .o_remainder(div_rem_raw)
267     );
268
269     reg [31:0] div_res;
270     always @(*) begin
271         if (inst_div || inst_divu) begin
272             if (rs2_data == 0) begin
273                 div_res = 32'hFFFF_FFFF;
274             end else if (inst_div && (rs1_data == 32'h8000_0000) && (
275                 rs2_data == 32'hFFFF_FFFF)) begin
276                 div_res = 32'h8000_0000;
277             end else if (is_signed_div && (rs1_data[31] ^ rs2_data[31]))
278                 begin
279                     div_res = ~div_quot_raw + 32'd1;
280                 end else begin
281                     div_res = div_quot_raw;
282                 end
283             end else begin // rem or remu
284                 if (rs2_data == 0) begin
285                     div_res = rs1_data;
286                 end else if (inst_rem && (rs1_data == 32'h8000_0000) && (

```

```

      rs2_data == 32'hFFFF_FFFF)) begin
282      div_res = 32'd0;
283    end else if (is_signed_div && (rs1_data[31])) begin
284      div_res = ~div_rem_raw + 32'd1;
285    end else begin
286      div_res = div_rem_raw;
287    end
288  end
289
290
291  wire [63:0] mul_u, mul_s, mul_su;
292  assign mul_u = {32'b0, rs1_data} * {32'b0, rs2_data};
293  assign mul_s = $signed(rs1_data) * $signed(rs2_data);
294  assign mul_su = $signed(rs1_data) * $signed({1'b0, rs2_data});
295
296  reg take_branch;
297  always @(*) begin
298    case (inst_func3)
299      3'b000: take_branch = (rs1_data == rs2_data);
300          // beq
301      3'b001: take_branch = (rs1_data != rs2_data);
302          // bne
303      3'b100: take_branch = ($signed(rs1_data) < $signed(rs2_data));
304          // blt
305      3'b101: take_branch = ($signed(rs1_data) >= $signed(rs2_data));
306          // bge
307      3'b110: take_branch = (rs1_data < rs2_data);
308          // bltu
309      3'b111: take_branch = (rs1_data >= rs2_data);
310          // bgeu
311    default: take_branch = 1'b0;
312  endcase
313 end
314
315
316  reg [31:0] load_data;
317  always @(*) begin
318    case (addr_to_dmem[1:0])
319      2'b00: load_data = load_data_from_dmem;
320      2'b01: load_data = load_data_from_dmem >> 8;
321      2'b10: load_data = load_data_from_dmem >> 16;
322      2'b11: load_data = load_data_from_dmem >> 24;
323    endcase
324  end
325
326
327  reg illegal_inst;
328
329  always @(*) begin
330    illegal_inst = 1'b0;
331    pcNext = pcCurrent + 4;
332    rf_we = 0;
333    rd_data = 0;
334    halt = 0;
335    addr_to_dmem = 0;

```

```

328     store_data_to_dmem = 0;
329     store_we_to_dmem = 0;
330
331     case (inst_opcode)
332       OpLui: begin
333         // TODO: start here by implementing lui
334         rf_we = 1;
335         rd_data = {inst_from_imem[31:12], 12'b0};
336       end
337
338       OpAuipc: begin
339         rf_we = 1;
340         rd_data = pcCurrent + {inst_from_imem[31:12], 12'b0};
341       end
342
343       OpJal: begin
344         rf_we = 1;
345         rd_data = pcCurrent + 4;
346         pcNext = pcCurrent + imm_j_sext;
347       end
348
349       OpJalr: begin
350         rf_we = 1;
351         rd_data = pcCurrent + 4;
352         pcNext = (rs1_data + imm_i_sext) & ~32'd1;
353       end
354
355       OpBranch: begin
356         if (take_branch) begin
357           pcNext = pcCurrent + imm_b_sext;
358         end
359       end
360
361       OpLoad: begin
362         rf_we = 1;
363         addr_to_dmem = rs1_data + imm_i_sext;
364
365         case (inst_func3)
366           3'b000: rd_data = {{24{load_data[7]}}, load_data[7:0]};    //
367             lb
368           3'b001: rd_data = {{16{load_data[15]}}, load_data[15:0]};  //
369             lh
370           3'b010: rd_data = load_data;                                //
371             lw
372           3'b100: rd_data = {24'b0, load_data[7:0]};                //
373             lbu
374           3'b101: rd_data = {16'b0, load_data[15:0]};              //
375             lhu
376         endcase
377       end
378
379       OpStore: begin
380         addr_to_dmem = rs1_data + imm_s_sext;

```

```

376     store_data_to_dmem = rs2_data << (addr_to_dmem[1:0] * 8);
377
378     case (inst_funct3)
379         3'b000: store_we_to_dmem = 4'b0001 << addr_to_dmem[1:0]; //sb
380         3'b001: store_we_to_dmem = 4'b0011 << addr_to_dmem[1:0]; //sh
381         3'b010: store_we_to_dmem = 4'b1111; //sw
382     endcase
383 end
384
385 OpRegImm: begin
386     rf_we = 1;
387     case (inst_funct3)
388         3'b000: rd_data = cla_res; // addi
389         3'b010: rd_data = ($signed(rs1_data) < $signed(imm_i_sext))
390             ? 32'd1 : 32'd0; // slti
391         3'b011: rd_data = (rs1_data < imm_i_sext) ? 32'd1 : 32'd0; // sltiu
392         3'b100: rd_data = rs1_data ^ imm_i_sext; // xor
393         3'b110: rd_data = rs1_data | imm_i_sext; // ori
394         3'b111: rd_data = rs1_data & imm_i_sext; // andi
395         3'b001: rd_data = rs1_data << imm_shamt; // slli
396         3'b101: begin
397             if (inst_funct7 == 7'd0) begin
398                 rd_data = rs1_data >> imm_shamt; // srli
399             end else begin
400                 rd_data = $signed(rs1_data) >>> imm_shamt; // srai
401             end
402         endcase
403     end
404
405 OpRegReg: begin
406     if (is_div_op) begin
407         rf_we = !stall_en;
408     end else begin
409         rf_we = 1;
410     end
411
412     if (inst_funct7 == 7'd1) begin
413         case (inst_funct3)
414             3'b000: rd_data = rs1_data * rs2_data; // mul
415             3'b001: rd_data = mul_s[63:32]; // mulh
416             3'b010: rd_data = mul_su[63:32]; //

```

```

        mulhsu
417      3'b011: rd_data = mul_u[63:32];                                // mulhu
418      3'b100, 3'b101, 3'b110, 3'b111: rd_data = div_res; // div,
419          divu, rem, remu
420      endcase
421  end else begin
422    case (inst_funct3)
423      3'b000: rd_data = cla_res;                                         // add,
424          sub
425      3'b001: rd_data = rs1_data << rs2_data[4:0];
426          // sll
427      3'b010: rd_data = ($signed(rs1_data) < $signed(rs2_data))
428          ? 32'd1 : 32'd0; // slt
429      3'b011: rd_data = (rs1_data < rs2_data) ? 32'd1 : 32'd0;
430          // sltu
431      3'b100: rd_data = rs1_data ^ rs2_data;
432          // xor
433      3'b101: begin
434        if (inst_funct7[5]) rd_data = $signed(rs1_data) >>>
435            rs2_data[4:0];           // sra
436        else rd_data = rs1_data >> rs2_data[4:0];
437            // srl
438      end
439      3'b110: rd_data = rs1_data | rs2_data;
440          // or
441      3'b111: rd_data = rs1_data & rs2_data;
442          // and
443    endcase
444  end
445
446 OpEnviron: begin
447   if (inst_from_imem[31:7] == 25'd0) halt = 1'b1; // ecall
448 end
449
450 default: begin
451   illegal_inst = 1'b1;
452 end
453 endcase
454 end
455 endmodule
456
457 module MemorySingleCycle #(
458   parameter NUM_WORDS = 512
459 ) (
460   input                           rst,                      // rst for both imem
461   and dmém
462   input                           clock_mem,           // clock for both imem
463   and dmém
464   input      [‘REG_SIZE:0] pc_to_imem,           // must always be
465   aligned to a 4B boundary

```

```

454     output reg ['REG_SIZE:0] inst_from_imem,           // the value at memory
455         location pc_to_imem
456     input      ['REG_SIZE:0] addr_to_dmem,           // must always be
457         aligned to a 4B boundary
458     output reg ['REG_SIZE:0] load_data_from_dmem, // the value at memory
459         location addr_to_dmem
460     input      ['REG_SIZE:0] store_data_to_dmem, // the value to be
461         written to addr_to_dmem, controlled by store_we_to_dmem
462 // Each bit determines whether to write the corresponding byte of
463 // store_data_to_dmem to memory location addr_to_dmem.
464 // E.g., 4'b1111 will write 4 bytes. 4'b0001 will write only the
465 // least-significant byte.
466     input      [            3:0] store_we_to_dmem
467 );
468
469 // memory is arranged as an array of 4B words
470 reg ['REG_SIZE:0] mem_array[0:NUM_WORDS-1];
471
472 // preload instructions to mem_array
473 always @(posedge rst) begin
474     $readmemh("mem_initial_contents.hex", mem_array);
475 end
476
477 localparam AddrMsb = $clog2(NUM_WORDS) + 1;
478 localparam AddrLsb = 2;
479
480 always @(posedge clock_mem) begin
481     inst_from_imem <= mem_array[{pc_to_imem[AddrMsb:AddrLsb]}];
482 end
483
484 always @(negedge clock_mem) begin
485     if (store_we_to_dmem[0]) begin
486         mem_array[addr_to_dmem[AddrMsb:AddrLsb]][7:0] <=
487             store_data_to_dmem[7:0];
488     end
489     if (store_we_to_dmem[1]) begin
490         mem_array[addr_to_dmem[AddrMsb:AddrLsb]][15:8] <=
491             store_data_to_dmem[15:8];
492     end
493     if (store_we_to_dmem[2]) begin
494         mem_array[addr_to_dmem[AddrMsb:AddrLsb]][23:16] <=
495             store_data_to_dmem[23:16];
496     end
497     if (store_we_to_dmem[3]) begin
498         mem_array[addr_to_dmem[AddrMsb:AddrLsb]][31:24] <=
499             store_data_to_dmem[31:24];
500     end
501 // dmem is "read-first": read returns value before the write
502     load_data_from_dmem <= mem_array[{addr_to_dmem[AddrMsb:AddrLsb]}];
503 end
504 endmodule
505 /*

```

```

497 This shows the relationship between clock_proc and clock_mem. The
498   clock_mem is
499 phase-shifted 90° from clock_proc. You could think of one proc cycle
500   being
501 broken down into 3 parts. During part 1 (which starts @posedge
502   clock_proc)
503 the current PC is sent to the imem. In part 2 (starting @posedge
504   clock_mem) we
505 read from imem. In part 3 (starting @negedge clock_mem) we read/write
506   memory and
507 prepare register/PC updates, which occur at @posedge clock_proc.
508
509
510 proc: | ---- |
511      | |
512      | |
513
514 */
515
516 module Processor (
517   input  clock_proc,
518   input  clock_mem,
519   input  rst,
520   output halt
521 );
522
523
524   wire [‘REG_SIZE:0] pc_to_imem, inst_from_imem, mem_data_addr,
525     mem_data_loaded_value, mem_data_to_write;
526   wire [3:0] mem_data_we;
527
528   // This wire is set by cocotb to the name of the currently-running
529   // test, to make it easier
530   // to see what is going on in the waveforms.
531   wire [(8*32)-1:0] test_case;
532
533
534 MemorySingleCycle #(
535   .NUM_WORDS(8192)
536 ) memory (
537   .rst           (rst),
538   .clock_mem     (clock_mem),
539   // imem is read-only
540   .pc_to_imem    (pc_to_imem),
541   .inst_from_imem (inst_from_imem),
542   // dmem is read-write
543   .addr_to_dmem  (mem_data_addr),
544   .load_data_from_dmem (mem_data_loaded_value),
545   .store_data_to_dmem (mem_data_to_write),
546   .store_we_to_dmem (mem_data_we)
547 );
548
549
550 DatapathMultiCycle datapath (
551   .clk           (clock_proc),
552   .rst           (rst),
553   .pc_to_imem    (pc_to_imem),
554   .inst_from_imem (inst_from_imem),

```

```

543     .addr_to_dmem      (mem_data_addr),
544     .store_data_to_dmem (mem_data_to_write),
545     .store_we_to_dmem  (mem_data_we),
546     .load_data_from_dmem (mem_data_loaded_value),
547     .halt              (halt)
548 );
549 endmodule

```

Program 2: RTL implementation of the DatapathMultiCycle file.

## 3 Testbench

### 3.1 Divider testbench

This testbench is used to verify the functional correctness and the latency of the 8-stage Pipelined Unsigned Divider. Unlike standard combinational logic tests, this testbench must account for the 8-cycle latency between input stimulus and valid output.

The testbench uses two separate `initial` blocks running in parallel to handle the pipelined nature of the device:

- Stimulus generator: Feeds input data into the divider on every clock cycle.
- Output checker: Waits for the pipeline latency (roughly 10 cycles for reset and pipeline stages) before beginning to check the output against the expected values.

The testbench generates 15 random and directed test cases:

- Test 0: The first iteration test has exact division.
- Test 1: The second iteration test has remainder.
- Test 2: The third iteration test has dividend smaller than divisor.
- Test 3-14: The remaining cases use `$random` to generate arbitrary 32-bit values to ensure robust coverage.

```

1  `timescale 1ns / 1ps
2  module tb_div;
3  reg clk, rst, stall;
4  reg [31:0] i_dividend, i_divisor;
5  wire [31:0] o_remainder, o_quotient;
6
7  integer i, j;
8  integer pass, fail;
9  parameter TEST_CNT = 15;
10
11 reg [31:0] i_divd [0:TEST_CNT-1];
12 reg [31:0] i_divs [0:TEST_CNT-1];
13 reg [31:0] exp_quot [0:TEST_CNT-1];
14 reg [31:0] exp_rem [0:TEST_CNT-1];
15
16 DividerUnsignedPipelined dut (
17     .clk(clk),

```

```

18     .rst(rst),
19     .stall(stall),
20     .i_dividend(i_dividend),
21     .i_divisor(i_divisor),
22     .o_remainder(o_remainder),
23     .o_quotient(o_quotient)
24 );
25
26 task divi;
27   $display ("%0s", {80{"-"}});
28 endtask
29
30 task br;
31   $display ("%0s", {100{"="}});
32 endtask
33
34 task msg (input [700:0] txt);
35 begin
36   br();
37   $display ("%0s", txt);
38   br();
39 end
40 endtask
41
42 initial begin
43   clk = 0;
44   forever #5 clk = ~clk;
45 end
46
47 initial begin
48   $dumpfile("wave_div.vcd");
49   $dumpvars(0, tb_div);
50   msg ("TEST BEGIN");
51
52   rst = 1;
53   stall = 0;
54   i_dividend = 0;
55   i_divisor = 1;
56   pass = 0;
57   fail = 0;
58   #20;
59
60   @(negedge clk);
61   rst = 0;
62   repeat (2) @(posedge clk);
63
64   for (i = 0; i < TEST_CNT; i = i + 1) begin
65     @(negedge clk);
66     if (i == 0) begin
67       i_divd[i] = 100;
68       i_divs[i] = 10; // chia het
69     end else if (i == 1) begin
70       i_divd[i] = 100;

```

```

71         i_divs[i] = 30; // co du
72     end else if (i == 2) begin
73         i_divd[i] = 5;
74         i_divs[i] = 10; // bi chia < chia
75     end else begin
76         i_divd[i] = $random;
77         i_divs[i] = ($random % 100) + 1;
78     end
79
80     i_dividend = i_divd[i];
81     i_divisor = i_divs[i];
82
83     exp_quot[i] = i_divd[i] / i_divs[i];
84     exp_rem[i] = i_divd[i] % i_divs[i];
85 end
86
87 @(negedge clk);
88 i_dividend = 0;
89 i_divisor = 1;
90 end
91
92 initial begin
93     wait (rst == 0);
94     repeat (10) @(posedge clk); // 2 delay + 8 latency = 10
95
96     for (j = 0; j < TEST_CNT; j = j + 1) begin
97         @(negedge clk);
98         if ((o_quotient === exp_quot[j]) && (o_remainder === exp_rem[j]))
99             begin
100                 $display("[PASS] Time = %0t | Test #%-2d: %d / %d | Output
101                     Q = %d, R = %d", $time, j, i_divd[j], i_divs[j],
102                     o_quotient, o_remainder);
103                 pass = pass + 1;
104             end else begin
105                 $display("[FAIL] Time = %0t | Test #%-2d: %d / %d | Expect
106                     Q = %d, R = %d | Output Q = %d, R = %d", $time, j,
107                     i_divd[j], i_divs[j], exp_quot[j], exp_rem[j],
108                     o_quotient, o_remainder);
109                 fail = fail + 1;
110             end
111         end
112
113         msg ("SUMMARY");
114         $display ("TOTAL TESTS : %0d", pass + fail);
115         $display ("TOTAL PASSED: %0d", pass);
116         $display ("TOTAL FAILED: %0d", fail);
117
118         if (fail == 0)
119             $display ("=====> ALL TESTS PASSED");
120         else if (pass == 0)
121             $display ("=====> ALL TESTS FAILED");
122         else
123             $display ("=====> SOME TESTS FAILED");

```

```

118     msg ("TEST END");
119
120     #100;
121     $finish;
122 end
123 endmodule

```

Program 3: Testbench implementation of the DividerUnsignedPipelined module.

### 3.2 Processor testbench

To verify the `DatapathMultiCycle` file, we reused the testbench originally designed for the `DatapathSingleCycle` file (`tb_single.v`), can be viewed again in Section 4.2 in Lab 3 report. This approach is valid because the Instruction Set Architecture (ISA) interface remains identical between the two designs. Both processors:

- Read 32-bit instruction from memory.
- Execute standard RV32I instructions (ADD, SUB, BEQ, etc).
- Perform division operations (DIV, DIVU, REM, REMU), although with different internal latencies.
- Ultimately update the register file and memory with the same correct values.

Since the testbench verifies correctness based on the final state rather than internal details, it serves as an effective regression test to ensure that the new multi-cycle features did not break existing functionality.

While the test logic remains the same, the execution time for division-heavy programs increase significantly due to the 8-cycle latency per division operation.

## 4 Testbench results

All test cases executed successfully for all tested modules, with each marked as PASSED. This outcome confirms that the test modules behave as intended and that the implemented RTL designs meet the expected functionality.

### 4.1 Divider testbench results

```

1 # =====
2 # TEST BEGIN
3 # =====
4 # [PASS] Time = 120000 | Test # 0:          100 /           10 | Output Q
5 #             =          10, R =            0
6 # [PASS] Time = 130000 | Test # 1:          100 /           30 | Output Q
7 #             =          3, R =            10
8 # [PASS] Time = 140000 | Test # 2:          5 /            10 | Output Q
9 #             =          0, R =            5
# [PASS] Time = 150000 | Test # 3: 303379748 / 4294967198 | Output Q
#             =          0, R = 303379748

```

```

8 # [PASS] Time = 160000 | Test # 4: 2223298057 / 4294967288 | Output Q
9   = 0, R = 2223298057
10 # [PASS] Time = 170000 | Test # 5: 112818957 / 58 | Output Q
11   = 1945154, R = 25
12 # [PASS] Time = 180000 | Test # 6: 2999092325 / 4294967283 | Output Q
13   = 0, R = 2999092325
14 # [PASS] Time = 190000 | Test # 7: 15983361 / 30 | Output Q
15   = 532778, R = 21
16 # [PASS] Time = 200000 | Test # 8: 992211318 / 98 | Output Q
17   = 10124605, R = 28
18 # [PASS] Time = 210000 | Test # 9: 1993627629 / 13 | Output Q
19   = 153355971, R = 6
20 # [PASS] Time = 220000 | Test #10: 2097015289 / 4294967227 | Output Q
21   = 0, R = 2097015289
22 # [PASS] Time = 230000 | Test #11: 3807872197 / 4294967223 | Output Q
23   = 0, R = 3807872197
24 # [PASS] Time = 240000 | Test #12: 1924134885 / 4294967256 | Output Q
25   = 0, R = 1924134885
26 # [PASS] Time = 250000 | Test #13: 2301810194 / 40 | Output Q
27   = 57545254, R = 34
28 # [PASS] Time = 260000 | Test #14: 2033215986 / 4294967251 | Output Q
29   = 0, R = 2033215986
30 # =====
# SUMMARY
# =====
# TOTAL TESTS : 15
# TOTAL PASSED: 15
# TOTAL FAILED: 0
# =====> ALL TESTS PASSED
# =====
# TEST END
# =====
# ** Note: $finish : ../../tb/tb_div.v(122)
#     Time: 360 ns Iteration: 0 Instance: /tb_div

```

Simulation 1: Testbench results of the DividerUnsignedPipelined module.

## 4.2 Processor testbench results

```

1 # =====
2 # TEST BEGIN
3 # =====
4 # =====
5 # TEST 1
6 # =====
7 # [PASS] Time = 691 | ADDI x1, x0, 10 | Output x1 = 0
8   x0000000a
9 # [PASS] Time = 691 | ADDI x2, x0, -5 | Output x2 = 0
10  xfffffff
11 # [PASS] Time = 691 | ADD x3, x1, x2 | Output x3 = 0
12  x00000005
13 # [PASS] Time = 691 | SUB x4, x1, x2 | Output x4 = 0
14  x0000000f

```

```

11 # [PASS] Time = 691 | AND x5, x1, x2 | Output x5 = 0
12 # [PASS] Time = 691 | OR x6, x1, x2 | Output x6 = 0
13 # [PASS] Time = 691 | XOR x7, x1, x2 | Output x7 = 0
14 # [PASS] Time = 691 | SLLI x8, x1, 2 | Output x8 = 0
15 # [PASS] Time = 691 | SRAI x9, x2, 1 | Output x9 = 0
16 # [PASS] Time = 691 | SRLI x10, x2, 1 | Output x10 = 0
17 # [PASS] Time = 691 | SLT x11, x2, x1 | Output x11 = 0
18 # [PASS] Time = 691 | SLTU x12, x2, x1 | Output x12 = 0
19 # [PASS] Time = 691 | MUL x13, x1, x1 | Output x13 = 0
20 # [PASS] Time = 691 | DIV x14, x1, x2 | Output x14 = 0
21 # [PASS] Time = 691 | REM x15, x1, x2 | Output x15 = 0
22 # [PASS] Time = 691 | DIVU x16, x2, x1 | Output x16 = 0
23 # [PASS] Time = 691 | LW x18, 100(x0) | Output x18 = 0
24 # [PASS] Time = 691 | ADDI x19, x0, 1 | Output x19 = 0
25 # [PASS] Time = 691 | JAL x20, 8 | Output x20 = 0
26 # [PASS] Time = 691 | ADDI x21, x0, 2 | Output x21 = 0
27 # [PASS] Time = 691 | LUI x22, 0x12345 | Output x22 = 0
28 # [PASS] Time = 691 | XORI x23, x1, 5 | Output x23 = 0
29 # [PASS] Time = 691 | ANDI x24, x1, 0xF | Output x24 = 0
30 # [PASS] Time = 691 | ORI x25, x13, 0xFF00 | Output x25 = 0
31 # [PASS] Time = 691 | SLTI x26, x7, 0x28 | Output x26 = 0
32 # [PASS] Time = 691 | LTIU x27, x8, 0xFFFF_FFF1 | Output x27 = 0
33 # [PASS] Time = 691 | SLL x28, x1, x19 | Output x28 = 0
34 # [PASS] Time = 691 | SRL x29, x6, x19 | Output x29 = 0
35 # [PASS] Time = 691 | SRA x30, x6, x19 | Output x30 = 0
36 # [PASS] Time = 691 | REMU x31, x6, x3 | Output x31 = 0
37 # [PASS] HALT asserted.

```

```

38 # =====
39 # TEST 2 (REUSED REGISTERS)
40 # =====
41 # [PASS] Time = 971 | ADDI x1, x0, 0xFFFF_FFFB | Output x1 = 0
42 #           xfffffff
43 # [PASS] Time = 971 | ADDI x2, x0, 3 | Output x2 = 0
44 #           x00000003
45 # [PASS] Time = 971 | MULH x3, x1, x2 | Output x3 = 0
46 #           xfffffff
47 # [PASS] Time = 971 | MULHSU x4, x1, x2 | Output x4 = 0
48 #           xfffffff
49 # [PASS] Time = 971 | MULHU x5, x1, x2 | Output x5 = 0
50 #           x00000002
51 # [PASS] Time = 971 | ADDI x6, x0, 0xABCD_FE98 | Output x6 = 0
52 #           xfffffe98
53 # [PASS] Time = 971 | LB x7, 100(x0) | Output x7 = 0
54 #           xfffff98
55 # [PASS] Time = 971 | LH x8, 104(x0) | Output x8 = 0
56 #           xfffffe98
57 # [PASS] Time = 971 | LBU x9, 105(x0) | Output x9 = 0
58 #           x000000fe
59 # [PASS] Time = 971 | LHU x10, 104(x0) | Output x10 = 0
60 #           x0000fe98
61 # [PASS] Time = 971 | ADD x11, x9, x10 | Output x11 = 0
62 #           x0000ff96
63 # [PASS] Time = 971 | OR x12, x9, x10 | Output x12 = 0
64 #           x0000fefefe
65 # [PASS] Time = 971 | LW x13, 120(x0) | Output x13 = 0
66 #           x000000fe
67 # [PASS] Time = 971 | JALR x14, 130(x0) | Output x14 = 0
68 #           x00000054
69 # [PASS] HALT asserted.
70 # =====
71 # SUMMARY
72 # =====
73 # TOTAL TESTS : 46
74 # TOTAL PASSED: 46
75 # TOTAL FAILED: 0
76 # =====> ALL TESTS PASSED
77 # =====
78 # TEST END
79 # =====
80 # ** Note: $finish : ../../tb/tb_single.v(230)
81 #     Time: 1071 ns Iteration: 0 Instance: /tb_single

```

Simulation 2: Testbench results of the Processor module.

# 5 Simulation waveform

## 5.1 Divider simulation waveform



Figure 1

In Figure 1, the testbench updates the dividend and divisor values on each negative clock edge to initiate new division operations. At 40,000 ps, the first test case loads dividend = 100 (0x64) and divisor = 10 (0x0a). The `DividerUnsignedPipelined` module has an 8-stage pipeline latency, meaning the result appears after 8 positive clock edges. Consequently, at 120,000 ps (8 cycles later), the outputs are updated: quotient = 10 (0x0a) and remainder = 0.

The second test case loads dividend = 100 (0x64) and divisor = 30 (0x1e) at 50,000 ps (one clock cycle after the first). Due to the pipelined architecture, this operation also requires 8 clock cycles to complete. At 130,000 ps, the second result emerges: quotient = 3 and remainder = 10 (0x0a). This pattern continues for subsequent test cases, with each new operation initiated on consecutive negative edges.



Figure 2

As shown in Figure 2, once the pipeline is filled (after the initial 8-cycle latency), the divider produces one result per clock cycle in steady-state operation. The testbench loads 15 test cases sequentially, feeding new operands every cycle. After all valid test cases complete, the testbench sets dividend = 0 and divisor = 1 as safe default values to avoid division-by-zero scenarios, which are not handled in this design. Once the pipeline is full, we get one result per cycle, although each operation needs 8 cycles.

## 5.2 Processor simulation waveform



Figure 3

Both processor implementations were tested using the same testbench file `tb_single.v` to ensure they produce identical functional results. As shown in Figure 3, the multi-cycle processor executes most instructions with the same behavior as the single-cycle implementation. The register file values, ALU results and memory interface signals match between both designs, confirming that both processors correctly implement the RISC-V instruction set. For instructions like ADD, SUB, LW, SW, and branches, both processors complete execution in one clock cycle and advance the program counter by 4 bytes to fetch the next instruction.



Figure 4

However, a significant difference appears when executing division instructions. The single-cycle processor uses a combinational divider that completes the entire division operation within a single clock cycle. In contrast, the multi-cycle processor employs a pipelined divider which has an 8-stage pipeline architecture. As shown in Figure 4 at approximately 338 ns, when the DIVU instruction is executed, the processor detects that this is a division operation and activates the stall mechanism. The `stall_counter` begins counting from 0 to 8 over eight consecutive clock cycles, during which the `stall_en` signal remains high. While `stall_en` is active, the program counter freezes at its current value (0x3C in this case) and does not advance to the next instruction. The register file write enable signal `rf_we` is also disabled during the stall period, preventing any premature register updates. After finishing the 8-stages pipelines in 8 clock cycles, on the next positive edge `clock_proc`, the program counter finally advanced to the next instruction. Therefore, each division instruction takes 9 clock cycles total: 8 for computation and 1 for safe control handoff, which is standard practice in pipelined processor design where the write-back stage occurs one cycle after the execution stage completes.

Although each individual division takes 9 cycles to complete in the multi-cycle design compared to just 1 cycle in the single-cycle design, the multi-cycle processor can run at a higher clock frequency. For programs that do not heavily rely on division operations, the multi-cycle processor achieves better overall performance through its higher clock frequency despite having a variable CPI (Cycles Per Instruction) of 1 for most instructions and 9 for divisions.

# 6 Timing closure and Resource utilization

## 6.1 Timing closure

| Design Timing Summary                    |                                     |                                                   |
|------------------------------------------|-------------------------------------|---------------------------------------------------|
| Setup                                    | Hold                                | Pulse Width                                       |
| Worst Negative Slack (WNS): -9,451 ns    | Worst Hold Slack (WHS): -0,239 ns   | Worst Pulse Width Slack (WPWS): 3,020 ns          |
| Total Negative Slack (TNS): -8431,276 ns | Total Hold Slack (THS): -189,649 ns | Total Pulse Width Negative Slack (TPWS): 0,000 ns |
| Number of Failing Endpoints: 1290        | Number of Failing Endpoints: 1587   | Number of Failing Endpoints: 0                    |
| Total Number of Endpoints: 4276          | Total Number of Endpoints: 4276     | Total Number of Endpoints: 1664                   |

**Timing constraints are not met.**

Figure 5: Timing summary.

After synthesizing and implementing the multi-cycle datapath design on the Arty Z7-20 FPGA, we obtained the following timing results:

- Constraint clock (clock\_proc) frequency: 125 MHz (Period: 8.000 ns)
- Worst Negative Slack (WNS): -9.451 ns
- Conclusion: The design has not achieved Timing closure (Timing constraints are not met).

Based on the WNS, the maximum operating frequency ( $F_{max}$ ) is calculated as follows:

$$T_{min} = T_{constraint} - WNS = 8.000 - (-9.451) = 17.451ns$$

$$F_{max} = \frac{1}{T_{min}} \approx 57.30MHz$$

The previous results we get from the single-cycle datapath design in Lab 3 (can be viewed in Section 7.1) had a significantly lower  $F_{max}$  ( 8.75 MHz). Meanwhile, the introduction of the 8-stage pipelined divider has successfully broken the critical path. The new  $F_{max}$  of 57.30 MHz represents a massive improvement, approximately 6.5 times faster, compared to the single-cycle version. Although the design still does not meet the aggressive 125 MHz constraint, the pipelining strategy has proven effective in reducing the critical path delay, validating the design goal of the multi-cycle architecture.

## 6.2 Resource utilization

| Resource | Utilization | Available | Utilization % |
|----------|-------------|-----------|---------------|
| LUT      | 4941        | 53200     | 9.29          |
| FF       | 1627        | 106400    | 1.53          |
| BRAM     | 8           | 140       | 5.71          |
| DSP      | 12          | 220       | 5.45          |
| IO       | 4           | 125       | 3.20          |



Figure 6: Utilization summary.

Both single-cycle datapath and multi-cycle datapath resulted in 9% LUT usage, indicating that the combinational logic complexity is similar. However, there is a noticeable increase in Flip-Flop utilization, 1.53% vs 0.96% in single-cycle). This increase is directly attributed to the pipeline registers added between the 8 stages of the divider and the new state registers (stall counter) in the datapath. In conclusion, the design fits easily within the Arty Z7-20 FPGA.