

# FPGA Homework 2 Report

Parsa Haghaghoo  
Student ID : 40030644

December 27, 2025

## 1 Question 1

The task is to write Verilog code for the circuit shown in Figure 1. The circuit contains two combinational logic blocks feeding two D flip-flops that share a common clock:

- The upper D flip-flop receives  $D_1 = a \wedge b$ .
- The lower D flip-flop receives  $D_2 = b \vee c$ .
- Both flip-flops sample their inputs on the rising edge of `clk`.

1. Write the Verilog code for the circuit below.



Figure 1

Figure 1: Given circuit:  $q1 \leftarrow (a \wedge b)$  and  $q2 \leftarrow (b \vee c)$  at each rising edge of `clk`.

## 2 Design Approach

The circuit is modeled using:

- Two internal wires for the combinational outputs:

$$d1 = a \wedge b, \quad d2 = b \vee c$$

- A sequential `always @(posedge clk)` block to represent the two D flip-flops.

On every rising edge of `clk`, the flip-flops capture the combinational values and update `q1` and `q2`.

### 3 Verilog Implementation (RTL)

Listing 1 shows the Verilog module that directly matches the logic in Figure 1.

```

1  module q1 (
2    input  wire a,
3    input  wire b,
4    input  wire c,
5    input  wire clk,
6    output reg q1,
7    output reg q2
8  );
9
10 wire d1 = a & b;
11 wire d2 = b | c;
12
13 always @ (posedge clk) begin
14   q1 <= d1;
15   q2 <= d2;
16 end
17
18 endmodule

```

Listing 1: RTL Verilog module for Question 1.

### 4 Testbench

A testbench was created to verify sequential behavior. It generates a clock with a 10 ns period and applies different input combinations of  $(a, b, c)$  over time. The outputs are observed in simulation to confirm that they only update on clock edges.

```

1  `timescale 1ns/1ps
2
3  module q1_tb;
4
5  reg a, b, c;
6  reg clk;
7  wire q1, q2;
8
9  q1 uut (
10   .a(a),
11   .b(b),
12   .c(c),
13   .clk(clk),
14   .q1(q1),
15   .q2(q2)

```

```

16 );
17
18 // Clock generation (10 ns period)
19 always #5 clk = ~clk;
20
21 initial begin
22     // Initialize
23     clk = 0;
24     a = 0; b = 0; c = 0;
25
26     // Apply test vectors
27     #10 a=0; b=0; c=0;
28     #10 a=0; b=0; c=1;
29     #10 a=0; b=1; c=0;
30     #10 a=0; b=1; c=1;
31     #10 a=1; b=0; c=0;
32     #10 a=1; b=0; c=1;
33     #10 a=1; b=1; c=0;
34     #10 a=1; b=1; c=1;
35
36     #20 $stop;
37 end
38
39 endmodule

```

Listing 2: Testbench for Question 1 (Vivado simulation).

## 5 Simulation Results and Output Analysis

Figure 2 shows the simulation waveform for `a`, `b`, `c`, `clk`, `q1`, and `q2`.

### 5.1 Expected Behavior

Because the design uses D flip-flops:

- `q1` should update only on each rising edge of `clk` to the value ( $a \& b$ ).
- `q2` should update only on each rising edge of `clk` to the value ( $b | c$ ).

Therefore, changes on inputs `a`, `b`, and `c` do not immediately change `q1` and `q2`; the outputs change only after the next rising clock edge.

### 5.2 Observed Behavior

In the waveform:

- `q1` transitions to 1 only when both `a` and `b` are 1 at the sampling edge.
- `q2` becomes 1 when either `b` or `c` is 1 at the sampling edge.
- Both `q1` and `q2` remain stable between rising edges, indicating correct flip-flop behavior.

Any initial unknown (X) values are normal at time 0 since no reset signal is defined in the circuit.



Figure 2: Simulation waveform: q1 updates to  $(a \& b)$  and q2 updates to  $(b | c)$  on each rising edge of clk.

## 6 Synthesis / Resource Utilization

The synthesized design is minimal and uses only a small number of FPGA resources because it consists of:

- Two flip-flops (for q1 and q2)
- One LUT for AND logic and one LUT for OR logic (often mapped efficiently)

Figure 23 shows the Vivado utilization report for this design.



Figure 3: Vivado utilization summary for the synthesized design.



Figure 4: Vivado Schematic design.

## 7 Conclusion

This report implemented the given sequential circuit in Verilog by:

- Creating combinational signals  $d1 = a \& b$  and  $d2 = b | c$
- Capturing them on `posedge clk` using two D flip-flops

Simulation results confirm that  $q1$  and  $q2$  update only on clock edges and match the expected Boolean functions of the inputs. The utilization report verifies a compact hardware implementation.

## 8 Question 2: Full Adder and Ripple-Carry Adder

### 8.1 Problem Statement

Figure 5 illustrates a *full adder* circuit with inputs  $a$ ,  $b$ , and  $c_i$  (carry-in), and outputs  $s$  (sum) and  $c_o$  (carry-out). The full adder computes the two-bit binary result:

$$\{c_o, s\} = a + b + c_i$$

where the addition operator represents arithmetic addition, not logical OR.

The figure also demonstrates how multiple full adder blocks can be cascaded to form a *four-bit ripple-carry adder*, where the carry output of each stage feeds the carry input of the next stage.



(a) Full adder circuit



(b) Full adder symbol



(c) Four-bit ripple-carry adder circuit

Figure 2: A ripple-carry adder circuit.

Figure 5: (a) Full adder circuit, (b) full adder symbol, and (c) four-bit ripple-carry adder.

## 8.2 Full Adder Logic

A full adder produces:

- A **sum** bit, which is the XOR of all inputs
- A **carry-out** bit, which is asserted when at least two inputs are high

The Boolean equations are:

$$s = a \oplus b \oplus c_i$$

$$c_o = (a \wedge b) \vee (a \wedge c_i) \vee (b \wedge c_i)$$

These equations match the truth table and schematic shown in the problem statement.

## 8.3 Verilog Implementation

The full adder was implemented as a purely combinational module using continuous assignments.

```

1 module full_adder (
2     input wire a,
3     input wire b,
```

```

4     input  wire  ci,
5     output wire  s,
6     output wire  co
7 );
8
9 assign s = a ^ b ^ ci;
10 assign co = (a & b) | (a & ci) | (b & ci);
11
12 endmodule

```

Listing 3: Verilog implementation of the full adder.

## 8.4 Testbench

To verify correctness, a testbench was written to apply all eight possible input combinations of  $(a, b, c_i)$ . Because the full adder is combinational, no clock signal is required.

```

1 `timescale 1ns/1ps
2
3 module full_adder_tb;
4
5 reg a, b, ci;
6 wire s, co;
7
8 // Instantiate the full adder
9 full_adder dut (
10   .a(a),
11   .b(b),
12   .ci(ci),
13   .s(s),
14   .co(co)
15 );
16
17 initial begin
18   a = 0; b = 0; ci = 0; #10;    // 0 + 0 + 0 = 0
19   a = 0; b = 0; ci = 1; #10;    // 0 + 0 + 1 = 1
20   a = 0; b = 1; ci = 0; #10;    // 0 + 1 + 0 = 1
21   a = 0; b = 1; ci = 1; #10;    // 0 + 1 + 1 = 2
22   a = 1; b = 0; ci = 0; #10;    // 1 + 0 + 0 = 1
23   a = 1; b = 0; ci = 1; #10;    // 1 + 0 + 1 = 2
24   a = 1; b = 1; ci = 0; #10;    // 1 + 1 + 0 = 2
25   a = 1; b = 1; ci = 1; #10;    // 1 + 1 + 1 = 3
26
27   $stop;
28 end
29
30 endmodule

```

Listing 4: Testbench for the full adder.

## 8.5 Simulation Results and Analysis

Figure 6 shows the simulation waveform of the full adder.

- The  $s$  output correctly represents the least significant bit of the sum.
- The  $co$  output is asserted whenever the arithmetic sum exceeds 1.
- All output transitions occur immediately after input changes, as expected for combinational logic.

The observed results match the full adder truth table exactly.



Figure 6: Simulation waveform of the full adder showing correct sum and carry behavior.

---

## 8.6 Conclusion

In this question, a full adder was successfully implemented and verified using Verilog. The design correctly computes the two-bit sum  $\{c_o, s\}$  for all input combinations of  $a$ ,  $b$ , and  $c_i$ . Simulation results confirm that the implementation matches the theoretical equations and the provided circuit diagrams. This full adder module can be directly used as a building block for larger arithmetic circuits such as ripple-carry adders.

## 8.7 Synthesis Results (Resource Utilization)

After synthesis and implementation in Vivado, the full adder maps to a very small amount of FPGA hardware. Figure 7 shows that the design uses only **one LUT** for the logic, since modern FPGA LUTs can implement multi-input Boolean functions efficiently. Because the full adder is purely combinational, it requires **no flip-flops**.



Figure 7: Vivado utilization report for the full adder design. The design uses minimal resources (one LUT as logic).

### Discussion:

- The full adder outputs ( $s$  and  $c_o$ ) are combinational functions of three inputs ( $a$ ,  $b$ ,  $c_i$ ).
- FPGA tools can pack these functions into LUT resources efficiently; therefore, the utilization is very small.
- The I/O count corresponds to the ports of the module (3 inputs and 2 outputs).

## 8.8 Implemented Design Schematic (Technology Mapping)

Vivado's implemented schematic is shown in Figure 8. The inputs are connected through **input buffers (IBUF)** and the outputs are driven through **output buffers (OBUF)**, which is standard FPGA I/O structure. The internal logic is mapped into LUT primitives labeled **LUT3**, meaning each LUT implements a Boolean function of three inputs.



Figure 8: Implemented schematic of the full adder in Vivado. Each output is implemented using a 3-input LUT (LUT3), with IBUF/OBUF at the FPGA pins.

## Discussion:

- The **sum** output  $s = a \oplus b \oplus c_i$  is realized by a 3-input LUT (LUT3).
- The **carry** output  $c_o = (ab) + (ac_i) + (bc_i)$  is also realized by a LUT3.
- IBUF and OBUF blocks indicate the FPGA's physical input/output buffering around the LUT logic.
- This mapping confirms that the Verilog code is synthesized exactly as intended for the full adder.

## 8.9 Four-Bit Ripple-Carry Adder Implementation

Using the full adder module developed in the previous section, a four-bit ripple-carry adder was constructed as shown in Figure ???. This circuit adds two 4-bit binary numbers  $a[3 : 0]$  and  $b[3 : 0]$  along with an input carry  $c_{in}$ , producing a 4-bit sum  $s[3 : 0]$  and a final carry-out  $c_{out}$ .

The carry signal propagates sequentially from the least significant bit (LSB) to the most significant bit (MSB), which is why this design is referred to as a *ripple-carry adder*.

## 8.10 Design Description

The design consists of four full adder instances connected in series:

- The first full adder computes the sum of the LSBs using  $c_{in}$
- Each subsequent full adder receives the carry-out from the previous stage
- The final carry-out is produced by the MSB full adder

Mathematically, the circuit computes:

$$\{c_{out}, s[3 : 0]\} = a[3 : 0] + b[3 : 0] + c_{in}$$

## 8.11 Verilog Implementation

The Verilog module for the four-bit ripple-carry adder is shown in Listing 5.

```

1 module ripple_carry_adder4 (
2     input wire [3:0] a,
3     input wire [3:0] b,
4     input wire       cin,
5     output wire [3:0] s,
6     output wire       cout
7 );
8
9     wire c1, c2, c3;
10
11    full_adder fa0 (.a(a[0]), .b(b[0]), .ci(cin), .s(s[0]), .co(c1));
12    full_adder fa1 (.a(a[1]), .b(b[1]), .ci(c1),   .s(s[1]), .co(c2));
13    full_adder fa2 (.a(a[2]), .b(b[2]), .ci(c2),   .s(s[2]), .co(c3));

```

```

14 full_adder fa3 (.a(a[3]), .b(b[3]), .ci(c3), .s(s[3]), .co(cout)
15 );
16 endmodule

```

Listing 5: Verilog implementation of the 4-bit ripple-carry adder.

## 8.12 Testbench

A testbench was written to verify the operation of the ripple-carry adder using representative test cases, including overflow scenarios.

```

1 `timescale 1ns/1ps
2
3 module ripple_carry_adder4_tb;
4
5 reg [3:0] a, b;
6 reg cin;
7 wire [3:0] s;
8 wire cout;
9
10 ripple_carry_adder4 dut (
11     .a(a), .b(b), .cin(cin),
12     .s(s), .cout(cout)
13 );
14
15 initial begin
16     // test 1: 3 + 5 = 8
17     a = 4'd3; b = 4'd5; cin = 1'b0;
18     #10;
19
20     // test 2: 9 + 6 = 15
21     a = 4'd9; b = 4'd6; cin = 1'b0;
22     #10;
23
24     // test 3: 15 + 1 = 16 -> s=0, cout=1
25     a = 4'd15; b = 4'd1; cin = 1'b0;
26     #10;
27
28     // test 4: 7 + 8 + cin(1) = 16 -> s=0, cout=1
29     a = 4'd7; b = 4'd8; cin = 1'b1;
30     #10;
31
32     $stop;
33 end
34
35 endmodule

```

Listing 6: Testbench for the 4-bit ripple-carry adder.

## 8.13 Simulation Results and Output Analysis

Figure 9 shows the simulation waveform for the ripple-carry adder.



Figure 9: Simulation waveform of the 4-bit ripple-carry adder.

### Observed Results:

- For  $a = 3, b = 5, c_{in} = 0: s = 8, c_{out} = 0$
- For  $a = 9, b = 6, c_{in} = 0: s = 15, c_{out} = 0$
- For  $a = 15, b = 1, c_{in} = 0: s = 0, c_{out} = 1$  (overflow)
- For  $a = 7, b = 8, c_{in} = 1: s = 0, c_{out} = 1$  (overflow)

These results confirm correct ripple-carry behavior. The carry propagates through each full adder stage until the final sum and carry-out are produced.

## 8.14 Discussion

The ripple-carry adder correctly performs multi-bit binary addition using modular full adder blocks. Although simple and easy to implement, ripple-carry adders suffer from carry propagation delay, as each bit must wait for the previous carry to resolve. Despite this limitation, the ripple-carry adder is widely used due to its simplicity and clear structure.

## 8.15 Conclusion

The four-bit ripple-carry adder was successfully implemented using four cascaded full adders. Simulation results demonstrate correct functionality for normal addition and overflow cases. This design completes the implementation of the full adder-based arithmetic circuit described in Question 2.

# 9 Question 3(a): JK Flip-Flop Design

## 9.1 Problem Statement

The objective of this question is to design a *JK flip-flop* module. A JK flip-flop is a sequential storage element with inputs  $J$ ,  $K$ , a clock signal, and outputs  $Q$  and  $\bar{Q}$ . Depending on the values of  $J$  and  $K$ , the flip-flop can hold its state, set, reset, or toggle its output.

## 9.2 JK Flip-Flop Operation

The behavior of a JK flip-flop on the rising edge of the clock is summarized as follows:

| J | K | Next State $Q^+$     |
|---|---|----------------------|
| 0 | 0 | Hold ( $Q$ )         |
| 0 | 1 | Reset (0)            |
| 1 | 0 | Set (1)              |
| 1 | 1 | Toggle ( $\bar{Q}$ ) |

An asynchronous active-high reset is included to initialize the output to a known state.

## 9.3 Verilog Implementation

The JK flip-flop is implemented using a sequential `always` block triggered by the rising edge of the clock or the reset signal. The complement output  $\bar{Q}$  is generated using a continuous assignment.

```
1 module jk_ff (
2     input wire clk,
3     input wire rst,      // async reset (active-high)
4     input wire j,
5     input wire k,
6     output reg q,
7     output wire qbar
8 );
9
10 assign qbar = ~q;
11
12 always @ (posedge clk or posedge rst) begin
13     if (rst) begin
14         q <= 1'b0;
15     end else begin
16         case ({j,k})
17             2'b00: q <= q;          // hold
18             2'b01: q <= 1'b0;      // reset
19             2'b10: q <= 1'b1;      // set
20             2'b11: q <= ~q;        // toggle
21         endcase
22     end
23 end
```

```

24
25 endmodule

```

Listing 7: Verilog implementation of the JK flip-flop.

## 9.4 Testbench

A self-checking testbench was written to verify the correct operation of the JK flip-flop. The testbench applies all major JK input combinations and checks the output against the expected behavior after each rising clock edge.

```

1  `timescale 1ns/1ps
2
3 module jk_ff_tb;
4
5   reg clk, rst;
6   reg j, k;
7   wire q, qbar;
8
9   jk_ff dut (
10     .clk(clk), .rst(rst),
11     .j(j), .k(k),
12     .q(q), .qbar(qbar)
13 );
14
15 // 10 ns clock
16 always #5 clk = ~clk;
17
18 task step_and_check(input reg exp_q, input [1:0] jk);
19   begin
20     {j,k} = jk;
21     @(posedge clk);
22     #1;
23     if (q !== exp_q) begin
24       $display("FAIL time=%0t JK=%b exp_q=%b got_q=%b",
25               $time, jk, exp_q, q);
26       $stop;
27     end else begin
28       $display("PASS time=%0t JK=%b q=%b qbar=%b",
29               $time, jk, q, qbar);
30     end
31   end
32 endtask
33
34 initial begin
35   clk = 0; rst = 1; j = 0; k = 0;
36
37   // asynchronous reset
38   #2 rst = 1;
39   #8 rst = 0;
40

```

```

41      step_and_check(1'b0, 2'b00); // hold
42      step_and_check(1'b1, 2'b10); // set
43      step_and_check(1'b1, 2'b00); // hold
44      step_and_check(1'b0, 2'b01); // reset
45      step_and_check(1'b1, 2'b11); // toggle
46      step_and_check(1'b0, 2'b11); // toggle
47
48      $display("JK FF TEST PASSED");
49      $stop;
50
51
52 endmodule

```

Listing 8: Testbench for the JK flip-flop.

## 9.5 Simulation Results and Output Analysis

Figure 10 shows the simulation waveform of the JK flip-flop.



Figure 10: Simulation waveform of the JK flip-flop showing reset, hold, set, reset, and toggle operations.

### Observed Behavior:

- When `rst` is asserted,  $Q$  is immediately cleared to 0.
- With  $J = 0$  and  $K = 0$ , the output holds its previous value.
- With  $J = 1$  and  $K = 0$ , the output is set to 1.
- With  $J = 0$  and  $K = 1$ , the output is reset to 0.

- With  $J = 1$  and  $K = 1$ , the output toggles on each rising edge of the clock.

The waveform confirms that the flip-flop responds only on clock edges (except for reset), which is the correct behavior for a synchronous JK flip-flop with asynchronous reset.

## 9.6 Conclusion

A JK flip-flop with asynchronous active-high reset was successfully designed and verified. The Verilog implementation correctly follows the JK truth table, and simulation results confirm proper hold, set, reset, and toggle behavior. This module can be used as a fundamental building block in counters and other sequential digital systems.

## 10 Question 3(b): 4-bit Up-Counter Using JK Flip-Flops

### 10.1 Problem Statement

Using the JK flip-flop designed in Question 3(a) and the circuit diagram provided in Figure 11, design a 4-bit up-counter. The counter output should increment by one on every rising edge of the clock and wrap around after reaching 15 (i.e., modulo-16 behavior). A reset input is used to initialize the counter output to zero.



Figure 3: Counter With JK Flip-Flop

2

Figure 11: Counter with JK flip-flops (given diagram).

### 10.2 Design Approach

A JK flip-flop toggles its output when  $J = K = 1$ . Therefore, by setting  $J = K = T$ , a JK flip-flop behaves like a T flip-flop:

$$T = 1 \Rightarrow \text{toggle}, \quad T = 0 \Rightarrow \text{hold}$$

To construct a synchronous 4-bit up-counter:

- Bit 0 toggles every clock:  $T_0 = 1$
- Bit 1 toggles when bit 0 is 1:  $T_1 = Q_A$
- Bit 2 toggles when bits 1:0 are 11:  $T_2 = Q_A Q_B$
- Bit 3 toggles when bits 2:0 are 111:  $T_3 = Q_A Q_B Q_C$

This matches the AND-gate enable structure shown in the diagram.

### 10.3 Verilog Implementation

The 4-bit up-counter is built by instantiating four JK flip-flops and driving each one with the proper toggle-enable signal.

```

1 module upcounter4_jk (
2     input wire clk,
3     input wire rst,
4     output wire [3:0] out
5 );
6 wire qa, qb, qc, qd;
7
8 wire t0 = 1'b1;
9 wire t1 = qa;
10 wire t2 = qa & qb;
11 wire t3 = qa & qb & qc;
12
13 jk_ff FFA (.clk(clk), .rst(rst), .j(t0), .k(t0), .q(qa), .qbar())
14     ;
14 jk_ff FFB (.clk(clk), .rst(rst), .j(t1), .k(t1), .q(qb), .qbar())
15     ;
15 jk_ff FFC (.clk(clk), .rst(rst), .j(t2), .k(t2), .q(qc), .qbar())
16     ;
16 jk_ff FFD (.clk(clk), .rst(rst), .j(t3), .k(t3), .q(qd), .qbar())
17     ;
17
18 assign out = {qd, qc, qb, qa};
19
20 endmodule

```

Listing 9: Verilog implementation of 4-bit JK up-counter.

### 10.4 Testbench

A clock is generated with a 10 ns period. The reset is asserted initially and then released, allowing the counter to run.

```

1 `timescale 1ns/1ps
2
3 module upcounter4_jk_tb;
4
5 reg clk;
6 reg rst;

```

```

7   wire [3:0] out;
8
9   upcounter4_jk dut (
10    .clk(clk),
11    .rst(rst),
12    .out(out)
13  );
14
15 // Clock generation: 10 ns period
16 always #5 clk = ~clk;
17
18 initial begin
19   clk = 0;
20   rst = 1;
21
22   // Apply reset
23   #15;
24   rst = 0;
25
26   // Let the counter run
27   #200;
28
29   $stop;
30 end
31
32 endmodule

```

Listing 10: Testbench for the 4-bit JK up-counter.

## 10.5 Simulation Results and Output Analysis

Figure 12 shows the simulation waveform. After reset is deasserted, the output increments on each rising edge of the clock and cycles through hexadecimal values:

$$0, 1, 2, \dots, 9, a, b, c, d, e, f, 0, 1, \dots$$

This confirms correct modulo-16 up-counter behavior.



Figure 12: Simulation waveform of the 4-bit JK up-counter.

## 10.6 Implemented Schematic View

Vivado's schematic view confirms the design is composed of four JK flip-flop instances with the correct interconnections, and the 4-bit output is routed through output buffers.



Figure 13: Implemented schematic of the 4-bit JK up-counter in Vivado.

## 10.7 Resource Utilization

The utilization report in Figure 14 shows the counter uses a small number of FPGA resources. As expected, the design requires four registers (one per bit) and a few LUTs to implement the AND enable logic.



Figure 14: Vivado utilization report for the 4-bit JK up-counter.

## 10.8 Timing Summary

Figure 15 shows the timing summary. The report indicates that timing constraints are met (no failing endpoints), confirming the counter can operate correctly at the specified clock constraint in the project.



Figure 15: Vivado timing summary for the 4-bit JK up-counter.

## 10.9 Conclusion

A 4-bit synchronous up-counter was successfully implemented using four JK flip-flops. By driving each JK flip-flop with the appropriate toggle-enable condition, the design increments by one on each rising edge and wraps after 15. Simulation, schematic mapping, utilization, and timing results confirm correct operation.

# 11 Question 3(c): RTL Implementation of a 4-bit Up-Counter

## 11.1 Problem Statement

Design a 4-bit up-counter directly using an `always` block. The counter must increment on every rising edge of the clock and reset asynchronously to zero when the reset signal is asserted.

## 11.2 Design Approach

Instead of explicitly instantiating JK flip-flops, this design uses a register-based RTL description. The counter is implemented using a sequential `always` block that triggers

on the rising edge of the clock or the rising edge of the reset signal.

When reset is asserted, the output is cleared to zero. Otherwise, the counter increments by one on each clock cycle. Since the output is 4 bits wide, the counter naturally wraps around after reaching 15, implementing modulo-16 behavior.

### 11.3 Verilog Implementation

The RTL-based counter is shown in Listing 11.

```
1 module upcounter4_rtl (
2     input wire clk,
3     input wire rst,           // async reset (active-high)
4     output reg [3:0] out
5 );
6
7 always @ (posedge clk or posedge rst) begin
8     if (rst)
9         out <= 4'b0000;
10    else
11        out <= out + 1'b1;
12 end
13
14 endmodule
```

Listing 11: RTL Verilog implementation of 4-bit up-counter.

### 11.4 Testbench

The testbench generates a 10 ns clock and applies an initial reset before allowing the counter to run freely.

```
1 `timescale 1ns/1ps
2
3 module upcounter4_rtl_tb;
4
5 reg clk;
6 reg rst;
7 wire [3:0] out;
8
9 upcounter4_rtl dut (
10     .clk(clk),
11     .rst(rst),
12     .out(out)
13 );
14
15 // Clock generation: 10 ns period
16 always #5 clk = ~clk;
17
18 initial begin
19     clk = 0;
20     rst = 1;
21
```

```

22 // Apply reset
23 #15;
24 rst = 0;
25
26 // Let counter run freely
27 #200;
28
29 $stop;
30 end
31
32 endmodule

```

Listing 12: Testbench for RTL up-counter.

## 11.5 Simulation Results

Figure 16 shows the simulation waveform. After reset is deasserted, the output increments sequentially:

$$0, 1, 2, \dots, 9, a, b, c, d, e, f, 0, 1, \dots$$

This confirms correct modulo-16 up-counter behavior.



Figure 16: Simulation waveform of RTL 4-bit up-counter.

## 11.6 Implemented Schematic

The synthesized schematic in Figure 17 shows four flip-flops with minimal combinational logic, reflecting the simplicity of the RTL description.



Figure 17: Implemented schematic of RTL up-counter in Vivado.

## 11.7 Resource Utilization

Figure 18 presents the utilization report. Compared to the JK-based counter, this RTL implementation uses fewer LUTs while still requiring four registers.

| Hierarchy             |                |                        |                             |                  |                          |                     |
|-----------------------|----------------|------------------------|-----------------------------|------------------|--------------------------|---------------------|
|                       | Name           | Slice LUTs<br>(303600) | Slice Registers<br>(607200) | Slice<br>(75900) | LUT as Logic<br>(303600) | Bonded IOB<br>(600) |
| Summary               | upcounter4_rtl | 2                      | 4                           | 1                | 2                        | 6                   |
| Slice Logic           |                |                        |                             |                  |                          | 1                   |
| Slice LUTs (<1%)      |                |                        |                             |                  |                          |                     |
| LUT as Logic (<       |                |                        |                             |                  |                          |                     |
| Slice Registers (<1%) |                |                        |                             |                  |                          |                     |
| Register as Flip      |                |                        |                             |                  |                          |                     |

Figure 18: Vivado utilization report for RTL up-counter.

## 11.8 Timing Analysis

The timing summary in Figure 19 confirms that all setup, hold, and pulse width constraints are met, with no failing paths reported.

| Design Timing Summary |  |                                                       |          |                              |          |                                           |          |
|-----------------------|--|-------------------------------------------------------|----------|------------------------------|----------|-------------------------------------------|----------|
| General Information   |  | Setup                                                 |          | Hold                         |          |                                           |          |
| Timer Settings        |  | Worst Negative Slack (WNS):                           | 9.003 ns | Worst Hold Slack (WHS):      | 0.205 ns | Worst Pulse Width Slack (WPWS):           | 4.600 ns |
| Design Timing Summary |  | Total Negative Slack (TNS):                           | 0.000 ns | Total Hold Slack (THS):      | 0.000 ns | Total Pulse Width Negative Slack (TPWNS): | 0.000 ns |
| Clock Summary (1)     |  | Number of Failing Endpoints:                          | 0        | Number of Failing Endpoints: | 0        | Number of Failing Endpoints:              | 0        |
| > Check Timing (5)    |  | Total Number of Endpoints:                            | 4        | Total Number of Endpoints:   | 4        | Total Number of Endpoints:                | 5        |
| > Intra-Clock Paths   |  | <b>All user specified timing constraints are met.</b> |          |                              |          |                                           |          |
| Inter-Clock Paths     |  |                                                       |          |                              |          |                                           |          |
| Other Path Groups     |  |                                                       |          |                              |          |                                           |          |

Figure 19: Timing summary for RTL up-counter.

## 11.9 Comparison with JK-Based Counter

Compared to the JK flip-flop implementation in Question 3(b), the RTL counter:

- Uses fewer LUTs
- Is simpler and more concise
- Is easier to scale to wider counters

Both designs, however, produce identical functional behavior.

## 11.10 Conclusion

A 4-bit up-counter was successfully implemented using a direct RTL approach with an `always` block. Simulation, schematic, utilization, and timing results verify correct operation and demonstrate the efficiency of RTL-based counter design.

# 12 Question 3(d): Design Comparison

## 12.1 Problem Statement

Compare the designs of the 4-bit up-counters implemented in Question 3(b) (JK flip-flop based counter) and Question 3(c) (RTL counter using an `always` block) in terms of:

- RTL schematic structure
- Maximum clock speed
- Resource utilization

The comparison is performed using synthesis and implementation results generated by Xilinx Vivado for the target FPGA device.

## 12.2 RTL Schematic Comparison

The RTL schematic of the JK-based counter (Question 3(b)) shows four explicitly instantiated JK flip-flops connected through AND-gate logic that generates the toggle enable signals for each counter bit. This design closely follows the theoretical JK counter structure typically presented in digital design textbooks.

In contrast, the RTL schematic of the always-block counter (Question 3(c)) consists of four D-type registers and a small incrementer circuit that computes the next state  $\text{out} + 1$ . The RTL description allows the synthesis tool to infer the optimal hardware implementation automatically.

Overall, the JK-based design is more structural and explicit, while the RTL-based design is higher-level and more abstract.

### 12.3 Resource Utilization Comparison

Figures from the Vivado utilization reports indicate the following resource usage:

Table 1: Resource utilization comparison of the two counter designs.

| Design            | Slice LUTs | Registers | BUFG | IOBs |
|-------------------|------------|-----------|------|------|
| JK Counter (Q3b)  | 3          | 4         | 1    | 6    |
| RTL Counter (Q3c) | 2          | 4         | 1    | 6    |

Both designs require four registers, corresponding to the four counter bits. However, the RTL counter uses fewer LUTs than the JK-based design. This reduction is due to Vivado's ability to optimize the incrementer logic more efficiently than the explicit AND-gate toggle network used in the JK design.

### 12.4 Maximum Clock Speed Comparison

Both designs were constrained with a clock period of 10 ns (100 MHz), as shown in the Clock Summary report. The maximum achievable clock speed was estimated using the Worst Negative Slack (WNS) reported by Vivado.

#### JK-Based Counter

$$T_{\min} = 10.000 \text{ ns} - 9.157 \text{ ns} = 0.843 \text{ ns}$$
$$F_{\max} \approx \frac{1}{0.843 \text{ ns}} \approx 1186 \text{ MHz}$$

#### RTL Counter

$$T_{\min} = 10.000 \text{ ns} - 9.003 \text{ ns} = 0.997 \text{ ns}$$
$$F_{\max} \approx \frac{1}{0.997 \text{ ns}} \approx 1003 \text{ MHz}$$

### 12.5 Timing Summary Interpretation

Both counters meet all timing constraints, with no failing endpoints reported. The JK-based design exhibits a slightly higher maximum theoretical clock frequency due to a marginally simpler critical path. However, the difference is small and not significant for practical operation at moderate clock speeds such as 100 MHz.

### 12.6 Overall Comparison and Discussion

- The RTL counter is more resource-efficient, using fewer LUTs.
- The JK counter closely matches the conceptual flip-flop-based design.
- The JK counter achieves a slightly higher theoretical maximum clock speed.
- Both designs operate correctly well above the required clock frequency.

## 12.7 Conclusion

Both counter designs successfully implement a 4-bit up-counter. The JK-based design provides clear insight into flip-flop-level counter construction, while the RTL always-block implementation offers a simpler, more compact, and scalable solution. In practice, the RTL approach is preferred for FPGA design due to its reduced resource usage and ease of development, while the JK implementation remains valuable for educational purposes.

## Question 4

The following finite impulse response (FIR) filter is considered:

$$y(n) = ax(n) + bx(n - 1) + cx(n - 2) \quad (1)$$

This report presents the design, implementation, simulation, and timing analysis for the architecture with the **lowest latency**.

## 13 (a) Lowest Latency FIR Filter

### 13.1 Proposed Architecture

Figure 20 shows the hand-drawn architecture of the proposed FIR filter. The design follows the direct-form FIR structure and minimizes latency by avoiding any pipelining in the arithmetic datapath.



Figure 20: Hand-drawn lowest-latency FIR filter architecture

The architecture consists of:

- Two delay elements implementing  $x(n - 1)$  and  $x(n - 2)$
- Three multipliers for coefficients  $a$ ,  $b$ , and  $c$
- Two adders to accumulate the partial products

Since no pipeline registers are inserted in the multiply-add chain, the output is produced with minimum latency.

## 13.2 RTL Implementation

The FIR filter was implemented in Verilog using fixed-point arithmetic. The delay elements are implemented using registers, and the output is registered to ensure synchronous operation.

## 13.3 RTL Schematic

Figure 21 shows the RTL schematic generated by Vivado after synthesis. The diagram confirms the presence of the expected multipliers, adders, and delay registers corresponding to the direct-form FIR architecture.



Figure 21: RTL schematic of the lowest-latency FIR filter

The absence of pipeline registers in the datapath confirms that the design is optimized for minimum latency.

### 13.4 Simulation Results

Behavioral simulation was performed to verify functional correctness. Figure 22 shows the simulation waveform.



Figure 22: Behavioral simulation waveform of the FIR filter

From the waveform:

- Initial undefined values occur before reset due to uninitialized registers

- After reset deassertion, the delay line fills correctly
- The output follows the FIR equation exactly

For example, with coefficients  $a = 1$ ,  $b = 2$ , and  $c = 3$ , the output matches:

$$y(n) = x(n) + 2x(n - 1) + 3x(n - 2)$$

This confirms the correctness of the implementation.

### 13.5 Resource Utilization

Figure 23 shows the post-synthesis resource utilization report.



Figure 23: Post-synthesis resource utilization

The design uses:

- 38 Slice LUTs
- 64 Slice Registers
- DSP blocks inferred for multipliers
- 1 BUFG

The low utilization is a result of the compact direct-form structure and the absence of pipeline registers.

### 13.6 Timing Analysis Without Clock Constraint

Figure 24 shows the timing report before applying a clock constraint.



Figure 24: Timing summary without user-defined clock constraint

Since no clock constraint is defined, Vivado reports infinite slack and cannot determine the maximum operating frequency.

## 13.7 Timing Analysis With Clock Constraint

A clock constraint of 10 was applied. Figure 25 shows the post-implementation timing results.



Figure 25: Timing summary with 10 ns clock constraint

The reported worst negative slack (WNS) is:

$$WNS = +4.632 \text{ ns}$$

The minimum achievable clock period is:

$$T_{\min} = 10 - 4.632 = 5.368 \text{ ns}$$

Thus, the maximum operating frequency is:

$$F_{\max} \approx 186 \text{ MHz}$$

## 13.8 Critical Path Discussion

The critical path of the design spans:

- A register holding a delayed input sample
- One multiplier
- Two cascaded adders
- The output register

The lack of pipelining results in a long combinational path, which limits the maximum achievable clock frequency despite the low latency.

## 13.9 Conclusion

The lowest-latency FIR filter design achieves minimal sample delay and low hardware utilization. However, the long combinational critical path limits the maximum operating frequency. This highlights the trade-off between latency and timing performance in FIR filter architectures.

## 14 (b) Highest Throughput FIR Filter

### 14.1 Proposed Architecture

Figure 26 shows the hand-drawn architecture of the highest-throughput FIR filter. In this design, pipeline registers are inserted between arithmetic blocks in order to reduce the combinational delay and maximize the achievable clock frequency.



Figure 26: Hand-drawn pipelined FIR filter architecture for highest throughput

The architecture is divided into multiple pipeline stages:

- Stage 1: Multiplication of input samples with coefficients
- Stage 2: First addition and pipeline alignment

- Stage 3: Final accumulation and output register

By inserting registers between stages, the critical path is significantly shortened, allowing the circuit to operate at a much higher clock frequency.

## 14.2 RTL Implementation

The FIR filter was implemented in Verilog using a fully pipelined datapath. Pipeline registers are placed after each multiplier and adder to maximize throughput. Although this increases the latency in terms of clock cycles, the design is capable of processing one input sample per clock cycle at a higher operating frequency.

## 14.3 RTL Schematic

Figure 28 shows the RTL schematic generated by Vivado after synthesis. Compared to the lowest-latency design, a larger number of registers is visible, confirming the insertion of pipeline stages in the datapath.



Figure 27: RTL schematic of the highest-throughput FIR filter



Figure 28: RTL schematic of the highest-throughput FIR filter

The schematic clearly illustrates the pipelined structure, which is responsible for reducing the combinational delay of the critical path.

## 14.4 Simulation Results

Behavioral simulation was performed to verify functional correctness. Figure 29 shows the simulation waveform.



Figure 29: Behavioral simulation waveform of the pipelined FIR filter

From the waveform:

- Initial undefined values are observed before reset due to pipeline registers
- After reset deassertion, the pipeline gradually fills
- Once filled, the output produces one valid sample per clock cycle

Although the output is delayed by several clock cycles compared to the lowest-latency design, the functional behavior follows the FIR equation exactly, confirming correctness.

## 14.5 Resource Utilization

Figure 30 shows the post-synthesis resource utilization.



Figure 30: Resource utilization for the highest-throughput FIR filter

The design utilizes:

- 58 Slice LUTs
- 144 Slice Registers
- DSP blocks inferred for multipliers
- 1 BUFG

Compared to part (a), the number of registers is significantly higher due to pipeline insertion. This increase in resource usage is the trade-off required to achieve higher throughput.

## 14.6 Timing Analysis

A clock constraint of 10 was applied. Figure 32 shows the post-implementation timing summary.



Figure 31: Post-implementation timing summary for highest-throughput design



Figure 32: Post-implementation timing summary for highest-throughput design

The reported worst negative slack is:

$$WNS = +7.731 \text{ ns}$$

The minimum achievable clock period is therefore:

$$T_{\min} = 10 - 7.731 = 2.269 \text{ ns}$$

The corresponding maximum operating frequency is:

$$F_{\max} \approx 441 \text{ MHz}$$

This represents a substantial improvement compared to the lowest-latency design.

## 14.7 Critical Path Discussion

Due to pipelining, the critical path is reduced to:

- A register-to-register path containing at most one multiplier or one adder

This short critical path enables a much higher clock frequency, which directly translates to increased throughput.

## 14.8 Discussion

The highest-throughput FIR design successfully maximizes performance by introducing pipeline stages in the datapath. While the latency in clock cycles increases, the design can operate at a significantly higher clock frequency, making it suitable for high-speed signal processing applications.

The comparison between parts (a) and (b) highlights the fundamental trade-off between latency and throughput in FIR filter architectures.

## 15 (c) Lowest Area FIR Filter

### 15.1 Proposed Architecture

Figure 33 shows the hand-drawn architecture of the lowest-area FIR filter. In this design, hardware resources are minimized by sharing a single multiplier and a single adder across multiple clock cycles.



Figure 33: Hand-drawn resource-shared FIR filter architecture

A phase counter controls multiplexers that sequentially select the input samples  $x(n)$ ,  $x(n - 1)$ , and  $x(n - 2)$ , along with their corresponding coefficients  $a$ ,  $b$ , and  $c$ . The partial products are accumulated over three clock cycles, producing one output sample every three cycles.

### 15.2 RTL Implementation

The FIR filter was implemented using time-multiplexing and an accumulator-based architecture. Only one multiplier and one adder are instantiated, significantly reducing the overall hardware area. A small control unit (phase counter) orchestrates the computation across multiple cycles.



Figure 34: RTL Implementation of the lowest-area FIR filter

### 15.3 Simulation Results

Behavioral simulation was performed to verify correctness. Figure 35 shows the simulation waveform.



Figure 35: Behavioral simulation waveform of the lowest-area FIR filter

The waveform shows that:

- New inputs are accepted only when `x_valid` is asserted
- The output `y` is updated only when `y_valid` is asserted
- One valid output is produced every three clock cycles

This confirms correct functionality with reduced throughput due to resource sharing.

## 15.4 Resource Utilization

Figure 36 presents the post-synthesis resource utilization.



Figure 36: Resource utilization for the lowest-area FIR filter

The design uses:

- 102 Slice LUTs
- 100 Slice Registers
- 31 Slices
- A single shared multiplier

Compared to designs (a) and (b), this architecture minimizes arithmetic resources by reusing hardware across multiple clock cycles.

## 15.5 Timing Analysis

Timing analysis was performed with a 10 clock constraint. Figure 38 shows the post-implementation timing summary.



Figure 37: Timing summary for the lowest-area FIR filter



Figure 38: Timing summary for the lowest-area FIR filter

The reported worst negative slack is:

$$WNS = +5.536 \text{ ns}$$

The minimum clock period is:

$$T_{\min} = 10 - 5.536 = 4.464 \text{ ns}$$

Resulting in a maximum operating frequency of:

$$F_{\max} \approx 224 \text{ MHz}$$

## 15.6 Critical Path Discussion

The critical path consists of a register-to-register path containing a multiplexer, a single multiplier, and an adder. Due to the reduced combinational complexity per cycle, timing constraints are easily met despite the shared architecture.

## 15.7 Discussion

The lowest-area FIR filter design achieves minimal hardware usage by employing resource sharing and time-multiplexing. While throughput is reduced compared to the pipelined design, the architecture is highly area-efficient and suitable for resource-constrained applications.

This design illustrates the trade-off between hardware area and throughput in FIR filter implementations.