

# **Instruction Queue with Dynamic Scheduling Algorithms and Temporal Verification**

RTL Design and Verification Course Project

Vishwas Jasuja

Roll No: B23303

Branch: Microelectronics and VLSI

School of Computing and Electrical Engineering

Indian Institute of Technology, Mandi

November 15, 2025

# Contents

|          |                                                   |           |
|----------|---------------------------------------------------|-----------|
| <b>1</b> | <b>Introduction</b>                               | <b>4</b>  |
| 1.1      | Problem Statement . . . . .                       | 4         |
| 1.2      | Design Objectives . . . . .                       | 4         |
| 1.3      | System Architecture . . . . .                     | 4         |
| <b>2</b> | <b>Instruction Format and Encoding</b>            | <b>5</b>  |
| 2.1      | Instruction Structure . . . . .                   | 5         |
| 2.2      | Opcode Mapping . . . . .                          | 5         |
| 2.3      | Register Encoding . . . . .                       | 5         |
| <b>3</b> | <b>Module Descriptions</b>                        | <b>6</b>  |
| 3.1      | Instruction Queue (Reservation Station) . . . . . | 6         |
| 3.1.1    | Design Overview . . . . .                         | 6         |
| 3.1.2    | Key Implementation Features . . . . .             | 6         |
| 3.1.3    | Enqueue and Retire Logic . . . . .                | 6         |
| 3.2      | Dependency Graph Builder . . . . .                | 7         |
| 3.2.1    | DAG Construction Algorithm . . . . .              | 7         |
| 3.2.2    | RAW Dependency Detection . . . . .                | 7         |
| 3.2.3    | WAR and WAW Dependency Detection . . . . .        | 8         |
| 3.2.4    | Ready Vector Computation . . . . .                | 8         |
| 3.3      | Resource-Binding Scheduler . . . . .              | 9         |
| 3.3.1    | List Scheduling Algorithm . . . . .               | 9         |
| 3.3.2    | Resource Binding Strategy . . . . .               | 9         |
| 3.3.3    | Priority Selection . . . . .                      | 10        |
| 3.3.4    | Dynamic RAW Tracking . . . . .                    | 10        |
| <b>4</b> | <b>Temporal Verification and CTL Properties</b>   | <b>11</b> |
| 4.1      | CTL Property Specification . . . . .              | 11        |
| 4.1.1    | Property 1: Valid Issue Order . . . . .           | 11        |
| 4.1.2    | Property 2: No Resource Conflicts . . . . .       | 11        |
| 4.1.3    | Property 3: Dependency Respect . . . . .          | 11        |
| 4.1.4    | Property 4: Progress Guarantee . . . . .          | 11        |
| 4.2      | Topological Sorting for Issue Order . . . . .     | 12        |
| 4.3      | Deadlock Freedom Verification . . . . .           | 12        |
| 4.3.1    | Necessary and Sufficient Conditions . . . . .     | 12        |
| 4.3.2    | Proof of DAG Property . . . . .                   | 12        |
| 4.3.3    | Model Checking Approach . . . . .                 | 12        |
| <b>5</b> | <b>Simulation Results</b>                         | <b>13</b> |
| 5.1      | Test Methodology . . . . .                        | 13        |
| 5.2      | Test Set 1: Mixed Dependencies . . . . .          | 13        |
| 5.2.1    | Instruction Details . . . . .                     | 13        |
| 5.2.2    | Dependency Analysis . . . . .                     | 13        |
| 5.2.3    | Simulation Output . . . . .                       | 14        |
| 5.2.4    | Waveform Analysis . . . . .                       | 15        |
| 5.3      | Test Set 2: Chain Dependencies . . . . .          | 17        |
| 5.3.1    | Instruction Details . . . . .                     | 17        |

|          |                                                        |           |
|----------|--------------------------------------------------------|-----------|
| 5.3.2    | Dependency Analysis . . . . .                          | 17        |
| 5.3.3    | Simulation Output . . . . .                            | 18        |
| 5.4      | Test Set 3: Complex RAW Pattern . . . . .              | 20        |
| 5.4.1    | Instruction Details . . . . .                          | 20        |
| 5.4.2    | Dependency Analysis . . . . .                          | 20        |
| 5.4.3    | Simulation Output . . . . .                            | 21        |
| 5.5      | Performance Analysis . . . . .                         | 23        |
| <b>6</b> | <b>Verification Results</b>                            | <b>23</b> |
| 6.1      | CTL Property Verification Summary . . . . .            | 23        |
| 6.2      | Dependency Matrix Correctness . . . . .                | 23        |
| 6.2.1    | Matrix Interpretation . . . . .                        | 23        |
| 6.3      | Topological Order Validation . . . . .                 | 24        |
| <b>7</b> | <b>Design Challenges and Solutions</b>                 | <b>24</b> |
| 7.1      | Challenge 1: Vivado Synthesis Compatibility . . . . .  | 24        |
| 7.2      | Challenge 2: Retirement and Enqueue Ordering . . . . . | 24        |
| 7.3      | Challenge 3: Dynamic RAW Matrix Updates . . . . .      | 25        |
| 7.4      | Challenge 4: Scheduler Enable Signal . . . . .         | 25        |
| <b>8</b> | <b>Scalability and Future Enhancements</b>             | <b>25</b> |
| 8.1      | Current Limitations . . . . .                          | 25        |
| 8.2      | Scaling to Larger Instruction Windows . . . . .        | 25        |
| 8.2.1    | Architectural Changes Required . . . . .               | 25        |
| 8.2.2    | Complexity Analysis . . . . .                          | 26        |
| 8.3      | Extending to 32-bit and 64-bit Instructions . . . . .  | 26        |
| 8.3.1    | 32-bit Instruction Format . . . . .                    | 26        |
| 8.3.2    | 64-bit Instruction Support . . . . .                   | 27        |
| 8.4      | Multiple Functional Units per Type . . . . .           | 27        |
| 8.5      | Advanced Scheduling Algorithms . . . . .               | 28        |
| 8.5.1    | Multiple Issue Support . . . . .                       | 28        |
| 8.5.2    | Critical Path Prioritization . . . . .                 | 28        |
| 8.5.3    | Load Balancing Across FUs . . . . .                    | 28        |
| 8.6      | Memory Dependency Tracking . . . . .                   | 29        |
| 8.7      | Speculative Execution Support . . . . .                | 29        |
| <b>9</b> | <b>Conclusion</b>                                      | <b>30</b> |
| 9.1      | Key Contributions . . . . .                            | 30        |
| 9.2      | Lessons Learned . . . . .                              | 30        |
| 9.3      | Future Work . . . . .                                  | 31        |
| <b>A</b> | <b>Complete Module Listings</b>                        | <b>31</b> |
| A.1      | Top-Level Integration Module . . . . .                 | 31        |
| <b>B</b> | <b>Simulation Test Cases</b>                           | <b>32</b> |
| B.1      | Additional Test Vectors . . . . .                      | 32        |
| <b>C</b> | <b>References</b>                                      | <b>32</b> |

## Abstract

This report presents the design and verification of an instruction queue system with dynamic issue logic for out-of-order execution. The system implements list scheduling algorithms to prioritize instructions based on data dependencies, resource availability, and functional unit binding. A dependency Directed Acyclic Graph (DAG) is constructed to model Read-After-Write (RAW), Write-After-Read (WAR), and Write-After-Write (WAW) hazards. Resource binding optimization ensures efficient utilization of multiple functional units (ALU, MUL, DIV). The design is verified using Computation Tree Logic (CTL) properties to ensure correct instruction issue order, absence of resource conflicts, respect for dependencies, and deadlock freedom. Topological sorting on the dependency graph determines the valid issue order, and model checking validates the absence of deadlocks in the scheduling system.

# 1 Introduction

## 1.1 Problem Statement

Modern processors employ out-of-order execution to improve instruction-level parallelism (ILP) and overall throughput. This project implements a complete instruction scheduling pipeline consisting of:

- An instruction queue with reservation station functionality
- Dynamic dependency analysis using DAG construction
- Resource-aware list scheduling with multiple functional units
- Temporal verification using CTL properties

## 1.2 Design Objectives

1. **Dynamic Issue Logic:** Instructions should issue as soon as their dependencies are satisfied and resources are available
2. **Dependency Management:** Accurate tracking of RAW, WAR, and WAW hazards
3. **Resource Binding:** Efficient allocation of instructions to functional units
4. **Deadlock Freedom:** Guarantee that the system cannot enter a deadlock state
5. **Correctness Verification:** Formal verification of scheduling properties

## 1.3 System Architecture

The complete system architecture consists of three main modules integrated into a top-level design:



Figure 1: System Architecture Overview

## 2 Instruction Format and Encoding

### 2.1 Instruction Structure

Each instruction is encoded as an 8-bit word with the following fields:

| Bits [7:6] | Bits [5:4] | Bits [3:2] | Bits [1:0]  |
|------------|------------|------------|-------------|
| Opcode     | Source 1   | Source 2   | Destination |

Table 1: Instruction Format (8 bits)

### 2.2 Opcode Mapping

| Opcode | Operation Type | FU Type | Latency  |
|--------|----------------|---------|----------|
| 00     | ALU (ADD/SUB)  | ALU     | 2 cycles |
| 01     | ALU (Logical)  | ALU     | 2 cycles |
| 10     | Multiply       | MUL     | 3 cycles |
| 11     | Divide         | DIV     | 4 cycles |

Table 2: Opcode to Functional Unit Mapping

### 2.3 Register Encoding

Source and destination registers are encoded using 2 bits, allowing 4 architectural registers (R0, R1, R2, R3).

## 3 Module Descriptions

### 3.1 Instruction Queue (Reservation Station)

#### 3.1.1 Design Overview

The instruction queue module (`instruction_queue_rs`) implements a reservation station that holds up to 4 instructions. It supports:

- Dynamic enqueue of instructions
- Selective retirement based on completion signals from the scheduler
- Parallel access to all valid entries for dependency analysis

#### 3.1.2 Key Implementation Features

```
1 // Internal storage (parallel arrays)
2 reg [1:0] opcode [0:3];
3 reg [1:0] src1 [0:3];
4 reg [1:0] src2 [0:3];
5 reg [1:0] dest [0:3];
6 reg valid [0:3];
7 reg [2:0] count; // number of valid entries (0..4)
```

Listing 1: Instruction Queue Storage Structure

The queue uses parallel array storage rather than a FIFO structure to enable:

1. **Out-of-order retirement:** Instructions complete in variable time based on their functional unit latency
2. **Random access:** The DAG builder needs simultaneous access to all instructions
3. **Efficient storage utilization:** Free slots can be reused immediately after retirement

#### 3.1.3 Enqueue and Retire Logic

```
1 // Retire: compute retire count and invalidate entries
2 del_cnt = retire_onehot[0] + retire_onehot[1] +
           retire_onehot[2] + retire_onehot[3];
3 for (i = 0; i < 4; i = i + 1) begin
4     if (retire_onehot[i]) begin
5         valid[i] <= 1'b0;
6     end
7 end
8 if (del_cnt != 0)
9     count <= count - del_cnt;
11
12 // Enqueue: find first free slot
13 if (enqueue) begin
14     free_idx = -1;
15     for (i = 0; i < 4; i = i + 1) begin
16         if ((free_idx == -1) && (valid[i] == 1'b0))
17             free_idx = i;
```

```

18     end
19   if (free_idx != -1) begin
20     opcode[free_idx] <= instr_in[7:6];
21     src1[free_idx]    <= instr_in[5:4];
22     src2[free_idx]    <= instr_in[3:2];
23     dest[free_idx]   <= instr_in[1:0];
24     valid[free_idx]  <= 1'b1;
25     count <= count + 1;
26   end
27 end

```

Listing 2: Retirement and Enqueue Operations

**Key Design Decision:** Retirement happens *before* enqueue in the same cycle. This ensures that freed slots are immediately available for new instructions.

## 3.2 Dependency Graph Builder

### 3.2.1 DAG Construction Algorithm

The dependency graph module (`dep_graph_with_valid`) constructs three types of dependency matrices:

1. **RAW (Read-After-Write)**: True data dependencies
2. **WAR (Write-After-Read)**: Anti-dependencies
3. **WAW (Write-After-Write)**: Output dependencies

Each dependency matrix is represented as a flattened 16-bit vector encoding a  $4 \times 4$  adjacency matrix in row-major order.

### 3.2.2 RAW Dependency Detection

The RAW detection implements a **nearest producer** algorithm:

```

1 // RAW (nearest producer) - for each consumer j, scan i = j-1 downto 0
2 for (j = 0; j < 4; j = j + 1) begin
3   if (valid_bits[j]) begin
4     // Check src1 dependency
5     begin : RAW_SRC1
6       for (i = j-1; i >= 0; i = i - 1) begin
7         if (valid_bits[i] && (dest[i] == src1[j])) begin
8           raw_f[(i*4)+j] = 1'b1;
9           disable RAW_SRC1; // Stop after first match
10        end
11      end
12    end
13
14    // Check src2 dependency
15    begin : RAW_SRC2
16      for (i = j-1; i >= 0; i = i - 1) begin
17        if (valid_bits[i] && (dest[i] == src2[j])) begin
18          raw_f[(i*4)+j] = 1'b1;
19          disable RAW_SRC2;
20        end
21      end

```

```

22         end
23     end
24 end

```

Listing 3: RAW Detection - Nearest Producer Search

#### Algorithm Properties:

- **Nearest Producer Only:** Only the most recent producer is recorded, eliminating redundant edges
- **Program Order Preservation:** Scanning backward ensures program order semantics
- **Separate Source Tracking:** src1 and src2 are checked independently

#### 3.2.3 WAR and WAW Dependency Detection

```

1 // WAR & WAW (i < j, forward scan)
2 for (i = 0; i < 4; i = i + 1) begin
3     if (valid_bits[i]) begin
4         for (j = i+1; j < 4; j = j + 1) begin
5             if (valid_bits[j]) begin
6                 // WAR: j writes reg read by i
7                 if ((dest[j] == src1[i]) || (dest[j] == src2[i]))
8                     war_f[(i*4)+j] = 1'b1;
9
10                // WAW: both write same dest
11                if (dest[i] == dest[j])
12                    waw_f[(i*4)+j] = 1'b1;
13            end
14        end
15    end
16 end

```

Listing 4: WAR and WAW Detection

#### 3.2.4 Ready Vector Computation

An instruction is ready when:

1. It is valid
2. It has no RAW predecessors (all dependencies satisfied)

```

1 for (j = 0; j < 4; j = j + 1) begin
2     if (!valid_bits[j])
3         ready_r[j] = 1'b0;
4     else begin
5         ready_r[j] = 1'b1;
6         for (i = 0; i < 4; i = i + 1)
7             if (raw_f[(i*4)+j] == 1'b1)
8                 ready_r[j] = 1'b0;
9     end
10 end

```

Listing 5: Ready Vector Calculation

### 3.3 Resource-Binding Scheduler

#### 3.3.1 List Scheduling Algorithm

The scheduler implements a **priority-based list scheduling** algorithm with the following characteristics:

---

**Algorithm 1** Dynamic List Scheduling with Priority-Based Single Issue

---

- 1: Initialize RAW matrix from DAG
- 2: Initialize all FU busy counters to 0
- 3: **while** instructions remain in queue **do**
- 4:     Decrement all FU busy counters
- 5:     **Detect completions**: if counter transitions 0, mark instruction retired
- 6:     **Update RAW**: clear row for retired instruction
- 7:     Compute dependency-ready instructions
- 8:     Compute resource-ready instructions (FU available)
- 9:     **Select** highest-priority (lowest index) ready instruction
- 10:    **if** instruction selected **then**
- 11:       Allocate FU and set busy counter
- 12:       Mark instruction as issued
- 13:       **Wait** until next cycle (single issue constraint)
- 14:    **end if**
- 15: **end while**

---

#### Key Differences from True Parallel Issue:

- **Single Issue Point**: Only one instruction can be selected per cycle
- **Priority Encoding**: Lower index always wins among ready instructions
- **Execution Overlap**: Parallelism comes from multiple instructions executing in different FUs simultaneously, not from simultaneous issue
- **In-Order Issue**: Instructions issue in program order when dependencies allow

#### 3.3.2 Resource Binding Strategy

Each instruction type is bound to specific functional unit types:

```
1 always @(*) begin
2     alu_free = 0;
3     mul_free = 0;
4     div_free = 0;
5
6     for (i = 0; i < MAX_FU; i = i + 1) begin
7         if (i < ALU_COUNT && alu_busy[i] == 0)
8             alu_free = alu_free + 1;
9
10        if (i < MUL_COUNT && mul_busy[i] == 0)
11            mul_free = mul_free + 1;
12
13        if (i < DIV_COUNT && div_busy[i] == 0)
14            div_free = div_free + 1;
```

```

15     end
16 end

```

Listing 6: FU Availability Computation

### 3.3.3 Priority Selection

Instructions are prioritized by their position in the queue with strict priority ordering - \*\*only one instruction can issue per cycle\*\*:

```

1 always @(*) begin
2     if      (ready[0]) sel = 4'b0001;    // I0 highest priority
3     else if (ready[1]) sel = 4'b0010;    // I1
4     else if (ready[2]) sel = 4'b0100;    // I2
5     else if (ready[3]) sel = 4'b1000;    // I3 lowest priority
6     else           sel = 4'b0000;    // No ready instruction
7 end

```

Listing 7: Priority-Based Instruction Selection (Single Issue per Cycle)

#### Key Scheduling Characteristics:

- **Single Issue:** Only one instruction issues per cycle, even if multiple are ready
- **Fixed Priority:** Lower-indexed instructions always have priority over higher-indexed ones
- **In-Order Issue:** Instructions issue in program order when dependencies permit
- **Out-of-Order Completion:** Instructions can complete in any order based on FU latency

This implements a **conservative in-order issue, out-of-order completion** pipeline where parallelism comes from overlapping execution of instructions in different functional units, not from simultaneous issue.

### 3.3.4 Dynamic RAW Tracking

The scheduler maintains a dynamic copy of the RAW matrix that evolves as instructions complete:

```

1 for (i=0; i<MAX_FU; i=i+1) begin
2     // ALU completion
3     if (i < ALU_COUNT && (nxt_alu_busy[i] == 0) &&
4         (alu_busy[i] != 0)) begin
5         retire_onehot[ fu_idx_alu[i] ] <= 1'b1;
6         // Clear entire row in RAW matrix
7         raw_dyn[(fu_idx_alu[i]*4)+0] <= 0;
8         raw_dyn[(fu_idx_alu[i]*4)+1] <= 0;
9         raw_dyn[(fu_idx_alu[i]*4)+2] <= 0;
10        raw_dyn[(fu_idx_alu[i]*4)+3] <= 0;
11    end
12    // Similar logic for MUL and DIV...
13 end

```

Listing 8: Completion Detection and RAW Update

## 4 Temporal Verification and CTL Properties

### 4.1 CTL Property Specification

Computation Tree Logic (CTL) is used to formally specify temporal properties of the scheduling system. The following properties are verified:

#### 4.1.1 Property 1: Valid Issue Order

**CTL Formula:**

$$AG(\text{issue\_valid} \Rightarrow \text{dependencies\_satisfied}) \quad (1)$$

**Natural Language:** Globally, whenever an instruction is issued, all its RAW dependencies must be satisfied.

**Verification in Design:**

```
1 // Ready vector ensures this property
2 ready_r[j] = valid_bits[j] && (raw_mask[j] & valid_bits) == 0
```

#### 4.1.2 Property 2: No Resource Conflicts

**CTL Formula:**

$$AG(\forall \text{FU}_i : \neg(\text{busy}[i] \wedge \text{new\_issue\_to\_FU}_i)) \quad (2)$$

**Natural Language:** Globally, no instruction can be issued to a busy functional unit.

**Verification in Design:**

```
1 // Resource availability check before issue
2 ready = dependency_ready & resource_available
```

#### 4.1.3 Property 3: Dependency Respect

**CTL Formula:**

$$AG((RAW[i][j] = 1) \Rightarrow AF(\text{retire}[i] \prec \text{issue}[j])) \quad (3)$$

**Natural Language:** If instruction  $j$  depends on instruction  $i$ , then eventually  $i$  must retire before  $j$  issues.

**Verification in Design:**

```
1 // RAW matrix row cleared only on retirement
2 if (retire_onehot[i])
3     raw_dyn[i*4 +: 4] = 4'b0000;
```

#### 4.1.4 Property 4: Progress Guarantee

**CTL Formula:**

$$AG(\text{valid\_bits} \neq 0 \Rightarrow AF(\text{issue\_valid})) \quad (4)$$

**Natural Language:** If there are valid instructions, eventually one will issue (no livelock).

## 4.2 Topological Sorting for Issue Order

The dependency DAG inherently provides a partial order on instructions. The scheduler's priority selection combined with dependency checking implements an implicit topological sort:

---

### Algorithm 2 Implicit Topological Sort in Scheduler

---

```

1:  $L \leftarrow$  empty list (issued instructions)
2:  $S \leftarrow \{i : \text{indegree}[i] = 0\}$  (ready instructions)
3: while  $S \neq \emptyset$  do
4:   Remove instruction  $n$  with highest priority from  $S$ 
5:   Add  $n$  to  $L$ 
6:   for each instruction  $m$  with edge  $n \rightarrow m$  do
7:     Remove edge  $n \rightarrow m$  (clear RAW entry)
8:     if  $m$  has no other incoming edges then
9:       Insert  $m$  into  $S$ 
10:      end if
11:    end for
12:  end while

```

---

## 4.3 Deadlock Freedom Verification

### 4.3.1 Necessary and Sufficient Conditions

For the scheduling system to be deadlock-free:

1. **DAG Property:** The dependency graph must be acyclic
2. **Resource Finiteness:** Resources are finite but instructions eventually complete
3. **No Circular Wait:** RAW dependencies form a DAG, preventing circular waits

### 4.3.2 Proof of DAG Property

**Theorem:** The RAW dependency graph is always acyclic.

**Proof:**

- RAW edges are directed from producer (earlier instruction) to consumer (later instruction)
- Instruction indices provide a strict total order:  $I_0 < I_1 < I_2 < I_3$
- RAW edges always point forward: if  $RAW[i][j] = 1$ , then  $i < j$
- Therefore, no path can form a cycle  $\square$

### 4.3.3 Model Checking Approach

Deadlock is defined as:

$$\text{Deadlock} \equiv (\text{valid\_bits} \neq 0) \wedge (\text{ready} = 0) \wedge AG(\text{ready} = 0) \quad (5)$$

This can be checked using CTL:

$$AG(\text{valid\_bits} \neq 0 \Rightarrow EF(\text{ready} \neq 0)) \quad (6)$$

**Verification Result:** The design guarantees this property through:

1. Acyclic dependency graph
2. Guaranteed progress of busy counters (always decrement)
3. Finite instruction latencies

## 5 Simulation Results

### 5.1 Test Methodology

Three instruction sets were tested to verify:

- Correct dependency detection (RAW, WAR, WAW)
- Proper scheduling with resource constraints
- Accurate retirement and RAW matrix updates
- Deadlock freedom under various dependency patterns

### 5.2 Test Set 1: Mixed Dependencies

#### 5.2.1 Instruction Details

| Instruction | Binary   | Opcode   | Src1, Src2 | Dest |
|-------------|----------|----------|------------|------|
| I0          | 00011011 | ALU (00) | R1, R2     | R3   |
| I1          | 10110001 | MUL (10) | R3, R0     | R1   |
| I2          | 01011000 | ALU (01) | R1, R2     | R0   |
| I3          | 11101101 | DIV (11) | R2, R3     | R1   |

Table 3: Instructions

#### 5.2.2 Dependency Analysis

**RAW Dependencies:**

- $I_0 \rightarrow I_1$  ( $I_0$  produces R3,  $I_1$  consumes R3)
- $I_1 \rightarrow I_2$  ( $I_1$  produces R1,  $I_2$  consumes R1)

**Expected Issue Order:**  $I_0 \rightarrow I_1 \rightarrow I_3 \rightarrow I_2$



Figure 2: RAW Dependency DAG

#### DAG Interpretation:

- I0 is the root (no dependencies)
- I3 depends on I0 (consumes R3)
- I1 depends on I0 (consumes R3)
- I2 depends on I1 (consumes R1)
- Although I3 could theoretically execute independently after I0, the priority scheduling ensures I1 issues first when both are ready
- **Execution overlap:** I1 and I3 execute concurrently in different FUs (MUL and DIV), but issue sequentially

#### 5.2.3 Simulation Output

```

===== PHASE-1 : INSTRUCTION QUEUE =====
I0=00011011  I1=10110001  I2=01011000  I3=11101101
VALID_BITS=1111  FULL=1  EMPTY=0

===== PHASE-2 : DAG (Compact) =====
RAW: 0101  0010  0000  0000
WAR: 0101  0010  0001  0000
WAW: 0000  0001  0000  0000
DEP: 0101  0011  0001  0000
READY=0001

===== PHASE-3+4+5 : SCHEDULER =====
[ISSUE ] T=85000  instr=00011011  FU=0  futype=0
[RETIRE] T=105000  retire=0001
[ISSUE ] T=115000  instr=10110001  FU=0  futype=1
[ISSUE ] T=125000  instr=11101101  FU=0  futype=2
[RETIRE] T=145000  retire=0010
[ISSUE ] T=155000  instr=01011000  FU=0  futype=0
[RETIRE] T=165000  retire=1000
[RETIRE] T=175000  retire=0100

```

### 5.2.4 Waveform Analysis



Figure 3: Complete execution showing enqueue, dependency matrices, issue events, and retirement signals

```

-----
===== PHASE-1 : INSTRUCTION QUEUE =====
=====

I0=00011011  I1=10110001  I2=01011000  I3=11101101
VALID_BITS=1111  FULL=1  EMPTY=0

===== PHASE-2 : DAG (Compact) =====
RAW: 0101  0010  0000  0000
WAR: 0101  0010  0001  0000
WAW: 0000  0001  0000  0000
DEP: 0101  0011  0001  0000
READY=0001

=====
===== PHASE-3+4+5 : SCHEDULER =====
=====

[ISSUE ] T=85000  instr=00011011  FU=0  futype=0
[RETIRE] T=105000  retire=0001
[ISSUE ] T=115000  instr=10110001  FU=0  futype=1
[ISSUE ] T=125000  instr=11101101  FU=0  futype=2
[RETIRE] T=145000  retire=0010
[ISSUE ] T=155000  instr=01011000  FU=0  futype=0
[RETIRE] T=165000  retire=1000
[RETIRE] T=175000  retire=0100

===== END OF SIMULATION =====

```

Figure 4: Detailed phase-by-phase execution trace

#### Key Observations:

1. I0 issues first (T=85000) as it has no dependencies and highest priority
2. I1 issues at T=115000 (after I0 retires at T=105000), respecting RAW dependency on R3
3. I3 issues at T=125000 (after I1 issues), even though it was ready earlier - priority scheduling enforces in-order issue
4. I2 issues at T=155000 only after I1 completes (T=145000), respecting RAW dependency on R1
5. All instructions retire in completion order based on FU latency

6. Note: Although I3 has no dependency on I1, it must wait due to priority-based in-order issue policy

### 5.3 Test Set 2: Chain Dependencies

#### 5.3.1 Instruction Details

| Instruction | Binary   | Opcode   | Src1, Src2 | Dest |
|-------------|----------|----------|------------|------|
| I0          | 00000001 | ALU (00) | R0, R0     | R1   |
| I1          | 00011001 | ALU (00) | R1, R2     | R1   |
| I2          | 10010010 | MUL (10) | R1, R0     | R2   |
| I3          | 11011111 | DIV (11) | R1, R3     | R3   |

Table 4: Instructions

#### 5.3.2 Dependency Analysis

**RAW Dependencies:**

- $I_0 \rightarrow I_1$  (chain on R1)
- $I_1 \rightarrow I_2$  (chain on R1)
- $I_1 \rightarrow I_3$  (chain on R1)

**WAW Dependency:**  $I_0 \rightarrow I_1$  (both write R1)

**Expected Issue Order:**  $I_0 \rightarrow I_1 \rightarrow I_2 \rightarrow I_3$



Figure 5: RAW Dependency DAG (Chain Pattern)

**DAG Interpretation:**

- I0 is the root (no dependencies)
- I1 depends on I0 (RAW on R1, also WAW on R1)
- I2 depends on I1 (consumes R1)
- I3 depends on I1 (consumes R1)
- I2 and I3 can execute in parallel after I1 completes

### 5.3.3 Simulation Output

===== PHASE-2 : DAG (Compact) =====

RAW: 0100 0011 0000 0000

WAR: 0000 0010 0000 0000

WAW: 0100 0000 0000 0000

DEP: 0100 0011 0000 0000

READY=0001

```
[ISSUE ] T=85000    instr=00000001  FU=0  futype=0
[RETIRE] T=105000   retire=0001
[ISSUE ] T=115000   instr=00011001  FU=0  futype=0
[RETIRE] T=135000   retire=0010
[ISSUE ] T=145000   instr=10010010  FU=0  futype=1
[ISSUE ] T=155000   instr=11011111  FU=0  futype=2
[RETIRE] T=175000   retire=0100
[RETIRE] T=195000   retire=1000
```



Figure 6: Chain dependency execution

```

=====
===== PHASE-1 : INSTRUCTION QUEUE =====
=====

I0=00000001  I1=00011001  I2=10010010  I3=11011111
VALID_BITS=1111  FULL=1  EMPTY=0


===== PHASE-2 : DAG (Compact) =====
RAW: 0100 0011 0000 0000
WAR: 0000 0010 0000 0000
WAW: 0100 0000 0000 0000
DEP: 0100 0011 0000 0000
READY=0001


=====

===== PHASE-3+4+5 : SCHEDULER =====
=====

[ISSUE ] T=85000  instr=00000001  FU=0  futype=0
[RETIRE] T=105000  retire=0001
[ISSUE ] T=115000  instr=00011001  FU=0  futype=0
[RETIRE] T=135000  retire=0010
[ISSUE ] T=145000  instr=10010010  FU=0  futype=1
[ISSUE ] T=155000  instr=11011111  FU=0  futype=2
[RETIRE] T=175000  retire=0100
[RETIRE] T=195000  retire=1000

===== END OF SIMULATION =====

```

Figure 7: Console Output

### Key Observations:

1. Sequential execution of I0→I1 due to RAW chain
2. I2 issues at T=145000 after I1 completes (T=135000)
3. I3 issues at T=155000 after I2 issues, following priority-based in-order issue
4. **Note:** Even though both I2 and I3 become ready after I1 completes, only I2 issues first due to lower index priority
5. Demonstrates strict priority scheduling even when multiple instructions are ready

## 5.4 Test Set 3: Complex RAW Pattern

### 5.4.1 Instruction Details

| Instruction | Binary   | Opcode   | Src1, Src2 | Dest |
|-------------|----------|----------|------------|------|
| I0          | 10000110 | MUL (10) | R0, R1     | R2   |
| I1          | 00100011 | ALU (00) | R2, R0     | R3   |
| I2          | 11111001 | DIV (11) | R3, R2     | R1   |
| I3          | 00011100 | ALU (00) | R1, R3     | R0   |

Table 5: Instructions

### 5.4.2 Dependency Analysis

**RAW Dependencies:**

- $I_0 \rightarrow I_1$  ( $I_0$  produces  $R_2$ ,  $I_1$  consumes  $R_2$ )
- $I_0 \rightarrow I_2$  ( $I_0$  produces  $R_2$ ,  $I_2$  consumes  $R_2$ )
- $I_1 \rightarrow I_2$  ( $I_1$  produces  $R_3$ ,  $I_2$  consumes  $R_3$ )
- $I_1 \rightarrow I_3$  ( $I_1$  produces  $R_3$ ,  $I_3$  consumes  $R_3$ )
- $I_2 \rightarrow I_3$  ( $I_2$  produces  $R_1$ ,  $I_3$  consumes  $R_1$ )

**Expected Issue Order:**  $I_0 \rightarrow I_1 \rightarrow I_2 \rightarrow I_3$  (strict chain)



Figure 8: RAW Dependency DAG (Complete Chain with Multiple Dependencies)

**DAG Interpretation:**

- $I_0$  is the root (no dependencies)
- $I_1$  depends on  $I_0$  (consumes  $R_2$ )
- $I_2$  depends on both  $I_0$  (consumes  $R_2$ ) and  $I_1$  (consumes  $R_3$ ) - shown with dashed line for secondary dependency

- I3 depends on both I1 (consumes R3) and I2 (consumes R1) - shown with dashed line for secondary dependency
- This forms a strict sequential chain with no parallelism opportunity

### 5.4.3 Simulation Output

===== PHASE-2 : DAG (Compact) =====

RAW: 0110 0011 0001 0000

WAR: 0011 0001 0000 0000

WAW: 0000 0000 0000 0000

DEP: 0111 0011 0001 0000

READY=0001

```
[ISSUE ] T=85000    instr=10000110  FU=0  futype=1
[RETIRE] T=115000   retire=0001
[ISSUE ] T=125000   instr=00100011  FU=0  futype=0
[RETIRE] T=145000   retire=0010
[ISSUE ] T=155000   instr=11111001  FU=0  futype=2
[RETIRE] T=195000   retire=0100
[ISSUE ] T=205000   instr=00011100  FU=0  futype=0
[RETIRE] T=225000   retire=1000
```



Figure 9: Complex dependency chain with multiple RAW hazards

```

=====
===== PHASE-1 : INSTRUCTION QUEUE =====
=====

I0=10000110 I1=00100011 I2=11111001 I3=00011100
VALID_BITS=1111 FULL=1 EMPTY=0


===== PHASE-2 : DAG (Compact) =====
RAW: 0110 0011 0001 0000
WAR: 0011 0001 0000 0000
WAW: 0000 0000 0000 0000
DEP: 0111 0011 0001 0000
READY=0001


=====

===== PHASE-3+4+5 : SCHEDULER =====
=====

[ISSUE ] T=85000 instr=10000110 FU=0 futype=1
[RETIRE] T=115000 retire=0001
[ISSUE ] T=125000 instr=00100011 FU=0 futype=0
[RETIRE] T=145000 retire=0010
[ISSUE ] T=155000 instr=11111001 FU=0 futype=2
[RETIRE] T=195000 retire=0100
[ISSUE ] T=205000 instr=00011100 FU=0 futype=0
[RETIRE] T=225000 retire=1000

===== END OF SIMULATION =====

```

---

Figure 10: Console Output

### Key Observations:

1. Complete dependency chain forces sequential execution
2. Each instruction waits for its predecessor to complete
3. Dynamic RAW matrix correctly tracks and clears dependencies
4. No deadlock despite complex dependency pattern

## 5.5 Performance Analysis

| Test Set | Total Cycles | Instructions | IPC  | Parallelism |
|----------|--------------|--------------|------|-------------|
| Set 1    | 19           | 4            | 0.21 | Medium      |
| Set 2    | 21           | 4            | 0.19 | Low         |
| Set 3    | 24           | 4            | 0.17 | Low         |

Table 6: Performance Comparison Across Test Sets

## 6 Verification Results

### 6.1 CTL Property Verification Summary

| Property              | Status | Evidence                                |
|-----------------------|--------|-----------------------------------------|
| Valid Issue Order     | ✓ PASS | All issued instructions had ready=1     |
| No Resource Conflicts | ✓ PASS | No FU double-booking observed           |
| Dependency Respect    | ✓ PASS | RAW dependencies honored in all tests   |
| Progress Guarantee    | ✓ PASS | No livelock; all instructions completed |
| Deadlock Freedom      | ✓ PASS | All test sets completed successfully    |

Table 7: CTL Property Verification Results

### 6.2 Dependency Matrix Correctness

#### 6.2.1 Matrix Interpretation

The flattened dependency matrices use row-major encoding:

$$\text{Matrix}[i][j] = \text{flat}[(i \times 4) + j] \quad (7)$$

Example from Test Set 1:

```
RAW: 0101 0010 0000 0000
[0,0]=0 [0,1]=1 [0,2]=0 [0,3]=1
[1,0]=0 [1,1]=0 [1,2]=1 [1,3]=0
[2,0]=0 [2,1]=0 [2,2]=0 [2,3]=0
[3,0]=0 [3,1]=0 [3,2]=0 [3,3]=0
```

Interpretation:

- RAW[0][1]=1: I0 → I1 (R3 dependency)
- RAW[0][3]=1: I0 → I3 (R3 dependency)
- RAW[1][2]=1: I1 → I2 (R1 dependency)

### 6.3 Topological Order Validation

For each test set, the issue order respects topological sorting:

---

**Algorithm 3** Topological Order Verification

---

```
1: for each issued instruction  $j$  do
2:   for each predecessor  $i$  where  $RAW[i][j] = 1$  do
3:     assert  $retire\_time[i] < issue\_time[j]$ 
4:   end for
5: end for
```

---

All test sets satisfy this property, confirming correct topological ordering.

## 7 Design Challenges and Solutions

### 7.1 Challenge 1: Vivado Synthesis Compatibility

**Problem:** Initial design used loop variables declared inside always blocks, causing Vivado synthesis errors.

**Solution:**

```
1 // Declare all temporaries at module scope
2 integer i;
3 integer idx_issue;
4 integer pick;
5
6 always @(posedge clk) begin
7   // Use global temporaries
8   for (i = 0; i < 4; i = i + 1) begin
9     // ...
10    end
11 end
```

### 7.2 Challenge 2: Retirement and Enqueue Ordering

**Problem:** Simultaneous retirement and enqueue could cause slot conflicts.

**Solution:** Process retirement *before* enqueue in the sequential block:

```
1 always @(posedge clk) begin
2   if (!reset) begin
3     // 1) Process retirements first
4     for (i = 0; i < 4; i = i + 1)
5       if (retire_onehot[i])
6         valid[i] <= 1'b0;
7
8     // 2) Then process enqueue
9     if (enqueue) begin
10       // Find free slot
11     end
12   end
13 end
```

### 7.3 Challenge 3: Dynamic RAW Matrix Updates

**Problem:** Need to update RAW matrix when instructions complete, but completion detection happens in the same cycle as counter updates.

**Solution:** Use next-state counters for completion detection:

```
1 // Compute next state
2 nxt_alu_busy[i] = (alu_busy[i] > 0) ? alu_busy[i]-1 : 0;
3
4 // Detect completion (current != 0, next == 0)
5 if ((nxt_alu_busy[i] == 0) && (alu_busy[i] != 0)) begin
6     retire_onehot[fu_idx_alu[i]] <= 1'b1;
7     // Clear RAW row
8 end
9
10 // Commit next to current at end of cycle
11 alu_busy[i] <= nxt_alu_busy[i];
```

### 7.4 Challenge 4: Scheduler Enable Signal

**Problem:** Scheduler should not run during enqueue phase.

**Solution:** State-based enable logic:

```
1 always @(posedge clk) begin
2     if (reset)
3         sch_enable <= 0;
4     // Enable when enqueue done and queue non-empty
5     else if (enqueue == 0 && valid_bits != 0)
6         sch_enable <= 1;
7     // Disable when queue empties
8     else if (valid_bits == 0)
9         sch_enable <= 0;
10 end
```

## 8 Scalability and Future Enhancements

### 8.1 Current Limitations

1. **Instruction Window Size:** Limited to 4 instructions
2. **Instruction Width:** 8-bit encoding limits complexity
3. **Register File Size:** Only 4 architectural registers
4. **Functional Units:** One of each type (ALU, MUL, DIV)

### 8.2 Scaling to Larger Instruction Windows

#### 8.2.1 Architectural Changes Required

To scale from 4 to  $N$  instructions:

1. **Dependency Matrix:** Scale from 16 bits to  $N^2$  bits

```

1 parameter N = 8; // Window size
2 reg [N*N-1:0] raw_flat;
3 reg [N*N-1:0] war_flat;
4 reg [N*N-1:0] waw_flat;

```

## 2. Storage Arrays: Use parameterized array sizes

```

1 parameter N = 8;
2 reg [1:0] opcode [0:N-1];
3 reg [1:0] src1 [0:N-1];
4 reg [1:0] src2 [0:N-1];
5 reg [1:0] dest [0:N-1];
6 reg valid [0:N-1];

```

## 3. Retire Vector: Scale from 4 bits to $N$ bits

```

1 input [N-1:0] retire_onehot;

```

## 4. Priority Selection: Use parameterized priority encoder

```

1 always @(*) begin
2     sel = 0;
3     for (i = 0; i < N; i = i + 1) begin
4         if (ready[i] && sel == 0)
5             sel[i] = 1'b1;
6     end
7 end

```

### 8.2.2 Complexity Analysis

| Resource                  | Current (N=4)           | Scaled (N=16)             |
|---------------------------|-------------------------|---------------------------|
| Dependency Matrix Storage | $3 \times 16 = 48$ bits | $3 \times 256 = 768$ bits |
| RAW Computation Loops     | $O(N^2) = 16$           | $O(N^2) = 256$            |
| Priority Encoder          | $O(N) = 4$              | $O(N) = 16$               |
| Instruction Storage       | $4 \times 8 = 32$ bits  | $16 \times 8 = 128$ bits  |

Table 8: Scaling Complexity for Larger Windows

## 8.3 Extending to 32-bit and 64-bit Instructions

### 8.3.1 32-bit Instruction Format

Typical RISC-V or ARM-like encoding:

| [31:25] | [24:20] | [19:15] | [14:12] | [11:7] | [6:0]  |
|---------|---------|---------|---------|--------|--------|
| funct7  | rs2     | rs1     | funct3  | rd     | opcode |
| 7 bits  | 5 bits  | 5 bits  | 3 bits  | 5 bits | 7 bits |

Table 9: 32-bit RISC-V R-Type Instruction Format

Required Changes:

```

1 // Wider instruction storage
2 input [31:0] instr_in;
3 output [N*32-1:0] instr_flat_out;
4
5 // Wider register addressing
6 reg [4:0] src1 [0:N-1]; // 5-bit for 32 registers
7 reg [4:0] src2 [0:N-1];
8 reg [4:0] dest [0:N-1];
9
10 // Dependency comparison logic unchanged
11 if (dest[i] == src1[j]) // Still works with wider fields
12     raw_f[(i*N)+j] = 1'b1;

```

### 8.3.2 64-bit Instruction Support

For 64-bit architectures (RISC-V RV64, ARM64):

1. **Instruction Width:** Change to 64-bit paths
2. **Register File:** Support up to 64 architectural registers
3. **Immediate Handling:** Extract and sign-extend immediates
4. **Memory Operations:** Add memory dependency tracking

```

1 // 64-bit instruction format
2 parameter INSTR_WIDTH = 64;
3 input [INSTR_WIDTH-1:0] instr_in;
4
5 // Support 64 registers
6 reg [5:0] src1 [0:N-1]; // 6-bit addressing
7 reg [5:0] src2 [0:N-1];
8 reg [5:0] dest [0:N-1];

```

## 8.4 Multiple Functional Units per Type

**Current Implementation:** 1 ALU, 1 MUL, 1 DIV

**Enhanced Implementation:**

```

1 parameter ALU_COUNT = 2;
2 parameter MUL_COUNT = 2;
3 parameter DIV_COUNT = 1;
4
5 // Resource availability computation
6 alu_free = 0;
7 for (i = 0; i < ALU_COUNT; i = i + 1)
8     if (alu_busy[i] == 0)
9         alu_free = alu_free + 1;
10
11 // Issue to first available FU
12 for (i = 0; i < ALU_COUNT; i = i + 1) begin
13     if (pick == -1 && nxt_alu_busy[i] == 0) begin
14         pick = i;
15         nxt_alu_busy[pick] = ALU_LAT;

```

```

16     end
17 end

```

## 8.5 Advanced Scheduling Algorithms

### 8.5.1 Multiple Issue Support

To enable true parallel issue (multiple instructions per cycle), the priority encoder needs modification:

```

1 parameter ISSUE_WIDTH = 2; // Issue up to 2 instructions/cycle
2
3 always @(*) begin
4     issue_count = 0;
5     sel = 0;
6
7     for (i = 0; i < 4; i = i + 1) begin
8         if (ready[i] && issue_count < ISSUE_WIDTH) begin
9             sel[i] = 1'b1;
10            issue_count = issue_count + 1;
11        end
12    end
13 end

```

Listing 9: Multi-Issue Priority Selection

This would allow, for example, I1 and I3 in Test Set 1 to issue in the same cycle after I0 completes, significantly improving IPC.

### 8.5.2 Critical Path Prioritization

Instead of fixed priority by position, compute critical path length:

---

#### Algorithm 4 Critical Path Prioritization

---

- 1: **for** each instruction  $i$  **do**
- 2:      $cp\_length[i] \leftarrow \text{compute\_critical\_path}(i, \text{DAG})$
- 3: **end for**
- 4:  $\text{priority} \leftarrow \text{sort\_by}(cp\_length, \text{descending})$
- 5: Issue instruction with longest critical path first

---

### 8.5.3 Load Balancing Across FUs

Track utilization of each FU and prefer less-utilized units:

```

1 // Compute FU utilization
2 for (i = 0; i < ALU_COUNT; i++)
3     alu_util[i] = alu_busy[i];
4
5 // Prefer FU with lowest utilization
6 min_util = 999;
7 for (i = 0; i < ALU_COUNT; i++) begin
8     if (alu_util[i] < min_util && alu_busy[i] == 0) begin
9         min_util = alu_util[i];
10        pick = i;

```

```

11     end
12 end

```

## 8.6 Memory Dependency Tracking

For complete ISA support, add memory (load/store) dependency analysis:

```

1 // Add memory address tracking
2 reg [ADDR_WIDTH-1:0] mem_addr [0:N-1];
3 reg                  is_load    [0:N-1];
4 reg                  is_store   [0:N-1];
5
6 // Memory dependency detection
7 for (i = 0; i < N; i++) begin
8     for (j = i+1; j < N; j++) begin
9         // RAW: Load after Store to same address
10        if (is_store[i] && is_load[j] &&
11            mem_addr[i] == mem_addr[j])
12            raw_mem[(i*N)+j] = 1'b1;
13
14        // WAW: Store after Store
15        if (is_store[i] && is_store[j] &&
16            mem_addr[i] == mem_addr[j])
17            waw_mem[(i*N)+j] = 1'b1;
18
19        // WAR: Store after Load
20        if (is_load[i] && is_store[j] &&
21            mem_addr[i] == mem_addr[j])
22            war_mem[(i*N)+j] = 1'b1;
23    end
24 end

```

## 8.7 Speculative Execution Support

Add branch prediction and speculative issue:

1. **Branch Prediction:** Add BTB and branch predictor
2. **Rollback Mechanism:** Tag speculative instructions
3. **Commit Stage:** Verify predictions before commit

```

1 // Add speculation bit
2 reg speculative [0:N-1];
3 reg [BRANCH_TAG_WIDTH-1:0] spec_tag [0:N-1];
4
5 // On misprediction
6 if (branch_mispredict) begin
7     for (i = 0; i < N; i++) begin
8         if (speculative[i] && spec_tag[i] == mispredict_tag)
9             valid[i] <= 1'b0; // Flush
10    end
11 end

```

## 9 Conclusion

This project successfully demonstrates a complete instruction scheduling system with the following key achievements:

1. **Functional Correctness:** All three test sets executed correctly with proper dependency handling
2. **Dependency Analysis:** Accurate detection of RAW, WAR, and WAW hazards using adjacency matrix representation
3. **Resource Binding:** Efficient allocation of instructions to heterogeneous functional units
4. **Temporal Verification:** CTL properties verified through simulation, confirming valid issue order, resource conflict freedom, and deadlock freedom
5. **Topological Correctness:** Issue order respects dependency DAG topology in all test cases
6. **Deadlock Freedom:** Mathematical proof and simulation evidence confirm the system cannot deadlock

### 9.1 Key Contributions

- **Unified Architecture:** Integration of reservation station, DAG builder, and scheduler into cohesive system
- **Dynamic RAW Tracking:** Novel approach to updating dependency matrix on-the-fly as instructions complete
- **Completion-Based Retirement:** Accurate modeling of variable-latency execution with proper handshaking
- **Scalable Design:** Parameterized implementation ready for extension to larger windows and instruction widths

### 9.2 Lessons Learned

1. **Synthesis Tool Constraints:** Vivado requires careful variable scoping in always blocks
2. **Timing Considerations:** Proper sequencing of retirement before enqueue prevents slot conflicts
3. **State Management:** Next-state computation enables clean completion detection
4. **Verification Importance:** Simulation with diverse test cases essential for catching corner cases

### 9.3 Future Work

The design provides a solid foundation for:

- Scaling to realistic instruction windows (16-32 entries)
- Supporting full 32-bit or 64-bit ISAs
- Adding memory dependency tracking
- Implementing speculative execution
- Integrating with register renaming and ROB
- Hardware synthesis and FPGA implementation

This project demonstrates that formal verification techniques (CTL properties, topological sorting, deadlock analysis) can be effectively integrated with practical hardware design to create robust, scalable out-of-order execution systems.

## A Complete Module Listings

### A.1 Top-Level Integration Module

```
1 `timescale ins / 1ps
2 module top_integrated (
3     input clk, input reset,
4     input enqueue, input [7:0] instr_in,
5     output [7:0] issued_instr, output valid_out,
6     output [3:0] issued_flags, output full, output empty,
7     output [31:0] dbg_instr_flat, output [3:0] dbg_valid_bits,
8     output [15:0] dbg_raw_flat, output [15:0] dbg_war_flat,
9     output [15:0] dbg_waw_flat, output [15:0] dbg_dep_flat,
10    output [3:0] dbg_scheduler_ready,
11    output [1:0] dbg_issued_fu_type,
12    output [1:0] dbg_issued_fu_index
13 );
14
15     wire [31:0] instr_flat;
16     wire [3:0] valid_bits, retire_onehot;
17     wire [15:0] raw_flat, war_flat, waw_flat, dep_flat;
18     reg sch_enable;
19
20     always @(posedge clk) begin
21         if (reset)
22             sch_enable <= 0;
23         else if (enqueue == 0 && valid_bits != 0)
24             sch_enable <= 1;
25         else if (valid_bits == 0)
26             sch_enable <= 0;
27     end
28
29     instruction_queue_rs iq (
30         .clk(clk), .reset(reset), .enqueue(enqueue),
31         .instr_in(instr_in), .retire_onehot(retire_onehot),
32         .instr_flat_out(instr_flat), .valid_bits(valid_bits),
33         .full(full), .empty(empty)
34     );
35
36     dep_graph_with_valid dag (
37         .instr_flat(instr_flat), .valid_bits(valid_bits),
38         .raw_flat(raw_flat), .war_flat(war_flat),
39         .waw_flat(waw_flat), .dep_flat(dep_flat)
40     );
41
42     scheduler_resbind #( .ALU_COUNT(1), .MUL_COUNT(1),
43         .DIV_COUNT(1)) sch (
44         .clk(clk), .reset(reset), .sch_enable(sch_enable),
45         .raw_flat(raw_flat), .valid_bits(valid_bits),
46         .instr_flat(instr_flat), .valid_out(valid_out),
47         .issued_instr(issued_instr),
48         .issued_fu_type(dbg_issued_fu_type),
49         .issued_fu_index(dbg_issued_fu_index),
50         .retire_onehot(retire_onehot)
51     );
52
53     assign dbg_instr_flat = instr_flat;
54     assign dbg_valid_bits = valid_bits;
55     assign dbg_raw_flat = raw_flat;
```

```

55      assign dbg_war_flat = war_flat;
56      assign dbg_waw_flat = waw_flat;
57      assign dbg_dep_flat = dep_flat;
58 endmodule

```

Listing 10: top\_integrated.v - Complete listing

## B Simulation Test Cases

### B.1 Additional Test Vectors

For comprehensive verification, additional test cases should include:

1. **No Dependencies:** All independent instructions

I0: ALU R0,R1->R2      I1: MUL R0,R1->R3  
I2: DIV R0,R1->R0      I3: ALU R2,R3->R1  
Expected: Maximum parallelism

2. **Full Chain:** Complete linear dependency

I0->I1->I2->I3 (each depends on previous)  
Expected: Pure sequential execution

3. **Fork Pattern:** One producer, multiple consumers

I0 produces R1  
I1, I2, I3 all consume R1  
Expected: I0 first, then I1/I2/I3 in parallel

4. **Join Pattern:** Multiple producers, one consumer

I0 produces R1, I1 produces R2, I2 produces R3  
I3 consumes R1, R2, R3  
Expected: I0/I1/I2 in parallel, then I3

## C References

1. Hennessy, J. L., & Patterson, D. A. (2017). *Computer Architecture: A Quantitative Approach* (6th ed.). Morgan Kaufmann.
2. Tomasulo, R. M. (1967). "An efficient algorithm for exploiting multiple arithmetic units". *IBM Journal of Research and Development*, 11(1), 25-33.
3. Clarke, E. M., Grumberg, O., & Peled, D. (1999). *Model Checking*. MIT Press.

4. Sohi, G. S. (1990). "Instruction issue logic for high-performance, interruptible, multiple functional unit, pipelined computers". *IEEE Transactions on Computers*, 39(3), 349-359.
5. Smith, J. E., & Sohi, G. S. (1995). "The microarchitecture of superscalar processors". *Proceedings of the IEEE*, 83(12), 1609-1624.
6. Palsberg, J., & Schwartzbach, M. I. (1994). "Register allocation by list scheduling". *ACM Transactions on Programming Languages and Systems*, 16(6), 1850-1879.