

A Project Report  
on

# “Design and Implementation of a 5-Stage Pipelined RV32IM RISC-V Processor with Custom AI Instruction Acceleration”

Submitted for partial fulfillment of award of

**POST GRADUATE DIPLOMA**

In

**DVLSI**

By

Name : Nilesh Nadekar (PRN: 250820533005)  
Name: Rajmani Singh (PRN: 250820533004)



DEPARTMENT OF ELECTRONICS  
CENTRE FOR DEVELOPMENT OF ADVANCED  
COMPUTING, NOIDA, UTTAR PRADESH, INDIA

SESSION - AUG 2025 - FEB 2026



## **CERTIFICATE**

We hereby declare that the work, which is being presented in the project report, entitled Design and Implementation of a **5-Stage Pipelined RV32IM RISC-V Processor with Custom AI Instruction Acceleration** in the partial fulfillment of the requirement for the award of Post Graduate Diploma in VLSI Design, submitted to C-DAC Noida as an authentic record of our own work carried out under the supervision of Dr. Ravi Payal, CENTRE FOR DEVELOPMENT OF ADVANCED COMPUTING, NOIDA.

**The matter embodied in this dissertation has not been submitted by us for the award of any other degree.**

**Date: 03 / 02 / 2026**

Nilesh Nadekar  
(PRN: 250820533005)  
**Name of Student**

Rajmani Singh  
(PRN: 250820533004)  
**Name of Student**

**This is to certify that the above statement made by the candidates is correct to the best of my knowledge.**

**Date: 03 / 02 / 2026**  
**Place: Noida**

**Dr. Ravi Payal  
Sc. 'E', CDAC NOIDA  
PROJECT GUIDE.**

## **ACKNOWLEDGEMENT**

**I would like to thank Our Project Guide Dr. Ravi Payal who helped us in successful completion of our project and also we appreciate the guidance given to us and it is a pleasure to express our feeling of gratitude towards all those who have contributed in the completion of this training.**

**We are highly thankful to our friends who have been kind enough to discuss with us various aspects involved with the topic and thus have helped in completing this project.**

**CDAC, Noida  
FEB 2026**

**Nilesh Nandekar  
Rajmani Singh**

# ABSTRACT

The rapid growth of artificial intelligence and machine learning applications has created a strong demand for processor architectures that are both high-performance and flexible while remaining energy efficient. RISC-V, being an open and extensible instruction set architecture, provides an ideal platform for incorporating application-specific enhancements without compromising standard compliance.

This project presents the **design and implementation of a 5-stage pipelined 32-bit RV32IM RISC-V processor with custom AI instruction acceleration**. The processor follows the classical **Instruction Fetch (IF)**, **Instruction Decode (ID)**, **Execute (EX)**, **Memory Access (MEM)**, and **Write-Back (WB)** pipeline architecture and fully supports the **RV32I base ISA along with the RV32M extension** for multiplication and division operations. To enhance performance for machine-learning-oriented workloads, **custom AI instructions (VDOT4 and VMAX4)** are introduced, enabling efficient vector dot-product and maximum operations on packed data elements.

The processor is implemented in **synthesizable Verilog HDL**, incorporating essential micro-architectural features such as **hazard detection**, **data forwarding**, **pipeline stalling**, and a dedicated **M-unit with handshake control** for multi-cycle multiply and divide operations. A separate **AI execution unit** is integrated into the EX stage, allowing seamless coexistence of standard arithmetic, RV32M operations, and AI-specific instructions within the same pipeline.

Functional verification is carried out using a **self-checking testbench in QuestaSim**, where directed test programs validate arithmetic instructions, pipeline behavior, RV32M operations, and custom AI acceleration. **Waveform analysis and simulation transcripts** confirm correct instruction execution, pipeline timing, and memory updates. The design is further prepared for **synthesis and timing analysis using Synopsys Design Vision**, ensuring hardware feasibility for FPGA or ASIC implementation.

The completed processor demonstrates a **scalable and extensible RISC-V microarchitecture**, capable of supporting standard computation as well as AI-oriented acceleration, making it suitable for **edge AI and embedded processing applications**.

# CONTENTS

| Chapter  | TITLE                                            | Page. No. |
|----------|--------------------------------------------------|-----------|
|          | <b>CERTIFICATE</b>                               |           |
|          | <b>ACKNOWLEDGEMENT</b>                           |           |
|          | <b>ABSTRACT</b>                                  |           |
|          | <b>CONTENTS</b>                                  |           |
|          | <b>LIST OF FIGURES</b>                           |           |
|          | <b>LIST OF TABLES</b>                            |           |
|          | <b>LIST OF ABBREVIATIONS</b>                     |           |
| <b>1</b> | <b>INTRODUCTION</b>                              | 12        |
| 1.1      | Overview of RISC-V Architecture                  | 13        |
| 1.2      | Motivation for AI-Oriented Processor Design      | 15        |
| 1.3      | Need for Custom ISA Extensions                   | 16        |
| 1.4      | Applications of AI-Accelerated RISC-V Processors | 17        |
| <b>2</b> | <b>LITERATURE SURVEY</b>                         | 18        |
| 2.1      | Overview of Existing RISC-V Processor Designs    | 18        |
| 2.2      | Pipelined Processor Architectures                | 19        |
| 2.3      | AI Acceleration in Embedded Processors           | 20        |
| 2.4      | Limitations of Existing Approaches               | 22        |

|          |                                              |    |
|----------|----------------------------------------------|----|
| <b>3</b> | <b>SYSTEM ARCHITECTURE AND DESIGN</b>        | 24 |
| 3.1      | Overall Processor Architecture               | 24 |
| 3.2      | Instruction Set Overview                     | 25 |
| 3.3      | Pipeline Architecture                        | 28 |
| 3.4      | Hazard Detection and Data Forwarding         | 31 |
| 3.5      | Multiply-Divide Unit (M-Unit)                | 33 |
| 3.6      | Custom AI Instruction Unit (VDOT4 and VMAX4) | 37 |
| <b>4</b> | <b>IMPLEMENTATION DETAILS</b>                | 42 |
| 4.1      | Verilog HDL Implementation                   | 42 |
| 4.2      | Pipeline Control and Stall Mechanism         | 46 |
| 4.3      | Integration of AI Acceleration Instructions  | 50 |
| 4.4      | Memory Interface Design                      | 54 |
| <b>5</b> | <b>VERIFICATION AND SIMULATION</b>           | 58 |
| 5.1      | Verification Methodology                     | 58 |
| 5.2      | Testbench Architecture                       | 62 |
| 5.3      | Directed Test Programs                       | 66 |

|          |                                             |    |
|----------|---------------------------------------------|----|
| 5.4      | Simulation Results and Waveform Analysis    | 70 |
| <b>6</b> | <b>SYNTHESIS AND ANALYSIS</b>               | 74 |
| 6.1      | Synthesis Flow using Synopsys Design Vision | 74 |
| 6.2      | Resource Utilization                        | 77 |
| 6.3      | Timing Analysis                             | 80 |
| <b>7</b> | <b>RESULTS AND DISCUSSION</b>               | 83 |
| 7.1      | Functional Results                          | 83 |
| 7.2      | Performance Evaluation                      | 86 |
| 7.3      | AI Instruction Performance Analysis         | 88 |
| <b>8</b> | <b>CONCLUSION AND FUTURE SCOPE</b>          | 90 |
| 8.1      | Conclusion                                  | 90 |
| 8.2      | Future Enhancements                         | 92 |
|          | <b>References</b>                           | 93 |
|          | <b>Appendix</b>                             | 94 |
| A        | SystemVerilog HDL Source Code               |    |
| B        | Testbench and Simulation Environment        |    |

|   |                                       |  |
|---|---------------------------------------|--|
| C | Instruction Memory Program (imem.hex) |  |
| D | Simulation Waveforms                  |  |
| E | Synthesis Reports                     |  |
| F | Toolchain and Software Environment    |  |

## LIST OF FIGURES

| Figure No. | Title                                | Page No. |
|------------|--------------------------------------|----------|
| Fig. 1     | General RISC-V Processor Overview    | 13       |
| Fig. 2     | Existing RISC-V Core Architecture    | 19       |
| Fig. 3     | Vector Processor                     | 22       |
| Fig. 4     | RISC-V Instruction Set Architecture  | 27       |
| Fig. 5     | Top Level Schematic in Design Vision | 80       |
| Fig. 6     | Transcript Output Data               | 98       |
| Fig. 7     | Simulation Data                      | 98       |
| Fig. 8     | Gate Level Schematic                 | 99       |

## **LIST OF TABLES**

| <b>Table No.</b> | <b>Title</b>                                                                       | <b>Page No.</b> |
|------------------|------------------------------------------------------------------------------------|-----------------|
| TABLE I          | Comparison of RISC Processor Generations (RISC-I to RISC-V)                        | 15              |
| TABLE II         | RISC-V Standard Extensions: Functionality, Specifications, and Application Domains | 28              |

## **LIST OF ABBREVIATIONS**

| <b>Abbreviation</b> | <b>Full Form</b>                             |
|---------------------|----------------------------------------------|
| RISC                | Reduced Instruction Set Computer             |
| ISA                 | Instruction Set Architecture                 |
| RV32IM              | RISC-V 32-bit Integer with Multiply/Divide   |
| IF                  | Instruction Fetch                            |
| ID                  | Instruction Decode                           |
| EX                  | Execute                                      |
| MEM                 | Memory Access                                |
| WB                  | Write Back                                   |
| AI                  | Artificial Intelligence                      |
| HDL                 | Hardware Description Language                |
| DSP                 | Digital Signal Processing                    |
| FPU                 | Floating Point Unit                          |
| SoC                 | System on Chip                               |
| CDAC                | Centre for Development of Advanced Computing |

## **CHAPTER 1 : INTRODUCTION**

The rapid advancement of computing technologies has led to an increasing demand for high-performance, energy-efficient, and customizable processors, especially for embedded systems and emerging artificial intelligence (AI) applications. Traditional proprietary processor architectures often restrict flexibility due to licensing constraints and limited extensibility. In contrast, RISC-V, an open-standard instruction set architecture (ISA), has emerged as a powerful alternative that enables designers to develop processors tailored to specific application requirements.

Modern applications such as edge AI, signal processing, and real-time data analytics require processors that not only deliver high computational throughput but also support specialized operations efficiently. Pipeline-based processor architectures play a crucial role in improving instruction throughput by overlapping multiple stages of instruction execution. A 5-stage pipelined architecture is widely adopted in academic and industrial designs due to its balance between performance, complexity, and scalability.

This project focuses on the design and implementation of a 5-stage pipelined 32-bit RISC-V processor based on the RV32IM ISA, enhanced with custom AI-oriented instructions. The processor integrates standard integer operations, multiplication and division support, and domain-specific acceleration for AI workloads. The complete design is implemented using Verilog HDL, verified through simulation, and prepared for synthesis using industry-standard tools



**Fig 1:- General RISC-V Processor Overview**

## 1.1 Overview of RISC-V Architecture

RISC-V is a **Reduced Instruction Set Computer (RISC)** architecture that was developed to provide an **open, modular, and extensible ISA** suitable for a wide range of computing platforms, from low-power embedded devices to high-performance processors. Unlike proprietary ISAs, RISC-V is free from licensing fees, making it highly attractive for academic research and commercial development.

The RISC-V architecture is designed around a **small and stable base instruction set**, known as **RV32I** for 32-bit implementations. This base ISA includes fundamental integer operations, load/store instructions, control-flow instructions, and system operations. Additional functionality is provided through **standard extensions**, such as the **M**

**extension** for multiplication and division, allowing designers to include only the features required for their target applications.

One of the key strengths of RISC-V is its **extensibility**, which allows the introduction of **custom instructions** without violating compatibility with the base ISA. This feature is particularly beneficial for application-specific acceleration, such as AI and machine-learning workloads, where custom operations can significantly improve performance and energy efficiency.

The modular nature of RISC-V, combined with its clean instruction encoding and pipeline-friendly design, makes it an ideal choice for implementing **pipelined processors with custom execution units**. These characteristics form the foundation for the processor architecture developed in this project.

| Aspect              | RISC-I    | RISC-II   | RISC-III  | RISC-IV    | RISC-V                    |
|---------------------|-----------|-----------|-----------|------------|---------------------------|
| <b>Year</b>         | 1981–1984 | 1983–1986 | 1986–1989 | 1989–1992  | 2010+                     |
| <b>Transistors</b>  | 44.4K     | 40.8K     | N/A       | N/A        | N/A<br>(modular design)   |
| <b>Instructions</b> | 31        | 39        | 45+       | 60+        | 40+ (with extensions)     |
| <b>Pipeline</b>     | 5-stage   | 6-stage   | 7-stage   | 8–10 stage | 2–13 stage (configurable) |
| <b>Frequency</b>    | 2 MHz     | 3 MHz     | 5 MHz     | 10 MHz     | 16 MHz – 5 GHz            |
| <b>Performance</b>  | 12.5 MIPS | 20 MIPS   | 50 MIPS   | 100+ MIPS  | GHz+ class                |
| <b>Registers</b>    | 138       | 152       | 160+      | 192+       | 32                        |

|                               |                      |                      |                        |                        | (standard)       |
|-------------------------------|----------------------|----------------------|------------------------|------------------------|------------------|
| <b>Technology</b>             | 3 $\mu\text{m}$ NMOS | 2 $\mu\text{m}$ NMOS | 1.2 $\mu\text{m}$ CMOS | 0.8 $\mu\text{m}$ CMOS | 5 nm – 28 nm     |
| <b>FPU Support</b>            | No                   | Yes                  | Yes                    | Yes                    | Optional (F/D/V) |
| <b>Multiprocessor Support</b> | No                   | No                   | No                     | Yes                    | Yes              |
| <b>Customization</b>          | No                   | No                   | No                     | No                     | Yes (Open ISA)   |

**TABLE I: - Comparison of RISC Processor Generations (RISC-I to RISC-V)**

## 1.2 Motivation for AI-Oriented Processor Design

In recent years, the rapid growth of **Artificial Intelligence (AI) and Machine Learning (ML)** applications has significantly influenced processor design methodologies. AI workloads such as vector operations, matrix computations, pattern recognition, and data-intensive processing demand **high computational efficiency, low latency, and optimized data movement**. General-purpose processors, although versatile, often fail to deliver optimal performance and energy efficiency for such specialized workloads.

Most AI algorithms involve repetitive arithmetic operations and parallel data processing. Executing these workloads on conventional processors leads to increased instruction counts, higher power consumption, and longer execution times. This limitation has motivated the development of **AI-oriented processors** that incorporate architectural optimizations and specialized execution units to accelerate frequently used operations.

Another major motivation is the growing importance of **edge AI systems**, where computations are performed locally on embedded devices rather than cloud servers. These systems require processors that are **compact, power-efficient, and**

**application-specific**, while still maintaining adequate performance. AI-oriented processor designs enable on-device intelligence, reduced communication latency, enhanced privacy, and lower dependence on external infrastructure.

By integrating AI acceleration directly into the processor pipeline, it becomes possible to **improve throughput, reduce instruction overhead, and achieve better performance-per-watt**. This project addresses these challenges by designing a pipelined RISC-V processor augmented with custom AI instructions, making it suitable for modern intelligent embedded applications.

## 1.3 Need for Custom ISA Extensions

Instruction Set Architecture (ISA) extensions play a crucial role in enhancing processor functionality beyond basic operations. While the standard RISC-V ISA provides essential computational instructions, many application domains—particularly AI and signal processing—require **specialized operations** that are not efficiently supported by generic instruction sets.

Custom ISA extensions allow designers to introduce **application-specific instructions** that can perform complex operations in fewer cycles compared to multiple standard instructions. This leads to **reduced instruction count, lower execution latency, and improved energy efficiency**. In AI workloads, operations such as vector dot products, maximum selection, and packed data processing are frequently executed and greatly benefit from custom hardware support.

RISC-V uniquely supports such customization by reserving opcode spaces specifically for user-defined extensions. This flexibility enables innovation without compromising compatibility with the base ISA. Custom instructions can coexist with standard RISC-V instructions, ensuring that software portability and toolchain support are maintained.

In this project, custom ISA extensions are introduced to accelerate AI-oriented operations by integrating dedicated execution logic within the processor pipeline. These extensions demonstrate how **custom instructions can significantly enhance performance while preserving the simplicity and modularity of the RISC-V architecture**.

## 1.4 Applications of AI-Accelerated RISC-V Processors

AI-accelerated RISC-V processors have emerged as a powerful solution for a wide range of modern computing applications due to their **open-source nature, scalability, and support for custom instruction extensions**. By integrating AI-specific acceleration within the processor pipeline, such architectures enable efficient execution of data-intensive and compute-heavy workloads.

One of the primary application areas is **embedded and edge AI systems**, where real-time data processing is required under strict power and performance constraints. Applications such as smart sensors, surveillance systems, and industrial automation benefit from AI-accelerated RISC-V processors by achieving **low-latency inference and reduced power consumption** without relying on cloud-based computation.

Another important domain is **Internet of Things (IoT)** devices, where on-device intelligence is becoming increasingly essential. AI-enabled RISC-V processors can perform tasks such as anomaly detection, predictive maintenance, and environmental monitoring locally, thereby improving reliability and reducing communication overhead.

AI-accelerated RISC-V processors are also widely applicable in **image and signal processing** applications, including facial recognition, object detection, speech processing, and biomedical signal analysis. Custom AI instructions allow these operations to be executed efficiently, leading to faster processing and improved accuracy.

In the field of **robotics and autonomous systems**, such processors enable real-time decision-making, sensor fusion, and motion control. The ability to customize the ISA ensures that domain-specific algorithms can be optimized at the hardware level.

Overall, AI-accelerated RISC-V processors provide a **flexible, efficient, and future-ready computing platform** suitable for a broad spectrum of applications ranging from low-power embedded devices to intelligent autonomous systems.

## **CHAPTER 2 : LITERATURE SURVEY**

### **2.1 Overview of Existing RISC-V Processor Designs**

RISC-V is an open-source Instruction Set Architecture (ISA) that has gained significant attention in both academic research and industrial development due to its **modular, extensible, and royalty-free nature**. Numerous RISC-V processor designs have been proposed and implemented to address a wide range of performance, power, and application requirements.

Early RISC-V processor designs primarily focused on **simple in-order pipelines**, typically implementing the RV32I or RV64I base instruction set. These designs emphasized simplicity, ease of verification, and suitability for educational and embedded applications. Common examples include single-cycle and multi-cycle cores that demonstrate the fundamental principles of RISC-V architecture.

As the adoption of RISC-V increased, more advanced processor designs emerged featuring **pipelined architectures**, usually consisting of 3-stage or 5-stage pipelines. These processors improved instruction throughput by overlapping instruction execution stages such as instruction fetch, decode, execute, memory access, and write-back. Many designs also incorporated hazard detection and forwarding mechanisms to enhance pipeline efficiency.

Several RISC-V processors have integrated **standard ISA extensions**, particularly the M-extension for multiplication and division operations, to support computationally intensive workloads. These implementations often include dedicated arithmetic units or multi-cycle execution blocks to handle complex operations efficiently.

In recent years, research has expanded toward **customizable and domain-specific RISC-V processors**, where designers introduce custom instructions or accelerators tailored for specific applications such as digital signal processing, cryptography, and machine learning. These processors leverage the open nature of RISC-V to extend functionality without violating ISA compatibility.

Overall, existing RISC-V processor designs demonstrate a wide spectrum of architectural choices, ranging from lightweight embedded cores to high-performance processors,

establishing RISC-V as a versatile and scalable computing platform.



**Fig 2:- Existing RISC-V Core Architecture**

## 2.2 Pipelined Processor Architectures

Pipelined processor architecture is a widely adopted technique used to improve instruction throughput by overlapping the execution of multiple instructions. Instead of completing one instruction before starting the next, pipelining divides instruction execution into multiple stages, allowing several instructions to be processed simultaneously at different stages of execution.

A typical pipelined processor consists of stages such as **Instruction Fetch (IF)**, **Instruction Decode (ID)**, **Execute (EX)**, **Memory Access (MEM)**, and **Write Back (WB)**. Each stage performs a specific function, and intermediate results are passed between stages through pipeline registers. This structure significantly enhances performance by increasing the number of instructions completed per clock cycle.

One of the major challenges in pipelined architectures is the presence of **pipeline**

**hazards**, which can degrade performance if not properly handled. These hazards are broadly classified into **structural hazards**, **data hazards**, and **control hazards**. Techniques such as **stalling**, **forwarding**, and **branch prediction** are commonly employed to mitigate these issues and ensure correct program execution.

Pipelined architectures are particularly well-suited for RISC-based designs like RISC-V due to their **simple and uniform instruction formats**, which simplify instruction decoding and pipeline control. Many modern RISC-V processors utilize 3-stage or 5-stage pipelines to balance performance and design complexity.

In advanced designs, pipelined processors may also integrate specialized execution units, such as multiplication/division units or custom accelerators, which may require multi-cycle execution. Proper pipeline control and handshake mechanisms are essential in such cases to maintain correct operation without sacrificing throughput.

Overall, pipelined processor architectures play a crucial role in achieving higher performance in modern processors and serve as the foundation for more advanced designs, including superscalar and out-of-order execution architectures.

## 2.3 AI Acceleration in Embedded Processors

Artificial Intelligence (AI) acceleration in embedded processors has gained significant importance due to the increasing demand for real-time data processing, low latency, and energy-efficient computation in edge devices. Traditional general-purpose processors often struggle to meet the performance and power requirements of AI workloads, prompting the integration of specialized hardware acceleration techniques within embedded processor architectures.

AI acceleration in embedded systems typically focuses on optimizing computationally intensive operations such as **vector arithmetic**, **matrix multiplication**, **dot-product calculations**, and **comparison-based operations**, which are commonly used in machine learning and neural network algorithms. These accelerators are designed to execute such operations in fewer cycles compared to conventional instruction execution, thereby improving overall system performance.

One common approach to AI acceleration is the use of **custom instruction set**

**extensions**, where new instructions are added to the processor's ISA to perform AI-specific operations. This method allows seamless integration of AI functionality without requiring a separate accelerator block, resulting in lower area overhead and simplified software support. The RISC-V architecture, due to its modular and extensible nature, is particularly well-suited for such enhancements.

Another approach involves integrating **dedicated hardware accelerators** or **co-processors** alongside the main processor core. These accelerators can operate in parallel with the core and are often optimized for specific AI tasks. However, this approach may introduce additional complexity in terms of memory access, data movement, and synchronization.

In embedded environments, power efficiency is a critical design constraint. AI acceleration techniques in embedded processors are therefore designed to minimize energy consumption by reducing instruction count, lowering memory accesses, and exploiting data-level parallelism. This makes AI-accelerated embedded processors highly suitable for applications such as **edge AI**, **Internet of Things (IoT)**, **smart sensors**, **image processing**, and **real-time control systems**.

In conclusion, AI acceleration in embedded processors enables efficient execution of intelligent workloads within constrained environments. By combining custom instruction extensions, optimized execution units, and efficient pipeline integration, modern embedded processors can deliver enhanced AI performance while maintaining low power and area overhead.



**Fig 3: - Vector Processor**

## 2.4 Limitations of Existing Approaches

Despite significant advancements in RISC-V processors and AI acceleration techniques, existing approaches still face several limitations that restrict their efficiency, scalability, and suitability for embedded and real-time applications.

One major limitation is the **lack of efficient support for AI-specific operations** in many traditional processor designs. General-purpose RISC-V cores primarily focus on standard arithmetic and control instructions, resulting in a high instruction count when executing AI workloads such as vector dot products, comparisons, and accumulation operations. This increases execution latency and power consumption.

Another challenge lies in the use of **external or tightly coupled hardware accelerators**. While dedicated accelerators can provide high performance, they often introduce complexity in terms of data transfer, memory coherence, and synchronization between the processor core and the accelerator. This additional overhead can negate performance

gains, especially in resource-constrained embedded systems.

Many existing designs also suffer from **limited pipeline optimization** for AI workloads. Traditional pipelined architectures are optimized for scalar instruction execution and may not efficiently exploit data-level parallelism required by AI algorithms. This results in pipeline stalls, inefficient resource utilization, and reduced throughput.

Power efficiency is another critical concern. Several AI acceleration approaches focus primarily on performance improvement without adequately addressing **energy constraints**, making them unsuitable for battery-powered and edge computing applications. Increased switching activity, higher memory accesses, and longer execution times contribute to higher power dissipation.

Additionally, some processor designs lack **flexibility and scalability**. Hardwired accelerators or fixed-function units may perform well for specific algorithms but are not easily adaptable to evolving AI models or new application requirements. This limits their long-term usability.

In summary, existing RISC-V and AI acceleration approaches face limitations related to performance inefficiency, power consumption, architectural complexity, and lack of flexibility. These challenges highlight the need for a **balanced and integrated design approach**, combining pipelined processing with custom AI-oriented instruction extensions, which forms the motivation for the proposed processor architecture in this project.

# **CHAPTER 3 : SYSTEM ARCHITECTURE AND DESIGN**

## **3.1 Overall Processor Architecture**

The proposed processor architecture is a **32-bit RISC-V based processor** designed with a **5-stage pipelined architecture** and enhanced with **custom AI instruction acceleration**. The processor implements the **RV32IM instruction set**, supporting the base integer instructions along with the multiplication and division extension, while also allowing custom ISA extensions for AI-oriented operations.

The architecture follows the classical **Instruction Fetch (IF)**, **Instruction Decode (ID)**, **Execute (EX)**, **Memory Access (MEM)**, and **Write Back (WB)** pipeline stages. Pipeline registers are placed between each stage to enable parallel instruction execution and improve throughput. The design incorporates hazard detection and data forwarding mechanisms to ensure correct execution in the presence of pipeline hazards.

At the core of the processor is a **register file** consisting of 32 general-purpose 32-bit registers, which supports two read ports and one write port. The **control unit** decodes instructions and generates appropriate control signals for pipeline operation, arithmetic execution, memory access, and register write-back.

The **execution stage** contains an Arithmetic Logic Unit (ALU) that supports standard arithmetic and logical operations defined by the RV32I instruction set. In addition, a dedicated **multiply/divide unit** is integrated to support the RV32M extension. This unit operates using a handshake-based interface to accommodate multi-cycle execution without disrupting pipeline flow.

To accelerate AI workloads, the processor includes a **custom AI execution unit** that supports specialized instructions such as vector dot-product and maximum-value operations. These instructions are executed directly within the pipeline, significantly reducing instruction count and execution latency for AI-related computations.

The memory subsystem interfaces with separate instruction and data memories, supporting load and store operations. Control flow instructions such as branches and

jumps are resolved within the pipeline with appropriate flushing and redirection logic.

Overall, the proposed architecture combines the simplicity and flexibility of the RISC-V ISA with pipelined execution and AI-specific enhancements, resulting in an efficient and scalable processor suitable for embedded and edge AI applications.

## 3.2 Instruction Set Overview

The proposed processor implements the **RV32IM instruction set architecture**, which consists of the **RV32I base integer instruction set** along with the **M extension** that supports multiplication and division operations. The RV32IM ISA provides a compact, efficient, and modular instruction framework suitable for embedded and high-performance computing applications.

The **RV32I base instruction set** includes fundamental instructions required for general-purpose computation. These instructions support integer arithmetic, logical operations, data movement, and control flow. Key instruction categories under RV32I include:

- Arithmetic and logical instructions such as ADD, SUB, AND, OR, XOR, and shift operations.
- Load and store instructions for memory access.
- Control flow instructions including branches, jumps, and procedure calls.
- Immediate-based instructions for efficient constant handling.

To enhance computational capability, the processor integrates the **RV32M extension**, which introduces hardware support for multiplication and division operations. The RV32M extension includes instructions such as:

- Integer multiplication operations (MUL, MULH, MULHSU, MULHU)
- Division operations (DIV, DIVU)

- Remainder operations (REM, REMU)

These operations are implemented using a dedicated multiply/divide unit that operates in the execute stage. The unit supports multi-cycle execution and is integrated with the pipeline using a handshake mechanism to maintain correctness while minimizing pipeline stalls.

In addition to the standard RV32IM instruction set, the processor supports **custom instruction extensions** designed specifically for AI acceleration. These custom instructions are encoded using the RISC-V custom opcode space and enable operations such as vector dot-product and maximum value computation. By offloading AI-intensive operations to specialized hardware units, the processor significantly improves execution efficiency for machine learning and signal processing workloads.

The modular nature of the RV32IM instruction set allows seamless integration of these custom extensions while maintaining compatibility with standard RISC-V toolchains. This makes the processor highly extensible, scalable, and suitable for future enhancements.

## RISC-V Instruction Set Architecture



**Fig 4: - RISC-V Instruction Set Architecture**

| Extension | Purpose                                    | Instructions                              | Latency    | Applications                             |
|-----------|--------------------------------------------|-------------------------------------------|------------|------------------------------------------|
| <b>M</b>  | Integer multiply and divide operations     | 8 (MUL, MULH, DIV, DIVU, REM, REMU, etc.) | 2–5 cycles | DSP, cryptography, arithmetic processing |
| <b>A</b>  | Atomic memory operations                   | 6+ (LR, SC, AMO operations)               | 1–2 cycles | Synchronization, multi-threading         |
| <b>F</b>  | Single-precision floating-point operations | 26 instructions                           | 3–5 cycles | AI/ML, graphics, signal processing       |
| <b>D</b>  | Double-precision                           | 26 instructions                           | 4–6 cycles | Scientific                               |

|   |                                   |                         |                |                                          |
|---|-----------------------------------|-------------------------|----------------|------------------------------------------|
|   | on floating-point operations      |                         |                | computing, HPC                           |
| C | Compressed 16-bit instruction set | 50+ instructions        | Same as 32-bit | Embedded systems, code density reduction |
| V | Vector SIMD processing            | 100+ instructions       | 1–4 cycles     | AI/ML acceleration, DSP, HPC             |
| K | Cryptographic operations          | 50+ instructions        | 1–3 cycles     | Encryption, hashing, security            |
| H | Hypervisor support                | Privileged instructions | Varies         | Virtualization, guest operating systems  |

**TABLE II: - RISC-V STANDARD EXTENSIONS: FUNCTIONALITY, SPECIFICATIONS, AND APPLICATION DOMAINS**

### 3.3 Pipeline Architecture

The proposed processor is designed using a **5-stage pipelined architecture**, which improves instruction throughput by allowing multiple instructions to be processed concurrently. Each instruction is divided into five sequential stages, with pipeline registers placed between stages to enable parallel execution. This architecture balances performance, hardware complexity, and design clarity, making it suitable for both academic and practical implementations.

The five pipeline stages implemented in the processor are:

1. Instruction Fetch (IF)
2. Instruction Decode (ID)

3. Execute (EX)
4. Memory Access (MEM)
5. Write-Back (WB)

Each stage performs a specific function in the instruction execution cycle, as described below.

### **3.3.1 Instruction Fetch (IF) Stage**

The Instruction Fetch (IF) stage is responsible for fetching the instruction from instruction memory based on the current value of the Program Counter (PC). The PC is incremented by 4 after each instruction fetch to support sequential execution.

The IF stage also handles control flow changes caused by branch and jump instructions. In such cases, the PC is updated with the branch or jump target address. Pipeline stalling mechanisms are implemented to handle hazards arising from branch resolution and multi-cycle operations.

The fetched instruction and the updated PC value are stored in the IF>ID pipeline register for use in the next stage.

### **3.3.2 Instruction Decode (ID) Stage**

The Instruction Decode (ID) stage decodes the fetched instruction and extracts fields such as opcode, source registers, destination register, function codes, and immediate values. The register file is accessed in this stage to read the source operand values.

Immediate generation is performed based on the instruction format (R-type, I-type, S-type, B-type, U-type, or J-type). The control unit generates necessary control signals for subsequent stages.

Hazard detection logic is incorporated in the ID stage to identify load-use hazards. When such hazards are detected, the pipeline is stalled to ensure correct execution.

### **3.3.3 Execute (EX) Stage**

The Execute (EX) stage performs arithmetic, logical, and comparison operations using the Arithmetic Logic Unit (ALU). This stage also computes branch target addresses and evaluates branch conditions.

For multiplication and division instructions from the RV32M extension, a dedicated multiply/divide unit is used. These operations may require multiple cycles, and a handshake mechanism is implemented to stall the pipeline until the result becomes available.

Additionally, custom AI instructions are executed in this stage using a specialized AI execution unit. These instructions accelerate operations such as vector dot-product and maximum value computation.

Data forwarding mechanisms are implemented in the EX stage to reduce pipeline stalls caused by data hazards.

### **3.3.4 Memory Access (MEM) Stage**

The Memory Access (MEM) stage is responsible for accessing data memory for load and store instructions. For load instructions, data is read from memory, while for store instructions, data is written to memory.

This stage also ensures proper alignment and address calculation for memory operations. Instructions that do not require memory access simply pass through this stage without modification.

The output of the MEM stage is forwarded to the WB stage through the MEM/WB pipeline register.

### **3.3.5 Write-Back (WB) Stage**

The Write-Back (WB) stage writes the final result of instruction execution back to the register file. Depending on the instruction type, the data written back may come from:

- The ALU result

- The multiply/divide unit output
- The AI execution unit
- The data memory (for load instructions)

Write-back control signals ensure that only valid destination registers are updated. This stage completes the instruction execution cycle.

## 3.4 Hazard Detection and Data Forwarding

In a pipelined processor, hazards occur when the normal execution flow of instructions leads to incorrect results due to resource conflicts or data dependencies. To ensure correct and efficient execution, the proposed processor incorporates **hazard detection and data forwarding mechanisms**.

The hazards addressed in this design are primarily **data hazards**, which arise when an instruction depends on the result of a previous instruction that has not yet completed its execution.

### 3.4.1 Data Hazards

Data hazards occur when an instruction requires data that is still being produced by an earlier instruction in the pipeline. In a 5-stage pipeline, such hazards commonly arise between the EX, MEM, and WB stages.

The most critical data hazard handled in this design is the **load-use hazard**, where an instruction immediately following a load instruction attempts to use the loaded data.

### 3.4.2 Hazard Detection Unit

A dedicated **Hazard Detection Unit** is implemented in the Instruction Decode (ID) stage. This unit monitors the source registers of the current instruction and compares them with the destination register of the instruction in the Execute (EX) stage.

When a load-use dependency is detected, the hazard detection unit:

- Inserts a pipeline stall
- Prevents the Program Counter (PC) from updating
- Freezes the IF/ID pipeline register
- Inserts a NOP (bubble) into the pipeline

This ensures that the dependent instruction waits until the required data becomes available, maintaining correct program execution.

### 3.4.3 Data Forwarding Mechanism

To minimize performance degradation caused by data hazards, **data forwarding (bypassing)** is implemented. The forwarding logic allows the processor to use results from later pipeline stages without waiting for them to be written back to the register file.

The forwarding unit checks for register dependencies between:

- EX stage and EX/MEM stage
- EX stage and MEM/WB stage

Based on these comparisons, the forwarding unit selects the appropriate data source for the ALU operands using multiplexers.

This mechanism significantly reduces the number of pipeline stalls and improves overall instruction throughput.

### 3.4.4 Integration with Multiply/Divide and AI Units

The hazard detection and forwarding logic is extended to support:

- Multi-cycle RV32M multiply and divide instructions
- Custom AI instructions executed in the EX stage

For multi-cycle operations, the pipeline is stalled until the result is ready. Forwarding paths ensure that results from the multiply/divide unit and AI execution unit are correctly routed to dependent instructions.

### 3.4.5 Performance Impact

By combining hazard detection with efficient data forwarding, the processor achieves:

- Correct execution of dependent instructions
- Reduced pipeline stalls
- Improved performance compared to a non-forwarded pipeline

This design ensures a balance between hardware complexity and execution efficiency, making it suitable for AI-accelerated RISC-V applications.

## 3.5 Multiply–Divide Unit (M-Unit)

The RV32M extension of the RISC-V instruction set architecture introduces integer multiplication and division operations, which are essential for computationally intensive workloads such as signal processing, control systems, and AI-oriented arithmetic. In this project, a dedicated **Multiply–Divide Unit (M-Unit)** has been designed and integrated into the **Execute (EX) stage** of the 5-stage pipelined RV32IM processor to support all RV32M instructions efficiently.

### 3.5.1 Purpose of the M-Unit

The base RV32I instruction set does not support multiplication and division operations. To overcome this limitation and enhance computational capability, the RV32M extension has been implemented. The M-Unit is responsible for executing the following instructions:

- **MUL** – Lower 32 bits of signed multiplication

- **MULH** – Upper 32 bits of signed  $\times$  signed multiplication
- **MULHSU** – Upper 32 bits of signed  $\times$  unsigned multiplication
- **MULHU** – Upper 32 bits of unsigned  $\times$  unsigned multiplication
- **DIV** – Signed division
- **DIVU** – Unsigned division
- **REM** – Signed remainder
- **REMU** – Unsigned remainder

These operations are offloaded from the main ALU to the M-Unit to maintain pipeline clarity and modularity.

### 3.5.2 Integration in the Execute (EX) Stage

The M-Unit is tightly coupled with the **EX stage** of the pipeline. During instruction decode, RV32M instructions are identified using:

- **opcode = OP**
- **funct7 = 0000001**

Once detected, the EX stage routes the operands (**rs1** and **rs2**) along with the decoded operation type to the M-Unit instead of the standard ALU.

A **handshake-based control mechanism** is used to coordinate execution between the pipeline and the M-Unit. This mechanism ensures correct operation even when multi-cycle instructions such as division are executed.

### 3.5.3 EX–M-Unit Handshake Mechanism

The following control signals are used for synchronization:

- **m\_req\_valid** – Indicates a valid multiply/divide request from the EX stage
- **m\_ready** – Indicates the M-Unit can accept a new request
- **m\_done** – Indicates completion of the operation
- **m\_result** – Holds the final computation result

When an RV32M instruction reaches the EX stage:

1. The EX stage asserts **m\_req\_valid**.
2. The pipeline **stalls** until **m\_done** is asserted.
3. Once the operation is complete, the result is forwarded to the **Write-Back (WB) stage**.

This stall mechanism ensures **pipeline correctness** while supporting variable-latency operations.

### 3.5.4 M-Unit Internal Operation

The M-Unit is implemented as a **finite state machine (FSM)** that handles:

- Operand latching
- Operation execution
- Result generation

- Completion signaling

Multiplication operations typically complete in fewer cycles, while division and remainder operations may take multiple cycles depending on operand values. Signed and unsigned operations are handled correctly as per the RV32M specification.

### 3.5.5 Write-Back and Verification

Once the M-Unit completes execution:

- The result is forwarded to the WB stage.
- The destination register (**rd**) is updated.
- The pipeline resumes normal operation.

The correctness of the M-Unit implementation has been verified using:

- **Directed test programs (`i.mem.hex`)**
- **Simulation waveform analysis**
- **Transcript-based result validation**

The waveform clearly shows:

- Stall cycles during M-Unit execution
- Assertion of **m\_done**
- Correct propagation of **m\_result** to the register file

### 3.5.6 Significance of the M-Unit in the Project

The integration of the RV32M M-Unit significantly enhances the processor's computational capabilities. It enables efficient execution of arithmetic-heavy workloads and forms a strong foundation for **AI-oriented instruction acceleration**, complementing the custom AI instructions implemented in the processor.

This modular and handshake-based design ensures scalability and makes the processor compliant with the **RV32IM RISC-V specification**, preparing it for future extensions and advanced verification.

## 3.6 Custom AI Instruction Unit (VDOT4 and VMAX4)

To enhance the computational efficiency of Artificial Intelligence (AI) and machine learning workloads, the proposed RV32IM processor integrates a **Custom AI Instruction Unit**. This unit accelerates data-level parallel operations that are frequently used in neural networks, signal processing, and vector-based computations. The AI acceleration is achieved through **custom ISA extensions** implemented using RISC-V's *custom opcode space*.

The custom AI unit is tightly coupled with the **Execute (EX) stage** of the 5-stage pipeline and operates alongside the standard ALU and RV32M Multiply-Divide Unit (M-Unit).

### 3.6.1 Rationale for Custom AI Instructions

Traditional RISC-V scalar instructions process one data element at a time, which leads to inefficient execution for AI workloads that rely on vector and dot-product operations. To overcome this limitation, the processor introduces **packed-byte parallel instructions** that perform multiple operations within a single instruction cycle.

The objectives of introducing custom AI instructions are:

- **Reduction in instruction count** for AI workloads
- **Improved execution throughput**
- **Lower power consumption** by minimizing instruction fetch and decode overhead

- Seamless integration with existing RV32IM pipeline

### 3.6.2 Overview of Implemented AI Instructions

Two custom AI instructions have been implemented:

#### 1. VDOT4 (Vector Dot Product of 4 Elements)

- Performs **parallel dot-product computation** on four 8-bit signed elements packed into a 32-bit register.
- Each byte from source registers contributes to the final accumulated result.
- The result is a **32-bit scalar output**.

#### Operation:

- Input: Two 32-bit registers (rs1, rs2)
- Each register is divided into four 8-bit values
- Corresponding bytes are multiplied and accumulated

#### Mathematical Representation:

$$\text{VDOT4} = (\text{rs1}[7:0] \times \text{rs2}[7:0]) + (\text{rs1}[15:8] \times \text{rs2}[15:8]) + (\text{rs1}[23:16] \times \text{rs2}[23:16]) + (\text{rs1}[31:24] \times \text{rs2}[31:24])$$

#### 2. VMAX4 (Vector Maximum of 4 Elements)

- Computes the **maximum value among four packed 8-bit elements**.
- Used in AI activation functions and comparison-based workloads.

#### Operation:

- Input: One or two 32-bit registers
- Each byte is treated as an independent signed value
- The maximum byte value is selected and expanded to 32-bit output

### 3.6.3 Instruction Encoding and Decode Mechanism

- Both instructions utilize the **RISC-V CUSTOM-0 opcode**.
- Instruction identification is based on:
  - Opcode = CUSTOM-0
  - funct7 field differentiates VDOT4 and VMAX4
- During the **Instruction Decode (ID) stage**, control signals are generated:
  - `id_ex_do_vdot4`
  - `id_ex_do_vmax4`

These signals ensure that the instruction is routed to the AI unit instead of the standard ALU.

### 3.6.4 Integration in the Execute (EX) Stage

The AI unit is placed in the **EX stage** and operates in parallel with other execution units. The execution flow is as follows:

1. Source operands are read from the register file.
2. Packed-byte extraction logic separates 8-bit elements.

3. The selected AI operation (VDOT4 or VMAX4) is executed.
4. The computed result is forwarded to:
  - Write-Back (WB) stage
  - Data memory (if required)

This integration ensures **minimal pipeline disruption** and maintains compatibility with existing hazard detection and forwarding logic.

### 3.6.5 Pipeline Behavior and Hazard Handling

- AI instructions are designed to complete in **a single EX cycle**, avoiding multi-cycle stalls.
- Existing **data forwarding paths** are reused to handle dependencies.
- Control hazards are managed using the same mechanisms as standard instructions.

This design choice ensures that AI acceleration does not introduce complexity into the pipeline control logic.

### 3.6.6 Verification and Simulation Results

The functionality of the AI unit was validated using:

- **Directed test programs** loaded through `imem.hex`
- Waveform analysis in **QuestaSim**
- Transcript-based result verification

#### Observed Results:

- **VDOT4 Output:** `0x0000000A`
- **VMAX4 Output:** `0x01020304`
- Results were correctly written to data memory and verified in simulation.

The waveform clearly demonstrates correct signal assertion in the EX stage and successful write-back of AI results.

### **3.6.7 Advantages of the Proposed AI Unit**

- ✓ High performance for AI workloads
- ✓ Minimal hardware overhead
- ✓ Fully compatible with RV32IM pipeline
- ✓ Scalable design for future vector extensions

### **3.6.8 Summary**

The Custom AI Instruction Unit significantly enhances the computational capability of the RV32IM processor by introducing application-specific acceleration for AI workloads. Through the implementation of VDOT4 and VMAX4 instructions, the processor achieves higher performance and efficiency while preserving the simplicity and modularity of the RISC-V architecture.

This unit demonstrates the **extensibility of RISC-V ISA** and validates the effectiveness of custom instruction acceleration in modern embedded processor design.

## **CHAPTER 4 : IMPLEMENTATION DETAILS**

This chapter describes the practical realization of the proposed **5-Stage Pipelined RV32IM RISC-V Processor with Custom AI Instruction Acceleration**. The complete processor has been implemented using **Verilog HDL**, simulated using **QuestaSim**, and functionally verified through directed test programs. The design follows a modular and hierarchical approach to ensure clarity, scalability, and ease of verification.

## 4.1 Verilog HDL Implementation

The entire processor is implemented using **structural and behavioral Verilog HDL**, with each functional block developed as an independent module. This modular design approach improves readability, debugging efficiency, and future extensibility of the processor.

### 4.1.1 Design Methodology

The implementation follows these key principles:

- **Modular architecture:** Each pipeline stage and execution unit is implemented as a separate Verilog module.
- **Pipeline-based execution:** A 5-stage pipeline is used to increase instruction throughput.
- **ISA compliance:** The core supports the **RV32I base instruction set** and the **RV32M extension**.
- **Custom acceleration:** AI-specific instructions are integrated using custom opcode decoding.
- **Synchronous design:** All state elements are clocked using a single system clock with active-low reset.

### 4.1.2 Top-Level Processor Module

The top-level module **MyCoreAI32** integrates all pipeline stages and functional units. It includes:

- Program Counter (PC) logic
- Instruction Fetch (IF) interface
- Instruction Decode (ID) and control logic
- Execute (EX) stage with ALU, M-Unit, and AI Unit
- Memory Access (MEM) stage
- Write-Back (WB) stage

Pipeline registers are placed between each stage to store instruction data and control signals.

#### 4.1.3 Pipeline Stage Implementation

##### Instruction Fetch (IF) Stage

- Fetches instructions from instruction memory using the Program Counter.
- Supports sequential execution and branch redirection.
- Instruction and PC values are stored in the IF/ID pipeline register.

##### Instruction Decode (ID) Stage

- Decodes opcode, funct3, and funct7 fields.
- Generates control signals for ALU, memory, M-Unit, and AI unit.
- Reads source operands from the register file.

- Includes hazard detection logic for load-use and M-unit stalls.

### **Execute (EX) Stage**

- Performs arithmetic, logical, and comparison operations using the ALU.
- Executes multiply/divide instructions using the **RV32M M-Unit**.
- Executes custom AI instructions (VDOT4 and VMAX4) using the **AI Unit**.
- Includes operand forwarding logic to resolve data hazards.

### **Memory Access (MEM) Stage**

- Handles load and store operations.
- Interfaces with data memory using address, write enable, and data signals.
- Passes data to the WB stage through pipeline registers.

### **Write-Back (WB) Stage**

- Writes results back to the destination register.
- Selects data from ALU, M-Unit, AI Unit, or data memory.
- Ensures correct register update with write-enable control.

#### **4.1.4 Multiply-Divide Unit Integration**

The RV32M extension is implemented using a dedicated **muldiv\_unit** module. Key features include:

- Support for **MUL**, **MULH**, **DIV**, **DIVU**, **REM**, and **REMU** instructions.

- Handshake-based interface (`valid`, `ready`, `done`) with the EX stage.
- Multi-cycle execution support with pipeline stall control.
- Result forwarding once computation is complete.

The EX stage stalls automatically while the M-Unit is busy, ensuring correct execution without data corruption.

#### 4.1.5 Custom AI Unit Implementation

The AI unit is implemented as a separate Verilog module (`ai_unit`) and integrated into the EX stage. Features include:

- Support for **VDOT4** and **VMAX4** instructions.
- Packed 8-bit data extraction from 32-bit registers.
- Parallel computation within a single clock cycle.
- Output directly forwarded to the WB stage or memory.

Control signals ensure mutual exclusivity between ALU, M-Unit, and AI unit execution paths.

#### 4.1.6 Hazard Detection and Forwarding Logic

To maintain pipeline correctness, the following mechanisms are implemented:

- **Load-use hazard detection** with pipeline stall insertion.
- **Data forwarding** from EX/MEM and MEM/WB stages.
- **M-Unit stall handling** to pause dependent instructions.

These mechanisms ensure correct execution without unnecessary pipeline flushing.

#### 4.1.7 Simulation and Verification Setup

- Simulation is performed using **QuestaSim**.
- Test programs are loaded through **imem.hex**.
- Waveforms are generated using WLF format for detailed signal analysis.
- Correct execution is verified using memory and register value checks.

Simulation results confirm that:

- RV32IM instructions execute correctly.
- Custom AI instructions produce expected outputs.
- Pipeline hazards are correctly resolved.

### 4.2 Pipeline Control and Stall Mechanism

Efficient pipeline control is essential to ensure correct execution of instructions in a pipelined processor. In the proposed **5-Stage Pipelined RV32IM RISC-V Processor**, a comprehensive pipeline control mechanism has been implemented to handle **data hazards, control hazards, and multi-cycle execution stalls**, particularly due to the RV32M multiply-divide unit.

The pipeline control logic ensures that instructions are executed in the correct order while maintaining maximum throughput wherever possible.

#### 4.2.1 Need for Pipeline Control

In a pipelined architecture, multiple instructions are processed simultaneously across

different stages. However, this parallelism introduces hazards such as:

- **Data hazards** when an instruction depends on the result of a previous instruction.
- **Control hazards** due to branch and jump instructions.
- **Structural hazards** caused by shared hardware resources.
- **Multi-cycle hazards** introduced by operations like multiplication and division.

To address these issues, stall and forwarding mechanisms are implemented.

#### 4.2.2 Load-Use Hazard Detection

A **load-use hazard** occurs when an instruction in the Decode (ID) stage depends on the result of a load instruction that is still in the Execute (EX) stage.

To resolve this:

- The hazard detection unit compares the source registers of the current instruction with the destination register of the preceding load instruction.
- If a dependency is detected, a **pipeline stall** is introduced for one cycle.
- The Program Counter (PC) and IF/ID pipeline register are frozen.
- A NOP (bubble) is inserted into the EX stage to prevent incorrect execution.

This mechanism guarantees data correctness without requiring complex forwarding for load instructions.

#### 4.2.3 Data Forwarding Mechanism

To minimize performance loss due to stalls, **data forwarding** is implemented.

Forwarding allows the processor to use results from later pipeline stages before they are written back to the register file.

The forwarding unit supports:

- Forwarding from the **EX/MEM stage** to the EX stage.
- Forwarding from the **MEM/WB stage** to the EX stage.

Priority is given to the most recent result to ensure correctness. This significantly reduces the number of pipeline stalls caused by data dependencies.

#### 4.2.4 M-Unit Stall Handling

The RV32M multiply and divide instructions are executed using a dedicated **multi-cycle M-Unit**. Since these operations may take multiple clock cycles, special stall handling is required.

The following handshake mechanism is used:

- A **valid** signal is asserted when a multiply or divide instruction enters the EX stage.
- The M-Unit asserts **ready** when it accepts the operation.
- While the M-Unit is busy, the pipeline is stalled to prevent dependent instructions from proceeding.
- Once computation completes, the M-Unit asserts **done**, and the result is forwarded to the Write-Back stage.

This mechanism ensures correct execution of multi-cycle instructions while maintaining pipeline integrity.

#### 4.2.5 Control Hazard Handling

Control hazards arise due to branch and jump instructions that may alter the program

flow.

In this design:

- Branch conditions are evaluated in the EX stage.
- If a branch is taken, the Program Counter is updated with the target address.
- Instructions that were fetched speculatively are invalidated.
- The pipeline is synchronized to resume execution from the correct instruction address.

This approach ensures correctness with minimal control complexity.

#### 4.2.6 Combined Stall and Control Logic

All stall conditions—including load-use hazards, M-Unit stalls, and control hazards—are combined using centralized pipeline control logic. The final stall signal controls:

- Program Counter update
- IF/ID and ID/EX pipeline register enable
- Injection of NOP instructions when required

This unified control ensures robust and predictable pipeline behavior.

#### 4.2.7 Summary

The implemented pipeline control and stall mechanism:

- Ensures correct execution of dependent instructions
- Minimizes performance loss through data forwarding

- Supports multi-cycle RV32M operations
- Maintains pipeline consistency during branches and jumps

As a result, the processor achieves both **functional correctness** and **efficient pipelined performance**, making it suitable for AI-accelerated and general-purpose workloads.

## 4.3 Integration of AI Acceleration Instructions

To enhance the computational capability of the processor for machine learning and data-parallel workloads, **custom AI acceleration instructions** have been integrated into the proposed RV32IM processor core. These instructions are designed to accelerate commonly used operations in lightweight AI and signal processing applications while maintaining compatibility with the RISC-V architecture.

The integration is achieved through custom instruction decoding and a dedicated execution unit without disrupting the standard pipeline flow.

### 4.3.1 Motivation for AI Instruction Integration

Conventional scalar processors execute AI-related operations such as dot products and vector comparisons using multiple instructions, leading to increased execution latency and power consumption. To address this limitation, the processor introduces **custom ISA extensions** that enable parallel computation within a single instruction.

The custom instructions are optimized for:

- Edge AI workloads
- Embedded machine learning inference
- Signal processing and multimedia applications

### 4.3.2 Custom Instruction Encoding

The AI acceleration instructions are implemented using the **RISC-V Custom-0 opcode**

**space**, ensuring no conflict with standard RV32I or RV32M instructions.

Key characteristics:

- Uses a dedicated opcode (**CUSTOM0**)
- Differentiated using **funct7** fields
- Operates on standard 32-bit general-purpose registers

This approach preserves ISA compatibility while allowing custom functionality

### 4.3.3 Supported AI Instructions

The processor supports the following custom AI instructions:

#### **VDOT4 (Vector Dot Product of 4 Elements)**

- Operates on two 32-bit source registers.
- Each register is interpreted as four packed 8-bit signed elements.
- Performs parallel multiplication and accumulation.
- Produces a 32-bit scalar output.

#### **VMAX4 (Vector Maximum of 4 Elements)**

- Compares corresponding 8-bit elements from two source registers.
- Selects the maximum value for each element.
- Packs the results into a single 32-bit output register.

These instructions significantly reduce the instruction count for vector operations.

#### 4.3.4 AI Unit Architecture

A dedicated **AI Unit** is implemented in the Execute (EX) stage to process custom AI instructions. The unit operates in parallel with the standard ALU and M-Unit.

Key features of the AI Unit include:

- Packed byte extraction logic
- Parallel arithmetic and comparison circuits
- Single-cycle execution for AI operations
- Direct result forwarding to the Write-Back stage

Control logic ensures that the AI unit is activated only for valid custom instructions.

#### 4.3.5 Pipeline Integration

The AI unit is seamlessly integrated into the pipeline with minimal disruption:

- AI instructions are decoded in the ID stage.
- Control signals select the AI execution path in the EX stage.
- Results bypass unnecessary stages when applicable.
- Standard pipeline registers handle result propagation.

Data hazards involving AI instructions are resolved using existing forwarding logic.

#### 4.3.6 Interaction with RV32M and ALU

The processor ensures mutual exclusivity between execution units:

- ALU executes standard arithmetic and logic instructions.
- M-Unit handles multiply and divide operations.
- AI Unit handles vector-based AI instructions.

A centralized control mechanism selects the appropriate execution unit based on instruction decoding, preventing resource conflicts.

#### **4.3.7 Verification of AI Instructions**

The functionality of the AI acceleration instructions is verified using directed test programs:

- Known vector inputs are loaded into registers.
- AI instructions are executed.
- Results are written to data memory for validation.
- Waveform analysis confirms correct packed computation.

Simulation results demonstrate accurate and efficient execution of both VDOT4 and VMAX4 instructions.

#### **4.3.8 Summary**

The integration of AI acceleration instructions enhances the processor's capability to efficiently execute data-parallel workloads. By leveraging custom ISA extensions and a dedicated execution unit, the processor achieves:

- Reduced instruction count

- Lower execution latency
- Improved suitability for edge AI applications

This design maintains full compatibility with RV32IM while enabling domain-specific acceleration.

## 4.4 Memory Interface Design

The memory interface plays a crucial role in enabling correct and efficient communication between the processor core and external memory. In the proposed **5-Stage Pipelined RV32IM RISC-V Processor**, a simple yet effective memory interface is implemented to support instruction fetch and data access operations.

The design follows a **Harvard-style separation** between instruction memory and data memory to allow simultaneous instruction fetch and data access.

### 4.4.1 Instruction Memory Interface

The instruction memory interface is used exclusively by the **Instruction Fetch (IF) stage**.

Key features include:

- A 32-bit address bus generated by the Program Counter (PC).
- A 32-bit instruction output fetched from instruction memory.
- Word-aligned access using the PC value.
- Read-only operation, as instruction memory is not modified during execution.

Instructions are fetched combinational and stored in the IF/ID pipeline register for further decoding.

### 4.4.2 Data Memory Interface

The data memory interface is accessed during the **Memory Access (MEM) stage** for load and store instructions.

The interface consists of:

- 32-bit address bus
- 32-bit write data bus
- 32-bit read data bus
- Memory read enable signal
- Memory write enable signal

This interface supports word-level load (**LW**) and store (**SW**) operations as defined in the RV32I specification.

#### 4.4.3 Load Operation Handling

For load instructions:

- The effective memory address is computed in the EX stage.
- The MEM stage reads data from memory using the computed address.
- The read data is forwarded to the Write-Back (WB) stage.
- The WB stage writes the data into the destination register.

Pipeline registers ensure proper synchronization of data across stages.

#### 4.4.4 Store Operation Handling

For store instructions:

- The EX stage calculates the effective memory address.
- The MEM stage writes data from the source register into memory.
- No write-back is performed for store instructions.

Write enable signals ensure that memory is updated only for valid store operations.

#### **4.4.5 Address Alignment and Access Control**

The memory interface enforces:

- Word-aligned access by ignoring lower address bits.
- Controlled access through explicit read and write enable signals.
- Safe operation by preventing simultaneous read and write on the same cycle.

These measures ensure predictable memory behavior during simulation and execution.

#### **4.4.6 Integration with Pipeline Control**

The memory interface is tightly integrated with pipeline control logic:

- Load-use hazards are detected and handled through pipeline stalls.
- Forwarding logic supports load result propagation.
- Store instructions do not interfere with pipeline progression.

This integration ensures correct memory operation without compromising pipeline efficiency.

#### **4.4.7 Verification of Memory Interface**

The correctness of the memory interface is verified through simulation:

- Directed test programs store known values into data memory.
- Load instructions retrieve stored values and verify correctness.
- Simulation waveforms confirm correct timing of read/write signals.

Observed simulation results confirm reliable instruction fetch and data access behavior.

#### **4.4.8 Summary**

The memory interface design provides a robust and efficient mechanism for instruction and data access. By maintaining simplicity and strict adherence to the RV32I specification, the design ensures:

- Correct memory operations
- Seamless pipeline integration
- Ease of verification and future extensibility

This completes the implementation of core memory interactions within the processor.

## **CHAPTER 5 : VERIFICATION AND SIMULATION**

This chapter presents the verification and simulation strategy adopted to validate the functional correctness and performance of the proposed **5-Stage Pipelined RV32IM RISC-V Processor with Custom AI Instruction Acceleration**. Verification is a critical phase in processor design to ensure compliance with architectural specifications and correct handling of pipeline hazards and custom instructions.

## 5.1 Verification Methodology

The verification of the processor core is carried out using a **simulation-based verification approach**. The methodology focuses on validating individual functional units, pipeline behavior, and instruction-level correctness under various operating conditions.

The verification strategy is structured to progressively test the processor from basic functionality to complex instruction interactions.

### 5.1.1 Simulation Environment

The simulation environment is set up using:

- **QuestaSim** as the primary simulation tool
- **SystemVerilog testbench** for stimulus generation and result checking
- Hex-based instruction memory initialization (`imem.hex`)
- Waveform generation using WLF format for detailed signal analysis

The testbench instantiates the processor core and models instruction memory and data memory behavior.

### 5.1.2 Directed Test-Based Verification

A **directed test methodology** is used to verify processor functionality. In this approach,

specific instruction sequences are manually written to test individual features of the processor.

The directed tests are designed to validate:

- RV32I base instructions (ADD, SUB, LOAD, STORE, BRANCH)
- RV32M instructions (MUL, DIV, REM)
- Custom AI instructions (VDOT4, VMAX4)
- Pipeline hazard handling
- Correct memory access and write-back behavior

Each test program produces predictable results that are checked against expected outputs.

### **5.1.3 Instruction Memory Initialization**

The instruction memory is initialized using a hex file (`imem.hex`), which contains machine code instructions generated specifically for verification.

This approach enables:

- Precise control over instruction execution order
- Repeatable simulation runs
- Easy modification of test programs

The processor begins execution from a known reset address, ensuring deterministic simulation behavior.

### **5.1.4 Data Memory Result Checking**

To verify correctness, computed results are written to specific locations in data memory. The testbench reads these memory locations after execution and compares them with expected values.

This method is used to validate:

- Arithmetic and logical operation results
- Multiply-divide unit outputs
- AI instruction outputs
- Correct load and store functionality

Pass/fail messages are displayed in the simulation transcript for quick verification.

### **5.1.5 Pipeline and Hazard Verification**

Pipeline behavior is verified through:

- Load-use hazard test cases
- Data forwarding scenarios
- Multi-cycle stall conditions due to the RV32M unit
- Control flow changes from branch instructions

Simulation waveforms are analyzed to confirm:

- Correct stall insertion
- Proper forwarding paths
- No incorrect register updates during hazards

## 5.1.6 Waveform-Based Analysis

Waveforms generated during simulation are used extensively for debugging and validation.

Key signals observed include:

- Program Counter (PC)
- Pipeline registers (IF/ID, ID/EX, EX/MEM, MEM/WB)
- ALU, M-Unit, and AI Unit outputs
- Memory read/write signals
- Write-back control signals

Waveform analysis confirms correct timing and interaction between pipeline stages

## 5.1.7 Verification Coverage

Although advanced verification frameworks such as UVM are not used in this project, the directed tests provide strong functional coverage for:

- Instruction execution correctness
- Pipeline control mechanisms
- Custom instruction functionality

The verification methodology is sufficient for academic validation and final project submission.

## 5.1.8 Summary

The verification methodology ensures that the processor design is:

- Functionally correct
- Pipeline-safe
- Capable of executing RV32IM and custom AI instructions reliably

Simulation results and waveform analysis confirm that the processor meets the intended design specifications.

## 5.2 Testbench Architecture

The testbench architecture is designed to provide a controlled and repeatable environment for verifying the functional correctness of the proposed **RV32IM pipelined processor with custom AI instructions**. A **SystemVerilog-based testbench** is used to stimulate the processor core, model memory behavior, and validate execution results.

The testbench follows a **non-synthesizable, behavioral architecture** focused on functional verification.

### 5.2.1 Top-Level Testbench Structure

The testbench instantiates the processor core (**MyCoreAI32**) as the Device Under Test (DUT). It connects the DUT to behavioral models of instruction memory and data memory.

Major components of the testbench include:

- Clock generator
- Reset generator
- Instruction memory model
- Data memory model

- Result checking and reporting logic

This structure allows isolation of the processor core from external dependencies while ensuring accurate verification.

### **5.2.2 Clock and Reset Generation**

- A single system clock is generated using an always block.
- The clock period is fixed to simulate synchronous operation.
- An active-low reset signal initializes the processor state.
- Reset is asserted for a few initial clock cycles before normal execution begins.

This ensures deterministic startup and repeatable simulation results.

### **5.2.3 Instruction Memory Model**

The instruction memory is modeled as a simple word-addressable array.

Key characteristics:

- Instructions are loaded from a hexadecimal file (`imem.hex`).
- Instruction fetch is purely combinational.
- Word-aligned access is enforced.

This approach allows easy modification of test programs without changing the testbench code.

### **5.2.4 Data Memory Model**

The data memory is implemented as a behavioral array within the testbench.

Features include:

- Support for load and store operations.
- Write operations synchronized to the clock edge.
- Combinational read access for simplicity.
- Word-level access consistent with RV32I specification.

Data memory is used to store intermediate and final results for verification.

### **5.2.5 Result Observation and Checking**

To verify correctness:

- The processor writes computation results to predefined data memory locations.
- After program execution, the testbench reads these locations.
- Expected values are compared against actual values.
- Pass or fail messages are printed in the simulation transcript.

This mechanism enables quick identification of functional correctness.

### **5.2.6 Debug and Visibility Support**

The testbench supports extensive debugging through:

- Waveform dumping (WLF format)
- Observation of pipeline registers and control signals

- Monitoring of ALU, M-Unit, and AI Unit outputs

Waveform analysis is used to validate pipeline timing, hazard handling, and instruction execution flow.

### 5.2.7 Execution Control

The testbench:

- Allows the processor to run for a fixed number of clock cycles.
- Terminates simulation automatically after execution completes.
- Ensures no manual intervention is required during simulation.

This automation simplifies regression testing.

### 5.2.8 Summary

The testbench architecture provides a robust and flexible environment for validating the processor design. By combining directed instruction tests, memory-based result checking, and waveform analysis, the testbench ensures reliable verification of:

- RV32IM instruction execution
- Custom AI instruction functionality
- Pipeline control and hazard mechanisms

This completes the verification infrastructure of the project.

## 5.3 Directed Test Programs

Directed test programs are used to verify the functional correctness of the proposed **RV32IM pipelined processor with custom AI instruction support**. These programs are carefully written to exercise specific instruction classes, pipeline behavior, and custom extensions in a controlled manner.

Unlike random testing, directed tests focus on **known scenarios with predictable outcomes**, making them ideal for early-stage verification and debugging.

### 5.3.1 Purpose of Directed Testing

The primary objectives of directed test programs are:

- To validate correct execution of RV32I base instructions
- To verify RV32M multiply and divide operations
- To confirm correct functionality of custom AI instructions (VDOT4 and VMAX4)
- To observe pipeline behavior, hazards, and forwarding
- To ensure correct memory read/write operations

Each test program is designed to produce **deterministic results**, which are stored in data memory for verification.

### 5.3.2 Instruction Memory Initialization

All directed tests are encoded as machine instructions and stored in a hexadecimal file ([imem.hex](#)). This file is loaded into the instruction memory model at the start of simulation.

Key characteristics:

- Instructions are word-aligned
- Programs start execution from address [`0x00000000`](#)

- No operating system or bootloader is used
- Programs terminate using an infinite loop or jump instruction

This setup allows full control over instruction sequencing.

### 5.3.3 RV32I Functional Tests

Directed programs include verification of:

- Arithmetic instructions (ADD, SUB, AND, OR, XOR)
- Immediate instructions (ADDI, ANDI)
- Load and store instructions (LW, SW)
- Branch instructions (BEQ, BNE)
- Jump instructions (JAL)

Results of these operations are written to data memory and compared against expected values.

### 5.3.4 RV32M Multiply-Divide Tests

Dedicated test cases are created to verify RV32M instructions, including:

- MUL
- DIV
- REM

These tests validate:

- Correct interaction with the M-Unit
- Proper stall behavior during multi-cycle operations
- Correct result write-back after M-Unit completion

The correctness of these operations is confirmed through memory-mapped result checking.

### 5.3.5 Custom AI Instruction Tests

Special directed tests are written to validate custom AI instructions:

- **VDOT4**: Performs vector dot product on packed 8-bit operands
- **VMAX4**: Computes element-wise maximum of packed vector elements

Test programs:

- Load known vector values into registers
- Execute custom instructions
- Store results to data memory
- Verify expected outputs via simulation transcript and waveform

These tests demonstrate successful integration of AI acceleration into the processor pipeline.

### 5.3.6 Pipeline Hazard and Forwarding Tests

Directed tests are also used to validate pipeline behavior, including:

- Data hazards between dependent instructions
- Forwarding paths from EX and MEM stages
- Load-use hazard stall insertion
- Correct instruction ordering during stalls

Waveform analysis confirms that hazards are correctly handled without data corruption.

### 5.3.7 Pass/Fail Criteria

Each directed test follows a simple validation strategy:

- Expected results are stored in predefined data memory locations
- The testbench compares actual and expected values
- Simulation prints **PASS** or **FAIL** messages in the transcript

This method ensures fast and reliable verification.

### 5.3.8 Summary

Directed test programs play a crucial role in validating the correctness and robustness of the processor design. Through systematic testing of standard instructions, RV32M operations, custom AI instructions, and pipeline hazards, the directed tests establish confidence in the functional behavior of the processor.

## 5.4 Simulation Results and Waveform Analysis

Simulation and waveform analysis were carried out to validate the functional correctness, pipeline behavior, and custom instruction execution of the proposed **5-stage pipelined RV32IM processor with AI acceleration**. All simulations were performed using **QuestaSim**, and waveforms were analyzed to ensure proper signal transitions and timing relationships.

### 5.4.1 Simulation Environment

The processor design was simulated using a self-contained Verilog testbench. The simulation environment includes:

- Clock and reset generation
- Instruction memory initialized using `imem.hex`
- Data memory model for load and store operations
- Monitoring of pipeline control and write-back signals
- Automated pass/fail reporting through the simulation transcript

The simulation runs without any operating system or bootloader, enabling full control over instruction execution.

### 5.4.2 Functional Simulation Results

The simulation results confirm correct execution of:

- **RV32I base instructions**, including arithmetic, logical, load/store, and control flow operations
- **RV32M instructions**, such as MUL, DIV, and REM, executed through the dedicated M-Unit
- **Custom AI instructions (VDOT4 and VMAX4)**, producing correct vector

computation results

All computed values are written to data memory and verified at the end of simulation. The simulation transcript reports **PASS** messages when expected results match actual outputs.

### 5.4.3 Write-Back and Memory Validation

Waveform inspection confirms that:

- Write-back (**WB**) stage correctly updates the register file
- Memory write enable (**dmem\_we**) is asserted only during valid store operations
- Load data is correctly forwarded to the write-back stage
- Results from the M-Unit and AI unit are properly routed to the WB stage

These observations verify correct memory and register interactions.

### 5.4.4 Pipeline Behavior Analysis

Waveform analysis clearly demonstrates the operation of the 5-stage pipeline:

- Instruction flow through IF, ID, EX, MEM, and WB stages is continuous under no-hazard conditions
- Pipeline stalls are correctly inserted during load-use and multi-cycle multiply/divide operations
- Data forwarding paths eliminate unnecessary stalls for dependent instructions
- Program counter redirection is correctly handled during branch and jump instructions

The pipeline operates efficiently while maintaining correctness.

### 5.4.5 M-Unit Execution Timing

Waveforms show that:

- Multiply and divide instructions activate the M-Unit in the EX stage
- The pipeline stalls while waiting for M-Unit completion
- The result is captured only when the M-Unit signals completion
- Normal pipeline execution resumes after result write-back

This confirms correct handshake and integration of the M-Unit.

### 5.4.6 Custom AI Instruction Waveform Analysis

For the AI instructions:

- **VDOT4**: Packed byte multiplication and accumulation occur correctly in a single EX stage
- **VMAX4**: Element-wise maximum values are computed accurately
- Control signals uniquely identify AI instruction execution
- AI results are forwarded to memory or registers without pipeline disruption

Waveform traces validate both functional correctness and timing integrity of AI acceleration.

### 5.4.7 Summary

The simulation results and waveform analysis confirm that the processor design meets all functional requirements. The correct execution of RV32IM instructions, efficient pipeline operation, proper hazard handling, and successful integration of custom AI instructions demonstrate the robustness and effectiveness of the proposed architecture.

## **CHAPTER 6 : SYNTHESIS AND ANALYSIS**

## 6.1 Synthesis Flow using Synopsys Design Vision

Synthesis is a critical phase in the VLSI design flow where a Register Transfer Level (RTL) description is transformed into a gate-level netlist that can be mapped onto a target technology. In this project, the proposed **5-stage pipelined RV32IM RISC-V processor with custom AI instruction acceleration** was synthesized using **Synopsys Design Vision**, a widely used industry-standard logic synthesis tool.

The primary objective of synthesis in this work was to validate the **hardware feasibility**, **logical correctness**, and **structural efficiency** of the designed processor core, including the RV32M multiply-divide unit and custom AI instruction unit.

### 6.1.1 RTL Preparation for Synthesis

The complete processor design was written in **Verilog HDL**, following synthesizable coding guidelines. The RTL hierarchy includes:

- Instruction Fetch, Decode, Execute, Memory, and Write-Back stages
- Register file and immediate generator
- Hazard detection and forwarding unit
- RV32M Multiply-Divide (M-Unit)
- Custom AI Instruction Unit (VDOT4 and VMAX4)

Before synthesis, the RTL code was carefully reviewed to ensure:

- No non-synthesizable constructs (such as delays or initial blocks in design modules)
- Proper clock and reset usage
- Clear module hierarchy and signal definitions

## 6.1.2 Synthesis Tool and Environment

The synthesis was performed using **Synopsys Design Vision** in a Linux-based environment. Design Vision provides both command-line and graphical interfaces to:

- Read and elaborate RTL designs
- Apply timing and design constraints
- Optimize logic for area, speed, and power
- Generate gate-level netlists and reports

The tool was configured to target a **generic standard-cell library** suitable for academic and research-level synthesis analysis.

## 6.1.3 Synthesis Flow Steps

The synthesis process followed a structured flow as described below:

### 1. Reading the RTL Design

All Verilog HDL files corresponding to the processor core were read into Design Vision using the `read_verilog` command. The design was then elaborated to resolve module hierarchies and interconnections.

### 2. Design Elaboration and Linking

The top-level module of the processor was specified, and the design was linked against the target technology library. This step ensured that all referenced cells and components were available for mapping.

### 3. Constraint Definition

Basic design constraints were applied, including:

- Clock definition for the processor
- Input and output timing assumptions

These constraints guide the synthesis tool in optimizing the design for correct timing behavior.

#### **4. Logic Optimization and Mapping**

Design Vision performed logic optimization to minimize redundant logic and improve efficiency. The RTL was then mapped to available standard cells from the target library.

#### **5. Netlist Generation**

A synthesized gate-level netlist was generated, representing the hardware implementation of the processor.

#### **6. Report Generation**

Various synthesis reports were generated to analyze:

- Area utilization
- Cell usage
- Timing characteristics
- Hierarchical module contribution

#### **6.1.4 Synthesis of Custom and RV32M Units**

Special attention was given to the synthesis of:

- **RV32M Multiply-Divide Unit:** Verified for correct handling of multi-cycle operations and control logic.
- **Custom AI Instruction Unit:** Confirmed that the VDOT4 and VMAX4 operations

were synthesized into efficient combinational logic without excessive area overhead.

The synthesis results demonstrate that the addition of AI acceleration introduces **minimal hardware complexity** while significantly enhancing computational capability.

### 6.1.5 Importance of Synthesis Results

The synthesis process confirms that:

- The proposed processor design is **fully synthesizable**
- The architecture is suitable for **real hardware implementation**
- Custom ISA extensions can be integrated efficiently into a RISC-V pipeline

This synthesis step bridges the gap between functional RTL simulation and physical implementation, validating the practicality of the proposed design.

## 6.2 Resource Utilization

Resource utilization analysis provides insight into the hardware cost of the proposed **5-Stage Pipelined RV32IM RISC-V Processor with Custom AI Instruction Acceleration**. Using **Synopsys Design Vision**, the synthesized design was evaluated in terms of logic elements, functional units, and hierarchical module contribution.

The synthesis results indicate that the processor core is **resource-efficient**, despite the inclusion of advanced features such as:

- RV32M multiply-divide extension
- Custom AI acceleration instructions (VDOT4 and VMAX4)
- Hazard detection and forwarding logic

### 6.2.1 Area Breakdown by Functional Blocks

The total area utilization is distributed across the following major modules:

- **Pipeline Datapath (IF, ID, EX, MEM, WB)**

This forms the core of the processor and contributes the largest share of the area due to registers, multiplexers, and arithmetic logic.

- **Register File and Control Logic**

The register file and decode logic occupy moderate area and scale linearly with pipeline complexity.

- **RV32M Multiply-Divide Unit (M-Unit)**

The multiply-divide unit introduces additional arithmetic hardware. However, due to its controlled execution and reuse of logic, the area overhead remains reasonable.

- **Custom AI Instruction Unit**

The AI unit implementing **VDOT4** and **VMAX4** instructions uses compact combinational logic. The synthesis results show that this unit adds **minimal area overhead** while significantly improving computational throughput.

- **Hazard Detection and Forwarding Unit**

This unit consists mainly of comparators and multiplexers and contributes a small fraction of the overall area.

### 6.2.2 Impact of Custom AI Instructions on Area

One of the key objectives of this project was to evaluate whether AI acceleration could be integrated without excessive hardware cost. The synthesis results confirm that:

- The AI instruction unit occupies a **small percentage of total area**

- No significant duplication of datapath resources was required

- The overall design remains scalable and modular

This demonstrates that **custom ISA extensions are a practical approach** for enhancing processor functionality in embedded and AI-oriented applications.

### 6.2.3 Observations

- The overall area utilization is well within acceptable limits for a custom embedded processor.
- Modular design allows independent optimization of RV32M and AI units.
- The synthesis confirms that the proposed architecture is suitable for further physical design stages.



**Fig 5: - Top Level Schematic in Design Vision**

## 6.3 Timing Analysis

Timing analysis is a crucial step to ensure that the processor meets performance requirements under the defined clock constraints. In this project, timing analysis was performed using **Synopsys Design Vision** after synthesis to verify that all critical paths satisfy setup and hold timing constraints.

### 6.3.1 Clock Definition and Timing Constraints

A primary clock was defined for the processor core to represent synchronous operation across all pipeline stages. The clock constraint guided the synthesis tool in optimizing logic paths and balancing combinational delays between pipeline registers.

Basic timing constraints included:

- Clock period specification
- Input and output timing assumptions
- Register-to-register path analysis

### 6.3.2 Critical Path Analysis

The timing report generated by Design Vision identified the **Execute (EX) stage** as the most timing-sensitive part of the pipeline. This is expected due to:

- Arithmetic and logical operations
- RV32M multiply/divide control logic
- Custom AI instruction execution paths

Despite this, the pipeline structure ensures that:

- Long combinational paths are broken by pipeline registers
- Multi-cycle operations (such as division) do not violate timing constraints
- AI instructions execute within allowable timing limits

### 6.3.3 Timing Closure Results

The synthesized design successfully meets the defined timing constraints, indicating:

- No setup time violations
- No hold time violations
- Stable operation at the target clock frequency

The successful timing closure confirms that the processor design is **functionally correct and performance-viable**.

### 6.3.4 Significance of Timing Results

The timing analysis validates that:

- The 5-stage pipelined architecture effectively improves clock performance
- Custom AI and RV32M extensions do not negatively impact timing
- The design is suitable for real-time and compute-intensive embedded applications

## **CHAPTER 7 : RESULTS AND DISCUSSION**

### **7.1 Functional Results**

This section presents the functional validation results of the **5-Stage Pipelined RV32IM RISC-V Processor with Custom AI Instruction Acceleration**. The objective of functional testing was to verify the correctness of instruction execution, pipeline behavior, and custom instruction integration under simulation.

Functional verification was carried out using a **self-developed Verilog testbench** and executed in **QuestaSim**. The test programs were designed to validate both standard RV32IM instructions and the newly introduced custom AI instructions.

#### **7.1.1 Verification of Base RV32I Instructions**

The processor was first validated using base RV32I instructions, including:

- Arithmetic operations (ADD, SUB)
- Logical operations (AND, OR, XOR)

- Immediate operations
- Load and store instructions
- Branch and control-flow instructions

Simulation results confirmed that:

- All instructions produced correct results
- Program Counter (PC) updates occurred as expected
- Register write-back values matched expected outputs
- Memory read and write operations were correctly aligned and synchronized

### **7.1.2 Verification of RV32M Extension (Multiply and Divide Instructions)**

The **RV32M extension** was validated using directed test programs that exercised:

- Multiplication instructions (MUL, MULH)
- Division and remainder operations (DIV, REM)

The functional results demonstrated that:

- The multiply-divide unit correctly executed multi-cycle operations
- Pipeline stalls were correctly inserted during long-latency operations
- Results were written back to the register file without data corruption

- No hazards or incorrect forwarding were observed during M-unit execution

This confirms that the integration of the RV32M extension into the pipeline is **functionally correct and robust**.

### 7.1.3 Verification of Custom AI Instructions

The custom AI acceleration instructions **VDOT4** and **VMAX4** were thoroughly validated using dedicated test cases.

- **VDOT4 Instruction**

This instruction computes the dot product of packed 8-bit elements. Simulation results showed correct accumulation behavior, producing the expected scalar output.

- **VMAX4 Instruction**

This instruction computes the maximum value among packed 8-bit elements. The waveform analysis confirmed correct comparison and selection logic.

The results stored in data memory matched the expected values, demonstrating correct execution and seamless integration with the existing pipeline.

### 7.1.4 Pipeline Hazard Handling Validation

Functional testing also verified the effectiveness of:

- Data forwarding mechanisms
- Load-use hazard detection
- Stall insertion logic

The simulation waveforms clearly showed:

- Correct forwarding paths from EX/MEM and MEM/WB stages

- Single-cycle stalls for load-use hazards
- No incorrect instruction execution due to hazards

### 7.1.5 Waveform-Based Validation

Waveform analysis played a key role in functional verification. Signals observed included:

- Pipeline registers across all stages
- ALU and M-unit control signals
- AI instruction enable signals
- Memory interface signals

The waveforms confirmed that:

- Instructions progressed correctly through pipeline stages
- Multi-cycle operations were properly synchronized
- Custom AI instructions did not interfere with standard instruction flow

## 7.2 Performance Evaluation

The performance evaluation of the proposed **5-Stage Pipelined RV32IM RISC-V Processor** was carried out to assess the efficiency of the pipelined architecture and the impact of integrating multi-cycle arithmetic and custom AI acceleration instructions.

The processor follows a classic **five-stage pipeline** consisting of Instruction Fetch (IF), Instruction Decode (ID), Execute (EX), Memory Access (MEM), and Write-Back (WB).

This pipelined organization allows multiple instructions to be processed simultaneously, significantly improving instruction throughput compared to a single-cycle or multi-cycle non-pipelined design.

### 7.2.1 Instruction Throughput

For standard RV32I arithmetic and logical instructions, the processor achieves:

- **One instruction per clock cycle** after pipeline fill
- Efficient utilization of all pipeline stages
- Minimal performance degradation due to hazards, owing to effective data forwarding and stall mechanisms

This demonstrates that the pipeline is well-balanced and capable of sustaining high throughput under normal workloads.

### 7.2.2 Impact of RV32M Multiply-Divide Unit

The inclusion of the **RV32M extension** introduces multi-cycle operations such as multiplication and division. Performance evaluation showed that:

- Multiply and divide instructions are handled using a dedicated **M-unit**
- Pipeline stalls are inserted only when required, preventing incorrect execution
- Other independent instructions continue execution wherever possible

Although multi-cycle operations increase latency for individual instructions, the overall pipeline efficiency remains high due to controlled stalling and proper handshake mechanisms.

### 7.2.3 Pipeline Efficiency and Hazard Management

Performance measurements from simulation waveforms indicate that:

- Data hazards are resolved using forwarding paths, reducing unnecessary stalls
- Load-use hazards introduce only single-cycle stalls
- Control hazards are handled without pipeline instability

These mechanisms ensure optimal pipeline utilization while maintaining correctness.

## 7.3 AI Instruction Performance Analysis

This section analyzes the performance benefits introduced by the custom AI acceleration instructions **VDOT4** and **VMAX4**, which are specifically designed to optimize data-parallel operations commonly used in AI and signal-processing workloads.

### 7.3.1 Motivation for AI Instruction Acceleration

AI workloads often involve repeated vector operations such as dot products and maximum value computations. Executing these operations using scalar instructions results in:

- Higher instruction count
- Increased execution time
- Greater power consumption

The custom AI instructions address these challenges by enabling **packed data processing** within a single instruction.

### 7.3.2 Performance of VDOT4 Instruction

The **VDOT4 instruction** performs a dot product operation on four packed 8-bit elements in parallel. Performance analysis shows that:

- Multiple arithmetic operations are executed within a single EX-stage operation
- Instruction count is significantly reduced compared to scalar implementations
- Execution latency is minimized without additional pipeline complexity

Simulation results confirm that VDOT4 produces correct results while improving computational efficiency.

### 7.3.3 Performance of VMAX4 Instruction

The **VMAX4 instruction** computes the maximum value among four packed elements. Performance observations indicate that:

- The operation completes in a single execute phase
- Comparison logic is efficiently integrated into the AI unit
- The instruction avoids multiple branch or comparison instructions required in scalar code

This leads to reduced execution cycles and improved performance for reduction-type operations.

### 7.3.4 Overall Impact of AI Acceleration

The inclusion of custom AI instructions results in:

- Reduced instruction count

- Lower execution latency for AI kernels
- Improved pipeline utilization
- Enhanced suitability for AI and DSP workloads

The performance analysis confirms that custom AI instruction acceleration provides a meaningful improvement over conventional scalar execution without compromising pipeline integrity.

## **CHAPTER 8 : CONCLUSION AND FUTURE SCOPE**

### **8.1 Conclusion**

In this project, a **5-Stage Pipelined RV32IM RISC-V Processor with Custom AI Instruction Acceleration** has been successfully designed, implemented, and verified using **Verilog HDL**. The processor adheres to the open-source **RISC-V ISA**, incorporating the **RV32I base instruction set** along with the **RV32M extension** to support multiplication and division operations. In addition, custom AI-oriented instructions were integrated to enhance performance for data-parallel workloads.

The proposed processor architecture employs a classical five-stage pipeline comprising **Instruction Fetch (IF)**, **Instruction Decode (ID)**, **Execute (EX)**, **Memory Access (MEM)**, and **Write-Back (WB)** stages. This pipelined organization significantly improves instruction throughput while maintaining correctness through effective **hazard detection, data forwarding, and stall control mechanisms**.

A dedicated **Multiply-Divide Unit (M-Unit)** was implemented in the Execute stage to support RV32M instructions. The M-unit operates using a controlled multi-cycle execution model with handshake-based stall management, ensuring seamless integration with the pipeline without affecting overall stability.

To address the growing need for AI acceleration in embedded systems, two custom instructions—**VDOT4** and **VMAX4**—were designed and implemented. These instructions enable parallel processing of packed data elements within a single instruction, resulting in reduced instruction count and improved execution efficiency for AI and signal-processing tasks.

Functional verification was carried out using a **custom SystemVerilog testbench**, and extensive simulation was performed using **QuestaSim**. Simulation results and waveform analysis confirmed the correct operation of the pipeline, arithmetic units, and AI acceleration logic. The processor successfully executed both standard RISC-V programs and AI-specific test cases, producing correct and expected outputs.

Furthermore, synthesis was performed using **Synopsys Design Vision**, demonstrating that the design is synthesizable and suitable for hardware implementation. Resource utilization and timing analysis indicated that the processor meets design constraints while offering a balanced trade-off between performance and hardware complexity.

Overall, this project demonstrates a practical and extensible RISC-V processor design that combines **pipeline efficiency**, **multi-cycle arithmetic support**, and **AI-oriented instruction acceleration**, making it well-suited for modern embedded and intelligent computing applications.

## 8.2 Future Enhancements

Although the proposed **5-Stage Pipelined RV32IM RISC-V Processor with Custom AI Instruction Acceleration** has been successfully implemented and verified, several enhancements can be incorporated in the future to further improve its performance, scalability, and applicability.

One important future enhancement is the **implementation of the RISC-V Compressed (C) extension (RV32C)**. This would reduce code size, improve instruction cache utilization, and enhance performance in memory-constrained embedded systems.

The processor can also be extended to support **advanced branch prediction techniques**, such as dynamic branch predictors, to minimize control hazards and further improve pipeline efficiency. Currently, simple branch handling is used, and predictive mechanisms would significantly enhance instruction throughput.

Another potential enhancement is the integration of a **Memory Management Unit (MMU)** with support for **virtual memory (Sv32)**. This would enable the processor to run operating systems and more complex software stacks, increasing its usability in real-world applications.

From an AI acceleration perspective, additional **custom vector and matrix operations** can be introduced to support more complex machine learning workloads. The AI unit can

also be expanded to support **wider data paths or configurable vector lengths**, making it adaptable to different AI algorithms.

The verification framework can be further strengthened by adopting **UVM-based verification**, automated test generation, and compliance testing using the **official RISC-V compliance suite**. This would ensure higher functional coverage and industry-level verification quality.

Finally, the design can be optimized for **low power consumption** and targeted toward **FPGA or ASIC implementation**, including clock gating and power-aware synthesis techniques. These enhancements would make the processor suitable for deployment in energy-efficient embedded and AI-enabled systems

## REFERENCES

1. RISC-V International, *The RISC-V Instruction Set Manual – Volume I: Unprivileged ISA*, Version 2.2, 2019.
2. RISC-V International, *The RISC-V Instruction Set Manual – Volume II: Privileged Architecture*, Version 1.12, 2021.
3. David A. Patterson and John L. Hennessy, *Computer Organization and Design: The Hardware/Software Interface*, 5th Edition, Morgan Kaufmann, 2014.
4. Andrew Waterman and Krste Asanović, *The RISC-V Instruction Set Manual*, University of California, Berkeley.
5. John L. Hennessy and David A. Patterson, *Computer Architecture: A Quantitative Approach*, 6th Edition, Morgan Kaufmann, 2019.
6. Synopsys Inc., *Design Vision User Guide*, Synopsys Documentation, 2023.
7. Clifford E. Cummings, *SystemVerilog Assertions and Functional Coverage*, Sunburst Design, Inc.

8. Stuart Sutherland, Simon Davidmann, and Peter Flake, *SystemVerilog for Design*, 2nd Edition, Springer, 2006.
9. Pong P. Chu, *RTL Hardware Design Using VHDL and Verilog*, Wiley, 2006.
10. IEEE Standard for SystemVerilog – Unified Hardware Design, Specification, and Verification Language, IEEE Std 1800™.
11. Krste Asanović et al., *Instruction Set Extensions for Efficient Machine Learning Acceleration*, UC Berkeley Technical Report.
12. S. Mittal, “A Survey of Techniques for Improving Energy Efficiency in Embedded Computing Systems,” *International Journal of Computer Aided Engineering and Technology*, 2014.

## APPENDIX

### APPENDIX A : SystemVerilog HDL Source Code

This appendix contains the complete Verilog/SystemVerilog source code developed as part of the project. The code implements a **5-stage pipelined RV32IM RISC-V processor** along with **custom AI acceleration instructions**.

The following major modules are included:

- **MyCoreAI32.sv** – Top-level processor module
- **mycoreai32\_pkg.sv** – Package file containing opcode definitions, RV32IM constants, and control enums
- **alu.sv / alu\_sv\_unit.sv** – Arithmetic and logic unit implementation
- **muldiv\_unit.sv** – RV32M Multiply/Divide unit with EX-stage stall handshake
- **ai\_unit.sv** – Custom AI instruction unit implementing **VDOT4** and **VMAX4**

operations

- **hazard\_forward.sv** – Hazard detection and data forwarding logic
- **regfile.sv** – Register file implementation
- **immgen.sv** – Immediate generation logic

All modules were designed using **SystemVerilog** and verified through simulation using **QuestaSim**.

## APPENDIX B : Testbench and Simulation Environment

This appendix includes the verification infrastructure used to validate the processor design.

- **tb\_MyCoreAI32.sv** – Top-level testbench
- Clock and reset generation
- Instruction memory (IMEM) and data memory (DMEM) models
- Directed test programs to verify:
  - RV32I base instructions
  - RV32M multiply and divide instructions
  - Custom AI instructions (VDOT4 and VMAX4)

Simulation results confirmed correct execution of all implemented instructions, including pipeline behavior and hazard resolution.

## APPENDIX C : Instruction Memory Program (imem.hex)

The instruction memory hex file used during simulation is provided in this appendix. It includes directed test programs to verify:

- Arithmetic and logical operations
- Multiply and divide instructions
- AI acceleration instructions producing expected outputs:
  - **VDOT4 → 0x0000000A**
  - **VMAX4 → 0x01020304**

The program also includes store instructions to write results into data memory for validation.

## APPENDIX D : Simulation Waveforms

Waveforms generated using **QuestaSim** are included for analysis and debugging. Key signals observed include:

- Program Counter (PC) progression
- Pipeline stage registers (IF, ID, EX, MEM, WB)
- Hazard stall and forwarding signals
- Multiply/Divide unit handshake signals
- AI unit input operands and output results

These waveforms demonstrate correct pipeline operation, stall handling, and execution of custom AI instructions.

```

# INFO: Using NEW ai_unit.sv (VDOT4/VMAX4 packed-byte version)
# ----- IMEM DUMP [0..15] -----
# IMEM[0] = 0x01020137
# IMEM[1] = 0x30410113
# IMEM[2] = 0x010101b7
# IMEM[3] = 0x10118193
# IMEM[4] = 0x00000013
# IMEM[5] = 0x00000013
# IMEM[6] = 0x0031020b
# IMEM[7] = 0x00402223
# IMEM[8] = 0x0231038b
# IMEM[9] = 0x00702423
# IMEM[10] = 0x00202623
# IMEM[11] = 0x0aa00513
# IMEM[12] = 0x00a02023
# IMEM[13] = 0x0000006f
# IMEM[14] = 0x00000013
# IMEM[15] = 0x00000013
# -----
# CYC=5 PC=0x00000000 IF=0x00000013 stall(lu,m,t)=0,0,0 redir=0
# CYC=6 PC=0x00000004 IF=0x01020137 stall(lu,m,t)=0,0,0 redir=0
# CYC=7 PC=0x00000008 IF=0x30410113 stall(lu,m,t)=0,0,0 redir=0
# CYC=8 PC=0x0000000c IF=0x010101b7 stall(lu,m,t)=0,0,0 redir=0
# CYC=9 PC=0x00000010 IF=0x10118193 stall(lu,m,t)=0,0,0 redir=0
#     WB: x2 <= 0x01020000
# CYC=10 PC=0x00000014 IF=0x00000013 stall(lu,m,t)=0,0,0 redir=0
#     WB: x2 <= 0x01020304
# CYC=11 PC=0x00000018 IF=0x00000013 stall(lu,m,t)=0,0,0 redir=0
#     WB: x3 <= 0x01010000
# CYC=12 PC=0x0000001c IF=0x0031020b stall(lu,m,t)=0,0,0 redir=0
#     WB: x3 <= 0x01010101
# CYC=13 PC=0x00000020 IF=0x00402223 stall(lu,m,t)=0,0,0 redir=0
# CYC=14 PC=0x00000024 IF=0x0231038b stall(lu,m,t)=0,0,0 redir=0
# CYC=15 PC=0x00000028 IF=0x00702423 stall(lu,m,t)=0,0,0 redir=0
#     WB: x4 <= 0x02030405
#     DMEM STORE: addr=0x00000004 data=0x02030405
#     AI: vdot4=1 vmax4=0 rs1=0x01020304 rs2=0x01010101 ai_y=0x0000000a
# CYC=16 PC=0x0000002c IF=0x00202623 stall(lu,m,t)=0,0,0 redir=0
# CYC=17 PC=0x00000030 IF=0x0aa00513 stall(lu,m,t)=0,0,0 redir=0
#     WB: x7 <= 0x0000000a
#     DMEM STORE: addr=0x00000008 data=0x0000000a
# CYC=18 PC=0x00000034 IF=0x00a02023 stall(lu,m,t)=0,0,0 redir=0
#     DMEM STORE: addr=0x0000000c data=0x01020304
# CYC=19 PC=0x00000038 IF=0x0000006f stall(lu,m,t)=0,0,0 redir=0
# CYC=20 PC=0x0000003c IF=0x00000013 stall(lu,m,t)=0,0,0 redir=1

```

```

# DBG: dmem[0] (PASS/FAIL) =0x000000aa
# DBG: dmem[1] (VDOT4)      =0x02030405
# DBG: dmem[2] (VMAX4)       =0x0000000a
# DBG: dmem[3] (expected)    =0x01020304
# PASS: dmem[0]=0x000000aa
# ** Note: $finish : tb/tb_MyCoreAI32.sv(137)
#   Time: 8045 ns Iteration: 1 Instance: /tb_MyCoreAI32
# 1
# Break in Module tb_MyCoreAI32 at tb/tb_MyCoreAI32.sv line 137

```

VSIM 12> |

**Fig 6:- Transcript Output Data**



**Fig 7: - Simulation Data**

## APPENDIX E : Synthesis Reports

This appendix contains synthesis-related information obtained using **Synopsys Design Vision**, including:

- Area utilization report
- Gate-level resource usage
- Timing summary and critical path analysis

The synthesis confirms that the design is suitable for implementation in standard-cell-based ASIC or FPGA flows.



Fig 8: - Gate Level Schematic

## APPENDIX F : Toolchain and Software Environment

The following tools were used during the project:

- **QuestaSim** – Simulation and waveform analysis
- **Synopsys Design Vision** – RTL synthesis and timing analysis
- **RISC-V GCC Toolchain** – Assembly and hex generation
- **Linux Environment** – Development and testing