

# **FPGA-Based Image Processing Pipeline Simulator in C++**

Hardware Accelerator Modeling  
and Simulation Report

## **Author**

Adil Rafiq

December 27, 2025

## Contents

---

|                                                        |          |
|--------------------------------------------------------|----------|
| <b>1 Abstract</b>                                      | <b>2</b> |
| <b>2 Introduction</b>                                  | <b>2</b> |
| <b>3 System Architecture: The Synchronous Pipeline</b> | <b>2</b> |
| 3.1 Cycle-Accurate Simulation Loop . . . . .           | 2        |
| 3.2 Pipeline Stage Definitions . . . . .               | 2        |
| 3.3 Pipeline Data Flow Diagram . . . . .               | 2        |
| 3.4 Ping-Pong Buffering . . . . .                      | 3        |
| <b>4 Modular Design and Memory Modeling</b>            | <b>3</b> |
| 4.1 Object-Oriented Hardware Mapping . . . . .         | 3        |
| 4.2 Templates and Bus Width Flexibility . . . . .      | 3        |
| 4.3 Dynamic Allocation as Physical RAM . . . . .       | 3        |
| <b>5 Mathematical Optimization for Silicon</b>         | <b>3</b> |
| 5.1 Floating-Point Reference . . . . .                 | 3        |
| 5.2 Fixed-Point Hardware Accuracy . . . . .            | 4        |
| <b>6 DSP Architecture and Filter Hierarchy</b>         | <b>4</b> |
| 6.1 Pre-defined Filter Kernels . . . . .               | 4        |
| 6.2 The Sobel Block . . . . .                          | 4        |
| <b>7 Design Constraints and Simplifications</b>        | <b>4</b> |
| <b>8 Simulation Environment and Results</b>            | <b>4</b> |
| 8.1 Linux Setup . . . . .                              | 4        |
| 8.2 Code Implementation . . . . .                      | 5        |
| 8.3 Execution Outputs . . . . .                        | 5        |
| 8.4 Results and Comparisons . . . . .                  | 7        |
| <b>9 Conclusion</b>                                    | <b>8</b> |

## 1 Abstract

---

This report provides a comprehensive technical overview of a C++ based simulator designed to model an FPGA-based image processing pipeline. It bridges the gap between high-level algorithmic development and low-level Register Transfer Level (RTL) implementation. Key features include the modeling of synchronous pipeline latches, fixed-point arithmetic optimization, and hardware-software co-design. This system serves as a model for verifying the accuracy of hardware implementation.

## 2 Introduction

---

Software modeling is a critical requirement before physical implementation in the domain of IC design and FPGA development. This simulator enforces the spatial and temporal constraints of ICs, specifically modeling data movement, resource utilization, and the synchronous nature of digital logic.

## 3 System Architecture: The Synchronous Pipeline

---

The simulator is designed as a **Synchronous 4-Stage Pipeline**. In a physical FPGA, all logic gates evaluate in parallel every clock cycle.

### 3.1 Cycle-Accurate Simulation Loop

A common challenge in software simulation is "race conditions" where data propagates through a chain in a single software loop. To resolve this, the simulator uses a central loop acting as a system clock. Within this loop, stages are executed in **Reverse Order** (Stage 4 to Stage 1).

By processing the end of the pipeline first, we ensure that data produced by Stage  $N$  in Cycle  $X$  is only available to Stage  $N + 1$  in Cycle  $X + 1$ . This perfectly replicates the behavior of physical hardware registers (latches).

### 3.2 Pipeline Stage Definitions

- **Stage 1: Frame Reader:** Simulates the hardware physical layer reading from host memory into an input register (`reg_RawData`).
- **Stage 2: ISP Block:** Consumes `reg_RawData`, performs color conversion, and latches the result into `reg_GrayData`.
- **Stage 3: DSP Engine:** Processes grayscale data using spatial filters and latches the result into `reg_ProcessedData`.
- **Stage 4: Frame Writer:** Consumes `reg_ProcessedData` and writes the final frame back to memory.

### 3.3 Pipeline Data Flow Diagram



Figure 1: FPGA-based image processing pipeline.

### 3.4 Ping-Pong Buffering

In hardware convolution pipelines, a processing unit must never read from and write to the same memory region in a single cycle. The filters require access to neighboring pixels; overwriting input data prematurely would corrupt the computation.

To address this constraint, the simulator implements a **ping-pong buffering scheme**. Two frame buffers are alternately assigned as input and output to the DSP engine. After each filter operation, buffer roles are swapped, ensuring that every convolution stage reads from a stable data source and writes results to a separate memory region.

This approach minimizes memory usage while preserving correctness and mirrors the double-buffering strategy commonly used in FPGA image pipelines and DMA-based accelerators.

---

## 4 Modular Design and Memory Modeling

### 4.1 Object-Oriented Hardware Mapping

The project uses a hybrid of `structs` and `classes` to mimic hardware components:

- **Structs (Data Packets):** Types like `Pixel` and `Kernel` are implemented as `structs`. In hardware, these represent wires or data packets that carry signals across the bus.
- **Classes (IP Cores):** Modules like `ColorConverter` and `ConvolutionEngine` are implemented as `classes`. These represent independent Intellectual Property (IP) cores. They encapsulate the logic and internal state required to transform data.

### 4.2 Templates and Bus Width Flexibility

The `FrameBuffer` class uses C++ **Templates**. In hardware, memory controllers are often generic. By using `FrameBuffer<T>`, we simulate a unified memory interface that can handle different bus widths—such as 24-bit RGB or 8-bit Grayscale—using the same underlying logic.

### 4.3 Dynamic Allocation as Physical RAM

To simulate DDR RAM, the system utilizes `malloc` and `free`.

- `malloc`: This simulates the **Memory Arbiter** reserving a contiguous block of physical addresses. The resulting pointer acts as the **Base Address Register** used by DMA engines.
- `free`: This simulates the release of a memory bank or the draining of a FIFO after a frame has been successfully transmitted.

---

## 5 Mathematical Optimization for Silicon

The simulator supports two distinct modes of calculation, toggled via preprocessor macros (`USE_FLOAT` or `USE_FIXED_POINT`).

### 5.1 Floating-Point Reference

In floating-point mode, the ISP and DSP blocks use standard IEEE-754 math. This serves as a reference to determine the maximum possible image quality. Requires more complex calculations and Floating-Point Units (FPUs).

$$Y = 0.299 \cdot R + 0.587 \cdot G + 0.114 \cdot B \quad (1)$$

## 5.2 Fixed-Point Hardware Accuracy

In fixed-point mode, the system uses integer coefficients scaled by 256 ( $2^8$ ). The division is replaced by a right shift, which only involves bit-rewiring in hardware.

$$Y = (77 \cdot R + 150 \cdot G + 29 \cdot B) \gg 8 \quad (2)$$

## 6 DSP Architecture and Filter Hierarchy

---

The **Convolution Engine** is the main complex block in the system, managing multiple filters through a unified 3x3 Multiply-Accumulate (MAC) structure.

### 6.1 Pre-defined Filter Kernels

Filters are stored in `kernel.h` as static coefficient matrices. These represent the hard-wired constants in the hardware's look-up tables.

- **Box Blur (Always-On):** This filter is implemented as a mandatory pre-processing step. In real-world sensors, high-frequency noise is common; the Box Blur acts as a "noise floor" reduction unit, smoothing the signal before complex enhancement.
- **Gaussian and Sharpen:** These two filters share the same internal **Convolution Block**. Since both are 3x3 spatial operations, they utilize the same MAC logic units. The only difference is the coefficients loaded from the registers, demonstrating a reconfigurable hardware design.

### 6.2 The Sobel Block

Unlike the standard filters, **Sobel Edge Detection** is implemented in a separate logic block within the engine. This is because Sobel requires dual kernels ( $G_x$  and  $G_y$ ) to be calculated simultaneously followed by a magnitude calculation:

$$|G| \approx |G_x| + |G_y| \quad (3)$$

## 7 Design Constraints and Simplifications

---

To maintain clarity and focus, several real-world hardware complexities were intentionally abstracted:

- Line buffers and sliding window generators are modeled implicitly rather than cycle-by-cycle.
- Memory access latency is assumed to be uniform and deterministic.
- Control logic for arbitration and interrupts is not modeled.

These simplifications allow the simulator to emphasize architectural correctness and numerical accuracy while remaining extensible for future enhancements.

## 8 Simulation Environment and Results

---

### 8.1 Linux Setup

The project was developed and tested on an Ubuntu 24.04 LTS environment.

## 8.2 Code Implementation

The following figures provide snippets of the main source code and the makefile build system.



Figure 2: Linux environment and project directory.

```
Dec 19 16:29
Terminal

// --- STAGE 3: DSP ACCELERATOR (Convolution) ---
if (reg_GrayData != nullptr) {
    #ifndef DEBUG
    std::cout << " [STG 3] Running Filter Pipeline..." << std::endl;
    #endif
    int w = reg_GrayData->getWidth();
    int h = reg_GrayData->getHeight();

    // Ping-pong buffer management within the accelerator
    FrameBuffer<GrayPixel*> src = reg_GrayData;
    FrameBuffer<GrayPixel*>* dst = new FrameBuffer<GrayPixel>(w, h);

    // Base Filtering (Mandatory Box Blur)
    dsp.process(src, dst, k_blur);
    std::swap(src, dst);

    // Extended Filtering
    if (enable_gaussian) {
        dsp.process(src, dst, k_gaussian);
        std::swap(src, dst);
    }
    if (enable_sharpen) {
        dsp.process(src, dst, k_sharpen);
        std::swap(src, dst);
    }
    if (enable_sobel) {
        dsp.processSobel(src, dst);
        std::swap(src, dst);
    }
    delete dst; // Cleanup the swap buffer
}
```

Figure 3: Main code (main.cpp).

```
Dec 19 16:26
Terminal

# Makefile for Hardware Accelerator Simulator
# Compiler and Flags
CXX = g++
CXXFLAGS = -std=c++11 -Wall
# Target Executable Name
TARGET = pipelined
# Source Files - Includes all .cpp files
SRCS = pipelined.cpp frame_reader.cpp frame_writer.cpp color_converter.cpp convolution.cpp
# Build Rules
all: $(TARGET)

$(TARGET): $(SRCS)
    @echo "Building Hardware Simulator..."
    $(CXX) $(CXXFLAGS) -o $(TARGET) $(SRCS)
    @echo "Build Complete. Run ./$(TARGET)"

# Debug Build Rule
debug: CXXFLAGS += -DDEBUG
debug: clean $(TARGET)

# Fixed-Point Build Rule
fixed: CXXFLAGS += -DUSE_FIXED_POINT
fixed: clean $(TARGET)

# Floating-Point Build Rule
float: CXXFLAGS += -DUSE_FLOAT
float: clean $(TARGET)
```

Figure 4: Project Makefile.

## 8.3 Execution Outputs

The simulator was verified across three primary build modes.

```
adil@adil-VMware-Virtual-Platform:~/cpp_tasks/ha$ cd cpp_tasks/ha
adil@adil-VMware-Virtual-Platform:~/cpp_tasks/ha$ make
Building Hardware Simulator...
g++ -std=c++11 -Wall -o ha main.cpp frame_reader.cpp frame_writer.cpp color_converter.cpp convolution.cpp
Build Complete. Run ./ha
adil@adil-VMware-Virtual-Platform:~/cpp_tasks/ha$ ./ha
== C++ Hardware Accelerator Model ==
Usage: ./ha [options] <file1.bmp> <file2.bmp> ...
Options:
  -gaussian  Apply Gaussian Blur
  -sharpen    Apply Sharpening
  -sobel      Apply Sobel Edge Detection

Note: Box Blur is always applied as the base filter.
adil@adil-VMware-Virtual-Platform:~/cpp_tasks/ha$ ./ha -gaussian -sobel lena.bmp blackbuck.bmp
PGA Pipeline Initialized...
[MODE] FIXED-POINT MODE
[CONF] Processing 2 frame(s).
[CONF] Box Blur: ALWAYS ON
[CONF] Gaussian: ENABLED
[CONF] Sharpen: DISABLED
[CONF] Sobel: ENABLED

== Simulation Complete ==
Total Clock Cycles: 5
Frames Processed: 2
Results saved!
adil@adil-VMware-Virtual-Platform:~/cpp_tasks/ha$
```

Figure 5: Normal (Fixed-Point) execution output.

```
adil@adil-VMware-Virtual-Platform:~/cpp_tasks/ha$ rm -f ha *.o output_*.bmp input_*.bmp
adil@adil-VMware-Virtual-Platform:~/cpp_tasks/ha$ make debug
rm -f ha *.o output_*.bmp input_*.bmp
Building Hardware Simulator...
g++ -std=c++11 -Wall -DDEBUG -o ha main.cpp frame_reader.cpp frame_writer.cpp color_converter.cpp convolution.cpp
Build Complete. Run ./ha
adil@adil-VMware-Virtual-Platform:~/cpp_tasks/ha$ ./ha -sharpen lena.bmp
PGA Pipeline Initialized...
[MODE] FIXED-POINT MODE
[CONF] Processing 1 frame(s).
[CONF] Box Blur: ALWAYS ON
[CONF] Gaussian: DISABLED
[CONF] Sharpen: ENABLED
[CONF] Sobel: DISABLED

--- CLK Cycle 0 ---
[STG 1] Loading lena.bmp
Input Detected: 512x512 (24-bit)
File loaded successfully into RAM.

--- CLK Cycle 1 ---
[STG 2] Converting RGB -> Gray
[ISP] Fixed-Point Conversion (RGB -> Gray)...
[ISP] Conversion Complete.

--- CLK Cycle 2 ---
[STG 3] Running Filter Pipeline
[DSP] Fixed-Point Convolution...
[DSP] Fixed-Point Convolution...

--- CLK Cycle 3 ---
[STG 4] Writing output_0.bmp
Output Writer: Saved output_0.bmp
```

Figure 6: Execution with debug logging enabled.



```

adil@adil-VMware-Virtual-Platform:~/cpp_tasks/ha$ make float
rm -f ha *.o output_*.bmp input_*.bmp
Building Hardware Simulator...
g++ -std=c++11 -Wall -DUSE_FLOAT -o ha main.cpp frame_reader.cpp frame_writer.cpp color_converter.cpp convolution.cpp
Build Complete. Run ./ha
adil@adil-VMware-Virtual-Platform:~/cpp_tasks/ha$ ./ha -gaussian -sharpen -sobel blackbuck.bmp
PGA Pipeline Initialized...
[MODE] FLOATING-POINT MODE
[CONF] Processing 1 frame(s).
[CONF] Box Blur: ALWAYS ON
[CONF] Gaussian: ENABLED
[CONF] Sharpen: ENABLED
[CONF] Sobel: ENABLED

==== Simulation Complete ====
Total Clock Cycles: 4
Frames Processed: 1
Results saved!
adil@adil-VMware-Virtual-Platform:~/cpp_tasks/ha$ make clean
rm -f ha *.o output_*.bmp input_*.bmp
adil@adil-VMware-Virtual-Platform:~/cpp_tasks/ha$ make fixed
rm -f ha *.o output_*.bmp input_*.bmp
Building Hardware Simulator...
g++ -std=c++11 -Wall -DUSE_FIXED_POINT -o ha main.cpp frame_reader.cpp frame_writer.cpp color_converter.cpp convolution.cpp
Build Complete. Run ./ha
adil@adil-VMware-Virtual-Platform:~/cpp_tasks/ha$ ./ha -gaussian -sharpen -sobel blackbuck.bmp
PGA Pipeline Initialized...
[MODE] FIXED-POINT MODE
[CONF] Processing 1 frame(s).
[CONF] Box Blur: ALWAYS ON
[CONF] Gaussian: ENABLED
[CONF] Sharpen: ENABLED

```

Figure 7: Floating-Point reference execution.

## 8.4 Results and Comparisons

Two input images, `lena.bmp` and `blackbuck.bmp`, were used for verification of output.



Figure 8: Comparison of input and output of `lena.bmp` for Sharpen filter.



Figure 9: Comparison of input and output of `blackbuck.bmp` for Gaussian and Sobel filters.



Figure 10: Comparison between Fixed-Point and Floating-Point results.

## 9 Conclusion

---

The FPGA Pipeline Simulator provides a robust framework for hardware-software co-design. By enforcing synchronous loops, template-based memory management, and dual-mode arithmetic, the system accurately predicts hardware performance. This documentation serves as the final specification for the hardware implementation phase.