

HANOI UNIVERSITY OF SCIENCE AND  
TECHNOLOGY  
FACULTY OF MATHEMATICS AND  
INFORMATICS



# CS3365 - Introduction of Computer Organization

Design and Simulation of  
a Single-Qubit Hardware Accelerator  
Using Fixed-Point  
Arithmetic

Program: Troy IT

Lecturer: Assoc Prof.Nguyen Dinh Han

Tran Vu Gia Huy  
Tran Ngoc Khoa  
Nguyen Son Tung  
Nguyen Truong Son  
Mai Tran Hung

**Hanoi – 2026**

**CS 3365 – Computer Organization and Architecture**

# **Design and Simulation of a Single-Qubit Hardware Accelerator Using Fixed-Point Arithmetic**

## **Authors:**

Nguyen Son Tung  
Nguyen Truong Son  
Tran Vu Gia Huy  
Tran Ngoc Khoa  
Mai Tran Hung

## **Instructor:**

Nguyen Dinh Han

January 14, 2026

# Contents

|                                                                  |           |
|------------------------------------------------------------------|-----------|
| <b>Abstract</b>                                                  | <b>3</b>  |
| <b>1 Introduction</b>                                            | <b>4</b>  |
| <b>2 Theoretical Foundations of Quantum Emulation</b>            | <b>6</b>  |
| 2.1 Real-Valued Quantum Mechanics . . . . .                      | 6         |
| 2.1.1 Justification for Real Amplitudes . . . . .                | 6         |
| 2.2 Fixed-Point Arithmetic and Q-Notation . . . . .              | 6         |
| 2.2.1 The Q8.8 Format . . . . .                                  | 6         |
| 2.2.2 Quantization Error and Numerical Decoherence . . . . .     | 7         |
| 2.3 Fundamental Gate Operations . . . . .                        | 8         |
| 2.3.1 Pauli-X (NOT Gate) . . . . .                               | 8         |
| 2.3.2 Pauli-Z (Phase Flip) . . . . .                             | 8         |
| 2.3.3 Hadamard (Superposition) . . . . .                         | 8         |
| <b>3 System Architecture</b>                                     | <b>10</b> |
| 3.1 High-Level Datapath Analysis . . . . .                       | 10        |
| 3.1.1 1. The Quantum State Register (QSR) . . . . .              | 10        |
| 3.1.2 2. The Decoder Unit . . . . .                              | 10        |
| 3.1.3 3. The Quantum ALU (Execution Units) . . . . .             | 10        |
| 3.1.4 4. Feedback and Write-Back . . . . .                       | 11        |
| 3.2 Instruction Set Architecture (ISA) . . . . .                 | 11        |
| 3.3 Pipeline Strategy and Timing . . . . .                       | 12        |
| <b>4 Hardware Implementation of the Single-Qubit Accelerator</b> | <b>13</b> |
| 4.1 Numerical Representation: Q8.8 Fixed-Point Format . . . . .  | 13        |
| 4.2 Simulation and Verification Utilities . . . . .              | 13        |
| 4.3 Control Logic and State Machine . . . . .                    | 14        |
| <b>5 Quantum Gate Modules</b>                                    | <b>16</b> |
| 5.1 Simulation Testbench . . . . .                               | 18        |
| 5.2 Simulation Results . . . . .                                 | 21        |
| <b>6 Booth Multiplier Design</b>                                 | <b>23</b> |
| 6.1 Algorithm Selection: Shift-and-Add vs. Booth . . . . .       | 23        |
| 6.2 Design and Operational Analysis . . . . .                    | 23        |
| 6.2.1 Register Definitions . . . . .                             | 23        |
| 6.2.2 Operational Logic . . . . .                                | 23        |
| 6.3 Hardware Implementation . . . . .                            | 24        |
| 6.3.1 Booth Multiplier Module . . . . .                          | 24        |
| 6.3.2 Testbench Simulation . . . . .                             | 25        |
| 6.4 Simulation Results . . . . .                                 | 27        |
| <b>7 Fixed-Point Arithmetic Units: Adder and Subtractor</b>      | <b>28</b> |
| 7.1 Fixed-Point Subtractor . . . . .                             | 28        |
| 7.2 Fixed-Point Adder . . . . .                                  | 29        |
| <b>References</b>                                                | <b>31</b> |

|                                            |           |
|--------------------------------------------|-----------|
| <b>A Appendix: Verilog Source Code</b>     | <b>33</b> |
| A.1 Quantum Accelerator Core . . . . .     | 33        |
| A.2 Booth Multiplier . . . . .             | 41        |
| A.3 Fixed-Point Arithmetic Units . . . . . | 43        |
| <b>B Contribution Summary</b>              | <b>45</b> |

## Abstract

This report presents the design and simulation of a single-qubit hardware accelerator implemented on an FPGA using fixed-point arithmetic. To reduce hardware complexity while preserving fundamental quantum behavior, the system employs real-valued probability amplitudes represented in 16-bit Q8.8 format. The accelerator supports core single-qubit quantum gates, including Pauli-X, Pauli-Z, and Hadamard, implemented using simple arithmetic and routing logic. The proposed architecture consists of a quantum state register, an instruction decoder, and a dedicated quantum arithmetic logic unit (QALU). Simulation results demonstrate that fixed-point FPGA-based quantum emulation provides an efficient and deterministic platform for studying quantum gate behavior, numerical precision effects, and hardware-level control without requiring physical quantum hardware.

## 1 Introduction

The computational landscape is currently navigating a pivotal transition point, arguably the most significant since the invention of the integrated circuit. For decades, the semiconductor industry has marched to the cadence of Moore’s Law, an observation-turned-self-fulfilling-prophecy that predicted the doubling of transistor density approximately every two years. This exponential scaling fueled the digital revolution, enabling everything from the personal computer to the modern era of artificial intelligence. However, as feature sizes approach the atomic scale, the physical limitations of silicon are becoming undeniable. Quantum effects such as electron tunneling, which were once negligible nuisances, have now become dominant sources of leakage current and reliability failure. Furthermore, the thermodynamic costs of removing heat from these ultra-dense circuits are imposing a “power wall” that stifles further clock frequency increases.

In this context, Quantum Computing (QC) has emerged not merely as a faster alternative, but as a fundamentally different paradigm for information processing. By harnessing the principles of superposition and entanglement, quantum computers promise to solve problems that are computationally intractable for classical von Neumann architectures, such as integer factorization, chemical simulation, and complex optimization. However, the physical realization of quantum hardware is fraught with immense engineering challenges. Qubits, the fundamental units of quantum information, are notoriously fragile. They require extreme isolation from environmental noise—often necessitating cryogenic temperatures near absolute zero—to maintain coherence. This fragility, combined with the high cost and complexity of current Noisy Intermediate-Scale Quantum (NISQ) devices, creates a significant barrier to entry for researchers, students, and system architects who wish to explore quantum algorithms and control logic.

Consequently, there is a pressing need for accessible, robust, and deterministic platforms that can emulate quantum behavior without the overhead of physical quantum hardware. While software simulators running on classical CPUs or GPUs are widely used, they often fail to capture the precise timing characteristics and hardware-level concurrency required for developing real-time quantum control systems. This report presents a detailed analysis and design specification for a Single-Qubit Hardware Accelerator, a specialized digital circuit implemented on a Field-Programmable Gate Array (FPGA).

The proposed design distinguishes itself through a deliberate architectural choice: the use of 16-bit Signed Fixed-Point Arithmetic (Q8.8 format) and the restriction to real-valued probability amplitudes. While general quantum mechanics operates in a complex Hilbert space, a significant subset of quantum information tasks can be effectively modeled using real numbers. This simplification dramatically reduces the hardware resource footprint—eliminating the need for complex multipliers and floating-point units—while preserving the core quantum phenomenon of superposition and interference.

The architecture, as visualized in the system diagram and detailed in this report, features a pipelined datapath where a central Decoder orchestrates operations across specialized functional units for Hadamard, Pauli-X, and Pauli-Z transformations. The design leverages the FPGA’s inherent parallelism to execute these “quantum” operations using efficient digital logic: adders, subtractors, and hardware multipliers. By mapping the continuous mathematics of quantum mechanics onto the discrete logic of an FPGA, this project serves as a foundational “digital twin,” enabling the rigorous study of quantization errors, pipeline hazards, and control latencies that will define the future of hybrid classical-quantum systems.

This report is structured to provide an exhaustive examination of the accelerator. We begin with the Theoretical Background, establishing the mathematical validity of real-valued quantum

mechanics and the numerical properties of fixed-point arithmetic. We then proceed to System Architecture, dissecting the provided block diagram and the flow of data through the Decoder and Quantum ALU. Finally, the Hardware Implementation section details the translation of these concepts into Verilog HDL, analyzing the resource utilization and timing characteristics that make this approach a viable path toward scalable quantum emulation.

## 2 Theoretical Foundations of Quantum Emulation

The design of a hardware accelerator for quantum emulation requires a synthesis of two distinct theoretical domains: the linear algebra of quantum mechanics and the digital signal processing (DSP) theory of finite-precision arithmetic. This section establishes the rigorous mathematical framework necessary to justify the architectural decisions, specifically the use of real-valued amplitudes and fixed-point number representations.

### 2.1 Real-Valued Quantum Mechanics

Standard quantum mechanics is formulated over the field of complex numbers,  $C$ . A single qubit state  $|\psi\rangle$  is represented as a unit vector in a two-dimensional complex Hilbert space  $\mathcal{H} \cong C^2$ :

$$|\psi\rangle = \alpha|0\rangle + \beta|1\rangle \quad (1)$$

where  $\alpha, \beta \in C$  and satisfy the normalization condition  $|\alpha|^2 + |\beta|^2 = 1$ . The relative phase between  $\alpha$  and  $\beta$  is crucial for phenomena such as interference. However, for the specific purpose of this hardware accelerator, we adopt a real-valued formulation.

#### 2.1.1 Justification for Real Amplitudes

The restriction of probability amplitudes to the field of real numbers  $R$  is not merely a hardware simplification but a theoretically sound approximation for a wide class of quantum algorithms. Research indicates that standard quantum computing (BQP) is computationally equivalent to quantum computing over real amplitudes, provided an additional qubit is available to simulate the imaginary component. Furthermore, for single-qubit operations involving the stabilizer group (Pauli gates, Hadamard), real amplitudes are sufficient to demonstrate superposition and interference.

In this real-valued model, the state vector  $|\psi\rangle$  resides in  $R^2$ . Geometrically, this restricts the qubit state from the entire surface of the Bloch sphere to a single great circle (specifically, the circle lying in the xz-plane). The coefficients  $\alpha$  and  $\beta$  represent signed real numbers. The normalization condition simplifies to  $\alpha^2 + \beta^2 = 1$ . This implies that the state can be parameterized by a single angle  $\theta$ :

$$|\psi(\theta)\rangle = \cos(\theta)|0\rangle + \sin(\theta)|1\rangle \quad (2)$$

This reduction has profound hardware implications. A complex multiplication  $(a + bi)(c + di)$  requires four real multiplications and two real additions. A real multiplication requires only one of each. By constraining the system to  $R$ , we reduce the arithmetic intensity of the emulator by approximately 75%.

### 2.2 Fixed-Point Arithmetic and Q-Notation

To implement continuous real values on digital hardware without the area and power penalty of floating-point units (FPU), **Fixed-Point Arithmetic** is employed. This method represents fractional numbers using scaled integers.

#### 2.2.1 The Q8.8 Format

The specific format chosen for this design is Q8.8 (or more formally, a 16-bit signed Two's Complement format with 8 fractional bits). The notation  $Q_i.f$  describes a number with  $i$  integer bits and  $f$  fractional bits.

- **Total Width:** 16 bits.
- **Sign Bit:** 1 bit (MSB).
- **Integer Bits:** 7 bits (derived from  $16 - 1 - 8$ ).
- **Fractional Bits:** 8 bits.

The value  $V$  represented by a 16-bit integer  $N$  is given by:

$$V = N \times 2^{-8} \quad (3)$$

Table 1: Characteristics of the Q8.8 Format

| Parameter               | Value                          | Description                                          |
|-------------------------|--------------------------------|------------------------------------------------------|
| Resolution ( $\Delta$ ) | $2^{-8} \approx 0.00390625$    | The smallest non-zero difference between two values. |
| Maximum Value           | $2^7 - 2^{-8} \approx 127.996$ | Limits the dynamic range.                            |
| Minimum Value           | $-2^7 = -128.0$                | Asymmetry due to Two's Complement representation.    |
| Dynamic Range           | $\approx 48 \text{ dB}$        | $20 \log_{10}(2^{16})$ .                             |

While Q1.15 is often preferred for normalized quantum states (where amplitudes are  $\alpha, \beta \in [-1, 1]$ ), the choice of Q8.8 in this design suggests a trade-off favoring dynamic range. This is particularly useful during intermediate calculations within the Hadamard gate, where values like  $\alpha + \beta$  can temporarily exceed unity (up to  $\sqrt{2} \approx 1.414$ ), and potentially accumulate larger values if not strictly normalized at every step. The Q8.8 format provides ample "headroom" (integer bits) to prevent overflow during these intermediate additions, simplifying the saturation logic.

## 2.2.2 Quantization Error and Numerical Decoherence

A critical aspect of emulating quantum systems on digital hardware is the management of quantization noise. In a physical quantum computer, noise arises from thermal fluctuations and electromagnetic interference (decoherence). In an FPGA emulator, "noise" arises from the finite precision of the bit representation.

When a true real value  $x$  is mapped to Q8.8, it undergoes a non-linear operation  $\mathcal{Q}(x)$ :

$$\mathcal{Q}(x) = \text{round}(x \cdot 2^8) \cdot 2^{-8} \quad (4)$$

The error term  $e = \mathcal{Q}(x) - x$  is the quantization error. Assuming ideal rounding, this error is bounded by half the Least Significant Bit (LSB):

$$|e| \leq \frac{\Delta}{2} = 2^{-9} \approx 0.00195 \quad (5)$$

If we assume the error is uniformly distributed across this interval, the error variance (noise power) is:

$$\sigma_e^2 = \frac{\Delta^2}{12} = \frac{2^{-16}}{12} \approx 1.27 \times 10^{-6} \quad (6)$$

This numerical noise propagates through the quantum gates. For a Hadamard transformation involving matrix multiplication, the errors from the input amplitudes combine with the quantization errors of the matrix coefficients ( $1/\sqrt{2}$ ). This accumulation of error leads to a phenomenon we can term *Numerical Decoherence*.

The fidelity of the emulated state  $|\psi_{emu}\rangle$  compared to the ideal state  $|\psi_{ideal}\rangle$  will degrade exponentially with circuit depth  $D$ , scaling as  $O(2^{n-f})$ , where  $n$  is the number of qubits (1 in this case) and  $f$  is the fractional bit width (8). This theoretical bound sets a "horizon" on the simulation: the maximum number of gates that can be executed before the state vector becomes statistically indistinguishable from random noise.

## 2.3 Fundamental Gate Operations

The accelerator supports a universal set of single-qubit gates, defined by their action on the real vector  $\begin{bmatrix} \alpha \\ \beta \end{bmatrix}$ .

### 2.3.1 Pauli-X (NOT Gate)

$$X = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix} \quad (7)$$

**Transformation:**

$$\begin{aligned} \alpha' &= \beta \\ \beta' &= \alpha \end{aligned}$$

This is a simple data swap, requiring no arithmetic logic, only routing resources.

### 2.3.2 Pauli-Z (Phase Flip)

$$Z = \begin{bmatrix} 1 & 0 \\ 0 & -1 \end{bmatrix} \quad (8)$$

**Transformation:**

$$\begin{aligned} \alpha' &= \alpha \\ \beta' &= -\beta \end{aligned}$$

This requires a sign inversion (Negation) on the  $\beta$  component. In Two's Complement,  $-x = (\sim x) + 1$ .

### 2.3.3 Hadamard (Superposition)

$$H = \frac{1}{\sqrt{2}} \begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix} \quad (9)$$

**Transformation:**

$$\begin{aligned} \alpha' &= \frac{1}{\sqrt{2}}(\alpha + \beta) \\ \beta' &= \frac{1}{\sqrt{2}}(\alpha - \beta) \end{aligned}$$

This is the most computationally intensive operation in the single-qubit set. It requires:

- Two Additions/Subtractions ( $\alpha + \beta$ ,  $\alpha - \beta$ ).
- Two Multiplications by the constant  $k \approx 0.70710678$ .
- Implicit truncation/rounding to fit the result back into 16 bits.

These operations form the theoretical functional requirements for the System Architecture described in the next section.

## 3 System Architecture

The architectural design of the Single-Qubit Hardware Accelerator is derived directly from the functional requirements of the quantum gates and the constraints of the Q8.8 fixed-point format. The system is organized as a specialized Arithmetic Logic Unit (ALU) coupled with a Decoder and State Registers, forming a streamlined processing pipeline.

### 3.1 High-Level Datapath Analysis

The system diagram provided illustrates a unidirectional data flow with feedback, characteristic of a synchronous digital machine. The architecture can be decomposed into four primary stages: State Storage, Instruction Decoding, Execution (The Quantum ALU), and Write-Back.

#### 3.1.1 1. The Quantum State Register (QSR)

At the heart of the system (labeled "Register" on the left and right of the diagram) is the storage for the quantum state vector.

- **Structure:** The QSR consists of two 16-bit registers, REG\_ALPHA and REG\_BETA.
- **Input:** On the left side, the register feeds the current state into the combinatorial logic.
- **Output:** On the right side, the updated state is captured and fed back (bottom path) to the input for the next clock cycle.
- **Initialization:** Upon a global RESET signal, the QSR is initialized to the ground state  $|0\rangle$ , which corresponds to  $\alpha = 1.0$  (integer 0x0100 in Q8.8) and  $\beta = 0.0$  (0x0000).

#### 3.1.2 2. The Decoder Unit

The "Decoder" block serves as the control unit for the accelerator. It receives an opcode from an external instruction memory (not shown) and generates the control signals that activate specific paths within the ALU.

- **Functionality:** It operates as a 1-to-N demultiplexer control. Based on the instruction (e.g., OP\_H, OP\_X, OP\_Z), it enables the specific logic gates associated with that operation.
- **Control Signals:**
  - ENABLE\_H: Activates the Hadamard path.
  - ENABLE\_X: Activates the Pauli-X path.
  - ENABLE\_Z: Activates the Pauli-Z path.
  - WE (Write Enable): Controls the update of the output register.

#### 3.1.3 3. The Quantum ALU (Execution Units)

The diagram explicitly details three parallel execution paths, corresponding to the three supported gates. This "parallel-by-function" design allows for simplified control logic, as data flows through all paths simultaneously, and the correct result is selected via multiplexing.

**Path A: The Hadamard (H) Complex** This is the most sophisticated path, as indicated by the chain of operations: H → ADD/SUB → Multiply.

- **Input Stage:** The current  $\alpha$  and  $\beta$  values enter the H-block.
- **Arithmetic Stage (ADD/SUB):** The diagram shows the signal splitting into two branches.
  - *Top Branch:* Computes the sum  $(\alpha + \beta)$  using an adder.
  - *Bottom Branch:* Computes the difference  $(\alpha - \beta)$  using a subtractor.
- **Architectural Note:** To prevent overflow during this addition, the internal datapath at this stage must be expanded by 1 bit (to 17 bits).
- **Scaling Stage (Multiply):** The outputs of the ADD/SUB units feed into a "Multiply" block. This block multiplies both streams by the constant  $1/\sqrt{2}$ .
- **Implementation:** This represents two hardware multipliers (or a time-multiplexed single multiplier) performing 17-bit  $\times$  16-bit operations. The constant  $1/\sqrt{2}$  is stored as a fixed-point immediate value.

**Path B: The Pauli-X (X) Logic** The X-gate path implements the swap operation.

- **Logic:** It routes the input REG\_BETA to the alpha\_next line and the input REG\_ALPHA to the beta\_next line.
- **Resource Cost:** This path is purely routing (wiring) and consumes zero logic gates (LUTs) if implemented as a direct multiplexer input, or minimal logic if implemented as discrete swap gates.

**Path C: The Pauli-Z (Z) Logic** The Z-gate path implements the phase flip.

- **Logic:** The diagram shows the  $\alpha$  path bypassing any logic (identity). The  $\beta$  path goes through a "SUB" block.
- **Interpretation:** The "SUB" block here represents the operation  $0 - \beta$ , which is the arithmetic equivalent of negation. This confirms the Z-gate operation  $\beta \rightarrow -\beta$ .

### 3.1.4 4. Feedback and Write-Back

The output of these three paths converges at the right-side Register. A multiplexer (implicit in the diagram's converging arrows) selects the final alpha\_next and beta\_next based on the active instruction from the Decoder. The "Register" (bottom) represents the feedback loop, ensuring the state is updated synchronously on the clock edge, ready for the next instruction.

## 3.2 Instruction Set Architecture (ISA)

To control this pipeline, a minimal Instruction Set Architecture is defined. Assuming a simple 3-bit opcode structure:

Table 2: Accelerator Instruction Set

| <b>Opcode</b> | <b>Mnemonic</b> | <b>ALU Path</b> | <b>Description</b>                           | <b>Signals</b> |
|---------------|-----------------|-----------------|----------------------------------------------|----------------|
| 000           | NOP             | None            | No Operation. Maintains current state.       | WE = 0         |
| 001           | INIT            | -               | Reset state to $ 0\rangle$ .                 | INIT = 1       |
| 010           | X               | X-Path          | Apply Pauli-X (Swap).                        | SEL = X        |
| 011           | Z               | Z-Path          | Apply Pauli-Z (Phase Flip).                  | SEL = Z        |
| 100           | H               | H-Path          | Apply Hadamard (Superposition).              | SEL = H        |
| 101           | MEAS            | -               | Output $(\alpha, \beta)$ measurement result. | WE = 0         |

### 3.3 Pipeline Strategy and Timing

The architecture is designed to support pipelining to maximize throughput.

- **Stage 1 (Fetch/Decode):** The opcode is latched, and the Decoder enables the appropriate ALU path.
- **Stage 2 (Execute):**
  - For X and Z gates, the propagation delay is minimal (combinational logic only).
  - For the H gate, the critical path involves Addition  $\rightarrow$  Multiplication. In a high-speed design, this might be split into two sub-stages (Add, then Multiply) to improve timing closure.
- **Stage 3 (Write-Back):** The result is selected and written to the QSR.

The diagram implies a simplified single-cycle execution model for X and Z, but likely a multi-cycle path for H due to the multiplier latency. The control unit handles this by stalling the pipeline or using a "Done" signal to acknowledge the completion of the slower H-gate operation.

## 4 Hardware Implementation of the Single-Qubit Accelerator

This section describes the hardware design of the single-qubit accelerator, including the numerical representation of quantum states, conversion mechanisms used in simulation, and the control logic responsible for decoding and executing quantum gate operations.

The single-qubit accelerator is a dedicated hardware module designed to perform quantum operations on a single qubit. The quantum state is represented by two amplitudes,  $\alpha$  and  $\beta$ , stored in a fixed-point format. The internal architecture follows a simple control and datapath flow consisting of an opcode decoder, parallel quantum gate modules, and a state update unit. This organization enables low-latency execution for simple gates while supporting multi-cycle operations for more complex transformations.

### 4.1 Numerical Representation: Q8.8 Fixed-Point Format

The quantum amplitudes are represented using the Q8.8 signed fixed-point format, which uses 16 bits to encode both integer and fractional values. In this format, the most significant bit denotes the sign, the next seven bits represent the integer portion with a range from  $-128$  to  $127$ , and the remaining eight bits encode the fractional part with a resolution of  $1/256$ . As a result, Q8.8 provides a numerical range from  $-128.0$  to approximately  $127.9961$  with deterministic precision suitable for hardware implementation.

Several representative Q8.8 values are used throughout the design. The value  $1.0$  is encoded as  $0x0100$ , while  $0.5$  is represented by  $0x0080$ . The constant  $1/\sqrt{2}$ , which is required for the Hadamard gate, is approximated as  $0x00B5$ . Negative values are represented using two's complement notation; for example,  $-0.5$  is encoded as  $0xFF80$ , and zero is encoded as  $0x0000$ .

Listing 1: Fixed-Point Constants Definition

```

1 localparam signed [15:0] ONE          = 16'h0100;
2 localparam signed [15:0] ZERO         = 16'h0000;
3 localparam signed [15:0] SQRT2_INV   = 16'h00B5;
```

The choice of the Q8.8 format is motivated by its efficient use of hardware resources, as it requires significantly fewer logic elements than floating-point representations. Additionally, fixed-point arithmetic enables faster computation, predictable numerical behavior across platforms, and lower power consumption, making it well suited for embedded and prototype quantum accelerators.

### 4.2 Simulation and Verification Utilities

During verification, conversion between fixed-point values and real numbers is required to facilitate debugging and result comparison in the testbench. Two utility functions are therefore implemented to convert between Q8.8 and real representations. The first function converts a signed Q8.8 value into a real number by dividing the integer representation by a scaling factor of 256.0.

Listing 2: Helper Function: Q8.8 to Real

```

1 function real q88_to_real;
2     input signed [15:0] q88_value;
3     begin
4         q88_to_real = $itor(q88_value) / Q88_SCALE;
```

```

5   end
6 endfunction

```

For example, the value  $16'h0100$  is converted to 1.0, and  $16'h0080$  is converted to 0.5. The inverse conversion is performed by multiplying a real number by the same scaling factor and rounding the result to the nearest signed integer.

Listing 3: Helper Function: Real to Q8.8

```

1 localparam real Q88_SCALE = 256.0;
2
3 function signed [15:0] real_to_q88;
4     input real value;
5     begin
6         real_to_q88 = $rtoi(value * Q88_SCALE);
7     end
8 endfunction

```

### 4.3 Control Logic and State Machine

The control logic of the accelerator integrates opcode decoding directly into a simple finite state machine rather than using a separate decode module. This approach reduces architectural complexity while remaining sufficient for the limited instruction set supported by the design. The accelerator recognizes five opcodes corresponding to no operation, state initialization, Pauli-X, Pauli-Z, and Hadamard gates.

Listing 4: Opcode and State Definitions

```

1 localparam OP_NOP    = 3'b000;
2 localparam OP_INIT   = 3'b001;
3 localparam OP_X      = 3'b010;
4 localparam OP_Z      = 3'b011;
5 localparam OP_H      = 3'b100;
6
7 localparam STATE_IDLE = 1'b0;
8 localparam STATE_EXEC = 1'b1;
9 reg state;

```

Execution is controlled by a two-state finite state machine consisting of an IDLE state and an EXEC (execution) state. When the accelerator is in the idle state, the opcode is decoded and single-cycle operations are executed immediately. State initialization sets the qubit to the  $|0\rangle$  basis state by assigning  $\alpha = 1$  and  $\beta = 0$ . The Pauli-X and Pauli-Z gates update the quantum state in a single cycle using combinational logic. When a Hadamard operation is requested, the controller asserts a busy signal and transitions to the execution state to wait for the pipelined result.

Listing 5: FSM Idle State Logic

```

1 STATE_IDLE: begin
2     case (opcode)
3         OP_INIT: begin
4             alpha <= ONE;
5             beta  <= ZERO;
6         end
7         OP_X: begin
8             alpha <= x_a;

```

```

9      beta   <= x_b;
10     end
11     OP_Z: begin
12       alpha <= z_a;
13       beta  <= z_b;
14     end
15     OP_H: begin
16       busy  <= 1'b1;
17       state <= STATE_EXEC;
18     end
19     OP_NOP: begin
20       end
21     endcase
22 end

```

While in the execution state, the controller waits until the Hadamard pipeline asserts a valid signal. Once the output is available, the quantum state is updated, the busy signal is cleared, and the controller returns to the idle state.

Listing 6: FSM Execution State Logic

```

1 STATE_EXEC: begin
2   if (h_valid) begin
3     alpha <= h_a;
4     beta  <= h_b;
5     busy  <= 1'b0;
6     state <= STATE_IDLE;
7   end
8 end

```

All quantum gate modules are instantiated in parallel and continuously receive the current quantum state as input. The Pauli-X and Pauli-Z gates are implemented as purely combinational blocks, while the Hadamard gate is implemented as a synchronous, pipelined module that produces a valid output after multiple clock cycles.

Listing 7: Gate Module Instantiation

```

1 wire signed [15:0] x_a, x_b;
2 wire signed [15:0] z_a, z_b;
3 wire signed [15:0] h_a, h_b;
4 wire h_valid;
5
6 gate_pauli_x  gateX (alpha, beta, x_a, x_b);
7 gate_pauli_z  gateZ (alpha, beta, z_a, z_b);
8 gate_hadamard gateH (clk, alpha, beta, h_a, h_b, h_valid);

```

## 5 Quantum Gate Modules

The core functionality of the accelerator is distributed across specialized hardware modules, each responsible for a specific quantum transformation. Unlike a general-purpose processor which fetches instructions sequentially, the accelerator instantiates these gate modules in parallel within the top-level architecture. This design strategy allows the current quantum state  $(\alpha, \beta)$  to be fed simultaneously into all gate logic blocks, with the instruction decoder simply selecting the appropriate output. The following subsections detail the hardware description language (HDL) implementation for the three fundamental gates supported by the system.

### Pauli-X Gate (Bit Flip)

The Pauli-X gate performs a quantum bit flip, which is mathematically equivalent to swapping the  $\alpha$  and  $\beta$  probability amplitudes. As shown in Listing 8, this operation is implemented purely through signal routing. Since no arithmetic logic is required, this module consumes zero LUT (Look-Up Table) resources and operates with negligible propagation delay.

Listing 8: Verilog Implementation of Pauli-X Gate

```

1 module gate_pauli_x (
2     input signed [15:0] alpha_in,
3     input signed [15:0] beta_in,
4     output signed [15:0] alpha_out,
5     output signed [15:0] beta_out
6 );
7     // Pauli-X (Bit Flip): Swaps alpha and beta
8     // Matrix: [0 1; 1 0]
9     assign alpha_out = beta_in;
10    assign beta_out = alpha_in;
11 endmodule

```

#### Module Description:

- `input/output signed [15:0]`: Defines 16-bit ports utilizing the Q8.8 format. The `signed` keyword ensures the synthesizer treats these as Two's Complement numbers rather than raw bit vectors.
- `assign alpha_out = beta_in`: This continuous assignment creates a direct hardware wire connecting the input  $\beta$  signal to the output  $\alpha$  port.
- `assign beta_out = alpha_in`: Similarly, this routes the input  $\alpha$  signal to the output  $\beta$  port, completing the swap operation.

### Pauli-Z Gate (Phase Flip)

The Pauli-Z gate leaves the  $\alpha$  component unchanged while inverting the sign of the  $\beta$  component ( $\beta \rightarrow -\beta$ ). Mechanically, this is implemented using a fixed-point subtractor to compute  $0 - \beta_{in}$ . This operation requires arithmetic logic but is simple enough to be executed in a single clock cycle without pipelining.

Listing 9: Verilog Implementation of Pauli-Z Gate

```

1 module gate_pauli_z (
2     input signed [15:0] alpha_in,

```

```

3   input  signed [15:0] beta_in,
4   output signed [15:0] alpha_out,
5   output signed [15:0] beta_out
6 );
7   // Constant Zero
8   localparam signed [15:0] ZERO = 16'h0000;
9
10  // Alpha passes through
11  assign alpha_out = alpha_in;
12
13  fp_subtractor sub_z (
14    .a(ZERO),
15    .b(beta_in),
16    .sub(beta_out)
17 );
18 endmodule

```

### Module Description:

- `localparam ZERO`: Defines the constant zero in 16-bit hexadecimal format (`16'h0000`) to serve as the minuend.
- `fp_subtractor`: This instance performs the arithmetic negation.
  - **Port A**: Receives the hardcoded zero signal.
  - **Port B**: Receives the current  $\beta$  amplitude.
  - **Output**: The result  $(0 - \beta)$  is driven to `beta_out`.

## Hadamard Gate (Superposition)

The Hadamard gate is the most computationally intensive module, requiring both addition/subtraction and multiplication by the scaling factor  $1/\sqrt{2}$ . To maintain timing stability and prevent critical path violations, this module is designed as a pipelined stage synchronized to the system clock.

Listing 10: Verilog Implementation of Hadamard Gate

```

1 module gate_hadamard (
2   input  clk,
3   input  signed [15:0] alpha_in,
4   input  signed [15:0] beta_in,
5   output reg signed [15:0] alpha_out,
6   output reg signed [15:0] beta_out,
7   output reg valid
8 );
9   // Hadamard Gate (Superposition)
10  // Matrix: 1/sqrt(2) * [1  1; 1 -1]
11  localparam signed [15:0] SQRT2_INV = 16'h00B5;
12
13  wire signed [15:0] sum, diff;
14  wire signed [15:0] m1, m2;
15
16  // Step 1: Add/Sub (Combinational)
17  fp_adder      add0 (alpha_in, beta_in, sum);
18  fp_subtractor sub0 (
19    .a (alpha_in),

```

```

20     .b (beta_in),
21     .sub (diff)
22   );
23
24   // Step 2: Multiply by scaling factor
25   fp_multiplier mult1 (sum, SQRT2_INV, m1);
26   fp_multiplier mult2 (diff, SQRT2_INV, m2);
27
28   // Step 3: Register Output
29   always @(posedge clk) begin
30     alpha_out <= m1;
31     beta_out  <= m2;
32     valid      <= 1'b1; // Signal that pipeline contains valid data
33   end
34 endmodule

```

### Module Description:

- **Input clk:** Since the multiplication operations have a higher propagation delay, the module is synchronized to the system clock to ensure signal stability.
- **Output reg:** Unlike the Pauli gates, the outputs are stored in Flip-Flops (registers) rather than being driven by continuous assignment.
- **Step 1 (Arithmetic):** Two instances, fp\_adder and fp\_subtractor, operate in parallel to compute the intermediate terms  $(\alpha + \beta)$  and  $(\alpha - \beta)$ .
- **Step 2 (Scaling):** The intermediate results are multiplied by the constant SQRT2\_INV (approx 0.707).
- **Step 3 (Write-Back):** On the rising edge of the clock (posedge clk), the calculated values are latched into the output registers, and the valid flag is asserted to indicate completion to the main controller.

## 5.1 Simulation Testbench

To verify the functional correctness of the hardware accelerator, a comprehensive testbench was developed using SystemVerilog. This testbench serves as a wrapper around the Device Under Test (DUT), generating the necessary clock and reset signals, driving input opcodes, and systematically validating the output states  $(\alpha, \beta)$  against expected mathematical models. The verification strategy adopts a bottom-up approach, testing fundamental arithmetic units before validating the complex quantum gate operations.

### Testbench Architecture and Signal Setup

The simulation environment is structured to distinguish between testbench drivers and hardware outputs. As shown in Listing 11, ‘reg’ types are used for signals driven by the testbench (Clock, Reset, Opcode), while ‘wire’ types capture the outputs driven by the accelerator.

Listing 11: Testbench Setup and DUT Instantiation

```

1 module tb_single_qubit_accelerator;
2
3   // Clock and reset
4   reg clk;

```

```

5   reg rst;
6   reg [2:0] opcode;
7
8   // DUT outputs
9   wire signed [15:0] alpha;
10  wire signed [15:0] beta;
11  wire busy;
12
13  // Test statistics
14  integer test_count;
15  integer pass_count;
16  integer fail_count;
17
18  localparam real Q88_SCALE = 256.0; // 2^8 for scaling
19
20  // Opcode definitions
21  localparam OP_NOP = 3'b000;
22  localparam OP_INIT = 3'b001;
23  localparam OP_X = 3'b010;
24  localparam OP_Z = 3'b011;
25  localparam OP_H = 3'b100;
26  localparam OP_MEAS = 3'b101;
27
28  // Instantiate the DUT
29  single_qubit_accelerator dut (
30      .clk(clk),
31      .rst(rst),
32      .opcode(opcode),
33      .alpha(alpha),
34      .beta(beta),
35      .busy(busy)
36  );
37
38  // Clock generation: 10ns period (100MHz)
39  initial begin
40      clk = 0;
41      forever #5 clk = ~clk;
42  end

```

## Data Conversion Utilities

Hardware processing occurs in binary fixed-point format, which can be difficult to interpret directly. To facilitate debugging and automated checking, the testbench includes helper functions that translate between the 16-bit Q8.8 format and standard real numbers.

Listing 12: Helper Functions for Q8.8 Conversion

```

1  // Function to convert Q8.8 to real
2  function real q88_to_real;
3      input signed [15:0] q88_value;
4      begin
5          q88_to_real = $itor(q88_value) / Q88_SCALE;
6      end
7  endfunction
8
9  // Function to convert real to Q8.8
10 function signed [15:0] real_to_q88;

```

```

11     input real value;
12 begin
13     real_to_q88 = $rtoi(value * Q88_SCALE);
14 end
15 endfunction

```

## Verification Logic

The core verification logic relies on a reusable task, ‘check\_value’, which compares the DUT’s actual output against a point arithmetic. Listing 13 illustrates the verification of the Hadamard gate, ensuring that the superposition is correctly created.

Listing 13: Main Test Execution and Gate Verification

```

// Task to check if value is within tolerance
task check_value;
    input real actual;
    input real expected;
    input real tolerance;
    input [200*8:1] test_name;
begin
    test_count = test_count + 1;
    if ((actual >= expected - tolerance) && (actual <= expected + tolerance)) begin
        $display("[PASS] %0s: Expected=%.6f, Got=%.6f", test_name,
            expected, actual);
        pass_count = pass_count + 1;
    end else begin
        $display("[FAIL] %0s: Expected=%.6f, Got=%.6f, Error=%.6f",
            test_name, expected, actual, actual - expected);
        fail_count = fail_count + 1;
    end
end
endtask

// Test 5: Hadamard Gate (Superposition Creator)
task test_hadamard;
    real alpha_real, beta_real, expected_val;
begin
    $display("TEST_5:_Hadamard_Gate");

    // Initialize to |0>
    apply_op(OP_INIT);

    // Apply H gate: |0> -> (|0> + |1>)/sqrt(2)
    $display("\nApplying_Hadamard_gate");
    apply_op(OP_H);

    alpha_real = q88_to_real(alpha);
    beta_real = q88_to_real(beta);
    expected_val = 0.707106781; // 1/sqrt(2)

    check_value(alpha_real, expected_val, 0.02, "H_on_|0>:_alpha=_1/sqrt(2)");
    check_value(beta_real, expected_val, 0.02, "H_on_|0>:_beta=_1/sqrt(2)");
end
endtask

```

```

42 // Main Test Sequence Block (Excerpt)
43 initial begin
44     // Reset and Run tests
45     rst = 1; #20 rst = 0; #20;
46
47     test_pauli_x();
48     test_pauli_z();
49     test_hadamard();
50
51     // Final Summary
52     $display("\nTEST_SUMMARY");
53     $display("Passed: %d / %d", pass_count, test_count);
54     $finish;
55 end
56 endmodule

```

## 5.2 Simulation Results

The testbench described in the previous section was executed to validate the design. The simulation results are categorized into two phases: low-level arithmetic verification and high-level quantum gate state transitions.

### Phase 1: Arithmetic Unit Verification

The first phase of the simulation isolated the fixed-point arithmetic units. As shown in Listing 14, the adder and subtractor performed within the expected precision. However, the multiplier test revealed the inherent limitations of the Q8.8 format.

Listing 14: Simulation Log: Arithmetic Unit Tests

```

1 Test phase:
2 Arithmetic Unit Tests
3
4 1.1. Multiplier test:
5 TEST 1: Fixed-Point Multiplier
6 [PASS] 0.5 * 0.5: Expected=0.250000, Got=0.250000
7 [PASS] -1.0 * 0.5: Expected=-0.500000, Got=-0.500000
8 [PASS] 1/sqrt(2) * 1/sqrt(2): Expected=0.500000, Got=0.496094
9
10 1.2. Adder/Subtractor:
11 TEST 2: Fixed-Point Adder/Subtractor
12 [PASS] 0.5 + 0.5: Expected=1.000000, Got=1.000000
13 [PASS] 1.0 - 0.5: Expected=0.500000, Got=0.500000
14 [PASS] -0.5 + 0.5: Expected=0.000000, Got=0.000000
15 [PASS] 0.1 - 0.9: Expected=-0.800000, Got=-0.800781

```

### Analysis:

- The calculation  $(1/\sqrt{2})^2$  resulted in 0.496094 instead of the ideal 0.5. This represents an error of approximately 0.8%. This error stems from the truncation of  $1/\sqrt{2}$  to 0x00B5 and the subsequent bit-shifting after multiplication.
- The subtraction  $0.1 - 0.9$  yielded  $-0.800781$  versus  $-0.8$ , showing a minor quantization deviation of one LSB ( $\approx 0.0039$ ).

## Phase 2: Quantum Gate Verification

The second phase tested the Pauli and Hadamard gates. The Pauli gates (X and Z) exhibited perfect fidelity as they rely primarily on signal routing and simple negation. The Hadamard gate, however, showed signs of "numerical decoherence" due to the accumulated arithmetic errors identified in Phase 1.

Listing 15: Simulation Log: Pauli and Hadamard Gates

```

1 Gate tests
2
3 2.1. Pauli X Gate
4 TEST 3: Pauli-X Gate
5 Initial state |0>
6 alpha = 1.000000 (0x0100), beta = 0.000000 (0x0000)
7 After X gate
8 alpha = 0.000000 (0x0000), beta = 1.000000 (0x0100)
9 [PASS] X gate: alpha should be 0: Expected=0.000000, Got=0.000000
10 [PASS] X gate: beta should be 1: Expected=1.000000, Got=1.000000
11
12 2.2. Pauli Z Gate
13 TEST 4: Pauli-Z Gate
14 Initial state |0> -> After Z -> Unchanged [PASS]
15 Initial state |1> -> After Z
16 alpha = 0.000000 (0x0000), beta = -1.000000 (0xff00)
17 [PASS] Z on |1>: beta negated: Expected=-1.000000, Got=-1.000000
18
19 2.3. Hadamard Gate
20 TEST 5: Hadamard Gate
21 Initial state |0>
22 Applying Hadamard gate
23 After H gate
24 alpha = 0.707031 (0x00b5), beta = 0.707031 (0x00b5)
25 P(|0>) = 0.499893, P(|1>) = 0.499893, Sum = 0.999786
26 [PASS] H on |0>: alpha = 1/sqrt(2): Expected=0.707107, Got=0.707031
27
28 Applying Hadamard gate again
29 After H^2 (should be |0>)
30 alpha = 0.996094 (0x00ff), beta = 0.000000 (0x0000)
31 P(|0>) = 0.992203, P(|1>) = 0.000000, Sum = 0.992203
32 [PASS] H^2: alpha should be 1: Expected=1.000000, Got=0.996094

```

### Analysis:

- **Unitary Violation:** In the Hadamard test (Test 5), the sum of probabilities ( $|\alpha|^2 + |\beta|^2$ ) dropped to 0.999786. While mathematically the total probability must equal 1, the fixed-point representation caused a slight "leakage" of probability.
- **Reversibility ( $H^2 \approx I$ ):** When the Hadamard gate was applied twice, the system returned to  $\alpha = 0.996094$  (0x00FF) rather than the exact 1.0 (0x0100). This confirms that while the logic is functionally correct, the circuit depth is limited by the accumulation of rounding errors.

## 6 Booth Multiplier Design

Multiplication is one of the most resource-intensive operations in digital systems, specifically within Central Processing Units (CPUs) and Digital Signal Processing (DSP) blocks. The performance of the multiplication unit directly dictates the overall execution speed and hardware complexity of the system. Therefore, selecting an optimal multiplication algorithm is a critical design decision.

### 6.1 Algorithm Selection: Shift-and-Add vs. Booth

Among basic binary multiplication methods, the *Shift-and-Add* algorithm is common due to its simplicity. It performs multiplication by left-shifting the multiplicand and conditionally adding it based on each bit of the multiplier. However, this approach has significant drawbacks:

- **Performance Penalty:** When the multiplier contains many consecutive '1' bits, the number of addition operations increases linearly, leading to longer critical paths.
- **Signed Number Complexity:** Handling signed numbers requires additional pre-processing and post-processing steps (e.g., converting to absolute values and restoring signs).

The **Booth Algorithm** was proposed to overcome these limitations by exploiting the properties of Two's Complement binary representation. Instead of processing bits individually, Booth's algorithm uses encoding to identify runs of consecutive 1s. This mechanism allows multiple repeated additions to be replaced by a single addition and a single subtraction. Furthermore, it inherently supports signed multiplication without requiring data conversion, making it highly suitable for modern arithmetic architectures.

### 6.2 Design and Operational Analysis

The Booth Algorithm optimizes signed binary multiplication using four primary registers. The operation relies on the state of the current Least Significant Bit (LSB) of the multiplier ( $Q_0$ ) and a history bit ( $Q_{-1}$ ).

#### 6.2.1 Register Definitions

- **A (Accumulator):** Stores the partial product results. Initialized to 0.
- **M (Multiplicand):** Stores the 16-bit input value to be multiplied.
- **Q (Multiplier):** Stores the 16-bit multiplier. Bits are shifted out to the right during execution.
- **$Q_{-1}$  (History Bit):** Stores the bit shifted out from  $Q$  in the previous cycle. Initialized to 0.

#### 6.2.2 Operational Logic

In each clock cycle, the controller examines the bit pair  $\{Q_0, Q_{-1}\}$  to determine the operation:

- **00 or 11:** No arithmetic operation.
- **01:**  $A \leftarrow A + M$  (Add Multiplicand).

- 10:  $A \leftarrow A - M$  (Subtract Multiplicand).

Following the arithmetic operation, the combined register set  $\{A, Q, Q_{-1}\}$  undergoes an **Arithmetic Right Shift (ARS)**. This shift preserves the sign bit of  $A$ , ensuring correct signed arithmetic behavior. This process repeats for  $N$  cycles (where  $N = 16$  is the bit width).



Figure 1: Flowchart of the Booth Multiplication Algorithm

## 6.3 Hardware Implementation

The hardware design is implemented in Verilog HDL. It consists of a control FSM and a datapath containing the adder/subtractor unit and shift registers.

### 6.3.1 Booth Multiplier Module

Listing 16 details the implementation of the 16-bit signed multiplier. The design uses a counter to track the 16 iterations and a state flag ‘done’ to indicate completion.

Listing 16: Verilog Implementation of 16-bit Booth Multiplier

```

1 module booth_multiplier_16bit(
2     input clk,
3     input start,

```

```

4      input signed [15:0] multiplicand,
5      input signed [15:0] multiplier,
6      output reg signed [31:0] product,
7      output reg done
8  );
9
10     reg signed [15:0] M;
11     reg signed [15:0] Q;
12     reg signed [15:0] A;
13     reg Qm1;           // Q(-1) history bit
14     reg [4:0] count;    // Iteration counter (0-16)
15
16     always @ (posedge clk) begin
17       if (start) begin
18         // Initialization Phase
19         A      <= 16'd0;
20         M      <= multiplicand;
21         Q      <= multiplier;
22         Qm1   <= 1'b0;
23         count <= 5'd16;
24         done   <= 1'b0;
25       end
26       else if (count > 0) begin
27         // Arithmetic Phase based on {Q[0], Qm1}
28         case ({Q[0], Qm1})
29           2'b01: A <= A + M;    // Transition 0->1: Add
30           2'b10: A <= A - M;    // Transition 1->0: Subtract
31           default: ;          // 00 or 11: No operation
32         endcase
33
34         // Shift Phase: Arithmetic Shift Right {A, Q, Qm1}
35         // MSB of A is preserved (Sign Extension)
36         Qm1 <= Q[0];
37         Q    <= {A[0], Q[15:1]};
38         A    <= {A[15], A[15:1]};
39
40         count <= count - 1;
41       end
42       else begin
43         // Completion Phase
44         done   <= 1'b1;
45         product <= {A, Q};      // Concatenate result
46       end
47     end
48   endmodule

```

### 6.3.2 Testbench Simulation

To verify the design, a testbench was created to cover various corner cases, including positive-positive, negative-positive, negative-negative, and maximum boundary values.

Listing 17: Testbench for Booth Multiplier Verification

```

1 `timescale 1ns/1ps
2
3 module tb_booth_multiplier_16bit;
4
5   reg clk;

```

```

6   reg start;
7   reg signed [15:0] multiplicand;
8   reg signed [15:0] multiplier;
9
10  wire signed [31:0] product;
11  wire done;
12
13 // Instantiate DUT (Device Under Test)
14 booth_multiplier_16bit DUT (
15     .clk(clk),
16     .start(start),
17     .multiplicand(multiplicand),
18     .multiplier(multiplier),
19     .product(product),
20     .done(done)
21 );
22
23 // Clock generation (100MHz)
24 always #5 clk = ~clk;
25
26 initial begin
27     $display("===_Booth_Multiplier_16-bit_Testbench_Started_==");
28
29     // Initial values
30     clk = 0; start = 0; multiplicand = 0; multiplier = 0;
31
32     // TEST 1: 5 * 7 = 35 (Pos * Pos)
33     #10;
34     multiplicand = 5; multiplier = 7; start = 1;
35     #10 start = 0;
36     wait(done == 1);
37     #5;
38     $display("Test_1:_%d_*_%d=%d", 5, 7, product);
39
40     // TEST 2: (-8) * 6 = -48 (Neg * Pos)
41     #20;
42     multiplicand = -8; multiplier = 6; start = 1;
43     #10 start = 0;
44     wait(done == 1);
45     #5;
46     $display("Test_2:_%d_*_%d=%d", -8, 6, product);
47
48     // TEST 3: (-12) * (-9) = 108 (Neg * Neg)
49     #20;
50     multiplicand = -12; multiplier = -9; start = 1;
51     #10 start = 0;
52     wait(done == 1);
53     #5;
54     $display("Test_3:_%d_*_%d=%d", -12, -9, product);
55
56     // TEST 4: Max * Max (Boundary Stress Test)
57     #20;
58     multiplicand = 16'sh7FFF;    // 32767
59     multiplier = 16'sh7FFF;    // 32767
60     start = 1;
61     #10 start = 0;
62     wait(done == 1);
63     #5;

```

```

64     $display ("Test_4: %d * %d = %d", multiplicand, multiplier, product)
       ;
65
66     $display ("====_Testbench_Finished===");
67     $stop;
68   end
69 endmodule

```

## 6.4 Simulation Results

The simulation of the testbench for the 16-bit Booth multiplier shows that the module operates correctly as designed. After the simulation starts, the testbench sequentially performs multiple multiplication operations under different scenarios, including positive numbers, negative numbers, and large boundary values.

The results displayed for each test indicate that the done signal is asserted after completing all 16 Booth cycles, confirming that the algorithm has executed the required number of iterations. The obtained output values (taken from bits [23:8] of the 32-bit product) are stable and consistent across all test cases. The simulation completes normally without any errors, confirming the correctness of the Booth algorithm implementation in Verilog.



Figure 2: Simulation Waveform of the Booth Multiplier

## 7 Fixed-Point Arithmetic Units: Adder and Subtractor

For fixed-point addition and subtraction, the underlying binary operations are identical to standard signed integer arithmetic found in classical computer architectures. Unlike the complexity involved in optimizing multiplication (as seen with the Booth algorithm), addition and subtraction primarily rely on standard logic gates. Consequently, a detailed theoretical derivation is omitted here. However, to ensure a comprehensive documentation of the hardware design, the Verilog implementations and their verification testbenches are presented below.

### 7.1 Fixed-Point Subtractor

The subtraction unit implements the operation  $RESULT = A - B$ . In digital logic, this is achieved using Two's Complement arithmetic, where subtraction is treated as the addition of the negative:  $A + (\sim B) + 1$ . The module includes a `COUT` signal to indicate the borrow status (where 1 indicates no borrow occurred).

Listing 18: Verilog Implementation of Fixed-Point Subtractor

```

1  `timescale 1ns / 1ps
2
3 // Fixed-point SUB unit
4 // RESULT = A - B (two's complement)
5
6 module sub #(
7     parameter WIDTH = 16
8 ) (
9     input wire [WIDTH-1:0] A,
10    input wire [WIDTH-1:0] B,
11    output wire [WIDTH-1:0] RESULT,
12    output wire           COUT // 1 = no borrow, 0 = borrow
13 );
14
15    wire [WIDTH:0] diff_ext;
16
17 // A - B is equivalent to A + (~B) + 1
18 assign diff_ext = {1'b0, A} + {1'b0, (~B)} + 1'b1;
19 assign RESULT  = diff_ext[WIDTH-1:0];
20 assign COUT    = diff_ext[WIDTH];
21
22 endmodule

```

Listing 19: Testbench for Subtractor

```

1  `timescale 1ns / 1ps
2
3 module tb_sub;
4
5     reg [15:0] A, B;
6     wire [15:0] RESULT;
7     wire       COUT;
8
9     sub tb_sub (
10         .A(A),
11         .B(B),
12         .RESULT(RESULT),
13         .COUT(COUT)

```

```

14 );
15
16 initial begin
17     $dumpfile("sub.vcd");
18     $dumpvars(0, tb_sub);
19
20     // Case 1: Positive Result (10 - 5)
21     A = 16'd10; B = 16'd5;
22     #10;
23     $display("SUB:_%0d_-_%0d=_%0d__COUT=%b", A, B, RESULT, COUT);
24
25     // Case 2: Negative Result (5 - 10)
26     A = 16'd5; B = 16'd10;
27     #10;
28     $display("SUB:_%0d_-_%0d=_%0d__COUT=%b", A, B, RESULT, COUT);
29
30     $finish;
31 end
32
33 endmodule

```

## 7.2 Fixed-Point Adder

The addition unit performs standard binary addition ( $RESULT = A + B$ ). It utilizes an extended bit width internally ('sum<sub>ext</sub>') to capture any carry-out generation, which serves as an overflow indicator.

Listing 20: Verilog Implementation of Fixed-Point Adder

```

1 `timescale 1ns / 1ps
2
3 // Fixed-point ADD unit
4 // RESULT = A + B
5
6 module add #(
7     parameter WIDTH = 16
8 ) (
9     input wire [WIDTH-1:0] A,
10    input wire [WIDTH-1:0] B,
11    output wire [WIDTH-1:0] RESULT,
12    output wire           COUT      // Carry-out (overflow)
13 );
14
15     wire [WIDTH:0] sum_ext;
16
17     assign sum_ext = {1'b0, A} + {1'b0, B};
18     assign RESULT  = sum_ext[WIDTH-1:0];
19     assign COUT    = sum_ext[WIDTH];
20
21 endmodule

```

Listing 21: Testbench for Adder

```

1 `timescale 1ns / 1ps
2
3 module tb_add;
4
5     reg [15:0] A, B;

```

```
6   wire [15:0] RESULT;
7   wire          COUT;
8
9   add tb_add (
10     .A(A),
11     .B(B),
12     .RESULT(RESULT),
13     .COUT(COUT)
14 );
15
16 initial begin
17   $dumpfile("add.vcd");
18   $dumpvars(0, tb_add);
19
20   // Case 1: Standard Addition (10 + 5)
21   A = 16'd10; B = 16'd5;
22   #10;
23   $display("ADD:_%0d+_%0d=_%0d__COUT=%b", A, B, RESULT, COUT);
24
25   // Case 2: Zero Addition (0 + 0)
26   A = 16'd0; B = 16'd0;
27   #10;
28   $display("ADD:_%0d+_%0d=_%0d__COUT=%b", A, B, RESULT, COUT);
29
30   $finish;
31 end
32
33 endmodule
```

## References

- Aaronson, S. (2018, December 17). *Why are amplitudes complex?* Shtetl-Optimized. <https://scottaaronson.blog/>
- Bautista, I., Kreinovich, V., Kosheleva, O., & Nguyen, H. P. (2020). *Why it is sufficient to have real-valued amplitudes in quantum computing* (Technical Report No. UTEP-CS-20-48). The University of Texas at El Paso. [https://scholarworks.utep.edu/cgi/viewcontent.cgi?article=2438&context=cs\\_te](https://scholarworks.utep.edu/cgi/viewcontent.cgi?article=2438&context=cs_te)
- Booth, A. D. (1951). A signed binary multiplication technique. *Quarterly Journal of Mechanics and Applied Mathematics*, 4(2), 236–240.
- Choi, S., Lee, K., Choi, J., & Lee, W. (2025). *Precision-aware fixed-point emulation of Grover's algorithm: Asymptotic error bounds and design guidelines* (Preprint). Research Square. <https://doi.org/10.21203/6904768/v1>
- Dick, R. P., Albonesi, D. H., & Skillicorn, D. J. W. J. (2004). *Digital Systems Design Using Verilog* (1st ed.). New York: Springer.
- Gill, S. S., Cetinkaya, O., Marrone, S., Claudino, D., Haunschild, D., Schlotte, L., Wu, H., Ottaviani, C., Liu, X., Machupalli, S. P., Kaur, K., Arora, P., Liu, J., Farouk, A., Song, H. H., Uhlig, S., & Ramamohanarao, K. (2025). *Quantum computing: Vision and challenges* (arXiv:2403.02240v5) [Preprint]. arXiv. <https://arxiv.org/abs/2403.02240v5>
- imec. (n.d.). *Is Moore's law dead?* <https://www.imec-int.com/en/semiconductor-education-and-workforce-development/microchips/moores-law/moores-law-dead>
- Khalid, A. U., Zilic, Z., & Radecka, K. (2004). FPGA emulation of quantum circuits. In *Proceedings of the IEEE International Conference on Computer Design: VLSI in Computers and Processors* (pp. 310–315). IEEE. <https://doi.org/10.1109/ICCD.2004.1347938>
- Khalid, A. U., Zilic, Z., & Radecka, K. (2005). *FPGA emulation of quantum circuits*. McGill University / Concordia University. <https://sites.cc.gatech.edu/computing/nano/documents/Radecka%20-%20FPGA%20Emulation%20of%20Quantum%20Circuits.pdf>
- Khalid, A. U., Zilic, Z., & Radecka, K. (2016). An FPGA-based quantum computing emulation framework based on serial-parallel architecture. *International Journal of Reconfigurable Computing*, 2016, Article 5718124. <https://doi.org/10.1155/2016/5718124>
- Lagostina, L., Volpe, D., Zamboni, M., & Turvani, G. (2025). *AQUAM: Accelerating quantum algorithm validation through FPGA-based emulation* (arXiv:2506.01029v1) [Preprint]. arXiv. <https://doi.org/10.48550/arXiv.2506.01029>
- Madhasamy, K. (2025, April 7). As tech reaches compute limits, quantum computing must work. *EE Times*. <https://www.eetimes.com/as-tech-reaches-compute-limits-quantum-computing-must-work/>
- Microchip Technology Inc. (n.d.). *What is Moore's Law?* <https://www.microchipusa.com/electrical-components/what-is-moores-law>
- Oukaira, A. (2025). Quantum hardware devices (QHDs): Opportunities and challenges. *IEEE*

Access. <https://doi.org/10.1109/ACCESS.2025.3576216>

Plain Concepts. (2024, June 19). *Quantum computing: Potential and challenges ahead*. <https://www.plainconcepts.com/computing-potential-challenges/>

Project F. (2020, May 26). *Fixed point numbers in Verilog* (Updated April 22, 2024). <https://projectf.io/posts/fixed-point-numbers-in-verilog/>

Q format. (n.d.). In *Wikipedia*. [https://en.wikipedia.org/wiki/Q\\_\(number\\_format\)](https://en.wikipedia.org/wiki/Q_(number_format))

Q-format Converter & Calculator. (n.d.). <https://chummersone.github.io/qformat.html>

Quantum Computing Stack Exchange. (2021). *Are complex amplitudes really needed?* <https://quantumcomputing.stackexchange.com/questions/10000/are-complex-amplitudes-really-needed>

Random person. (2024, April 6). Which bits do I need to extract in a signed fixed point multiplier? [Question]. *Stack Overflow*. <https://stackoverflow.com/questions/78283508/which-bits-do-i-need-to-extract-in-a-signed-fixed-point-multiplier>

Schlecker, W., Beuschel, C., & Pfleiderer, H.-J. (2007). Quantisation noise in fixed-point multiplications. *Electrical Engineering*, 89(4), 339–342. <https://doi.org/10.1007/s00202-006-0009-3>

Waidyasooriya, H., Oshiyama, H., Kurebayashi, Y., Hariyama, M., & Ohzeki, M. (2022). A scalable emulator for quantum Fourier transform using multiple FPGAs with high-bandwidth memory. *IEEE Access*, 10, 1–14. <https://doi.org/10.1109/ACCESS.2022.3183993>

Winandy, I., Dion, A., Manni, F., Garoche, P.-L., Ben Khalifa, D., & Martel, M. (2025). Automated fixed-point precision optimization for FPGA synthesis. *IEEE Open Journal of Circuits and Systems*, 6, 192–204. <https://doi.org/10.1109/OJCAS.2025.3580744>

## A Appendix: Verilog Source Code

This appendix contains the complete Verilog Hardware Description Language (HDL) source code for the Single-Qubit Hardware Accelerator, the Booth Multiplier, and the fundamental arithmetic units, along with their respective testbenches.

listings xcolor

### A.1 Quantum Accelerator Core

```

1  `timescale 1ns / 1ps
2
3 // 1. Basic Arithmetic Building Blocks
4 module fp_multiplier (
5     input signed [15:0] a,
6     input signed [15:0] b,
7     output signed [15:0] result
8 );
9     wire signed [31:0] full_product = a * b;
10    assign result = full_product[23:8]; // Extract middle 16 bits to maintain Q8.8 scaling
11 endmodule
12
13 module fp_adder (
14     input signed [15:0] a,
15     input signed [15:0] b,
16     output signed [15:0] sum
17 );
18     assign sum = a + b;
19 endmodule
20
21 module fp_subtractor (
22     input signed [15:0] a,
23     input signed [15:0] b,
24     output signed [15:0] sub
25 );
26     assign sub = a - b;
27 endmodule
28
29
30 // 2. Quantum Gate Modules
31 module gate_pauli_x (
32     input signed [15:0] alpha_in,
33     input signed [15:0] beta_in,
34     output signed [15:0] alpha_out,
35     output signed [15:0] beta_out
36 );
37     // Pauli-X (Bit Flip): Swaps alpha and beta
38     // Matrix: [0 1; 1 0]
39     assign alpha_out = beta_in;
40     assign beta_out = alpha_in;
41 endmodule
42
43 module gate_pauli_z (
44     input signed [15:0] alpha_in,
45     input signed [15:0] beta_in,
46     output signed [15:0] alpha_out,
47     output signed [15:0] beta_out
48 );
49     // Constant Zero
50     localparam signed [15:0] ZERO = 16'h0000;
51     // Alpha passes through
52     assign alpha_out = alpha_in;
53     fp_subtractor sub_z (
54         .a(ZERO),
55         .b(beta_in),
56         .sub(beta_out)

```

```

57    );
58 endmodule
59
60 module gate_hadamard (
61     input  clk,
62     input  signed [15:0] alpha_in,
63     input  signed [15:0] beta_in,
64     output reg signed [15:0] alpha_out,
65     output reg signed [15:0] beta_out,
66     output reg valid
67 );
68 // Hadamard Gate (Superposition)
69 // Matrix: 1/sqrt(2) * [1 1; 1 -1]
70 localparam signed [15:0] SQRT2_INV = 16'h00B5;
71
72 wire signed [15:0] sum, diff;
73 wire signed [15:0] m1, m2;
74
75 // Step 1: Add/Sub (Combinational)
76 fp_adder      add0 (alpha_in, beta_in, sum);
77 fp_subtractor sub0 (
78     .a (alpha_in),
79     .b (beta_in),
80     .sub (diff)
81 );
82 // Step 2: Multiply by scaling factor (Combinational in this design)
83 fp_multiplier mult1 (sum, SQRT2_INV, m1);
84 fp_multiplier mult2 (diff, SQRT2_INV, m2);
85
86 // Step 3: Register Output (Pipeline Stage)
87 always @ (posedge clk) begin
88     alpha_out <= m1;
89     beta_out <= m2;
90     valid     <= 1'b1;
91     // Signal that pipeline contains valid data
92 end
93 endmodule
94
95 // 3. Main Accelerator Module (FSM & Datapath)
96
97 module single_qubit_accelerator (
98     input  clk,
99     input  rst,
100    input  [2:0] opcode,
101    output reg signed [15:0] alpha,
102    output reg signed [15:0] beta,
103    output reg busy
104 );
105 // Instruction Set Definition
106 localparam OP_NOP = 3'b000; // No Operation
107 localparam OP_INIT = 3'b001; // Initialize to |0>
108 localparam OP_X   = 3'b010; // Pauli-X
109 localparam OP_Z   = 3'b011; // Pauli-Z
110 localparam OP_H   = 3'b100; // Hadamard
111
112 // Constants
113 localparam signed [15:0] ZERO = 16'h0000;
114 localparam signed [15:0] ONE  = 16'h0100; // 1.0 in Q8.8
115
116 // Internal Signals
117 wire signed [15:0] x_a, x_b;
118 wire signed [15:0] z_a, z_b;
119 wire signed [15:0] h_a, h_b;
120 wire h_valid;
121 // --- Instantiate Gates ---
122 // All gates receive current alpha/beta state combinatorially
123 gate_pauli_x gateX (alpha, beta, x_a, x_b);
124 gate_pauli_z gateZ (alpha, beta, z_a, z_b);
125 gate_hadamard gateH (clk, alpha, beta, h_a, h_b, h_valid);
126 // --- Control State Machine ---
127 localparam STATE_IDLE = 1'b0;
128 localparam STATE_EXEC = 1'b1;
129 reg state;

```

```

130    always @ (posedge clk or posedge rst) begin
131        if (rst) begin
132            alpha <= ONE; // Default start state /0>
133            beta  <= ZERO;
134            busy   <= 1'b0;
135            state  <= STATE_IDLE;
136        end else begin
137            case (state)
138                // STATE: IDLE (Fetch/Decode)
139                STATE_IDLE: begin
140                    case (opcode)
141                        OP_INIT: begin
142                            alpha <= ONE;
143                            beta  <= ZERO;
144                        end
145                        OP_X: begin
146                            alpha <= x_a; // Update state immediately (Single Cycle)
147                            beta  <= x_b;
148                        end
149                        OP_Z: begin
150                            alpha <= z_a; // Update state immediately (Single Cycle)
151                            beta  <= z_b;
152                        end
153                        OP_H: begin
154                            // Hadamard is multi-cycle (pipelined)
155                            busy   <= 1'b1; // Lock the processor
156                            state  <= STATE_EXEC;
157                        end
158                        OP_NOP: begin
159                            // Do nothing
160                        end
161                    endcase
162                end
163
164                // STATE: EXEC (Wait for Pipeline)
165                STATE_EXEC: begin
166                    // Wait for Hadamard gate to signal valid output
167                    if (h_valid) begin
168                        alpha <= h_a;
169                        beta  <= h_b;
170                        busy   <= 1'b0; // Release lock
171                        state  <= STATE_IDLE;
172                    end
173                end
174            endcase
175        end
176    end
177 endmodule

```

Listing 22: Main Accelerator Module (quantumn.v)

```

1 `timescale 1ns / 1ps
2
3 // Comprehensive Testbench for Single Qubit Hardware Accelerator
4 module tb_single_qubit_accelerator;
5     // Clock and reset
6     reg clk;
7     reg rst;
8     reg [2:0] opcode;
9     // DUT outputs
10    wire signed [15:0] alpha;
11    wire signed [15:0] beta;
12    wire busy;
13    // Test statistics
14    integer test_count;
15    integer pass_count;
16    integer fail_count;
17    // Q8.8 Fixed-Point Format Constants
18    localparam real Q88_SCALE = 256.0; // 2^8 for scaling
19
20    // Quantum Gate Operation Codes (3-bit)
21    localparam OP_NOP  = 3'b000;
22    localparam OP_INIT = 3'b001;

```

```

23  localparam OP_X      = 3'b010;
24  localparam OP_Z      = 3'b011;
25  localparam OP_H      = 3'b100;
26  localparam OP_MEAS   = 3'b101;
27  // Instantiate the DUT
28  single_qubit_accelerator dut (
29      .clk(clk),
30      .rst(rst),
31      .opcode(opcode),
32      .alpha(alpha),
33      .beta(beta),
34      .busy(busy)
35  );
36  // Clock generation: 10ns period (100MHz)
37  initial begin
38      clk = 0;
39      forever #5 clk = ~clk;
40  end
41
42  // Function to convert Q8.8 to real
43  function real q88_to_real;
44      input signed [15:0] q88_value;
45      begin
46          q88_to_real = $itor(q88_value) / Q88_SCALE;
47      end
48  endfunction
49
50  // Function to convert real to Q8.8
51  function signed [15:0] real_to_q88;
52      input real value;
53      begin
54          real_to_q88 = $rtoi(value * Q88_SCALE);
55      end
56  endfunction
57
58  // Task to wait for operation completion
59  task wait_for_ready;
60      begin
61          wait(!busy);
62          @(posedge clk);
63      end
64  endtask
65
66  // Task to apply an operation
67  task apply_op;
68      input [2:0] op;
69      begin
70          @(posedge clk);
71          opcode = op;
72          @(posedge clk);
73          opcode = OP_NOP;
74          wait_for_ready();
75      end
76  endtask
77
78  // Task to check if value is within tolerance
79  task check_value;
80      input real actual;
81      input real expected;
82      input real tolerance;
83      input [200*8:1] test_name;
84      begin
85          test_count = test_count + 1;
86          if ((actual >= expected - tolerance) && (actual <= expected + tolerance)) begin
87              $display("[PASS] %0s: Expected=%.6f, Got=%.6f", test_name, expected, actual);
88              pass_count = pass_count + 1;
89          end else begin
90              $display("[FAIL] %0s: Expected=%.6f, Got=%.6f, Error=%.6f",
91                      test_name, expected, actual, actual - expected);
92              fail_count = fail_count + 1;
93          end
94      end
95  endtask

```

```

96
97 // Task to display current state
98 task display_state;
99   input [200*8:1] label;
100  real alpha_real, beta_real, prob_0, prob_1;
101 begin
102   alpha_real = q88_to_real(alpha);
103   beta_real = q88_to_real(beta);
104   prob_0 = alpha_real * alpha_real;
105   prob_1 = beta_real * beta_real;
106   $display("\n%0s", label);
107   $display("alpha = %.6f (0x%04h), beta = %.6f (0x%04h)",
108           alpha_real, alpha, beta_real, beta);
109   $display("P(|0>) = %.6f, P(|1>) = %.6f, Sum = %.6f",
110           prob_0, prob_1, prob_0 + prob_1);
111 end
112 endtask
113
114 // Test 1: Fixed-Point Multiplier Verification
115 task test_fp_multiplier;
116   reg signed [15:0] a, b, result;
117   real a_real, b_real, expected, actual;
118 begin
119   $display("TEST 1: Fixed-Point Multiplier");
120   // Test case 1.1: 0.5 * 0.5 = 0.25
121   a = real_to_q88(0.5);
122   b = real_to_q88(0.5);
123   #1 result = (a * b) >> 8;
124   // Simulate multiplier behavior
125   actual = q88_to_real(result);
126   check_value(actual, 0.25, 0.01, "0.5 * 0.5");
127
128   // Test case 1.2: 1.0 * 1.0 = 1.0
129   a = real_to_q88(1.0);
130   b = real_to_q88(1.0);
131   #1 result = (a * b) >> 8;
132   actual = q88_to_real(result);
133   check_value(actual, 1.0, 0.01, "1.0 * 1.0");
134
135   // Test case 1.3: 2.0 * 0.5 = 1.0
136   a = real_to_q88(2.0);
137   b = real_to_q88(0.5);
138   #1 result = (a * b) >> 8;
139   actual = q88_to_real(result);
140   check_value(actual, 1.0, 0.01, "2.0 * 0.5");
141
142   // Test case 1.4: -1.0 * 0.5 = -0.5
143   a = real_to_q88(-1.0);
144   b = real_to_q88(0.5);
145   #1 result = (a * b) >> 8;
146   actual = q88_to_real(result);
147   check_value(actual, -0.5, 0.01, "-1.0 * 0.5");
148
149   // Test case 1.5: 1/sqrt(2) * 1/sqrt(2) = 0.5
150   a = 16'h00B5; // 1/sqrt(2) in Q8.8
151   b = 16'h00B5;
152   #1 result = (a * b) >> 8;
153   actual = q88_to_real(result);
154   check_value(actual, 0.5, 0.01, "1/sqrt(2) * 1/sqrt(2)");
155
156   // Test case 1.6: Maximum positive value
157   a = 16'h7FFF; // Max Q8.8
158   b = real_to_q88(0.01);
159   #1 result = (a * b) >> 8;
160   actual = q88_to_real(result);
161   $display("Max value test: %.6f * 0.01 = %.6f", q88_to_real(a), actual);
162 end
163 endtask
164
165 // Test 2: Fixed-Point Adder/Subtractor Verification
166 task test_fp_adder_subtractor;
167   reg signed [15:0] a, b;
168   reg signed [16:0] result_add, result_sub;

```

```

169      real actual;
170
171  begin
172    $display("TEST 2: Fixed-Point Adder/Subtractor");
173    // Test case 2.1: 0.5 + 0.5 = 1.0
174    a = real_to_q88(0.5);
175    b = real_to_q88(0.5);
176    result_add = {a[15], a} + {b[15], b};
177    actual = q88_to_real(result_add[15:0]);
178    check_value(actual, 1.0, 0.01, "0.5 + 0.5");
179    // Test case 2.2: 1.0 - 0.5 = 0.5
180    a = real_to_q88(1.0);
181    b = real_to_q88(0.5);
182    result_sub = {a[15], a} - {b[15], b};
183    actual = q88_to_real(result_sub[15:0]);
184    check_value(actual, 0.5, 0.01, "1.0 - 0.5");
185    // Test case 2.3: -0.5 + 0.5 = 0.0
186    a = real_to_q88(-0.5);
187    b = real_to_q88(0.5);
188    result_add = {a[15], a} + {b[15], b};
189    actual = q88_to_real(result_add[15:0]);
190    check_value(actual, 0.0, 0.01, "-0.5 + 0.5");
191    // Test case 2.4: Overflow detection (1.0 + 1.0)
192    a = real_to_q88(1.0);
193    b = real_to_q88(1.0);
194    result_add = {a[15], a} + {b[15], b};
195    $display("Overflow test: 1.0 + 1.0 = %.6f (17-bit result: 0x%05h)",
196             q88_to_real(result_add[15:0]), result_add);
197    // Test case 2.5: Large subtraction
198    a = real_to_q88(0.1);
199    b = real_to_q88(0.9);
200    result_sub = {a[15], a} - {b[15], b};
201    actual = q88_to_real(result_sub[15:0]);
202    check_value(actual, -0.8, 0.01, "0.1 - 0.9");
203
204
205  // Test 3: Pauli-X Gate (Bit Flip)
206  task test_pauli_x;
207    real alpha_real, beta_real;
208
209  begin
210    $display("TEST 3: Pauli-X Gate");
211    // Initialize to |0>
212    apply_op(OP_INIT);
213    display_state("Initial state |0>");
214    // Apply X gate: |0> -> |1>
215    apply_op(OP_X);
216    display_state("After X gate");
217
218    alpha_real = q88_to_real(alpha);
219    beta_real = q88_to_real(beta);
220    check_value(alpha_real, 0.0, 0.01, "X gate: alpha should be 0");
221    check_value(beta_real, 1.0, 0.01, "X gate: beta should be 1");
222
223    // Apply X again: |1> -> |0>
224    apply_op(OP_X);
225    display_state("After second X gate");
226
227    alpha_real = q88_to_real(alpha);
228    beta_real = q88_to_real(beta);
229    check_value(alpha_real, 1.0, 0.01, "X^2: alpha should be 1");
230    check_value(beta_real, 0.0, 0.01, "X^2: beta should be 0");
231
232
233  // Test 4: Pauli-Z Gate (Phase Flip)
234  task test_pauli_z;
235    real alpha_real, beta_real;
236
237  begin
238    $display("TEST 4: Pauli-Z Gate");
239    // Initialize to |0>
240    apply_op(OP_INIT);
241    display_state("Initial state |0>");
242    // Apply Z gate: should not change |0>

```

```

242     apply_op(OP_Z);
243     display_state("After Z gate on |0>");
244
245     alpha_real = q88_to_real(alpha);
246     beta_real = q88_to_real(beta);
247     check_value(alpha_real, 1.0, 0.01, "Z on |0>: alpha unchanged");
248     check_value(beta_real, 0.0, 0.01, "Z on |0>: beta unchanged");
249
250     // Apply X to get |1>
251     apply_op(OP_X);
252     display_state("After X gate");
253
254     // Apply Z gate: |1> -> -|1>
255     apply_op(OP_Z);
256     display_state("After Z gate on |1>");
257
258     alpha_real = q88_to_real(alpha);
259     beta_real = q88_to_real(beta);
260     check_value(alpha_real, 0.0, 0.01, "Z on |1>: alpha unchanged");
261     check_value(beta_real, -1.0, 0.01, "Z on |1>: beta negated");
262   end
263 endtask
264
265 // Test 5: Hadamard Gate (Superposition Creator)
266 task test_hadamard;
267   real alpha_real, beta_real, expected_val;
268   begin
269     $display("TEST 5: Hadamard Gate");
270     // Initialize to |0>
271     apply_op(OP_INIT);
272     display_state("Initial state |0>");
273     // Apply H gate: |0> -> (|0> + |1>)/sqrt(2)
274     $display("\nApplying Hadamard gate");
275     apply_op(OP_H);
276     display_state("After H gate");
277
278     alpha_real = q88_to_real(alpha);
279     beta_real = q88_to_real(beta);
280     expected_val = 0.707106781; // 1/sqrt(2)
281
282     check_value(alpha_real, expected_val, 0.02, "H on |0>: alpha = 1/sqrt(2)");
283     check_value(beta_real, expected_val, 0.02, "H on |0>: beta = 1/sqrt(2)");
284
285     // Apply H again: Should return to |0> ( $H^2 = I$ )
286     $display("\nApplying Hadamard gate again");
287     apply_op(OP_H);
288     display_state("After  $H^2$  (should be |0>) ");
289
290     alpha_real = q88_to_real(alpha);
291     beta_real = q88_to_real(beta);
292     check_value(alpha_real, 1.0, 0.02, "H^2: alpha should be 1");
293     check_value(beta_real, 0.0, 0.02, "H^2: beta should be 0");
294   end
295 endtask
296
297 // Test 6: Hadamard on |1>
298 task test_hadamard_on_one;
299   real alpha_real, beta_real;
300   begin
301     $display("TEST 6: Hadamard Gate on |1> ");
302     // Initialize to |0> then flip to |1>
303     apply_op(OP_INIT);
304     apply_op(OP_X);
305     display_state("Initial state |1> ");
306
307     // Apply H gate: |1> -> (|0> - |1>)/sqrt(2)
308     apply_op(OP_H);
309     display_state("After H on |1> ");
310
311     alpha_real = q88_to_real(alpha);
312     beta_real = q88_to_real(beta);
313
314     check_value(alpha_real, 0.707106781, 0.02, "H on |1>: alpha = 1/sqrt(2)");

```

```

315     check_value(beta_real, -0.707106781, 0.02, "H on |1>: beta = -1/sqrt(2)");
316   end
317 endtask
318
319 // Main Test Sequence
320 initial begin
321   // Initialize
322   test_count = 0;
323   pass_count = 0;
324   fail_count = 0;
325   rst = 1;
326   opcode = OP_NOP;
327
328   $display("\n");
329   $display(" Single-Qubit Hardware Accelerator - Comprehensive Test");
330   $display("Q8.8 Fixed-Point Format:");
331   $display(" Resolution: 2^-8 = %.8f", 1.0/256.0);
332   $display(" Range: [-128.0, 127.996]");
333
334   // Reset sequence
335   #20;
336   rst = 0;
337   #20;
338   // Run all tests
339   test_fp_multiplier();
340   test_fp_adder_subtractor();
341   test_pauli_x();
342   test_pauli_z();
343   test_hadamard();
344   test_hadamard_on_one();
345   // Final summary
346   $display("\n");
347   $display("TEST SUMMARY");
348   $display("Total Tests: %0d", test_count);
349   $display("Passed:      %0d", pass_count);
350   $display("Failed:      %0d", fail_count);
351   $display("Pass Rate:    %.1f%%", (pass_count * 100.0) / test_count);
352
353   $finish;
354 end
355
356 // Timeout watchdog
357 initial begin
358   #100000;
359   $display("\n[ERROR] Testbench timeout!");
360   $finish;
361 end
362
363 endmodule

```

Listing 23: Accelerator Testbench (quantumn\_testbench.v)

## A.2 Booth Multiplier

```

1 module booth_multiplier_16bit(
2     input clk,
3     input start,
4     input signed [15:0] multiplicand,
5     input signed [15:0] multiplier,
6     output reg signed [31:0] product,
7     output reg done
8 );
9     reg signed [15:0] M;
10    reg signed [15:0] Q;
11    reg signed [15:0] A;
12    reg Qm1;
13    // Q(-1)
14    reg [4:0] count;           // 16 iterations (0-15)
15
16    always @ (posedge clk) begin
17        if (start) begin
18            A      <= 16'd0;
19            M      <= multiplicand;
20            Q      <= multiplier;
21            Qm1   <= 1'b0;
22            count <= 5'd16;
23            done   <= 1'b0;
24        end
25        else if (count > 0) begin
26            case ({Q[0], Qm1})
27                2'b01: A <= A + M; // Add M
28                2'b10: A <= A - M; // Subtract M
29                default: ;          // Do nothing
30            endcase
31
32            // Arithmetic shift right {A, Q, Qm1}
33            Qm1 <= Q[0];
34            Q    <= {A[0], Q[15:1]};
35            A    <= {A[15], A[15:1]};
36
37            count = count - 1;
38        end
39        else begin
40            done    <= 1'b1;
41            product <= {A, Q}; // Final result
42        end
43    end
44 endmodule

```

Listing 24: 16-bit Booth Multiplier (booth\_multiplier\_16bit.v)

```

1 `timescale 1ns/1ps
2
3 module tb_booth_multiplier_16bit;
4
5     reg clk;
6     reg start;
7     reg signed [15:0] multiplicand;
8     reg signed [15:0] multiplier;
9
10    wire signed [31:0] product;
11    wire done;
12
13    // Instantiate DUT (Device Under Test)
14    booth_multiplier_16bit DUT (
15        .clk(clk),
16        .start(start),
17        .multiplicand(multiplicand),
18        .multiplier(multiplier),
19        .product(product),
20        .done(done)
21    );
22    // Clock generation
23    always #5 clk = ~clk; // 10ns period

```

```

24
25    initial begin
26        $display("==== Booth Multiplier 16-bit Testbench Started ===");
27        // Initial values
28        clk = 0;
29        start = 0;
30        multiplicand = 0;
31        multiplier = 0;
32
33        // TEST 1: 5 * 7 = 35
34        #10;
35        multiplicand = 5;
36        multiplier = 7;
37        start = 1;
38        #10 start = 0;
39
40        wait(done == 1);
41        #5;
42        $display("Test 1: %d * %d = %d", 5, 7, product);
43        // TEST 2: (-8) * 6 = -48
44        #20;
45        multiplicand = -8;
46        multiplier = 6;
47        start = 1;
48        #10 start = 0;
49
50        wait(done == 1);
51        #5;
52        $display("Test 2: %d * %d = %d", -8, 6, product);
53        // TEST 3: (-12) * (-9) = 108
54        #20;
55        multiplicand = -12;
56        multiplier = -9;
57        start = 1;
58        #10 start = 0;
59
60        wait(done == 1);
61        #5;
62        $display("Test 3: %d * %d = %d", -12, -9, product);
63        // TEST 4: Max*Max (stress test)
64        #20;
65        multiplicand = 16'shFFFF; // 32767
66        multiplier = 16'sh7FFF; // 32767
67        start = 1;
68        #10 start = 0;
69
70        wait(done == 1);
71        #5;
72        $display("Test 4: %d * %d = %d", multiplicand, multiplier, product);
73
74        $display("==== Testbench Finished ===");
75        $stop;
76    end
77 endmodule

```

Listing 25: Booth Multiplier Testbench (tb\_booth\_multiplier\_16bit.v)

### A.3 Fixed-Point Arithmetic Units

```

1  `timescale 1ns / 1ps
2
3 // Fixed-point ADD unit
4 // RESULT = A + B
5 module add #(
6   parameter WIDTH = 16
7 ) (
8   input wire [WIDTH-1:0] A,
9   input wire [WIDTH-1:0] B,
10  output wire [WIDTH-1:0] RESULT,
11  output wire           COUT    // Carry-out (overflow)
12 );
13 wire [WIDTH:0] sum_ext;
14
15 assign sum_ext = {1'b0, A} + {1'b0, B};
16 assign RESULT = sum_ext[WIDTH-1:0];
17 assign COUT    = sum_ext[WIDTH];
18 endmodule

```

Listing 26: Fixed-Point Adder (add.v)

```

1  `timescale 1ns/1ps
2
3 module tb_add;
4   reg [15:0] A, B;
5   wire [15:0] RESULT;
6   wire        COUT;
7   add tb_add (
8     .A(A),
9     .B(B),
10    .RESULT(RESULT),
11    .COUT(COUT)
12  );
13 initial begin
14   $dumpfile("add.vcd");
15   $dumpvars(0, tb_add);
16
17   A = 16'd10; B = 16'd5;
18   #10;
19   $display("ADD: %0d + %0d = %0d  COUT=%b", A, B, RESULT, COUT);
20
21   A = 16'd0; B = 16'd0;
22   #10;
23   $display("ADD: %0d + %0d = %0d  COUT=%b", A, B, RESULT, COUT);
24
25   $finish;
26 end
27 endmodule

```

Listing 27: Adder Testbench (tb.add.v)

```

1  `timescale 1ns / 1ps
2
3 // Fixed-point SUB unit
4 // RESULT = A - B (two's complement)
5 module sub #(
6   parameter WIDTH = 16
7 ) (
8   input wire [WIDTH-1:0] A,
9   input wire [WIDTH-1:0] B,
10  output wire [WIDTH-1:0] RESULT,
11  output wire           COUT    // 1 = no borrow, 0 = borrow
12 );
13 wire [WIDTH:0] diff_ext;
14
15 assign diff_ext = {1'b0, A} + {1'b0, (~B)} + 1'b1;
16 assign RESULT  = diff_ext[WIDTH-1:0];
17 assign COUT    = diff_ext[WIDTH];
18 endmodule

```

Listing 28: Fixed-Point Subtractor (sub.v)

```
1  `timescale 1ns / 1ps
2
3  module tb_sub;
4      reg [15:0] A, B;
5      wire [15:0] RESULT;
6      wire          COUT;
7
8      sub tb_sub (
9          .A(A),
10         .B(B),
11         .RESULT(RESULT),
12         .COUT(COUT)
13     );
14    initial begin
15        $dumpfile("sub.vcd");
16        $dumpvars(0, tb_sub);
17
18        A = 16'd10; B = 16'd5;
19        #10;
20        $display("SUB: %0d - %0d = %0d  COUT=%b", A, B, RESULT, COUT);
21
22        A = 16'd5; B = 16'd10;
23        #10;
24        $display("SUB: %0d - %0d = %0d  COUT=%b", A, B, RESULT, COUT);
25
26        $finish;
27    end
28 endmodule
```

Listing 29: Subtractor Testbench (tb\_sub.v)

## B Contribution Summary

The successful completion of the Single-Qubit Hardware Accelerator project was the result of a collaborative effort. The specific contributions of each team member are outlined below:

### **Nguyen Son Tung**

Responsible for developing the theoretical foundation of the project. He conducted research on the underlying principles relevant to the system, synthesized key concepts into a coherent framework, and designed the overall project structure. His work ensured that the project had a solid academic grounding and a logical, well-organized architecture.

### **Nguyen Truong Son**

Contributed by implementing the Q8.8 fixed-point format and developing the decoder application. He worked on converting numerical values into the Q8.8 representation and ensured accurate decoding functionalities. His contributions strengthened the project's practical computation and data-handling capabilities.

### **Tran Vu Gia Huy**

Focused on the theoretical explanation of Booth's Algorithm and created the corresponding Verilog demonstration. He provided a clear analysis of how Booth's multiplication algorithm works and implemented a functional hardware-level model. His work bridged theory and practice, allowing the team to validate algorithmic behavior through simulation. Prepare final report for a good format.

### **Tran Dang Khoa**

Responsible for simulating quantum gates. He studied the operational characteristics of fundamental quantum logic gates and developed simulations that visualize their behavior. His contribution added a modern computing perspective and expanded the technical scope of the project.

### **Mai Tran Hung**

Implemented the addition and subtraction modules. He designed the logic for arithmetic operations, tested their correctness, and ensured compatibility with other components of the system. His work formed a crucial part of the computational pipeline within the project.