

# Design and Implementation of a SAP-1 Style 8-Bit CPU on FPGA

Shayan Rizwan Yazdanie\*

Computer Engineering (CE)

Ghulam Ishaq Khan Institute Of Engineering Sciences and Technology (GIKI), Topi, Swabi, KPK, Pakistan.

U2024585@giki.edu.pk

Syed Zain Ali Shah Bokhari†

Computer Engineering (CE)

Ghulam Ishaq Khan Institute Of Engineering Sciences and Technology (GIKI), Topi, Swabi, KPK, Pakistan.

U2024623@giki.edu.pk

**Abstract**—This paper presents the design, implementation, and FPGA realization of a Simple-As-Possible (SAP-1) inspired 8-bit central processing unit, developed using Verilog HDL and deployed on the Digilent Nexys A7 (Artix-7) platform. The primary objective of this work is to bridge foundational digital logic and computer organization theory with a practical, hardware-realized CPU that can be directly observed and interacted with at the register-transfer level.

The processor follows a fully synchronous, bus-based architecture comprising D flip-flop-based registers, a shared internal data bus, an arithmetic logic unit supporting addition and subtraction, and a finite state machine (FSM)-based control unit. Special attention is given to FPGA-specific considerations such as clock scaling, button debouncing, removal of internal tri-states via multiplexed buses, and deterministic power-on behavior.

The system supports both preloaded programs baked into the FPGA bitstream and interactive memory writes using on-board switches and push-buttons. Internal CPU state—including the program counter, opcode, and accumulator—is continuously visualized using LEDs and a seven-segment display. The resulting design serves as a comprehensive educational platform that concretely demonstrates instruction sequencing, datapath control, arithmetic execution, and hardware–software interaction within a minimal yet complete CPU.

**Index Terms**—SAP-1, CPU Design, FPGA, Verilog, FSM Control Unit, Shared Bus Architecture, Computer Organization

## I. INTRODUCTION

Understanding how a central processing unit operates internally is a cornerstone of computer engineering education. While modern processors employ deep pipelines, caches, and speculative execution, their conceptual foundations are best introduced through simplified pedagogical architectures. One such architecture is the Simple-As-Possible (SAP-1) computer, originally introduced by Malvino, which demonstrates the essentials of instruction execution using a minimal datapath and control strategy.

This work implements a SAP-1 style 8-bit CPU on an FPGA, translating abstract architectural concepts into a physically observable hardware system. Unlike software simulations or breadboard-based realizations, an FPGA implementation introduces real-world constraints such as high-frequency clocks, signal synchronization, mechanical input noise, and limited on-chip tri-state support. Addressing these challenges while

remaining faithful to the SAP-1 philosophy is a central motivation of this project.

## II. SAP-1 ARCHITECTURAL OVERVIEW

The SAP-1 CPU is a stored-program computer built around a small set of core components:

- A shared 8-bit internal data bus
- A 4-bit Program Counter (PC)
- Unified  $16 \times 8$  instruction and data memory
- Registers: MAR, IR, ACC, B, and OUT
- An 8-bit Arithmetic Logic Unit (ALU)
- A centralized FSM-based control unit

Each instruction is 8 bits wide, with the upper nibble representing the opcode and the lower nibble serving as an address or operand. All operations are synchronized to a single global clock.



Fig. 1. High-level SAP-1 Architecture (Datapath and Control)

### III. REGISTER DESIGN AND SYNCHRONOUS OPERATION

All registers in the design are implemented using D flip-flops and operate synchronously under a common clock. Each register includes a load-enable signal asserted by the control FSM, allowing precise control over when new data is captured. No shifting or rotation logic is included, intentionally simplifying the datapath and maintaining conceptual clarity.

Synchronous operation ensures that all state updates occur only on clock edges, yielding deterministic and analyzable behavior—an essential requirement for both correct hardware operation and pedagogical transparency.



Fig. 2. Proteus schematic illustrating register implementation using D flip-flops, load logic, and output enable gating

The Proteus schematic in Fig. 2 illustrates how individual D flip-flops are combined with combinational logic to realize load and enable functionality. This hardware-level realization directly corresponds to the Verilog register modules synthesized on the FPGA.

### IV. SHARED BUS ORGANIZATION

The SAP-1 CPU employs a single shared 8-bit data bus to interconnect all registers, memory, and the ALU. In classical TTL-based SAP-1 designs, tri-state buffers are used to manage bus ownership. However, since modern FPGAs do not support internal tri-states, this design implements bus arbitration using a centralized **multiplexer-based approach** rather than physical tri-state buffers. At any given micro-operation, exactly one source is selected to drive the bus, eliminating contention and making data flow explicit. This approach aligns well with FPGA synthesis tools while preserving the logical behavior of the original SAP-1 architecture.

#### A. Bus Arbitration Logic

The 8-bit shared bus serves as the central data highway, and arbitration logic ensures that only one module drives the bus at any time, preventing data corruption due to bus contention. Unlike discrete logic implementations that use tri-state enables, the FPGA-based design implements arbitration through a multiplexer controlled by the Control Unit's FSM.

#### Control Signal Mapping:

- $\text{sel} = 000 \rightarrow$  Program Counter output enabled (CO equivalent)
- $\text{sel} = 001 \rightarrow$  Instruction Register output enabled (IO equivalent)
- $\text{sel} = 010 \rightarrow$  ALU output enabled (EO equivalent)
- $\text{sel} = 011 \rightarrow$  RAM output enabled (RO equivalent)

**Arbitration Protocol:** The Control Unit's FSM ensures mutually exclusive selection of bus sources. For instance:

- During PC  $\rightarrow$  MAR transfer:  $\text{sel} = 000$ , all other sources deselected.
- During RAM  $\rightarrow$  B-register transfer:  $\text{sel} = 011$ , other sources deselected.

**Timing Compliance:** All selection signals are generated synchronously and remain stable throughout each clock phase, with proper setup and hold times maintained to prevent glitches on the bus.

#### B. Operand Routing & Result Storage

##### Operand Routing:

- **RAM Operand Flow:** RAM  $\rightarrow$  Bus  $\rightarrow$  B-register. The RAM output is placed on the bus via  $\text{sel} = 011$ , then latched into the B-register on the next clock edge.
- **Accumulator Internal Routing:** The Accumulator value is routed directly to the ALU's A-input internally, bypassing the bus to conserve bandwidth and reduce latency.
- **Immediate/Address Routing:** The lower nibble of the Instruction Register is placed on the bus via  $\text{sel} = 001$  for address calculation or immediate operand use.

##### Result Storage:

- The ALU output is placed on the bus via  $\text{sel} = 010$  (EO equivalent).
- The Accumulator latches the result from the bus on the next clock edge when its load signal (AI) is asserted.
- No other register is write-enabled during this phase, ensuring deterministic state update without conflicts.

#### C. Design Rationale & FPGA Considerations

The multiplexer-based bus implementation offers several advantages in FPGA environments:

- **Deterministic Timing:** Elimination of tri-state buffers removes bus contention risks and simplifies static timing analysis.
- **Explicit Data Flow:** The multiplexer selection signals make all data transfers visible in the RTL description, aiding debugging and verification.
- **Synthesis Efficiency:** FPGA tools optimize multiplexer logic effectively, resulting in compact LUT utilization compared to emulated tri-state structures.
- **Architecture Fidelity:** While the implementation differs from the original tri-state approach, the logical behavior—single-source bus driving with controlled transfers—remains identical at the functional level.

This shared bus organization demonstrates how classical computer architecture concepts can be adapted to modern

FPGA constraints while maintaining educational clarity and functional correctness.

```
module sap1_tristate8(
    input wire [7:0] din,
    input wire      oe_n, // active low
    inout wire [7:0] bus
);
    assign bus = oe_n ? 8'hZZ : din;
endmodule
```

Fig. 3. Verilog implementation of bus arbitration using tri-state buffer emulation (File: sap1\_tristate.v)

## V. MEMORY ORGANIZATION

A unified  $16 \times 8$  RAM stores both instructions and data. A 4-bit Memory Address Register (MAR) supplies the address. Program memory can be initialized at configuration time using \$readmemh, embedding a demonstration program directly into the FPGA bitstream. Additionally, the memory supports external writes through on-board switches and a debounced write strobe, enabling interactive experimentation.

## VI. ARITHMETIC LOGIC UNIT DESIGN

The ALU supports two arithmetic operations: addition and subtraction. Subtraction is implemented using explicit two's complement arithmetic:

$$A - B = A + (\sim B + 1)$$

Operand B is conditionally inverted and incremented under control of the sub signal. This explicit structure—implemented using demultiplexers, multiplexers, and an adder—improves signal visibility and aligns closely with textbook datapath diagrams.

```
module ALU_custom(
    input wire [7:0] A,
    input wire [7:0] B,
    input wire      sub,
    input wire      out_en, // active-high enable
    output wire     cout,
    output wire [7:0] out
);
    wire [7:0] dmux0, dmux1, comp, B_in, add_sub_out; // intermediate signals

    dmux dmux0(.in(B), .sel(sub), .out0(dmux0), .out1(dmux1));
    assign comp = ~dmux1 + 8'b0000_0001; // two's complement of B
    mux mux1(.in0(dmux0), .in1(comp), .sel(sub), .out(B_in));

    rippleAdder rl(X(A), .Y(B_in), .S(add_sub_out), .Co(cout));

    tristateBuffer_8bit tri8(.in(add_sub_out), .out(out), .low_en(~out_en));
endmodule
```

Fig. 4. Fig. 2. The custom ALU implementation

The ALU result is routed onto the shared bus only when enabled by the control unit.

## VII. CONTROL UNIT AS A FINITE STATE MACHINE

In digital system theory, a control unit is formally modeled as a finite state machine (FSM) that generates time-ordered control signals. This project implements the SAP-1 control

unit as a synchronous FSM, ensuring deterministic sequencing of micro-operations.

The FSM alternates between FETCH and EXECUTE phases, mirroring the classical instruction cycle taught in computer organization courses. Each state asserts a specific combination of control signals that govern register loads, bus selection, ALU operation, and program counter updates. State transitions occur exclusively on clock edges, preventing race conditions and ensuring reliable operation.

This FSM-based approach directly maps abstract control theory onto practical hardware, demonstrating how instruction execution is orchestrated at the micro-operation level.

## VIII. FPGA INTEGRATION AND I/O

The CPU is implemented on the Digilent Nexys A7 FPGA platform. On-board switches are used to provide opcode, address, and data inputs, while push-buttons control memory writes, stepping, and reset. All mechanical inputs are debounced using counter-based filters to ensure reliable operation.

The Program Counter is displayed on LEDs, while the seven-segment display provides real-time visibility into internal CPU state. The display is multiplexed to show:

- AN0: Program Counter (low nibble)
- AN1: Opcode (IR[7:4])
- AN2: Accumulator (low nibble)

Once data and instructions are loaded, the CPU executes the program autonomously under the control of its internal finite-state machine. The results of execution propagate through the datapath and are eventually written back into RAM or latched into the output register. The output pipeline connects this internal result to a 7-segment display, which presents the lower nibble of the output value in hexadecimal form. This RAM-to-display flow allows users to visually confirm arithmetic and logical operations without requiring additional debugging hardware.

Several practical hardware considerations are addressed to ensure reliable operation on the FPGA. Mechanical push buttons are inherently noisy and prone to contact bounce, which can cause multiple unintended transitions. To mitigate this, debouncing logic is implemented using a counter-based delay, ensuring that each button press is registered as a single, clean event. The following Verilog module implements the debouncer:

The module uses a two-flop synchronizer to prevent metastability and a counter to filter out transient noise. The parameter COUNT\_MAX is set to correspond to a stable interval of approximately 10 ms for a 50 MHz clock. Additionally, all sequential components operate under a single global clock domain, maintaining synchronous behavior across registers, memory, and control logic. This uniform clocking strategy prevents race conditions and simplifies timing analysis.

**In simple words:** We built the CPU on a small circuit board that you can program in two ways: by hand using switches and buttons, or by loading software directly into its memory using

```

module debounce #(parameter COUNT_MAX = 19'd500000) (
    input wire clk,
    input wire rst,
    input wire noisy,
    output reg clean
);
    reg [${clog2(COUNT_MAX)}:0] cnt;
    reg sync_0, sync_1;

    // Z-flop synchronizer
    always @{posedge clk or posedge rst} begin
        if (rst) begin sync_0 <= 0; sync_1 <= 0; end
        else begin sync_0 <= noisy; sync_1 <= sync_0; end
    end

    always @{posedge clk or posedge rst} begin
        if (rst) begin
            cnt <= 0;
            clean <= 0;
        end else if (sync_1 == clean) begin
            cnt <= 0;
        end else begin
            cnt <= cnt + 1'b1;
            if (cnt == COUNT_MAX) begin
                clean <= sync_1;
                cnt <= 0;
            end
        end
    end
endmodule

```

Fig. 5. Debouncer.v

Verilog code. This flexibility lets you see both the manual process and the automated execution of a computer.



Fig. 6. Vivado-generated schematic of the SAP-1 CPU

## IX. HARDWARE OUTPUT DEMONSTRATION

The ultimate test of any computational architecture lies not in its theoretical design, but in its physical manifestation and observable behavior. This section presents the definitive validation of the implemented SAP-1 CPU: the real-time output captured directly from the Nexys A7 FPGA development board following program execution. By bridging the abstract execution of machine instructions with concrete, observable hardware signals, we demonstrate the functional correctness and operational integrity of the complete system—from control logic and datapath management to arithmetic computation and state representation. The synchronized display of the Program Counter, current opcode, and Accumulator value provides a transparent window into the CPU’s internal state, offering irrefutable evidence that the architectural simulation has been successfully realized in functional digital hardware. This precise alignment between simulated design and physical output confirms the successful hardware realization of the SAP-1 architecture.



Fig. 7. Output on the FPGA of a program

The FPGA output shown as d F 4 corresponds directly to the internal CPU state at the end of program execution. The three hexadecimal digits represent:

- **AN0 (PC):** 4 — the program counter has advanced through four instructions.
- **AN1 (Opcode):** F — the current instruction is HLT.
- **AN2 (ACC):** d — the accumulator contains the result of the arithmetic operation.

For the program:

```

1E // LDA E
3F // ADD F
80 // OUT
F0 // HLT
...
09 // data at E
04 // data at F

```

the CPU computes  $9 + 4 = 13$  (0x0D). Once the HLT instruction is executed, the control FSM prevents further state transitions, freezing the datapath in a stable configuration. The fact that the displayed output precisely matches the expected PC, opcode, and accumulator values demonstrates correct instruction sequencing, accurate ALU operation, and proper coordination between the control unit and datapath. This tight coupling between internal CPU state and external FPGA indicators serves as strong experimental evidence of functional correctness.

## X. RESULTS AND VALIDATION

Hardware testing confirms correct instruction sequencing, arithmetic operation, memory access, and control flow. The implemented SAP-1 CPU was validated through physical execution on the Nexys A7 FPGA board. A comprehensive test suite of assembly programs was executed, with each instruction rigorously verified against expected outcomes.

Single-step execution capability, enabled through a manual clock advancement mode, allowed for granular verification of finite state machine operation at the micro-operation level. This debugging mode revealed the precise timing of control signals (e.g., CO, IO, EO, RO, AI) and their corresponding effects on the datapath.

The seven-segment display output provided real-time visibility into the CPU's internal state during program execution. For the test program:

```
1E // LDA E
3F // ADD F
80 // OUT
F0 // HLT
```

with data values 09 at address E and 04 at address F, the display correctly showed the final accumulator value 0D (hexadecimal 13), confirming that  $9 + 4 = 13$  was computed accurately.

Timing analysis confirmed that all critical paths met the 50 MHz clock constraint, with a worst-case slack of 2.1 ns. The control FSM demonstrated deterministic behavior across 1000+ clock cycles of continuous operation without state corruption or metastability issues.

## XI. CONCLUSION

This project successfully implements a SAP-1 style 8-bit CPU on an FPGA while remaining faithful to its pedagogical intent. By addressing FPGA-specific constraints—particularly the replacement of tri-state bus arbitration with multiplexer-based routing—and providing extensive on-board visibility through switches, buttons, and displays, the design transforms abstract CPU concepts into a tangible, observable system.

The implementation demonstrates several key achievements:

- Architectural Fidelity:** The core SAP-1 architecture is preserved while adapting to modern FPGA design constraints.
- Educational Transparency:** Each component's function is directly observable, making the processor an effective teaching tool for computer organization principles.
- Functional Completeness:** The CPU correctly executes the implemented instruction set (LDA, ADD, SUB, OUT, HLT) with proper sequencing and data integrity.
- Debugging Support:** Single-step execution and state visualization capabilities provide valuable insight into micro-operation sequencing.

The resulting platform serves as an exceptionally effective pedagogical instrument for teaching core concepts in **computer organization**, **digital logic design**, and **control theory**. Its modular, observable architecture allows students to trace data movement through the datapath in real time, mapping abstract theoretical constructs—such as the fetch-decode-execute cycle, bus arbitration, and register-transfer operations—to concrete hardware behavior. By providing both manual control through physical I/O and automated execution

via programmed sequences, the system accommodates multiple learning modalities, from hands-on experimentation to analytical observation of synchronized control signals.

Beyond its immediate instructional utility, this implementation establishes a robust, extensible foundation for exploring more advanced architectural concepts. The clean separation between the datapath, control unit, and memory subsystem enables straightforward modification and enhancement. **Future work** could logically progress in several meaningful directions:

## Future Enhancements

- Full-byte 7-segment multiplexed display** to simultaneously show multiple internal registers (e.g., PC, IR, ACC, MAR), improving debugging visibility.
- Additional status flags (carry, zero, negative)** and **conditional branch instructions** to introduce control-flow decision-making and simple program loops.
- Extended memory addressing** and an **expanded instruction set** (e.g., logical operations, shifts, subroutines) to move toward a more general-purpose instruction set architecture.
- Refinement of single-step execution** to allow stepping through individual micro-operations rather than entire instruction cycles, offering finer-grained control for in-depth datapath analysis.
- Improved diagnostic output** via additional LED indicators or a **UART-based debug interface** to log internal state changes in real time, facilitating remote monitoring and automated testing.

Each extension can be implemented incrementally while preserving the system's educational transparency, ensuring that complexity is introduced without obscuring fundamental operational principles. Ultimately, this FPGA-based SAP-1 CPU not only validates the core design but also provides a versatile, scalable testbed for structured exploration of computer architecture evolution.

## REFERENCES

- [1] A. P. Malvino and J. A. Brown, *Digital Computer Electronics*, 3rd ed. New York, NY, USA: McGraw-Hill, 1993.
- [2] M. M. Mano and M. D. Ciletti, *Digital Design*, 5th ed. Pearson, 2013.
- [3] IEEE Computer Society, "IEEE Standard for Verilog Hardware Description Language," IEEE Std 1364-2005.
- [4] B. Eater, "Build an 8-bit Computer from Scratch," Online Video Series.