

# RISC-4 ISA Specification

Version 0.1.0  
Pre-Silicon

Jeremy King

January 21, 2026

## **Abstract**

RISC-4 is a 4-bit load/store RISC architecture designed to demonstrate that RISC principles are independent of datapath width. This specification defines the instruction set architecture (ISA), programmer's model, memory organization, and calling conventions for the RISC-4 processor.

This document represents an alternate history: “What if the RISC revolution occurred during the 4-bit era of the early 1970s?”

# Contents

# List of Figures

# List of Tables

# Chapter 1

## Introduction

### 1.1 Overview

RISC-4 is a 4-bit Reduced Instruction Set Computer (RISC) architecture that combines the simplicity of early microprocessors like the Intel 4004 with modern RISC design principles. It serves as both an educational tool and a demonstration that RISC concepts apply at any scale.

### 1.2 Design Principles

The RISC-4 architecture adheres to the following principles:

1. **Fixed-length instructions** – All instructions are 16 bits (4 nibbles)
2. **Load/store architecture** – Only load and store instructions access memory
3. **Simple instruction formats** – Four orthogonal formats: R, I, J, M
4. **Regular encoding** – Consistent field placement across formats
5. **Pipeline-friendly** – Designed for 5-stage pipeline from inception
6. **Compiler-friendly** – Simple, orthogonal instruction set

### 1.3 Key Specifications

| Parameter                 | Value                   |
|---------------------------|-------------------------|
| Datapath width            | 4 bits                  |
| Instruction width         | 16 bits (fixed)         |
| Address space             | 12 bits (4096 nibbles)  |
| General purpose registers | $16 \times 4$ bits      |
| Pipeline stages           | 5 (IF, ID, EX, MEM, WB) |
| Endianness                | Little-endian           |

Table 1.1: RISC-4 Architecture Parameters

### 1.4 Comparison to Historical Architectures

Table 1.2 compares RISC-4 to its historical inspirations.

| Feature           | Intel 4004               | RISC-4                   | RISC-I                     |
|-------------------|--------------------------|--------------------------|----------------------------|
| Year (design)     | 1971                     | 2026                     | 1982                       |
| Datapath          | 4-bit                    | 4-bit                    | 32-bit                     |
| Instruction width | 8-16 bits                | 16 bits                  | 32 bits                    |
| Architecture      | Accumulator              | Load/store               | Load/store                 |
| Registers         | $16 \times 4\text{-bit}$ | $16 \times 4\text{-bit}$ | $138 \times 32\text{-bit}$ |
| Pipeline          | None                     | 5-stage                  | 2-stage                    |
| CPI (average)     | 8.0                      | 1.2                      | 1.3                        |

Table 1.2: Architectural Comparison

# Chapter 2

## Programmer's Model

### 2.1 Register File

RISC-4 provides 16 general-purpose registers, each 4 bits wide. While all registers are architecturally equivalent, software conventions assign specific roles to certain registers.

| Register | Name | Usage (Convention)        |
|----------|------|---------------------------|
| r0       | zero | Hardwired to zero         |
| r1       | ra   | Return address            |
| r2       | a0   | Argument 0 / return value |
| r3       | a1   | Argument 1                |
| r4       | a2   | Argument 2                |
| r5       | a3   | Argument 3                |
| r6       | v0   | Return value              |
| r7       | t0   | Temporary (caller-saved)  |
| r8       | t1   | Temporary (caller-saved)  |
| r9       | t2   | Temporary (caller-saved)  |
| r10      | s0   | Saved (callee-saved)      |
| r11      | s1   | Saved (callee-saved)      |
| r12      | s2   | Saved (callee-saved)      |
| r13      | s3   | Saved (callee-saved)      |
| r14      | sp   | Stack pointer             |
| r15      | fp   | Frame pointer             |

Table 2.1: Register File Organization

### 2.1.1 Register r0 (zero)

Register **r0** is hardwired to the constant value zero:

- Reads always return 0x0
- Writes are silently discarded
- Useful for: comparisons, clearing registers, no-op destinations

### 2.1.2 Value Ranges

Each register holds values in the range:

- Unsigned: 0 to 15
- Signed (two's complement): -8 to +7

## 2.2 Effective Addressing & Register Pairs

To address memory beyond the 4-bit limit of a single register, RISC-4 utilizes **Register Pairing**. When a register is used as a base address for memory operations, the processor implicitly accesses the specified register and its neighbor to form an 8-bit address.

| Pointer Name       | High Nibble (4-bit) | Low Nibble (4-bit) |
|--------------------|---------------------|--------------------|
| Pair 0             | r0                  | r1                 |
| Pair 2             | r2                  | r3                 |
| ...                | ...                 | ...                |
| Frame Pointer (fp) | r12                 | r13                |
| Stack Pointer (sp) | r14                 | r15                |

Table 2.2: Register Pair Organization

### Rules for Pairing:

1. **Alignment:** Base registers in load/store instructions must be even-numbered (e.g., r14, r12).
2. **Formation:**  $\text{Address} = (\text{High\_Reg} \ll 4) \mid \text{Low\_Reg}$ .
3. **Range:** This allows for a Data Address Space of 256 nibbles (0x00 to 0xFF).

# Chapter 3

## Instruction Formats

All RISC-4 instructions are exactly 16 bits wide, divided into four formats based on instruction type.

### 3.1 Format Summary

The RISC-4 ISA uses fixed 16-bit instructions aligned on 16-bit boundaries. The primary opcode is always located in the most significant 4 bits [15:12].

Figure 3.1 illustrates how the 16 bits are utilized across the different instruction types.



Figure 3.1: RISC-4 Instruction Format Summary

#### Field Definitions:

- **Opcode:** Primary instruction operation code [15:12].
- **rd:** Destination register operand [11:8].
- **rs:** Source register 1 operand [7:4] (or [11:8] for store).
- **rt:** Source register 2 operand [3:0] (R-Type only).
- **imm4/offset4:** 4-bit immediate value or memory offset [3:0].
- **base:** Register pair base address pointer [7:4].

- **cond:** Branch condition code [11:8] (replaces rd).
- **offset8:** 8-bit signed branch offset [7:0].
- **target12:** 12-bit absolute jump address [11:0].

# Chapter 4

## Status Flags

### 4.1 Quick Reference: Flag Behavior

| Instruction                 | C   | Z   | Notes                             |
|-----------------------------|-----|-----|-----------------------------------|
| <b>Arithmetic</b>           |     |     |                                   |
| ADD, ADDI                   | Set | Set | C = carry out, Z = result is zero |
| SUB                         | Set | Set | C = borrow occurred               |
| ADC                         | Set | Set | Includes previous carry           |
| SBB                         | Set | Set | Includes previous borrow          |
| NEG                         | Set | Set | C = 0 only when negating zero     |
| <b>Logical</b>              |     |     |                                   |
| AND, ANDI                   | 0   | Set | Logical AND, carry cleared        |
| OR, ORI                     | 0   | Set | Logical OR, carry cleared         |
| XOR                         | 0   | Set | Logical XOR, carry cleared        |
| <b>Comparison</b>           |     |     |                                   |
| SLT, SLTI                   | 0   | Set | Z = comparison is false           |
| <b>Shift</b>                |     |     |                                   |
| SHF                         | Set | Set | C = last bit shifted out          |
| <b>Memory &amp; Control</b> |     |     |                                   |
| LW, SW                      | —   | —   | Flags preserved (not modified)    |
| BEQ, BNE, BCS, BCC          | —   | —   | Flags read but not modified       |
| J                           | —   | —   | Flags preserved                   |

Table 4.1: Flag Update Summary

#### Legend:

- **Set:** Flag updated based on result
- **0:** Flag cleared (forced to 0)
- **—:** Flag preserved (unchanged)

### 4.2 Overview

RISC-4 maintains a 2-bit status flag register that captures the results of arithmetic and logical operations. While pure academic RISC designs avoid condition codes, RISC-4 follows the commercial RISC tradition (ARM, SPARC, PowerPC) of using flags for efficient comparison and branching.

This design choice reflects the historical context: in the 4-bit microprocessor era of the early 1970s, all architectures used condition codes. RISC-4 demonstrates that RISC principles are compatible with flag-based conditional execution.

## 4.3 Flag Register

The status register contains two flags:



### 4.3.1 Carry Flag (C) - Bit 0

The Carry flag indicates unsigned arithmetic overflow or underflow.

#### Set Conditions:

- **Addition (ADD, ADDI, ADC):** Set if result overflows 4 bits (carry out of bit 3)
  - Example:  $0xF + 0x1 = 0x10 \rightarrow$  Result  $0x0$ , C=1
- **Subtraction (SUB, SBB):** Set if borrow occurs (inverted borrow)
  - Example:  $0x3 - 0x5 = -2 \rightarrow$  Result  $0xE$ , C=1
- **Shift (SHF):** Contains the last bit shifted out
  - Shift left: C  $\leftarrow$  original bit 3
  - Shift right: C  $\leftarrow$  original bit 0

#### Used By:

- **BCS (Branch if Carry Set):** Branch if C=1
- **BCC (Branch if Carry Clear):** Branch if C=0
- **ADC (Add with Carry):** Adds previous carry into result
- **SBB (Subtract with Borrow):** Subtracts previous borrow from result

#### Multi-Precision Arithmetic:

The Carry flag enables arithmetic on values larger than 4 bits:

Listing 4.1: 8-bit Addition: r0:r1 = r2:r3 + r4:r5

```

1 ADD r1, r3, r5      # Add low nibbles: r1 = r3 + r5, sets C
2 ADC r0, r2, r4      # Add high nibbles with carry: r0 = r2 + r4 + C

```

### 4.3.2 Zero Flag (Z) - Bit 1

The Zero flag indicates whether the result of an operation is zero.

**Set Conditions:**

- Set (Z=1) if result equals 0x0
- Clear (Z=0) if result is non-zero

**Updated By:**

- All arithmetic operations (ADD, SUB, ADDI, ADC, SBB, NEG)
- All logical operations (AND, OR, XOR, ANDI, ORI)
- Comparison (SLT, SLTI)
- Shift (SHF)

**Used By:**

- **BEQ** (Branch if Equal): Branch if Z=1
- **BNE** (Branch if Not Equal): Branch if Z=0

**Comparison Pattern:**

Testing equality requires a subtraction followed by a zero test:

Listing 4.2: If (r1 == r2) goto equal

```

1  SUB r3, r1, r2      # r3 = r1 - r2, sets Z if equal
2  BEQ equal           # Branch if Z=1 (difference was zero)

```

## 4.4 Flag Update Rules

### 4.4.1 Instructions That Set Flags

| Instruction | C   | Z   | Notes                             |
|-------------|-----|-----|-----------------------------------|
| ADD, ADDI   | Yes | Yes | C = carry out, Z = result is zero |
| SUB         | Yes | Yes | C = borrow, Z = result is zero    |
| AND, ANDI   | No  | Yes | Logical ops don't affect carry    |
| OR, ORI     | No  | Yes |                                   |
| XOR         | No  | Yes |                                   |
| SLT, SLTI   | No  | Yes | Z reflects comparison result      |
| SHF         | Yes | Yes | C = last bit shifted out          |
| ADC, SBB    | Yes | Yes | Extended arithmetic               |
| NEG         | Yes | Yes | C set unless negating zero        |

Table 4.2: Flag Update Behavior

#### 4.4.2 Instructions That Preserve Flags

The following instructions **do not modify** flags:

- **Memory operations:** LW, SW
- **Control flow:** BEQ, BNE, BCS, BCC, J
  - Note: Branches *read* flags but do not modify them

**Critical Implication:** Load/store operations between an ALU instruction and a branch will preserve flags:

Listing 4.3: Flags preserved across load

```

1 SUB r3, r1, r2      # Sets Z flag
2 LW   r4, 0(r14)    # Does NOT change Z flag
3 BEQ equal           # Still uses Z from SUB

```

### 4.5 Comparison Operations

#### 4.5.1 Equality (==, !=)

Listing 4.4: Test if r1 == r2

```

1 SUB r3, r1, r2      # r3 = r1 - r2
2 BEQ equal           # Branch if Z=1 (r1 == r2)
3 BNE not_equal       # Branch if Z=0 (r1 != r2)

```

#### 4.5.2 Unsigned Comparison (<, >=)

For unsigned values (0 to 15):

Listing 4.5: Test if r1 < r2 (unsigned)

```

1 SUB r3, r1, r2      # r3 = r1 - r2
2 BCS less             # Branch if C=1 (borrow occurred, r1 < r2)
3 BCC greater_equal   # Branch if C=0 (no borrow, r1 >= r2)

```

#### 4.5.3 Signed Comparison (<, >=)

For signed values (-8 to +7), use SLT:

Listing 4.6: Test if r1 < r2 (signed)

```

1 SLT r3, r1, r2      # r3 = (r1 < r2) ? 1 : 0
2 BNE less             # Branch if r3 != 0 (comparison true)
3 BEQ greater_equal   # Branch if r3 == 0 (comparison false)

```

**Rationale:** Signed comparison requires checking both the sign bit and magnitude. The SLT instruction implements the full signed comparison logic, outputting a boolean result that can be tested with the zero flag.

### 4.6 Pipeline Considerations

### 4.6.1 Flag Hazards

Since flags are updated by instructions in the Execute (EX) stage but may be needed by branches in the Instruction Decode (ID) stage, a **flag hazard** can occur:

|   | Cycle: | 1  | 2  | 3  | 4   | 5   |                            |
|---|--------|----|----|----|-----|-----|----------------------------|
| 1 | SUB    | IF | ID | EX | MEM | WB  | <- Sets Z flag in cycle 3  |
| 2 | BEQ    |    | IF | ID | EX  | MEM | <- Needs Z flag in cycle 3 |

**Problem:** BEQ in ID stage (cycle 3) needs the Z flag, but SUB only produces it in EX stage (also cycle 3). The flag value must be *forwarded* from the EX stage output to the ID stage input in the same cycle.

### 4.6.2 Solutions

#### Flag Forwarding (Performance)

Forward flag values directly from EX stage to branch evaluation logic in ID stage:

- **Hardware:** 2-bit bypass path (C and Z flags)
- **Delay:** 0 cycles (resolved within cycle)
- **Complexity:** Low (simpler than register forwarding)

#### Pipeline Stall (Simplicity)

Detect flag hazard and insert 1-cycle bubble:

- **Detection:** If (previous instruction sets flags) AND (current instruction is branch)
- **Action:** Stall IF and ID stages for 1 cycle
- **Delay:** 1 cycle per flagged branch

**Compiler Optimization:** Insert independent instruction between comparison and branch:

Listing 4.7: Avoiding stall with instruction scheduling

|   |                |                                   |
|---|----------------|-----------------------------------|
| 1 | SUB r3, r1, r2 | # Sets flags                      |
| 2 | LW r4, 0(r14)  | # Independent load (fills bubble) |
| 3 | BEQ equal      | # Flags ready by this point       |

## 4.7 Design Rationale

### 4.7.1 Why RISC-4 Uses Flags

While some RISC designs (MIPS, RISC-V) avoid condition codes, RISC-4 adopts flags for several reasons:

#### 1. Historical Authenticity

- Every 4-bit and 8-bit microprocessor of the 1970s used flags
- Intel 4004 (1971) had Carry and Accumulator Zero flags
- RISC-4 represents "RISC in 1971" flags were the standard then

#### 2. Commercial RISC Precedent

- ARM (1985): NZCV flags, most successful RISC architecture
- SPARC (1987): Integer condition codes
- PowerPC (1990): Condition register
- All three use flags despite being RISC

#### 3. Code Density

- Saves instructions: SUB + BEQ vs SUB + SLT + BNE
- Critical for 4-bit architecture with limited memory
- Typical comparison requires 2 instructions instead of 3

#### 4. Pipeline Equivalence

- Flag hazards and register hazards have identical solutions
- Both require forwarding or stalling
- No performance disadvantage compared to flagless design

#### 5. Simplicity

- Only 2 flags (C and Z), not 8+ like x86
- Clear update rules: ALU sets, Load/Store preserves
- Intuitive for programmers familiar with assembly

### 4.7.2 What RISC-4 Avoids

RISC-4's flag implementation avoids the complexity criticized in CISC designs:

- **No implicit flag updates:** Load/store don't set flags (unlike x86's MOV)
- **No complex flags:** No overflow (V), parity (P), or auxiliary carry (AC)
- **No every-instruction updates:** Only ALU operations affect flags
- **No condition code register:** Flags are hardware state, not GPRs

This keeps the design simple, regular, and pipeline-friendly — core RISC principles.

# Chapter 5

## Instruction Set Reference

### 5.1 Instruction Summary

Table 5.1 lists all RISC-4 instructions.

| Opcode | Mnemonic | Format | Description                   |
|--------|----------|--------|-------------------------------|
| 0x0    | ADD      | R      | Add                           |
| 0x1    | SUB      | R      | Subtract                      |
| 0x2    | AND      | R      | Bitwise AND                   |
| 0x3    | OR       | R      | Bitwise OR                    |
| 0x4    | XOR      | R      | Bitwise XOR                   |
| 0x5    | SLT      | R      | Set less than                 |
| 0x6    | SHF      | I      | Shift (Left/Right) Immediate  |
| 0x7    | EXT      | R      | Extended Math (ADC, SBB, NEG) |
| 0x8    | ADDI     | I      | Add immediate                 |
| 0x9    | ANDI     | I      | AND immediate                 |
| 0xA    | ORI      | I      | OR immediate                  |
| 0xB    | SLTI     | I      | Set less than immediate       |
| 0xC    | LW       | M      | Load word                     |
| 0xD    | SW       | M      | Store word                    |
| 0xE    | Bxx      | I      | Conditional Branch            |
| 0xF    | J        | J      | Unconditional Jump            |

Table 5.1: RISC-4 Opcode Map

### 5.2 Arithmetic Instructions

### 5.2.1 ADD – Add

```
1 ADD rd, rs, rt
```

| Field     | Value                    |
|-----------|--------------------------|
| Opcode    | 0x0                      |
| Format    | R-type                   |
| Encoding  | 0000 dddd ssss tttt      |
| Operation | $rd \leftarrow rs + rt$  |
| Flags     | C, Z                     |
| Cycles    | 4 (single-cycle impl: 1) |

**Description:** Adds the contents of registers **rs** and **rt**, storing the 4-bit result in **rd**. The carry-out is stored in the **Carry** flag. If result is zero then the **Zero** flag is set.

**Flags:**

- **C**  $\leftarrow$  Carry out from bit 3 (unsigned overflow)
- **Z**  $\leftarrow$  1 if result == 0, else 0

**Examples:**

```
1 ADD r1, r2, r3      # r1 = r2 + r3
2 # If r2 = 0xF, r3 = 0x1: result = 0x0, C = 1, Z = 1
3 # If r2 = 0x7, r3 = 0x2: result = 0x9, C = 0, Z = 0
```

### 5.2.2 SUB – Subtract

```
1 SUB rd, rs, rt
```

| Field     | Value                    |
|-----------|--------------------------|
| Opcode    | 0x1                      |
| Format    | R-type                   |
| Encoding  | 0001 dddd ssss tttt      |
| Operation | $rd \leftarrow rs - rt$  |
| Flags     | C, Z                     |
| Cycles    | 4 (single-cycle impl: 1) |

**Description:** Subtracts the contents of registers **rs** and **rt**, storing the 4-bit result in **rd**. The carry-out is stored in the **Carry** flag. If the result is Zero the **Zero** Flag is set

**Flags:**

- **C**  $\leftarrow$  1 if borrow occurred (unsigned underflow), 0 otherwise
- **Z**  $\leftarrow$  1 if result == 0, else 0

**Examples:**

```
1 SUB r4, r5, r6      # r4 = r5 - r6
2 # If r5 = 0x3, r6 = 0x5: result = 0xE (-2), C = 1, Z = 0
3 # If r5 = 0x8, r6 = 0x8: result = 0x0, C = 0, Z = 1
```

**Note:** The **Z** flag is set when operands are equal, making SUB useful for comparisons.

### 5.2.3 AND – Bitwise AND

```
1 AND rd, rs, rt
```

| Field     | Value                    |
|-----------|--------------------------|
| Opcode    | 0x2                      |
| Format    | R-type                   |
| Encoding  | 0010 dddd ssss tttt      |
| Operation | $rd \leftarrow rs \& rt$ |
| Flags     | Carry, Zero              |
| Cycles    | 4 (single-cycle impl: 1) |

**Description:** Bitwise AND compare operation on registers **rs** and **rt**, results stored in **rd**. **Carry** flag cleared, **Zero** flag set if result == 0

**Flags:**

- **C**  $\leftarrow$  0 (cleared)
- **Z**  $\leftarrow$  1 if result == 0, else 0

**Examples:**

```
1 AND r7, r8, r9      # r7 = r8 & r9
2 # If r8 = 0b1010, r9 = 0b1100: result = 0b1000, C = 0, Z = 0
3 # If r8 = 0b1010, r9 = 0b0101: result = 0b0000, C = 0, Z = 1
```

**Use Case:** Test if specific bits are set:

```
1 AND r1, r2, r0      # Test if r2 is zero
2 BEQ is_zero          # Branch if Z = 1
```

### 5.2.4 OR – Bitwise OR

```
1 OR rd, rs, rt
```

| Field     | Value                    |
|-----------|--------------------------|
| Opcode    | 0x3                      |
| Format    | R-type                   |
| Encoding  | 0011 dddd ssss tttt      |
| Operation | $rd \leftarrow rs   rt$  |
| Flags     | –                        |
| Cycles    | 4 (single-cycle impl: 1) |

**Description:** Bitwise OR compare operation on registers **rs** and **rt**, results store in **rd**. **Carry** flag cleared, **Zero** flag set if result == 0

**Flags:**

- **C**  $\leftarrow$  0 (cleared)
- **Z**  $\leftarrow$  1 if result == 0, else 0

**Examples:**

```
1 OR r2, r3, r0          # r2 = r3 | 0 = r3 (MOV pseudo-op)
2 # Z = 1 if r3 was 0, else Z = 0
3 # C = 0 always
4
5 OR r1, r1, r1          # Explicit way to clear C, update Z
```

### 5.2.5 XOR – Bitwise Exclusive OR

```
1 XOR rd, rs, rt
```

| Field     | Value                        |
|-----------|------------------------------|
| Opcode    | 0x4                          |
| Format    | R-type                       |
| Encoding  | 0100 dddd ssss tttt          |
| Operation | $rd \leftarrow rs \Delta rt$ |
| Flags     | –                            |
| Cycles    | 4 (single-cycle impl: 1)     |

**Description:** Bitwise eXclusive OR compare operation on registers **rs** and **rt**, results store in **rd**.  
**Carry** flag cleared, **Zero** flag set if result == 0

**Flags:**

- C  $\leftarrow$  0 (cleared)
- Z  $\leftarrow$  1 if result == 0, else 0

**Examples:**

```
1 XOR r1, r1          # Clear r1 (result = 0)
2 # r1 = 0, C = 0, Z = 1
3
4 XOR r2, r3, r4      # r2 = r3 ^ r4
5 # Z = 1 if r3 == r4, else Z = 0
6 # C = 0 always
```

**Use Case:** XOR with self is an idiomatic way to clear a register and set Z flag.

### 5.2.6 SLT – Set Less Than (Signed)

```
1 SLT rd, rs, rt
```

| Field     | Value                                                        |
|-----------|--------------------------------------------------------------|
| Opcode    | 0x5                                                          |
| Format    | R-type                                                       |
| Encoding  | 0101 dddd ssss tttt                                          |
| Operation | $rd \leftarrow (rs < rt)?1 : 0$ ( <i>signed comparison</i> ) |
| Flags     | –                                                            |
| Cycles    | 4 (single-cycle impl: 1)                                     |

**Description:** Set Less Than performs a signed comparison of **rs** and **rt** if **rs** is less than **rt** then **rd** is set to 1, if not **rd** is set to 0. **Carry** flag cleared, **Zero** flag set if result == 0

**Flags:**

- **C**  $\leftarrow$  0 (cleared)
- **Z**  $\leftarrow$  1 if result == 0 (comparison false), else 0

**Examples:**

```
1 SLT r1, r2, r3      # r1 = (r2 < r3) ? 1 : 0
2 # If r2 = 0xF (-1), r3 = 0x1 (1): r1 = 1, C = 0, Z = 0
3 # If r2 = 0x5 (5),   r3 = 0x3 (3): r1 = 0, C = 0, Z = 1
4
5 # Use with branch:
6 SLT r1, r2, r3      # Test r2 < r3
7 BNE less_than        # Branch if Z = 0 (comparison true)
8 BEQ greater_equal    # Branch if Z = 1 (comparison false)
```

## 5.3 Shift Instruction

### 5.3.1 SHF – Shift

<sup>1</sup> SHF rd, rs, imm4

| Field     | Value                           |
|-----------|---------------------------------|
| Opcode    | 0x6                             |
| Format    | I-type                          |
| Encoding  | 0110 dddd ssss iii              |
| Operation | $rd \leftarrow shift(rs, imm4)$ |
| Flags     | C, Z                            |
| Cycles    | 4                               |

**Description:** Shifts the value in rs by the amount and direction specified in imm4.

**Immediate Encoding:**

- **Bit 3:** Direction (0 = left, 1 = right)
- **Bits 2-0:** Shift amount (0-7)

**Operation:**

- **Shift Left:**  $rd \leftarrow rs \ll amount$ , fill with 0
- **Shift Right:**  $rd \leftarrow rs \gg amount$ , fill with 0 (logical)

**Flags:**

- **C**  $\leftarrow$  Last bit shifted out
  - Shift left: C  $\leftarrow$  bit 3 of original value (before shift)
  - Shift right: C  $\leftarrow$  bit 0 of original value (before shift)
  - For multi-bit shifts, C contains the last bit shifted out
- **Z**  $\leftarrow$  1 if result == 0, else 0

**Examples:**

```

1 SHF r1, r2, 0x1      # Shift left by 1: r1 = r2 << 1
2 # If r2 = 0b1010: r1 = 0b0100, C = 1, Z = 0
3
4 SHF r1, r2, 0x9      # Shift right by 1: r1 = r2 >> 1
5 # If r2 = 0b1010: r1 = 0b0101, C = 0, Z = 0
6
7 SHF r1, r2, 0xA      # Shift right by 2: r1 = r2 >> 2
8 # If r2 = 0b1100: r1 = 0b0011, C = 0, Z = 0
9
10 SHF r1, r2, 0x4      # Shift left by 4 (clear register)
11 # r1 = 0b0000, C = ?, Z = 1

```

**Use Cases:**

```

1 # Multiply by 2 (shift left 1)
2 SHF r1, r2, 0x1      # r1 = r2 * 2
3
4 # Divide by 2 (shift right 1, unsigned)
5 SHF r1, r2, 0x9      # r1 = r2 / 2
6
7 # Multi-precision shift: shift 8-bit value r0:r1 left
8 SHF r1, r1, 0x1      # Shift low nibble left, C = old bit 3
9 # (Need to implement rotate through carry for high nibble)

```

**Note:** RISC-4 does not have arithmetic right shift (sign-extension). For signed division by 2, additional instructions are needed to handle the sign bit.

## 5.4 Extended Arithmetic Instructions

These instructions share Opcode 7 and use a 2-operand destructive format (**rd** is both source and destination) to allow for a 4-bit function code.

### 5.4.1 ADC – Add with Carry

```
1 ADC rd, rs
```

| Field     | Value                           |
|-----------|---------------------------------|
| Opcode    | 0x7                             |
| Format    | R-type                          |
| Encoding  | 0111 dddd ssss 0000             |
| Operation | $rd \leftarrow rd + rs + Carry$ |
| Flags     | Carry                           |
| Cycles    | 4                               |

**Description:** Adds **rs** and the current **Carry** flag to **rd**. **Zero** flag set if result == **0**. Used for multi-nibble addition.

**Flags:**

- **C**  $\leftarrow$  Carry out from bit 3
- **Z**  $\leftarrow$  1 if result == 0, else 0

**Multi-Precision Example:**

```
1 # 8-bit addition: r0:r1 = r2:r3 + r4:r5
2 ADD r1, r3, r5      # Low nibble: r1 = r3 + r5, sets C
3 ADC r0, r2, r4      # High nibble: r0 = r2 + r4 + C
4 # Final Z flag reflects whether entire 8-bit result is zero
```

### 5.4.2 SBB – Subtract with Borrow

```
1 SBB rd, rs
```

| Field     | Value                           |
|-----------|---------------------------------|
| Opcode    | 0x7                             |
| Format    | R-type                          |
| Encoding  | 0111 dddd ssss 0001             |
| Operation | $rd \leftarrow rd - rs - Carry$ |
| Flags     | Carry                           |
| Cycles    | 4                               |

**Description:** Subtracts **rs** and the Carry flag (borrow) from **rd**. **Zero** flag set if result == 0. Used for multi-nibble subtraction.

**Flags:**

- **C**  $\leftarrow$  1 if borrow occurred, else 0
- **Z**  $\leftarrow$  1 if result == 0, else 0

**Multi-Precision Example:**

```
1 # 8-bit subtraction: r0:r1 = r2:r3 - r4:r5
2 SUB r1, r3, r5      # Low nibble: r1 = r3 - r5, sets C if borrow
3 SBB r0, r2, r4      # High nibble: r0 = r2 - r4 - C
4 # Z flag reflects whether entire 8-bit result is zero
```

### 5.4.3 NEG – Negate (Two’s Complement)

```
1 NEG rd, rs
```

| Field     | Value                  |
|-----------|------------------------|
| Opcode    | 0x7                    |
| Format    | R-type                 |
| Encoding  | 0111 dddd ssss 0010    |
| Operation | $rd \leftarrow 0 - rs$ |
| Flags     | C, Z                   |
| Cycles    | 4                      |

**Description:** Computes the two’s complement of **rs** and stores it in **rd**. (Pseudo-equivalent to **SUB rd, r0, rs**). **Carry** flag set to 0 if operand was zero else **Carry** set to 1. **Zero** flag set if result == 0

**Flags:**

- **C**  $\leftarrow$  0 if operand was zero, else 1
- **Z**  $\leftarrow$  1 if result == 0 (only when negating zero), else 0

**Examples:**

```
1 NEG r1, r2          # r1 = -r2 (two's complement)
2 # If r2 = 0x5 (5): r1 = 0xB (-5), C = 1, Z = 0
3 # If r2 = 0x0 (0):  r1 = 0x0 (0),   C = 0, Z = 1
4 # If r2 = 0x8 (-8): r1 = 0x8 (-8), C = 1, Z = 0 (overflow!)
```

**Note:** Negating -8 produces -8 due to two’s complement overflow in 4-bit signed arithmetic.

## 5.5 Immediate Arithmetic Instructions

### 5.5.1 ADDI – ADD Immediate

```
1 ADDI rd,rs, imm4
```

| Field     | Value                                    |
|-----------|------------------------------------------|
| Opcode    | 0x8                                      |
| Format    | I-type                                   |
| Encoding  | 1000 dddd ssss iiiii                     |
| Operation | $rd \leftarrow rs + sign_{extend}(imm4)$ |
| Flags     | C, Z                                     |
| Cycles    | 4 (single-cycle impl: 1)                 |

**Description:** ADD Immediate adds the sign-extended 4-bit immediate `imm4` to the contents of register `rs`, storing the result in `rd`. This allows for incrementing and decrementing (using negative immediates).

**Flags:**

- **C**  $\leftarrow$  Carry out from bit 3
- **Z**  $\leftarrow$  1 if result == 0, else 0

**Examples:**

```
1 ADDI r1, r2, 5      # r1 = r2 + 5
2 # If r2 = 0xB (11): r1 = 0x0 (0), C = 1, Z = 1
3
4 ADDI r3, r3, -1     # r3 = r3 - 1 (decrement)
5 # If r3 = 0x1: r3 = 0x0, C = 0, Z = 1
6 # If r3 = 0x0: r3 = 0xF (-1), C = 1, Z = 0
```

**Note on Negative Immediates:** Since the immediate field is 4 bits wide, it can represent values from -8 to +7. The processor treats the binary value as a signed two's complement number.

- **Subtraction:** To subtract 1, use the immediate 0xF (0b1111), which represents -1 in two's complement.
- **Calculation:** 0b0011 (3) + 0b1111 (-1) = 0b10010 → Result 0b0010 (2), Carry Set.

### 5.5.2 ANDI – AND Immediate

```
1 ANDI rd, rs, imm4
```

| Field     | Value                                    |
|-----------|------------------------------------------|
| Opcode    | 0x9                                      |
| Format    | I-type                                   |
| Encoding  | 1001 dddd ssss iiiii                     |
| Operation | $rd \leftarrow rs \& zero\_extend(imm4)$ |
| Flags     | C, Z                                     |
| Cycles    | 4                                        |

**Description:** Performs a bitwise AND between register **rs** and the 4-bit immediate, storing the result in **rd**. **Note:** Since the register is 4 bits, the mask **imm4** covers all bits. **Carry** flag is cleared, **Zero** flag is set if result == 0

**Flags:**

- C  $\leftarrow$  0 (cleared)
- Z  $\leftarrow$  1 if result == 0, else 0

**Examples:**

```
1 ANDI r1, r2, 0x1      # Extract bit 0 of r2
2 # If r2 = 0b1010: r1 = 0b0000, C = 0, Z = 1
3 # If r2 = 0b1011: r1 = 0b0001, C = 0, Z = 0
4
5 ANDI r1, r2, 0x8      # Extract bit 3 (sign bit)
6 BNE negative          # Branch if Z = 0 (bit was set)
```

### 5.5.3 ORI – OR Immediate

```
1 ORI rd, rs, imm4
```

| Field     | Value                                      |
|-----------|--------------------------------------------|
| Opcode    | 0xA                                        |
| Format    | I-type                                     |
| Encoding  | 1010 dddd ssss iii                         |
| Operation | $rd \leftarrow rs \mid zero\_extend(imm4)$ |
| Flags     | C, Z                                       |
| Cycles    | 4                                          |

**Description:** Performs a bitwise OR between register **rs** and the 4-bit immediate, storing the result in **rd**. **Carry** flag cleared, **Zero** flag set if result == 0

**Flags:**

- C  $\leftarrow$  0 (cleared)
- Z  $\leftarrow$  1 if result == 0, else 0

**Examples:**

```
1 ORI r1, r0, 0x5      # Load immediate: r1 = 0x5
2 # r1 = 5, C = 0, Z = 0
3
4 ORI r2, r3, 0x8      # Set bit 3 (sign bit)
5 # Z = 0 (result always non-zero since bit 3 set)
```

### 5.5.4 SLTI – Set Less Than Immediate

```
1 SLTI rd, rs, imm4
```

| Field     | Value                                             |
|-----------|---------------------------------------------------|
| Opcode    | 0xB                                               |
| Format    | I-type                                            |
| Encoding  | 1011 dddd ssss iiiii                              |
| Operation | $rd \leftarrow (rs < sign\_extend(imm4)) ? 1 : 0$ |
| Flags     | C, Z                                              |
| Cycles    | 4                                                 |

**Description:** Compares **rs** to the signed 4-bit immediate. If **rs** is strictly smaller, **rd** is set to 1, otherwise 0. **Carry** flag cleared, **Zero** flag set if result == 0

**Flags:**

- C  $\leftarrow$  0 (cleared)
- Z  $\leftarrow$  1 if result == 0 (comparison false), else 0

**Examples:**

```
1 SLTI r1, r2, 5      # r1 = (r2 < 5) ? 1 : 0
2 # If r2 = 0x3 (3): r1 = 1, C = 0, Z = 0 (true)
3 # If r2 = 0x7 (7): r1 = 0, C = 0, Z = 1 (false)
4 # If r2 = 0xF (-1): r1 = 1, C = 0, Z = 0 (true, signed)
```

## 5.6 Memory Instructions

### 5.6.1 LW – Load Word

```
1 LW rd, offset(base)
```

| Field     | Value                                                       |
|-----------|-------------------------------------------------------------|
| Opcode    | 0xC                                                         |
| Format    | M-type                                                      |
| Encoding  | 1100 dddd bbbb oooo                                         |
| Operation | $rd \leftarrow mem[reg\_pair(base) + sign\_extend(offset)]$ |
| Flags     | –                                                           |
| Cycles    | 5                                                           |

**Description:** Loads a 4-bit nibble from data memory into **rd**. **Addressing:** The effective address is calculated by taking the 8-bit value in the register pair specified by **base** (must be even) and adding the sign-extended 4-bit **offset**.

**Example:**

```
1 # Assume r14=0x8, r15=0x0 (Stack Pointer = 0x80)
2 LW r1, 4(r14)      # Address = 0x80 + 4 = 0x84
3                      # r1 gets value at mem[0x84]
```

### 5.6.2 SW – Store Word

```
1 SW rs, offset(base)
```

| Field     | Value                                                       |
|-----------|-------------------------------------------------------------|
| Opcode    | 0xD                                                         |
| Format    | M-type                                                      |
| Encoding  | 1101 ssss bbbb oooo                                         |
| Operation | $mem[reg\_pair(base) + sign\_extend(offset)] \leftarrow rs$ |
| Flags     | –                                                           |
| Cycles    | 5                                                           |

**Description:** Stores the 4-bit value from **rs** into data memory. **Note:** **base** must be an even-numbered register representing a pair.

**Example:**

```
1 # Push r2 to stack
2 # 1. Decrement SP (r14:r15) - Requires multi-step math or software
   convention
3 # 2. Store
4 SW r2, 0(r14)      # mem[sp] = r2
```

## 5.7 Control Flow Instructions

### 5.7.1 Bxx – Conditional Branch

```
1 BEQ offset8    # Branch if Equal (Z=1)
2 BNE offset8    # Branch if Not Equal (Z=0)
3 BCS offset8    # Branch if Carry Set (C=1)
4 BCC offset8    # Branch if Carry Clear (C=0)
```

| Field     | Value                                              |
|-----------|----------------------------------------------------|
| Opcode    | 0xE                                                |
| Format    | I-type                                             |
| Encoding  | 1110 cccc oooooooooo                               |
| Operation | $if(cond)pc \leftarrow pc + sign\_extend(offset8)$ |
| Flags     | –                                                  |
| Cycles    | 3                                                  |

**Description:** Branches relative to the current PC if the condition specified by the 4-bit **cond** field is met.

- **Encoding (cccc):** 0000 (EQ), 0001 (NE), 0010 (CS/LT), 0011 (CC/GE).
- **Offset:** 8-bit signed offset allows jumping forward/backward by 128 nibbles.

### 5.7.2 J – Jump (Unconditional)

```
1 J target12
```

| Field     | Value                           |
|-----------|---------------------------------|
| Opcode    | 0xF                             |
| Format    | J-type                          |
| Encoding  | 1111 0ttt tttt tttt             |
| Operation | $PC \leftarrow \text{target12}$ |
| Flags     | –                               |
| Cycles    | 1                               |

**Description:** Unconditional jump to the 12-bit absolute address specified by `target12`. The full instruction memory address space (0x000-0xFFFF) is reachable.

**Flags:** No flags are modified.

**Examples:**

```
1 J 0x234      # Jump to address 0x234
2 J 0xFFE      # Jump to address 0xFFE (near end of memory)
3
4 # Infinite loop:
5 loop:
6     # ...
7     J loop
```

**Use Cases:**

- Infinite loops
- Far jumps (across pages)
- End of non-returning functions
- Jump tables (with computed addresses)

### 5.7.3 JAL – Jump And Link

```
1 JAL target11
```

| Field     | Value                                                                                                          |
|-----------|----------------------------------------------------------------------------------------------------------------|
| Opcode    | 0xF                                                                                                            |
| Format    | J-type (variant)                                                                                               |
| Encoding  | 1111 ttt tttt tttt                                                                                             |
| Operation | $r1 \leftarrow PC[11 : 8], r2 \leftarrow PC[7 : 4], r3 \leftarrow PC[3 : 0], PC \leftarrow \{1'b0, target11\}$ |
| Flags     | –                                                                                                              |
| Cycles    | 1                                                                                                              |

**Description:** Jump and Link performs an unconditional jump to the 11-bit target address while saving the return address across three registers ( $r1:r2:r3$ ). This instruction is used for function calls.

**Register Convention:** The return address is split across three 4-bit registers:

- $r1$  receives  $PC[11:8]$  (high nibble)
- $r2$  receives  $PC[7:4]$  (middle nibble)
- $r3$  receives  $PC[3:0]$  (low nibble)

**Address Range:** JAL can reach addresses 0x000-0x7FF (2048 nibbles, 1024 bytes). For functions beyond this range, use J (unconditional jump) or reorganize code placement.

**Flags:** No flags are modified.

**Examples:**

```
1 # Simple function call:
2 main:
3     ORI r4, r0, 5          # Argument in r4
4     JAL add_one           # Call function, r1:r2:r3 <- return address
5     # Execution continues here after return
6
7 add_one:
8     ADDI r4, r4, 1         # r4 = r4 + 1
9     JR r1                  # Return (see JR instruction)
```

**Nested Function Call Example:**

```
1 caller:
2     JAL func_a             # r1:r2:r3 <- return address (e.g., 0x010)
3
4 func_a:
5     # Save return address to stack before calling another function
6     ADDI r15, r15, -3      # Allocate stack space
7     BCC no_borrow
8     ADDI r14, r14, -1
9 no_borrow:
10    SW r1, 0(r14)          # Save r1 (high nibble)
11    SW r2, 1(r14)          # Save r2 (mid nibble)
12    SW r3, 2(r14)          # Save r3 (low nibble)
13
```

```

14  JAL  func_b          # Call another function (overwrites r1:r2:r3)
15
16  # Restore return address
17  LW   r1, 0(r14)
18  LW   r2, 1(r14)
19  LW   r3, 2(r14)
20  ADDI r15, r15, 3
21  BCC no_carry
22  ADDI r14, r14, 1
23 no_carry:
24
25  JR   r1              # Return to original caller

```

**Pipeline Behavior:**

JAL executes in the ID stage:

- **Cycle 1 (IF):** Fetch JAL instruction
- **Cycle 2 (ID):**
  - Decode JAL
  - Write PC+1 to r1:r2:r3 (three register writes)
  - Update PC to target
  - Flush IF stage (wrong instruction was speculatively fetched)
- **Cycle 3 (IF):** Fetch instruction at target address

Total latency: 2 cycles (1 for JAL + 1 for branch penalty)

**Design Notes:**

- JAL performs three simultaneous register writes. Implementation must support writing r1, r2, and r3 in the same cycle, or execute JAL over multiple cycles.
- The target address has bit [11] forced to 0, limiting range to the lower half of instruction memory.
- For calls to addresses beyond 0x7FF, use a two-instruction sequence: J to the target, then save return address manually.

### 5.7.4 JR – Jump Register

```
1 JR r1
```

| Field     | Value                          |
|-----------|--------------------------------|
| Opcode    | 0x7                            |
| Format    | R-type (Extended)              |
| Encoding  | 0111 xxxx xxxx 0011            |
| Operation | $PC \leftarrow \{r1, r2, r3\}$ |
| Flags     | –                              |
| Cycles    | 1                              |

**Description:** Jump Register performs an indirect jump to the 12-bit address formed by concatenating registers r1, r2, and r3. This instruction is used to return from function calls.

**Register Convention:** JR always reads from r1:r2:r3 (the return address registers). The encoding fields for rd and rs are unused (set to 0 by assembler).

#### Address Formation:

```
1 PC <- {r1[3:0], r2[3:0], r3[3:0]}
2      |       |
3      High     Mid      Low
```

**Flags:** No flags are modified.

#### Examples:

```
1 # Simple return from function:
2 add_one:
3     ADDI r4, r4, 1          # Do work
4     JR    r1                 # Return to caller
5
6 # Return after nested call (r1:r2:r3 restored from stack):
7 func_a:
8     # ... work ...
9     LW    r1, 0(r14)        # Restore return address from stack
10    LW    r2, 1(r14)
11    LW    r3, 2(r14)
12    JR    r1                 # Return
```

#### Pipeline Behavior:

JR executes in the ID stage with dedicated return address path:

- **Cycle 1 (IF):** Fetch JR instruction
- **Cycle 2 (ID):**
  - Decode JR
  - Read r1, r2, r3 via dedicated 12-bit concatenation path
  - Update PC to {r1, r2, r3}
  - Flush IF stage
- **Cycle 3 (IF):** Fetch instruction at return address

Total latency: 2 cycles (1 for JR + 1 for branch penalty)

#### Hazard Handling:

If a previous instruction modifies r1, r2, or r3, forwarding logic provides the updated values:

```

1 # Hazard example:
2 ADD r1, r4, r5      # r1 written in WB stage
3 JR  r1              # r1 needed in ID stage
4 # Forwarding resolves this with 0 stalls

```

The dedicated return address path includes forwarding muxes for r1, r2, and r3 from the EX/MEM and MEM/WB pipeline registers.

#### Design Notes:

- JR uses a dedicated 12-bit read path that always accesses r1:r2:r3, separate from the normal register file read ports.
- This dedicated path allows JR to execute in a single cycle without requiring a 3-port register file.
- Hardware cost: 50 gates (ASIC) or 10 LUTs (FPGA) for the concatenation and forwarding logic.

# Chapter 6

## Pipeline Architecture

### 6.1 Overview

RISC-4 implements a classic 5-stage RISC pipeline designed for simplicity and efficiency. The pipeline enables an ideal throughput of one instruction per cycle (CPI = 1.0) while maintaining a straightforward control path suitable for educational implementation and FPGA synthesis.

### 6.2 Pipeline Stages

#### 6.2.1 Stage 1: Instruction Fetch (IF)

**Duration:** 1 cycle

**Operations:**

- Fetch 16-bit instruction from instruction memory at address PC
- Increment PC by 1 (RISC-4 uses nibble addressing, instructions are 4 nibbles = 16 bits)
- Forward instruction to IF/ID pipeline register

**Critical Path:**

```
1 PC \rightarrow Instruction Memory \rightarrow Instruction Register \
    \rightarrow IF/ID Register
```

**Stall Conditions:**

- Instruction memory not ready (cache miss in advanced implementations)
- Pipeline stall propagated from later stages

### 6.2.2 Stage 2: Instruction Decode (ID)

**Duration:** 1 cycle

**Operations:**

- Decode opcode [15:12] to determine instruction type
- Read up to two source registers from register file
- Sign-extend or zero-extend immediate values
- Generate control signals for subsequent stages
- Evaluate branch conditions (for conditional branches)

**Register File Access:**

```

1 Read Port 1: rs [7:4]
2 Read Port 2: rt [3:0] (R-type) or imm4 [3:0] (I-type)

```

**Branch Evaluation:**

- For conditional branches (BEQ, BNE, BCS, BCC), condition is checked in ID stage
- Branch target address calculated: PC + sign\_extend(offset8)
- If branch taken, flush IF stage and redirect PC

**Stall Conditions:**

- Load-use hazard: Previous instruction is LW and current instruction uses the loaded register
- Flag hazard: Previous instruction sets flags and current instruction is a conditional branch

### 6.2.3 Stage 3: Execute (EX)

**Duration:** 1 cycle

**Operations:**

- Perform ALU operation (ADD, SUB, AND, OR, XOR, SLT, shifts)
- Calculate memory address for load/store (base + offset)
- Update Carry (C) and Zero (Z) flags based on ALU result
- Forward result to ID stage (for forwarding) and to EX/MEM register

**ALU Operations:**

| Instruction Type                | ALU Operation                  |
|---------------------------------|--------------------------------|
| Arithmetic (ADD, SUB, ADC, SBB) | 4-bit adder/subtractor         |
| Logical (AND, OR, XOR)          | Bitwise operations             |
| Comparison (SLT, SLTI)          | Signed comparison              |
| Shift (SHF)                     | Barrel shifter (0-7 positions) |
| Memory (LW, SW)                 | Address calculation            |

**Flag Generation:**

- Carry flag: Set by arithmetic/shift operations
- Zero flag: Set if ALU result == 0x0
- Flags forwarded to ID stage for immediate use by branches

### 6.2.4 Stage 4: Memory Access (MEM)

**Duration:** 1 cycle

**Operations:**

- Load: Read 4-bit nibble from data memory at calculated address
- Store: Write 4-bit nibble to data memory at calculated address
- Non-memory instructions: Pass ALU result through unchanged

**Memory Interface:**

```
1 Address: 8-bit (from register pair + offset)
2 Data: 4-bit nibble
3 Control: Read enable, Write enable
```

**Memory Map:**

- Data memory: 256 nibbles (0x00 - 0xFF)
- Accessed via 8-bit addresses formed by register pairs

### 6.2.5 Stage 5: Write Back (WB)

**Duration:** 1 cycle

**Operations:**

- Select result source (ALU result or memory data)
- Write result to destination register rd
- Register r0 writes are ignored (hardwired to zero)

**Write Sources:**

| Instruction Type           | Write Source               |
|----------------------------|----------------------------|
| Arithmetic/Logical         | ALU result from EX stage   |
| Load (LW)                  | Memory data from MEM stage |
| Store (SW), Branches, Jump | No write (rd unused)       |

## 6.3 Pipeline Registers

Data is passed between stages via pipeline registers:

| Register | Contents                                                               |
|----------|------------------------------------------------------------------------|
| IF/ID    | Instruction[15:0], PC+1                                                |
| ID/EX    | Control signals, Register values (rs, rt), Immediate, PC+1, rd address |
| EX/MEM   | Control signals, ALU result, rt value (for stores), rd address, Flags  |
| MEM/WB   | Control signals, Memory data or ALU result, rd address                 |

## 6.4 Hazards and Solutions

### 6.4.1 Data Hazards

#### RAW (Read After Write) Hazards

**Problem:** Instruction needs a register that hasn't been written yet.

**Example:**

```

1 ADD r3, r1, r2    # r3 written in WB stage (cycle 5)
2 SUB r4, r3, r5    # r3 needed in EX stage (cycle 3)
3                      # Hazard: r3 not ready when needed

```

#### Solution 1: Forwarding (Bypassing)

Forward data from later pipeline stages to earlier stages:

- **EX-to-EX forwarding:** Forward ALU result from EX/MEM register back to ALU inputs
- **MEM-to-EX forwarding:** Forward memory result from MEM/WB register to ALU inputs

#### Forwarding Paths:

```

1. EX/MEM.ALUResult $→ ID/EX (ALU operand)
2. MEM/WB.Result $→ ID/EX (ALU operand)

```

#### Hardware Cost:

- 2 multiplexers per ALU input (4 total)
- Comparators to detect hazard conditions
- Minimal: 50 gates

### Load-Use Hazard

**Problem:** Load instruction followed immediately by use of loaded data.

**Example:**

```

1 LW r1, 0(r14)    # r1 available in MEM stage (cycle 4)
2 ADD r3, r1, r2    # r1 needed in EX stage (cycle 3)
3                      # Hazard: r1 not ready - 1 cycle gap

```

#### Solution: 1-Cycle Stall

- Detect hazard in ID stage: (previous instruction is LW) AND (LW.rd == current.rs or current.rt)
- Insert 1-cycle bubble (NOP) into pipeline
- Allow load to complete before use
- Performance: CPI increases by 0.1 (assuming 10% loads with immediate use)

#### Compiler Optimization:

Insert independent instruction between load and use:

```

1 LW r1, 0(r14)    # Load r1
2 ORI r4, r0, 5      # Independent operation (fills bubble)
3 ADD r3, r1, r2    # r1 now ready (no stall)

```

## Flag Hazards

**Problem:** Conditional branch needs flags that haven't been computed yet.

**Example:**

```

1 ADD r1, r2, r3    # Sets C and Z flags in EX stage (cycle 3)
2 BEQ target        # Needs Z flag in ID stage (cycle 3)
3                      # Hazard: flags not ready

```

### Solution 1: Flag Forwarding

Forward flags from EX stage to branch evaluation logic in ID stage:

- Hardware: 2-bit bypass path (C and Z flags)
- Timing: Flags available same cycle they're computed
- Delay: 0 cycles (resolved within cycle)

### Solution 2: 1-Cycle Stall

- Detect hazard: (previous instruction sets flags) AND (current instruction is conditional branch)
- Insert 1-cycle stall
- Simpler hardware than forwarding
- Performance cost: 0.05 CPI increase

## 6.4.2 Control Hazards

**Problem:** Branch outcome not known until ID stage, but next instruction already fetched.

### Branch Penalty

**Timeline:**

| Cycle: | 1  | 2  | 3  | 4   | 5  |                                        |
|--------|----|----|----|-----|----|----------------------------------------|
| BEQ    | IF | ID | EX | MEM | WB | <- Branch taken/not decided in ID      |
| next   | IF | XX | -- | --  | -- | <- Fetched but may be wrong            |
|        | -- | IF | -- | --  | -- | <- Correct instruction if branch taken |

**Penalty:** 1 cycle for taken branches (IF stage fetched wrong instruction)

### Solution 1: Branch Prediction (Static)

**Predict Not Taken:**

- Assume all branches fall through
- Continue fetching sequentially
- If branch actually taken: Flush IF stage, redirect PC
- Cost: 1 cycle penalty only when branch taken

**Performance:**

- Best case (branch not taken): 0 penalty
- Worst case (branch taken): 1 cycle penalty
- Average: 0.5 cycles per branch (assuming 50% taken rate)

## Solution 2: Branch Delay Slot

### Architecture Change:

- Instruction immediately after branch **always** executes
- Compiler fills delay slot with useful instruction or NOP
- No penalty if slot filled productively

### Example:

```

1   SUB r3, r1, r2    # Sets flags
2   BEQ equal        # Branch
3   ADD r4, r5, r6    # DELAY SLOT: Always executes
equal:
5   ...

```

**RISC-4 Decision:** Not implemented in v0.1 (simpler pipeline), but reserved for future versions.

## 6.5 Performance Analysis

### 6.5.1 Ideal Performance

- **Ideal CPI:** 1.0 (one instruction per cycle)
- **Clock frequency:** 100 MHz (estimated for SKY130 process)
- **Ideal throughput:** 100 MIPS

### 6.5.2 Realistic Performance

#### Hazard Impact:

| Hazard Type                   | Frequency                      | CPI Impact    |
|-------------------------------|--------------------------------|---------------|
| Load-use stall                | 10% of instructions            | +0.10         |
| Branch misprediction          | 15% of instructions, 50% taken | +0.075        |
| Flag stall (if no forwarding) | 5% of instructions             | +0.05         |
| <b>Total</b>                  |                                | <b>+0.225</b> |

**Realistic CPI:** 1.2 to 1.3

#### Comparison:

- Intel 4004: CPI = 8.0 (8-16 cycles per instruction)
- RISC-I: CPI = 1.3
- RISC-4: CPI = 1.2 - 1.3

**Speedup vs. 4004:** 6× to 7× at same clock frequency

### 6.5.3 Optimization Opportunities

#### Compiler Techniques:

- Instruction scheduling to avoid load-use hazards
- Loop unrolling to reduce branch frequency
- Register allocation to minimize memory access

#### Hardware Enhancements (Future):

- Branch delay slots (reduce branch penalty to 0)
- Dynamic branch prediction (reduce misprediction rate)
- Dual-ported register file (eliminate some forwarding delays)

## 6.6 Implementation Considerations

### 6.6.1 Critical Paths

Longest paths in each stage:

| Stage | Critical Path                                                   |
|-------|-----------------------------------------------------------------|
| IF    | PC → Instruction Memory → IF/ID Register                        |
| ID    | Register File Read → Sign Extend → ID/EX Register               |
| EX    | ALU Input Mux → 4-bit Adder → Flag Generation → EX/MEM Register |
| MEM   | Address → Data Memory → MEM/WB Register                         |
| WB    | Result Mux → Register File Write                                |

**Design Goal:** Balance stage delays to maximize clock frequency.

### 6.6.2 Forwarding Logic

#### EX Hazard Detection:

```

1 if (EX/MEM.RegWrite && EX/MEM.rd != 0 && EX/MEM.rd == ID/EX.rs)
2     ForwardA = 01 // Forward from EX/MEM
3 if (EX/MEM.RegWrite && EX/MEM.rd != 0 && EX/MEM.rd == ID/EX.rt)
4     ForwardB = 01 // Forward from EX/MEM

```

#### MEM Hazard Detection:

```

1 if (MEM/WB.RegWrite && MEM/WB.rd != 0 && MEM/WB.rd == ID/EX.rs)
2     ForwardA = 10 // Forward from MEM/WB
3 if (MEM/WB.RegWrite && MEM/WB.rd != 0 && MEM/WB.rd == ID/EX.rt)
4     ForwardB = 10 // Forward from MEM/WB

```

### 6.6.3 Stall Logic

#### Load-Use Hazard Detection:

```

1 if (ID/EX.MemRead &&
2     ((ID/EX.rd == IF/ID.rs) || (ID/EX.rd == IF/ID.rt)))
3     Stall = 1 // Insert bubble, freeze IF and ID stages

```

## 6.7 Design Rationale

### 6.7.1 Why 5 Stages?

- **Classic RISC:** Proven design from MIPS, ARM
- **Balanced stages:** Each stage equal delay
- **Simple control:** Straightforward forwarding and hazard detection
- **Educational:** Easy to understand and implement

### 6.7.2 Alternatives Considered

#### 3-Stage Pipeline (IF, EX, WB):

- Simpler, fewer hazards
- Longer critical path (lower clock frequency)
- Less instructive (too simple for learning)

#### 7-Stage Pipeline (split EX into multiple stages):

- Higher clock frequency potential
- More forwarding complexity
- Overkill for 4-bit datapath

**Decision:** 5 stages balances performance, complexity, and educational value.

# Chapter 7

## Addressing

### 7.1 Overview

RISC-4 faces a fundamental challenge: a 4-bit datapath cannot directly address more than 16 locations ( $2^4$ ), yet the architecture requires access to:

- **Instruction memory:** 4096 nibbles (12-bit address space)
- **Data memory:** 256 nibbles (8-bit address space)

To bridge this gap, RISC-4 employs **Register Pairing**, where consecutive registers combine to form wider addresses. This approach maintains the simplicity of a 4-bit datapath while enabling practical memory access.

### 7.2 Memory Organization

#### 7.2.1 Memory Map

RISC-4 uses separate instruction and data memories (Harvard architecture):

| Memory Type              | Address Range  | Size                      |
|--------------------------|----------------|---------------------------|
| Instruction Memory (ROM) | 0x000 - 0xFFFF | 4096 nibbles (2048 bytes) |
| Data Memory (RAM)        | 0x00 - 0xFF    | 256 nibbles (128 bytes)   |

Table 7.1: RISC-4 Memory Map

#### 7.2.2 Address Formation

##### Instruction Addresses:

- Formed by 12-bit Program Counter (PC)
- Automatically incremented by fetch unit
- Modified by jumps and branches

##### Data Addresses:

- Formed by combining two consecutive 4-bit registers
- Explicitly specified in load/store instructions
- Calculated as:  $\text{Address} = (\text{High\_Reg} \ll 4) \mid \text{Low\_Reg}$

## 7.3 Register Pairs

### 7.3.1 Pair Formation Rules

Register pairs combine two consecutive registers to form an 8-bit value:

1. **Base register must be even-numbered** (r0, r2, r4, ..., r14)
2. **High nibble:** Base register (e.g., r14)
3. **Low nibble:** Base register + 1 (e.g., r15)
4. **Formation:** Value =  $(r[n] \ll 4) | r[n+1]$

### 7.3.2 Standard Register Pairs

| Pair    | High Nibble | Low Nibble | Conventional Use                    |
|---------|-------------|------------|-------------------------------------|
| r0:r1   | r0          | r1         | General purpose (always reads 0x0n) |
| r2:r3   | r2          | r3         | General purpose / argument pointer  |
| r4:r5   | r4          | r5         | General purpose                     |
| r6:r7   | r6          | r7         | General purpose                     |
| r8:r9   | r8          | r9         | General purpose                     |
| r10:r11 | r10         | r11        | General purpose                     |
| r12:r13 | r12         | r13        | Frame pointer (fp)                  |
| r14:r15 | r14         | r15        | Stack pointer (sp)                  |

Table 7.2: Register Pair Organization

**Note:** Pair r0:r1 always has high nibble = 0 (since r0 is hardwired to zero), limiting it to addresses 0x00-0x0F.

### 7.3.3 Pair Manipulation

#### Loading a Constant Address

To set a register pair to a specific address (e.g., 0x84):

```

1 ORI r14, r0, 0x8      # r14 = 0x8 (high nibble)
2 ORI r15, r0, 0x4      # r15 = 0x4 (low nibble)
3                         # r14:r15 now = 0x84

```

#### Incrementing a Pointer

To increment r14:r15 by 1:

```

1 ADDI r15, r15, 1      # r15 = r15 + 1
2 # If no carry (r15 didn't overflow):
3 #   Done - pointer incremented
4 # If carry (r15 overflowed from 0xF to 0x0):
5 #   Need to increment high nibble:
6 ADDI r14, r14, 0      # Add 0 to propagate previous carry
7 # Or use ADC if carry is set:
8 # ADC r14, r0          # r14 = r14 + 0 + Carry

```

#### Proper implementation with carry handling:

```

1 ADDI r15, r15, 1      # Increment low nibble, sets C if overflow
2 # Check if we need to increment high nibble:
3 # Option 1: Conditional (if branches supported):
4 BCC no_carry          # Skip if no carry
5 ADDI r14, r14, 1      # Increment high nibble
6 no_carry:
7     # Continue...
8
9 # Option 2: Always add carry (simpler):
10 ADC r14, r0, r0       # r14 = r14 + 0 + C (adds carry unconditionally)

```

#### Decrementing a Pointer

To decrement r14:r15 by 1:

```

1 ADDI r15, r15, -1     # r15 = r15 - 1, sets C if borrow
2 # If borrow (r15 underflowed from 0x0 to 0xF):
3 BCC no_borrow         # Skip if no borrow
4 ADDI r14, r14, -1     # Decrement high nibble
5 no_borrow:
6     # Continue...

```

#### Adding an Offset to a Pointer

To add a multi-nibble offset (e.g., 0x12 = 18 decimal) to r14:r15:

```

1 # Add 0x12 to r14:r15
2 ADDI r15, r15, 0x2    # Add low nibble (2), sets C if overflow
3 ADC r14, r14, 0x1     # Add high nibble (1) + carry
4                         # Result: r14:r15 = original + 0x12

```

## 7.4 Load/Store Addressing Modes

### 7.4.1 Base + Offset Addressing

RISC-4 load/store instructions use base-plus-offset addressing:

```
1 LW rd, offset(base)
2 SW rs, offset(base)
```

#### Address Calculation:

```
1 Effective Address = reg_pair(base) + sign_extend(offset)
```

Where:

- **base**: Even-numbered register (specifies register pair)
- **offset**: 4-bit signed immediate (-8 to +7)
- **reg\_pair(base)**: 8-bit value from registers **base** and **base+1**

### 7.4.2 Addressing Examples

#### Stack Access

Assuming stack pointer sp (r14:r15) = 0x90:

```

1  LW  r1, 0(r14)      # Load from [0x90 + 0] = 0x90
2  LW  r2, 1(r14)      # Load from [0x90 + 1] = 0x91
3  LW  r3, -1(r14)     # Load from [0x90 + (-1)] = 0x8F
4  SW  r4, 2(r14)      # Store to [0x90 + 2] = 0x92

```

#### Array Access

Accessing elements of an array starting at 0x40:

```

1  # Set up base pointer to array start (0x40)
2  ORI r10, r0, 0x4        # r10 = 0x4 (high nibble)
3  ORI r11, r0, 0x0        # r11 = 0x0 (low nibble)
4                  # r10:r11 = 0x40
5
6  # Access array elements
7  LW  r1, 0(r10)         # array[0] from 0x40
8  LW  r2, 1(r10)         # array[1] from 0x41
9  LW  r3, 2(r10)         # array[2] from 0x42
10
11 # Access with negative offset (before base)
12 LW  r4, -1(r10)        # Address 0x3F (before array)

```

#### Structure Field Access

Accessing fields in a structure at address 0x60:

```

1  # Structure layout:
2  #   +0: status (1 nibble)
3  #   +1: value  (1 nibble)
4  #   +2: flags   (1 nibble)
5
6  ORI r8, r0, 0x6        # r8 = 0x6
7  ORI r9, r0, 0x0        # r9 = 0x0
8                  # r8:r9 = 0x60 (structure base)
9
10 LW  r1, 0(r8)          # Load status from 0x60
11 LW  r2, 1(r8)          # Load value from 0x61
12 LW  r3, 2(r8)          # Load flags from 0x62
13
14 # Modify a field
15 ADDI r2, r2, 1          # Increment value
16 SW  r2, 1(r8)          # Store back to 0x61

```

## 7.5 Multi-Precision Arithmetic

Register pairs enable arithmetic on values larger than 4 bits by chaining operations through the Carry flag.

### 7.5.1 8-Bit Addition

Add two 8-bit values:  $(r0:r1) = (r2:r3) + (r4:r5)$

```

1 ADD r1, r3, r5      # Add low nibbles: r1 = r3 + r5
2                                     # Sets C if sum >= 16
3 ADC r0, r2, r4      # Add high nibbles with carry:
4                                     # r0 = r2 + r4 + C
5 # Result: r0:r1 = (r2:r3) + (r4:r5)

```

**Example:**

```

1 r2:r3 = 0x9F (159 decimal)
2 r4:r5 = 0x23 (35 decimal)
3
4 Step 1: r1 = r3 + r5 = 0xF + 0x3 = 0x12 $\rightarrow$ r1 = 0x2, C = 1
5 Step 2: r0 = r2 + r4 + C = 0x9 + 0x2 + 1 = 0xC
6 Result: r0:r1 = 0xC2 (194 decimal) $\checkmark$

```

### 7.5.2 8-Bit Subtraction

Subtract two 8-bit values:  $(r0:r1) = (r2:r3) - (r4:r5)$

```

1 SUB r1, r3, r5      # Subtract low nibbles: r1 = r3 - r5
2                                     # Sets C if borrow occurred
3 SBB r0, r2, r4      # Subtract high nibbles with borrow:
4                                     # r0 = r2 - r4 - C
5 # Result: r0:r1 = (r2:r3) - (r4:r5)

```

**Example:**

```

1 r2:r3 = 0x84 (132 decimal)
2 r4:r5 = 0x37 (55 decimal)
3
4 Step 1: r1 = r3 - r5 = 0x4 - 0x7 = -3 $\rightarrow$ r1 = 0xD (13), C = 1 (borrow)
5 Step 2: r0 = r2 - r4 - C = 0x8 - 0x3 - 1 = 0x4
6 Result: r0:r1 = 0x4D (77 decimal) $\checkmark$

```

### 7.5.3 16-Bit Operations

For 16-bit (4-nibble) values, chain four registers:

```

1 # 16-bit addition: (r0:r1:r2:r3) = (r4:r5:r6:r7) + (r8:r9:r10:r11)
2
3 ADD r3, r7, r11     # Nibble 0 (LSB)
4 ADC r2, r6, r10    # Nibble 1
5 ADC r1, r5, r9     # Nibble 2
6 ADC r0, r4, r8     # Nibble 3 (MSB)
7 # Result: r0:r1:r2:r3 contains 16-bit sum

```

## 7.6 Stack Management

### 7.6.1 Stack Organization

The stack grows downward from high memory toward low memory:

```

1 High Memory (0xFF)
2 $\uparrow
3 | Stack grows downward
4 | (SP decrements on push)
5 $\downarrow
6 Low Memory (0x00)
```

#### Convention:

- Stack pointer ( $sp = r14:r15$ ) points to **next free location**
- Push: Decrement SP, then store
- Pop: Load, then increment SP

### 7.6.2 Push Operation

Push register  $r1$  onto stack:

```

1 # Decrement stack pointer (r14:r15)
2 ADDI r15, r15, -1      # Decrement low nibble
3 BCC no_borrow          # If no borrow, skip high nibble decrement
4 ADDI r14, r14, -1      # Decrement high nibble due to borrow
5 no_borrow:
6
7 # Store value
8 SW r1, 0(r14)          # mem[sp] = r1
```

#### Optimized (assuming no overflow):

```

1 ADDI r15, r15, -1      # SP--
2 SW r1, 0(r14)          # Push r1
3 # Note: This assumes SP low nibble doesn't underflow
4 # Full implementation needs borrow handling
```

### 7.6.3 Pop Operation

Pop from stack into register  $r1$ :

```

1 # Load value
2 LW r1, 0(r14)          # r1 = mem[sp]
3
4 # Increment stack pointer
5 ADDI r15, r15, 1        # Increment low nibble
6 BCC no_carry            # If no carry, done
7 ADDI r14, r14, 1        # Increment high nibble due to carry
8 no_carry:
```

### 7.6.4 Stack Frame Example

Typical function stack frame layout:

```

1 High addresses
2   [local variable n]    $\leftarrow$ FP + n
3   [local variable 2]    $\leftarrow$ FP + 2
4   [local variable 1]    $\leftarrow$ FP + 1
5   [saved r12]          $\leftarrow$ FP + 0 (old frame pointer)
6   [return address]     $\leftarrow$ FP - 1
7   [saved registers]   $\leftarrow$ FP - 2, FP - 3, ...
8   [free space]         $\leftarrow$ SP (current stack pointer)
9 Low addresses

```

## 7.7 Addressing Limitations and Workarounds

### 7.7.1 Limited Offset Range

Load/store offsets are only 4 bits signed (-8 to +7). To access beyond this range:

**Problem:** Access memory at base + 20

**Solution:** Adjust base pointer

```

1 # Want to access: base_ptr + 20 (0x14)
2 # But offset limited to -8 to +7
3
4 # Option 1: Adjust pointer temporarily
5 ADDI r11, r11, 0x4 # Add 0x14 to low nibble (will overflow)
6 ADC r10, r10, 0x1 # Add 0x1 to high nibble + carry
7 LW r1, 0(r10)      # Load from adjusted address
8
9 # Restore pointer if needed
10 ADDI r11, r11, -0x4
11 SBB r10, r10, 0x1
12
13 # Option 2: Use separate pointer
14 ORI r8, r0, <high> # Calculate new base
15 ORI r9, r0, <low>   # = original + 20
16 LW r1, 0(r8)        # Load from offset address

```

### 7.7.2 Pointer Arithmetic Complexity

Multi-nibble pointer arithmetic requires careful carry handling:

```

1 # Add 0x2F to pointer r14:r15
2
3 # Naive (WRONG - ignores carry):
4 ADDI r15, r15, 0xF # Low nibble
5 ADDI r14, r14, 0x2 # High nibble (WRONG - missed carry)
6
7 # Correct (handles carry):
8 ADDI r15, r15, 0xF # Low nibble, may set carry
9 ADC r14, r14, 0x2 # High nibble + any carry from low

```

### 7.7.3 r0:r1 Pair Limitation

Since r0 is hardwired to zero, pair r0:r1 can only address 0x00-0x0F:

```

1 # r0:r1 can only form addresses 0x00 through 0x0F
2 ORI r1, r0, 0x7      # r1 = 0x7
3 LW   r2, 0(r0)       # Loads from 0x07 (since r0 = 0)
4
5 # Cannot use r0:r1 for addresses >= 0x10
6 # Must use a different register pair

```

**Recommendation:** Avoid using r0 as base register for load/store instructions.

## 7.8 Design Rationale

### 7.8.1 Why Register Pairs?

**Alternatives Considered:**

1. **Dedicated address registers:** Add special 8-bit address registers
  - CONS: Increases ISA complexity, requires new instructions
  - CONS: Breaks load/store architecture simplicity
2. **Implicit addressing:** Use accumulator-style addressing
  - CONS: Not RISC (accumulator architecture)
  - CONS: Limits parallelism and compiler optimization
3. **Register pairs (chosen):**
  - PROS: Uses existing general-purpose registers
  - PROS: Maintains RISC load/store simplicity
  - PROS: Enables multi-precision arithmetic with same mechanism
  - CONS: Requires even-numbered base registers

### 7.8.2 Why Even-Numbered Base Registers?

**Hardware Simplification:**

- Base register [7:4] determines pair: (base, base+1)
- If odd base allowed: Would require (base, base+1) OR (base-1, base)
- Alignment simplifies decode logic and register file access

**Software Impact:**

- 8 usable pairs (r0:r1 through r14:r15)
- Sufficient for: SP, FP, and 6 general-purpose pointers
- Compiler must track register alignment

# Chapter 8

## Calling Convention

### 8.1 Overview

A calling convention defines the rules for how functions call each other, pass parameters, preserve state, and return values. RISC-4's calling convention balances the constraints of a 4-bit architecture with the need for practical, efficient function calls.

#### Key Challenges:

- Limited register count (16 registers, but r0 is zero)
- 4-bit values limit parameter and return value sizes
- 8-bit stack pointer requires multi-instruction manipulation
- No dedicated link register or call instruction (must use software convention)

### 8.2 Register Usage Convention

### 8.2.1 Register Roles

| Register | Name     | Role                       | Preserved?                    |
|----------|----------|----------------------------|-------------------------------|
| r0       | zero     | Constant zero              | N/A (always zero)             |
| r1       | ra       | Return address             | Caller saves                  |
| r2       | ra_hi    | Return address HIGH        | Caller Saves (if nested call) |
| r3       | ra_mid   | Return address MID         | Caller Saves (if nested call) |
| r4       | ra_low   | Return address LOW 2       | Caller Saves (if nested call) |
| r5       | a3       | Argument 3                 | No                            |
| r6       | v0       | Return value (alternate)   | No                            |
| r7       | t0       | Temporary                  | No (caller-saved)             |
| r8       | t1       | Temporary                  | No (caller-saved)             |
| r9       | t2       | Temporary                  | No (caller-saved)             |
| r10      | s0       | Saved register             | Yes (callee-saved)            |
| r11      | s1       | Saved register             | Yes (callee-saved)            |
| r12      | s2/fp_hi | Saved / Frame pointer high | Yes (callee-saved)            |
| r13      | s3/fp_lo | Saved / Frame pointer low  | Yes (callee-saved)            |
| r14      | sp_hi    | Stack pointer high         | Yes (callee-saved)            |
| r15      | sp_lo    | Stack pointer low          | Yes (callee-saved)            |

Table 8.1: Register Calling Convention

### 8.2.2 Return Address Handling

(For leaf functions)(don't call other functions): The return address in r1:r2:r3 is automatically preserved. Simply end with JR r1  
 (For non-leaf functions)(call other functions): Save r1:r2:r3 to the stack before making calls, restore before returning.

### 8.2.3 Caller-Saved vs Callee-Saved

#### Caller-Saved Registers (r1-r9):

- Function may freely modify these
- Caller must save before calling if values are needed afterward
- Used for: arguments, return values, temporary computations

#### Callee-Saved Registers (r10-r15):

- Function must preserve these if used
- Callee must save on entry and restore on exit
- Used for: local variables that must survive across function calls

## 8.3 Parameter Passing

### 8.3.1 Basic Parameter Passing (1-4 nibbles)

Parameters are passed in registers r2-r5:

| Register | Parameter                                             |
|----------|-------------------------------------------------------|
| r4 (a0)  | First argument (or bits [3:0] of multi-nibble value)  |
| r5 (a1)  | Second argument (or bits [7:4] of multi-nibble value) |
| r6 (a2)  | Third argument                                        |
| r7 (a3)  | Fourth argument                                       |

#### Example: Single 4-bit parameter

```

1 # Call: result = add_three(5, 3, 2)
2
3 ORI r4, r0, 0x5      # Set argument: a0 = 5
4 ORI r5, r0, <ret_addr_high>  # Load return address
5 ORI r6, r0, <ret_addr_low>   # (simplified; real implementation uses
                                computed address)
6 JAL add_three        # Call function
7
8 add_three:
9     ADD r4, r4, r5 # a0 = a0 + a1
10    ADD r4, r4, r6 # a0 = a0+a2
11    JR r1           # Return
12 # Return here

```

### 8.3.2 Multi-Nibble Parameters

For 8-bit (2-nibble) parameters, use register pairs:

```

1 # Call: result = multiply_by_two(value)
2 # Input: r2:r3 = value (8-bit)
3 # Output: r2:r3 = result (8-bit)
4
5 ORI r2, r0, 0x1      # High nibble = 0x1
6 ORI r3, r0, 0x7      # Low nibble = 0x7
7                         # r2:r3 = 0x17 (23 decimal)
8 # Call multiply_by_two
9 # r2:r3 = 0x2E (46 decimal) on return

```

### 8.3.3 More Than 4 Parameters

When more than 4 nibble parameters are needed, use the stack:

```

1 # Call: result = sum_six(a, b, c, d, e, f)
2 # First 4 in registers: r2=a, r3=b, r4=c, r5=d
3 # Remaining 2 on stack: e, f
4
5 # Caller:
6 ADDI r15, r15, -2      # Allocate stack space for 2 params
7 BCC no_carry1
8 ADDI r14, r14, -1      # Handle carry
9 no_carry1:
10
11 SW   r6, 0(r14)        # Push e (5th parameter)
12 SW   r7, 1(r14)        # Push f (6th parameter)
13
14 ORI  r2, r0, <a>      # Load first 4 params
15 ORI  r3, r0, <b>
16 ORI  r4, r0, <c>
17 ORI  r5, r0, <d>
18
19 # Call function
20 # ...
21
22 # Callee (sum_six):
23 sum_six:
24     # Parameters: r2, r3, r4, r5, [sp+0], [sp+1]
25     ADD  r2, r2, r3      # a + b
26     ADD  r2, r2, r4      # + c
27     ADD  r2, r2, r5      # + d
28     LW   r7, 0(r14)      # Load e from stack
29     ADD  r2, r2, r7      # + e
30     LW   r7, 1(r14)      # Load f from stack
31     ADD  r2, r2, r7      # + f
32     # Result in r2
33     # Return...

```

## 8.4 Return Values

### 8.4.1 Single Nibble Return

Return values are placed in r2 (a0):

```

1 # Function: returns 4-bit value
2 get_status:
3     # ... computation ...
4     ORI r2, r0, 0x5      # Return value = 5
5     # Jump to return address (in r1)
6     # (Pseudo-instruction: RET would be J r1, but RISC-4 v0.1 lacks
      indirect jump)

```

**Note:** RISC-4 v0.1 lacks indirect jump, so return requires storing r1 in a known location or using a software convention.

### 8.4.2 Multi-Nibble Return

For 8-bit return values, use r2:r3:

```

1 # Function: returns 8-bit value
2 compute_large:
3     # ... computation ...
4     ORI r2, r0, 0xA      # High nibble = 0xA
5     ORI r3, r0, 0x7      # Low nibble = 0x7
6             # Return value = 0xA7 (167 decimal)
7     # Return to caller

```

### 8.4.3 Returning Multiple Values

For multiple return values (unusual but possible):

```

1 # Function: returns quotient and remainder
2 # Output: r2 = quotient, r3 = remainder
3 divide:
4     # ... division logic ...
5     ORI r2, r0, 0x5      # quotient = 5
6     ORI r3, r0, 0x2      # remainder = 2
7     # Return

```

## 8.5 Stack Frame Structure

### 8.5.1 Stack Frame Layout

A typical stack frame contains:

```

1 High addresses
2     [Caller's frame]
3     [.....]
4     [Argument n]          $\leftarrow$ FP + n (arguments beyond 4)
5     [Argument 5]          $\leftarrow$ FP + 5
6     [Return address]       $\leftarrow$ FP + 4
7     [Saved FP high]       $\leftarrow$ FP + 3
8     [Saved FP low]        $\leftarrow$ FP + 2
9     [Saved r10]           $\leftarrow$ FP + 1
10    [Saved r11]           $\leftarrow$ FP + 0 (FP points here)
11    [Local var 0]          $\leftarrow$ FP - 1
12    [Local var 1]          $\leftarrow$ FP - 2
13    [Temp storage]         $\leftarrow$ SP (current stack pointer)
14 Low addresses

```

### 8.5.2 Frame Pointer (FP)

The frame pointer (r12:r13) provides a stable reference point for accessing:

- Local variables (negative offsets from FP)
- Saved registers (small positive offsets from FP)
- Stack parameters (larger positive offsets from FP)

Without FP, all references would be relative to SP, which changes during the function.

## 8.6 Function Prologue and Epilogue

### 8.6.1 Complete Function Prologue

The prologue sets up the stack frame:

```

1  function_name:
2      # Step 1: Allocate stack space
3      # Allocate N nibbles (e.g., 6 nibbles for saved regs + locals)
4      ADDI r15, r15, -6      # Decrement SP low nibble
5      BCC no_borrow1        # Check for borrow
6      ADDI r14, r14, -1      # Decrement SP high nibble
7  no_borrow1:
8
9      # Step 2: Save callee-saved registers
10     SW    r10, 0(r14)      # Save r10 at [sp+0]
11     SW    r11, 1(r14)      # Save r11 at [sp+1]
12
13     # Step 3: Save old frame pointer
14     SW    r12, 2(r14)      # Save FP high at [sp+2]
15     SW    r13, 3(r14)      # Save FP low at [sp+3]
16
17     # Step 4: Save return address
18     SW    r1, 4(r14)       # Save RA at [sp+4]
19
20     # Step 5: Set new frame pointer
21     # FP = SP + 2 (points to saved r11)
22     ADD   r13, r15, r0      # FP_lo = SP_lo
23     ADD   r12, r14, r0      # FP_hi = SP_hi
24     ADDI r13, r13, 2        # FP += 2
25     BCC  no_carry1
26     ADDI r12, r12, 1        # Handle carry
27  no_carry1:
28
29      # Function body starts here
30      # Local variables at [FP-1], [FP-2], etc.

```

### 8.6.2 Complete Function Epilogue

The epilogue tears down the stack frame and returns:

```

1   # Function body ends here
2
3   # Step 1: Restore callee-saved registers
4   LW    r10, 0(r14)      # Restore r10
5   LW    r11, 1(r14)      # Restore r11
6
7   # Step 2: Restore old frame pointer
8   LW    r12, 2(r14)      # Restore FP high
9   LW    r13, 3(r14)      # Restore FP low
10
11  # Step 3: Load return address
12  LW    r1, 4(r14)       # Load RA into r1
13
14  # Step 4: Deallocate stack space
15  ADDI r15, r15, 6       # Increment SP low nibble by 6
16  BCC no_carry2          # Check for carry
17  ADDI r14, r14, 1       # Increment SP high nibble
18 no_carry2:
19
20  # Step 5: Return (pseudo-instruction)
21  # RISC-4 v0.1 lacks indirect jump, so this is simplified
22  # In real implementation, r1 would contain return address
23  # and we'd need a software-managed return mechanism
24  # RET  (pseudo: JR r1, not supported in v0.1)

```

## 8.7 Complete Function Example

### 8.7.1 Leaf Function (No Calls)

A leaf function doesn't call other functions, so it doesn't need to save RA:

```

1  # Function: add_three(a, b, c) -> a + b + c
2  # Input: r4 = a, r5 = b, r6 = c
3  # Output: r4 = result
4
5 add_three:
6   # No prologue needed (leaf function, no locals)
7
8   # Function body
9   ADD r4, r5, r6           # r4 = a + b
10  ADD r4, r5, r6           # r4 = a + b + c
11
12  # No epilogue needed
13  JR r1                   # return to r1 address

```

### 8.7.2 Non-Leaf Function (Makes Calls)

A function that calls other functions must save RA:

```

1 # Function: add_and_double(x, y) -> (x + y) * 2
2 # Calls another function: double(n)
3
4 add_and_double:
5     # Prologue: Save return address
6     ADDI r15, r15, -3    # Allocate stack space
7     BCC no_borrow_p
8     ADDI r14, r14, -1
9 no_borrow_p:
10    SW r1, 0(r14)        # Save ra_hi
11    SW r2, 1(r14)        # Save ra_mid
12    SW r3, 2(r14)        # Save ra_low
13
14    # Function body
15    ADD r4, r4, r5      # x = x + y
16    JAL double          # Call double(x+y), overwrites r1:r2:r3
17    # Result in r4
18
19    # Epilogue: Restore return address
20    LW r1, 0(r14)        # Restore ra_hi
21    LW r2, 1(r14)        # Restore ra_mid
22    LW r3, 2(r14)        # Restore ra_low
23    ADDI r15, r15, 3     # Dealllocate stack
24    BCC no_carry_e
25    ADDI r14, r14, 1
26 no_carry_e:
27
28    JR r1                 # Return

```

### 8.7.3 Recursive Function

```

1 # Function: factorial(n) -> n!
2 # Input: r2 = n
3 # Output: r2 = n!
4
5 factorial:
6     # Prologue
7     ADDI r15, r15, -4      # Allocate space
8     BCC no_borrow_fp
9     ADDI r14, r14, -1
10    no_borrow_fp:
11
12    SW   r1, 0(r14)        # Save RA
13    SW   r10, 1(r14)       # Save r10 (for n)
14    SW   r2, 2(r14)        # Save n
15
16    # Base case: if (n <= 1) return 1
17    SLTI r7, r2, 2         # r7 = (n < 2) ? 1 : 0
18    BEQ  base_case        # If r7 == 0 (n >= 2), skip to recursive
19
20    # Base case: return 1
21    ORI  r2, r0, 1         # result = 1
22    J    factorial_return
23
24 base_case:
25     # Recursive case: return n * factorial(n-1)
26     ADD  r10, r2, r0       # Save n in r10
27
28     ADDI r2, r2, -1        # r2 = n - 1
29     # Recursive call to factorial(r2)
30     # (Pseudo-call, would need real call mechanism)
31
32     # After return, r2 = factorial(n-1)
33     # Multiply by n (in r10)
34     # For 4-bit, multiplication is complex, simplified here:
35     # result = n * factorial(n-1)
36     # (Assume multiply routine exists)
37
38 factorial_return:
39     # Epilogue
40     LW   r10, 1(r14)       # Restore r10
41     LW   r1, 0(r14)        # Restore RA
42
43     ADDI r15, r15, 4        # Deallocate
44     BCC no_carry_fe
45     ADDI r14, r14, 1
46    no_carry_fe:
47
48    RET

```

## 8.8 Advanced Topics

### 8.8.1 Accessing Stack Parameters

When more than 4 parameters are passed:

```

1 # Function: sum_six(a, b, c, d, e, f)
2 # Params: r2=a, r3=b, r4=c, r5=d, stack[sp+4]=e, stack[sp+5]=f
3
4 sum_six:
5     # After prologue, stack layout:
6     # [sp+0] = saved registers
7     # [sp+1] = saved FP
8     # [sp+2] = saved FP
9     # [sp+3] = saved RA
10    # [sp+4] = caller's stack param: e
11    # [sp+5] = caller's stack param: f
12
13    # To access e and f, need to account for our frame
14    # FP points to saved r11, so:
15    # [FP+3] = saved RA
16    # [FP+4] = parameter e
17    # [FP+5] = parameter f
18
19    ADD r2, r2, r3      # a + b
20    ADD r2, r2, r4      # + c
21    ADD r2, r2, r5      # + d
22    LW  r7, 4(r12)      # Load e from [FP+4]
23    ADD r2, r2, r7      # + e
24    LW  r7, 5(r12)      # Load f from [FP+5]
25    ADD r2, r2, r7      # + f
26    # Result in r2

```

### 8.8.2 Variable-Length Argument Lists

For functions like printf with variable arguments:

```

1  # Convention: First parameter is count, rest are values
2  # varargs(count, arg0, arg1, ...)
3
4  varargs:
5      # r2 = count
6      # r3 = arg0
7      # r4 = arg1
8      # r5 = arg2
9      # [FP+4] = arg3
10     # [FP+5] = arg4
11     # ...
12
13     # Loop through stack parameters if count > 4
14     SLTI r7, r2, 5          # if (count < 5), no stack params
15     BNE  no_stack
16
17     # Process stack parameters
18     # Calculate number of stack params: count - 4
19     ADDI r6, r2, -4        # r6 = count - 4
20     ORI   r8, r0, 4        # r8 = offset (starting at FP+4)
21
22 loop:
23     LW    r9, 0(r12)       # Load param at [FP + r8]
24                     # (Simplified: would need to add r8 to r12:r13)
25     # Process r9...
26
27     ADDI r8, r8, 1         # offset++
28     ADDI r6, r6, -1        # remaining--
29     BNE  loop              # Continue if more params
30
31 no_stack:
32     # Done processing

```

### 8.8.3 Nested Function Calls

Each function call adds a new frame:

```

1 # Call chain: main -> func_a -> func_b
2
3 main:
4     # main's frame
5     # SP = 0xF0, FP = 0xF0
6     # Call func_a
7     # ...
8
9 func_a:
10    # Prologue creates frame
11    # SP = 0xE8, FP = 0xE8
12    # Saved registers, RA point to main
13
14    # Call func_b
15    # ...
16
17 func_b:
18    # Prologue creates frame
19    # SP = 0xE0, FP = 0xE0
20    # Saved registers, RA point to func_a
21
22    # Work...
23
24    # Epilogue restores func_a's frame
25    # SP back to 0xE8
26    # Return to func_a
27
28    # func_a epilogue restores main's frame
29    # SP back to 0xF0
30    # Return to main

```

## 8.9 Limitations and Workarounds

### 8.9.1 No Indirect Jump (v0.1)

RISC-4 v0.1 lacks JALR (jump and link register), making returns difficult.

#### Workaround Options:

##### Option 1: Fixed Return Addresses

```

1 # Caller knows where it will return to
2 call_site_1:
3     ORI r1, r0, <ret1_high>
4     ORI r1, r0, <ret1_low> # r1 = address of ret1
5     J function
6
7 ret1:
8     # Returned here
9
10 function:
11     # Do work...
12     # Jump to address in r1 (but can't: no JALR)
13     # Instead, jump to fixed return trampoline

```

##### Option 2: Return Trampoline

```

1 # Shared return mechanism at fixed address
2 return_trampoline:
3     # Load RA from stack
4     LW r1, -1(r14)      # Assuming RA is at [SP-1]
5     # Dispatch based on r1 value
6     # (Requires jump table or series of comparisons)

```

**Recommendation:** For v1.0, add JALR rd, rs instruction:

- $rd \leftarrow PC + 1$  (save return address)
- $PC \leftarrow rs$  (jump to register value)

### 8.9.2 Limited Parameter Count

Only 4 register parameters forces stack usage for larger signatures.

**Workaround:** Pass pointer to structure:

```

1 # Instead of: func(a, b, c, d, e, f, g, h)
2 # Use: func(params_ptr)
3
4 # Caller:
5 # Build parameter structure at 0x80:
6 ORI r10, r0, 0x8
7 ORI r11, r0, 0x0      # r10:r11 = 0x80
8
9 SW  r2, 0(r10)        # params[0] = a
10 SW  r3, 1(r10)        # params[1] = b
11 # ... etc
12
13 # Pass pointer
14 ADD r2, r10, r0       # Only need low byte if < 16 params
15 J   func
16
17 # Callee:
18 func:
19     # r2 = params pointer (low byte)
20     # Reconstruct full pointer (assuming high byte is 0x8)
21     ORI r10, r0, 0x8
22     ADD r11, r2, r0      # r10:r11 = params pointer
23
24 LW  r3, 0(r10)        # Load params[0]
25 LW  r4, 1(r10)        # Load params[1]
26 # ...

```

## 8.10 Performance Considerations

### 8.10.1 Call Overhead

Typical function call overhead:

| Operation                            | Cycles              |
|--------------------------------------|---------------------|
| Allocate stack (with carry handling) | 2-4                 |
| Save registers (4 registers)         | 4                   |
| Save FP (2 nibbles)                  | 2                   |
| Save RA                              | 1                   |
| Update FP (with carry handling)      | 2-4                 |
| <b>Total Prologue</b>                | <b>11-15 cycles</b> |
| Restore registers (4 registers)      | 4                   |
| Restore FP                           | 2                   |
| Restore RA                           | 1                   |
| Deallocate stack                     | 2-4                 |
| Return (future JALR)                 | 1                   |
| <b>Total Epilogue</b>                | <b>10-12 cycles</b> |
| <b>Total Call Overhead</b>           | <b>21-27 cycles</b> |

### 8.10.2 Optimization Strategies

#### 1. Leaf Function Optimization

- Don't save RA if function doesn't call others
- Don't save callee-saved registers if not used
- Reduces overhead to 0 cycles for trivial functions

#### 2. Inline Small Functions

- If function body < 10 instructions, consider inlining
- Eliminates call overhead entirely
- Trade-off: code size increases

#### 3. Tail Call Optimization

```

1 # Instead of:
2 func_a:
3     # Work...
4     # Call func_b
5     # Epilogue
6     # Return
7
8 # Use tail call:
9 func_a:
10    # Work...
11    # Setup args for func_b
12    # Epilogue (restore our frame)
13    # Jump directly to func_b (not call)
14    J func_b      # func_b returns directly to our caller

```

#### 4. Register Allocation

- Prefer caller-saved registers (r7-r9) for temporaries
- Only use callee-saved registers (r10-r11) when values must survive calls
- Reduces save/restore overhead

## 8.11 Future Enhancements

### 8.11.1 Proposed v1.0 Additions

#### 1. JALR Instruction

```
1 JALR rd, rs      # rd <- PC+1, PC <- rs
```

Enables true function calls with return addresses.

#### 2. Hardware Stack Support

- Dedicated SP register (auto-increment/decrement)
- PUSH/POP instructions
- Simplifies stack management

#### 3. Link Register Convention

- Use r1 as dedicated link register
- JAL instruction:  $r1 \leftarrow PC+1$ ,  $PC \leftarrow \text{target}$
- RET instruction:  $PC \leftarrow r1$

## 8.12 Design Rationale

### 8.12.1 Why This Convention?

#### Register Allocation:

- 4 argument registers matches typical function signatures
- 3 temporaries (t0-t2) sufficient for most expressions
- 2 saved registers (s0-s1) for critical locals
- Dedicated FP and SP essential for stack management

#### Stack Frame Design:

- FP provides stable reference despite SP changes
- Saved RA at predictable offset enables debuggers
- Compatible with C calling conventions (adapted)

#### Caller vs Callee Saved:

- Caller-saved for values not needed after call (avoids unnecessary saves)
- Callee-saved for long-lived locals (saves once, use many times)
- Balance minimizes total save/restore operations

# Appendix A

# Assembly Language Syntax

## A.1 Instruction Format

RISC-4 assembly follows a simple, consistent syntax:

```
1 [label:] mnemonic [operands] [# comment]
```

### Components:

- **label:** Optional identifier followed by colon
- **mnemonic:** Instruction name (ADD, SUB, LW, etc.)
- **operands:** Comma-separated registers, immediates, or addresses
- **comment:** Optional text following # character

## A.2 Operand Types

### A.2.1 Registers

Registers are specified by name or number:

```
1 # By number:
2 ADD r1, r2, r3
3
4 # By conventional name:
5 ADD ra, a0, a1      # Equivalent to above
6
7 # Zero register:
8 ADD r4, r0, r5      # r4 = 0 + r5 (copy r5 to r4)
9 ORI r1, zero, 0x5    # r1 = 5 (load immediate)
```

### A.2.2 Immediate Values

Immediates can be decimal, hexadecimal, or binary:

```

1 ADDI r1, r2, 5      # Decimal: 5
2 ADDI r1, r2, 0x5    # Hexadecimal: 5
3 ADDI r1, r2, 0b0101 # Binary: 5
4
5 ADDI r3, r3, -1     # Negative immediate (two's complement)
6 ORI  r4, r0, 0xF    # Hexadecimal: 15

```

### A.2.3 Memory Operands

Load/store instructions use offset(base) syntax:

```

1 LW  r1, 0(r14)      # Load from [r14:r15 + 0]
2 SW  r2, 4(r14)      # Store to [r14:r15 + 4]
3 LW  r3, -1(r12)     # Load from [r12:r13 - 1]
4
5 # Base must be even-numbered register (pair)
6 LW  r1, 0(r14)      # Valid: r14 is even
7 LW  r1, 0(r15)      # INVALID: r15 is odd

```

### A.2.4 Branch Targets

Branches use labels or offsets:

```

1 loop:
2     ADDI r1, r1, 1
3     SLTI r2, r1, 10
4     BNE  loop          # Branch to label
5
6     BEQ  +5            # Branch forward 5 instructions (rare)
7     BCS  -3            # Branch backward 3 instructions (rare)

```

### A.2.5 Jump Targets

Jumps use labels or absolute addresses:

```

1 J start             # Jump to label
2 J 0x100              # Jump to absolute address (rare)
3
4 start:
5     # Program begins here

```

## A.3 Pseudo-Instructions

Assemblers often provide pseudo-instructions that expand to one or more real instructions:

### A.3.1 NOP - No Operation

```

1 NOP                  # Pseudo-instruction
2 # Expands to:
3 ADD r0, r0, r0       # Does nothing (writes to r0 are ignored)

```

### A.3.2 MOV - Move Register

```

1 MOV rd, rs          # Pseudo-instruction: rd = rs
2 # Expands to:
3 ADD rd, rs, r0      # rd = rs + 0
4 # Or:
5 OR rd, rs, r0       # rd = rs | 0

```

### A.3.3 LI - Load Immediate

```

1 LI rd, imm4        # Pseudo-instruction: rd = immediate
2 # Expands to:
3 ORI rd, r0, imm4    # rd = 0 | imm4

```

### A.3.4 NOT - Bitwise NOT

```

1 NOT rd, rs         # Pseudo-instruction: rd = ~rs
2 # Expands to:
3 XOR rd, rs, 0xF     # rd = rs ^ 0b1111 (flip all bits)

```

### A.3.5 CLR - Clear Register

```

1 CLR rd             # Pseudo-instruction: rd = 0
2 # Expands to:
3 XOR rd, rd, rd      # rd = rd ^ rd = 0
4 # Or:
5 AND rd, rd, r0       # rd = rd & 0 = 0

```

### A.3.6 INC/DEC - Increment/Decrement

```

1 INC rd             # Pseudo-instruction: rd++
2 # Expands to:
3 ADDI rd, rd, 1
4
5 DEC rd             # Pseudo-instruction: rd--
6 # Expands to:
7 ADDI rd, rd, -1

```

### A.3.7 RET - Return from Function

```

1 RET                  # Pseudo-instruction: return to caller
2 # In v0.1 (no JALR):
3 # Requires software convention (see Chapter 10)
4
5 # In future v1.0 with JALR:
6 # Expands to:
7 JALR r0, r1          # PC = r1, r0 = PC+1 (r0 write ignored)

```

### A.3.8 PUSH/POP - Stack Operations

```

1 PUSH rs           # Pseudo-instruction: push rs to stack
2 # Expands to:
3 ADDI r15, r15, -1 # SP--
4 SW   rs, 0(r14)   # mem[SP] = rs
5 # Note: Simplified; real implementation needs carry handling
6
7 POP rd           # Pseudo-instruction: pop from stack to rd
8 # Expands to:
9 LW   rd, 0(r14)   # rd = mem[SP]
10 ADDI r15, r15, 1  # SP++
11 # Note: Simplified; real implementation needs carry handling

```

### A.3.9 BLT/BGE/BGT/BLE - Signed Comparison Branches

```

1 BLT rs, rt, label # Pseudo: branch if rs < rt (signed)
2 # Expands to:
3 SLT r7, rs, rt    # r7 = (rs < rt) ? 1 : 0
4 BNE r7, label      # Branch if r7 != 0
5
6 BGE rs, rt, label # Pseudo: branch if rs >= rt (signed)
7 # Expands to:
8 SLT r7, rs, rt    # r7 = (rs >= rt) ? 1 : 0
9 BEQ r7, label      # Branch if r7 == 0
10
11 BGT rs, rt, label # Pseudo: branch if rs > rt (signed)
12 # Expands to:
13 SLT r7, rt, rs    # r7 = (rt < rs) ? 1 : 0
14 BNE r7, label      # Branch if r7 != 0
15
16 BLE rs, rt, label # Pseudo: branch if rs <= rt (signed)
17 # Expands to:
18 SLT r7, rt, rs    # r7 = (rt <= rs) ? 1 : 0
19 BEQ r7, label      # Branch if r7 == 0

```

## A.4 Assembler Directives

Common assembler directives for controlling assembly:

### A.4.1 .org - Set Origin Address

```

1 .org 0x000      # Start assembling at address 0x000
2 start:
3     ORI r14, r0, 0xF
4     ORI r15, r0, 0xF      # Initialize SP = 0xFF

```

### A.4.2 .data - Data Section

```

1   .data          # Switch to data section
2 array:
3     .nibble 0x1, 0x2, 0x3, 0x4  # Define nibble array
4 value:
5     .nibble 0xA      # Single nibble
6
7   .text          # Switch back to code section
8   LW r1, array    # Load first element

```

### A.4.3 .equ - Define Constant

```

1   .equ STACK_TOP, 0xFF
2   .equ MAX_COUNT, 10
3
4   ORI r14, r0, (STACK_TOP >> 4)
5   ORI r15, r0, (STACK_TOP & 0xF)

```

### A.4.4 .align - Alignment

```

1   .align 4          # Align to next 4-nibble boundary
2 function:
3   ADD r1, r2, r3

```

## A.5 Commenting Conventions

### A.5.1 Inline Comments

```

1 ADD r1, r2, r3      # r1 = r2 + r3
2 ADDI r4, r4, 1       # Increment counter

```

### A.5.2 Block Comments

```

1 =====
2 # Function: multiply
3 # Description: Multiply two 4-bit numbers
4 # Input: r2 = multiplicand, r3 = multiplier
5 # Output: r2:r3 = 8-bit product
6 # Clobbers: r4, r5, r6
7 =====
8 multiply:
9   # Implementation...

```

### A.5.3 Section Headers

```

1 #-----
2 # Initialization
3 #-----
4 init:
5     ORI r14, r0, 0xF      # Set up stack pointer
6     ORI r15, r0, 0xF
7
8 #-----
9 # Main Loop
10 #-----
11 main_loop:
12     # Loop body...

```

## A.6 Register Naming Conventions

| Number | Name                                   | Usage                        |
|--------|----------------------------------------|------------------------------|
| r0     | zero                                   | Constant zero                |
| r1     | ra_hi/ra(or full return address alias) |                              |
| r2     | ra_mid                                 | Return address MID           |
| r3     | ra_low                                 | Return address LOW           |
| r4     | a2                                     | Argument 2                   |
| r5     | a3                                     | Argument 3                   |
| r6     | v0                                     | Return value (alternate)     |
| r7     | t0                                     | Temporary 0                  |
| r8     | t1                                     | Temporary 1                  |
| r9     | t2                                     | Temporary 2                  |
| r10    | s0                                     | Saved register 0             |
| r11    | s1                                     | Saved register 1             |
| r12    | s2/fp_hi                               | Saved 2 / Frame pointer high |
| r13    | s3/fp_lo                               | Saved 3 / Frame pointer low  |
| r14    | sp_hi                                  | Stack pointer high           |
| r15    | sp_lo                                  | Stack pointer low            |

Table A.1: Register Name Aliases

## A.7 Example Assembly Program

```

1 #=====
2 # Program: Sum Array
3 # Description: Sum elements of a 4-element array
4 #=====
5
6     .org 0x000
7
8 #-----
9 # Initialization
10 #-----
11 start:

```

```

12      # Initialize stack pointer to 0xFF
13      LI    sp_hi, 0xF
14      LI    sp_lo, 0xF
15
16      # Set up array pointer
17      LI    r10, 0x8          # Array at 0x80
18      LI    r11, 0x0
19
20      # Initialize sum to zero
21      CLR   r2              # r2 = sum
22
23      # Initialize counter
24      LI    r3, 4            # r3 = count
25
26      #-----
27      # Sum Loop
28      #-----
29      sum_loop:
30      LW    r4, 0(r10)       # Load array[i]
31      ADD   r2, r2, r4       # sum += array[i]
32
33      # Increment pointer
34      INC   r11             # ptr++
35
36      # Decrement counter
37      DEC   r3              # count--
38      BNE   sum_loop         # Continue if count != 0
39
40      #-----
41      # Store Result
42      #-----
43      SW    r2, 0(r10)       # Store sum at array[4]
44
45      done:
46      J     done             # Infinite loop
47
48      #-----
49      # Data Section
50      #-----
51      .org 0x80
52      array:
53      .nibble 0x3, 0x5, 0x2, 0x7  # Array elements
54      result:
55      .nibble 0x0              # Result location

```

## Appendix B

# Example Programs

### B.1 Fibonacci Sequence

```
1 # Compute fib(n)
2 # Input: r2 (n)
3 # Output: r6 (result)
4 fib:
5     # Base Case: if n < 2, return n
6     SLTI r7, r2, 2          # r7 = (n < 2) ? 1 : 0
7     ADDI r7, r7, -1         # r7 = (n < 2) ? 0 : -1
8     BEQ r7, return_n       # If r7 == 0 (meaning n < 2), jump
9
10    # Recursive step would go here...
11    # (Omitted for brevity due to complexity of
12    # saving context in 4-bit mode)
13
14 return_n:
15     ADD r6, r2, r0          # return n
16     RET                      # Pseudo for J r1 (if implemented)
```

## Appendix C

## Revision History

| Version | Date       | Changes                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
|---------|------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0.0.1   | 2026-01-20 | <p>Initial draft</p> <ul style="list-style-type: none"><li>- Basic ISA definition</li><li>- 16 instructions documented</li><li>- Register model established</li><li>- Preliminary calling convention</li></ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| 0.1.0   | 2026-01-21 | <p>Major documentation expansion</p> <ul style="list-style-type: none"><li>- Added Chapter 4: Status Flags (comprehensive)</li><li>- Added Chapter 8: Pipeline Architecture (5-stage pipeline)</li><li>- Expanded Chapter 9: Addressing (register pairs, multi-precision)</li><li>- Expanded Chapter 10: Calling Convention (complete prologue/epilogue)</li><li>- Expanded Appendix A: Assembly Syntax (pseudo-instructions, directives)</li><li>- Fixed all instruction flag documentation (C, Z flags)</li><li>- Reorganized instruction order (opcode sequence)</li><li>- Improved document formatting and spacing</li><li>- Added comprehensive examples throughout</li></ul> |

Table C.1: Document Revision History

### C.1 Version 0.0.1 (Initial Draft)

**Date:** January 20, 2026

**Status:** Solidifying

**Major Components:**

- Defined 4-bit RISC architecture with 16 instructions
- Established 5-stage pipeline design (concept)
- Documented register file ( $16 \times 4$ -bit registers)

- Defined four instruction formats (R, I, M, J)
- Basic calling convention outlined
- Fibonacci example program

**Known Limitations:**

- Pipeline documentation incomplete
- Flag behavior not fully specified
- Calling convention needs expansion
- Missing examples for common operations
- No JALR instruction (returns difficult)

## C.2 Version 0.1.0 (Documentation Expansion)

**Date:** January 21, 2026

**Status:** Pre-Silicon

**Major Additions:**

### C.2.1 Chapter 4: Status Flags

- Complete flag register specification (Carry and Zero)
- Flag update rules for all instructions
- Comparison operation patterns (equality, signed, unsigned)
- Pipeline hazard analysis (flag forwarding vs stalls)
- Design rationale (why RISC-4 uses flags)
- Comparison to MIPS, ARM, SPARC flag implementations

### C.2.2 Chapter 8: Pipeline Architecture

- Complete 5-stage pipeline description (IF, ID, EX, MEM, WB)
- Pipeline register definitions
- Data hazard analysis (RAW, load-use, flag hazards)
- Control hazard solutions (branch prediction, delay slots)
- Forwarding logic specifications
- Performance analysis (realistic CPI estimation)
- Implementation considerations

### C.2.3 Chapter 9: Addressing (Expanded)

- Memory organization (Harvard architecture)
- Register pair formation rules
- Pair manipulation techniques (increment, decrement, offset addition)
- Complete addressing mode examples (stack, arrays, structures)
- Multi-precision arithmetic (8-bit, 16-bit operations)
- Stack management (push, pop with carry handling)
- Limitations and workarounds
- Design rationale for register pairing

#### C.2.4 Chapter 10: Calling Convention (Expanded)

- Complete register usage convention
- Parameter passing (1-4 parameters, stack parameters)
- Return value conventions
- Stack frame structure and layout
- Complete function prologue and epilogue
- Example functions (leaf, non-leaf, recursive)
- Advanced topics (variable arguments, nested calls)
- Performance analysis (call overhead)
- Optimization strategies
- Future enhancements (JALR proposal)

#### C.2.5 Appendix A: Assembly Syntax (Expanded)

- Complete operand type documentation
- Pseudo-instruction catalog (NOP, MOV, LI, NOT, etc.)
- Assembler directives (.org, .data, .equ, .align)
- Commenting conventions
- Register naming conventions
- Complete example program

### C.2.6 All Instructions Updated

- Comprehensive flag documentation (C and Z)
- Consistent formatting across all instruction descriptions
- Detailed examples for each instruction
- Flag behavior explicitly stated

#### Bug Fixes:

- Fixed AND instruction: corrected operation to  $rd \leftarrow rs \& rt$
- Fixed ADDI instruction: corrected parameter order in encoding
- Corrected flag naming inconsistencies throughout
- Fixed instruction ordering (SHF moved to correct position)

#### Formatting Improvements:

- Added spacing optimizations (reduced excessive whitespace)
- Implemented raggedbottom (no vertical stretching)
- Added subsection grouping (improved page breaks)
- Standardized table formatting
- Improved code listing presentation

#### Remaining for v1.0:

- Add JALR instruction for proper function returns
- Additional example programs (bubble sort, string operations)
- Hardware implementation guide
- Verilog/VHDL reference implementation
- Assembler specification
- Complete instruction timing diagrams
- Test suite definition

## C.3 Future Versions

### C.3.1 Planned for v1.0 (Silicon)

- JALR instruction (indirect jump with link)
- Hardware verification complete
- Tapeout-ready RTL
- Complete test suite
- Assembler and toolchain
- Reference implementation in Verilog

### C.3.2 Potential v2.0 Enhancements

- Hardware multiply/divide
- Interrupt support
- Memory-mapped I/O specification
- DMA controller
- Cache hierarchy
- Floating-point coprocessor (4-bit mantissa experiments)



## Appendix D

# Revision History

| Version | Date       | Changes                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|---------|------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0.0.1   | 2026-01-20 | <p>Initial draft</p> <ul style="list-style-type: none"> <li>- Basic ISA definition</li> <li>- 16 instructions documented</li> <li>- Register model established</li> <li>- Preliminary calling convention</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| 0.1.0   | 2026-01-21 | <p>Major documentation expansion</p> <ul style="list-style-type: none"> <li>- Added Chapter 4: Status Flags (comprehensive)</li> <li>- Added Chapter 6: Pipeline Architecture (5-stage pipeline)</li> <li>- Expanded Chapter 7: Addressing (register pairs, multi-precision)</li> <li>- Expanded Chapter 8: Calling Convention (complete prologue/epilogue)</li> <li>- Expanded Appendix A: Assembly Syntax (pseudo-instructions, directives)</li> <li>- Fixed all instruction flag documentation (C, Z flags)</li> <li>- Reorganized instruction order (opcode sequence)</li> <li>- Improved document formatting and spacing</li> <li>- Added comprehensive examples throughout</li> </ul>                                                                                                                                            |
| 0.2.0   | 2026-01-22 | <p>Function call mechanism and ISA refinement</p> <ul style="list-style-type: none"> <li>- Added JAL (Jump And Link) instruction</li> <li>- Added JR (Jump Register) instruction</li> <li>- Implemented 12-bit return address using r1:r2:r3 register triple</li> <li>- Updated J-type format to support JAL variant (bit [11] as link flag)</li> <li>- Extended EXT opcode with JR function (code 0x3)</li> <li>- Revised calling convention (r4-r7 now argument registers)</li> <li>- Updated register file documentation (r1:r2:r3 reserved for returns)</li> <li>- Added dedicated return address path in pipeline</li> <li>- Documented leaf vs non-leaf function prologue requirements</li> <li>- Added compiler guidance for JAL range limitation (11-bit)</li> <li>- Added documentation for floating-point support</li> </ul> |

## D.1 Version 0.0.1 (Initial Draft)

**Date:** January 20, 2026

**Status:** Solidifying

**Major Components:**

- Defined 4-bit RISC architecture with 16 instructions
- Established 5-stage pipeline design (concept)
- Documented register file ( $16 \times 4$ -bit registers)
- Defined four instruction formats (R, I, M, J)
- Basic calling convention outlined
- Fibonacci example program

**Known Limitations:**

- Pipeline documentation incomplete
- Flag behavior not fully specified
- Calling convention needs expansion
- Missing examples for common operations
- No JALR instruction (returns difficult)

## D.2 Version 0.1.0 (Documentation Expansion)

**Date:** January 21, 2026

**Status:** Pre-Silicon

**Major Additions:**

### D.2.1 Chapter 4: Status Flags

- Complete flag register specification (Carry and Zero)
- Flag update rules for all instructions
- Comparison operation patterns (equality, signed, unsigned)
- Pipeline hazard analysis (flag forwarding vs stalls)
- Design rationale (why RISC-4 uses flags)
- Comparison to MIPS, ARM, SPARC flag implementations

### D.2.2 Chapter 6: Pipeline Architecture

- Complete 5-stage pipeline description (IF, ID, EX, MEM, WB)
- Pipeline register definitions
- Data hazard analysis (RAW, load-use, flag hazards)
- Control hazard solutions (branch prediction, delay slots)
- Forwarding logic specifications
- Performance analysis (realistic CPI estimation)
- Implementation considerations

### D.2.3 Chapter 7: Addressing (Expanded)

- Memory organization (Harvard architecture)
- Register pair formation rules
- Pair manipulation techniques (increment, decrement, offset addition)
- Complete addressing mode examples (stack, arrays, structures)
- Multi-precision arithmetic (8-bit, 16-bit operations)
- Stack management (push, pop with carry handling)
- Limitations and workarounds
- Design rationale for register pairing

### D.2.4 Chapter 8: Calling Convention (Expanded)

- Complete register usage convention
- Parameter passing (1-4 parameters, stack parameters)
- Return value conventions
- Stack frame structure and layout
- Complete function prologue and epilogue
- Example functions (leaf, non-leaf, recursive)
- Advanced topics (variable arguments, nested calls)
- Performance analysis (call overhead)
- Optimization strategies
- Future enhancements (JALR proposal)

### D.2.5 Appendix A: Assembly Syntax (Expanded)

- Complete operand type documentation
- Pseudo-instruction catalog (NOP, MOV, LI, NOT, etc.)
- Assembler directives (.org, .data, .equ, .align)
- Commenting conventions
- Register naming conventions
- Complete example program

### D.2.6 All Instructions Updated

- Comprehensive flag documentation (C and Z)
- Consistent formatting across all instruction descriptions
- Detailed examples for each instruction
- Flag behavior explicitly stated

#### Bug Fixes:

- Fixed AND instruction: corrected operation to  $rd \leftarrow rs \& rt$
- Fixed ADDI instruction: corrected parameter order in encoding
- Corrected flag naming inconsistencies throughout
- Fixed instruction ordering (SHF moved to correct position)

#### Formatting Improvements:

- Added spacing optimizations (reduced excessive whitespace)
- Implemented raggedbottom (no vertical stretching)
- Added subsection grouping (improved page breaks)
- Standardized table formatting
- Improved code listing presentation

#### Remaining for v1.0:

- Add JALR instruction for proper function returns
- Additional example programs (bubble sort, string operations)
- Hardware implementation guide
- Verilog/VHDL reference implementation
- Assembler specification
- Complete instruction timing diagrams
- Test suite definition

## D.3 Version 0.2.0 (Function Call Mechanism)

**Date:** January 22, 2026

**Status:** Pre-Silicon

**Major Additions:**

### D.3.1 New Instructions

- **JAL (Jump And Link):** Unconditional jump with return address save
  - Encoding: 1111 1ttt tttt tttt (J-type variant, bit [11] = 1)
  - Operation:  $r1 \leftarrow PC[11:8]$ ,  $r2 \leftarrow PC[7:4]$ ,  $r3 \leftarrow PC[3:0]$ ,  $PC \leftarrow \text{target11}$
  - Range: 0x000–0x7FF (2048 nibbles, 11-bit target)
  - Uses register triple (r1:r2:r3) for 12-bit return address storage
- **JR (Jump Register):** Indirect jump to address in r1:r2:r3
  - Encoding: 0111 xxxx xxxx 0011 (EXT function code 0x3)
  - Operation:  $PC \leftarrow \{r1, r2, r3\}$  (concatenate to form 12-bit address)
  - Dedicated hardware path for return address read
  - Single-cycle execution with forwarding support

### D.3.2 Calling Convention Overhaul

- **Return address mechanism:** r1:r2:r3 form 12-bit return address register
  - r1 (ra\_hi): High nibble [11:8]
  - r2 (ra\_mid): Middle nibble [7:4]
  - r3 (ra\_low): Low nibble [3:0]
- **Argument registers shifted:** r4–r7 now used for parameters (was r2–r5)
- **Return value registers:** r4, r6, r7 (was r2, r6)
- **Leaf function optimization:** No prologue/epilogue required for functions that don't call others
- **Non-leaf function convention:** Must save r1:r2:r3 to stack before nested calls

### D.3.3 Pipeline Enhancements

- Dedicated 12-bit return address read path in ID stage
- Forwarding logic for r1, r2, r3 to handle hazards on JR
- JAL executes in ID stage with three simultaneous register writes
- JR hazard handling documented (forwarding from EX/MEM/WB stages)

### D.3.4 Instruction Format Updates

- J-type format split into two variants based on bit [11]:
  - Bit [11] = 0: J (unconditional jump, 12-bit range)
  - Bit [11] = 1: JAL (jump and link, 11-bit range)
- EXT opcode function code allocation updated:
  - 0x0: ADC (existing)
  - 0x1: SBB (existing)
  - 0x2: NEG (existing)
  - 0x3: JR (new)
  - 0x4–0xF: Reserved for future use

### D.3.5 Documentation Additions

- Complete JAL/JR instruction specifications with examples
- Leaf vs non-leaf function prologue/epilogue decision tree
- Nested function call examples with stack management
- Tail call optimization patterns
- Register usage convention table updated
- Pseudo-instruction RET added (expands to JR r1)

### D.3.6 Compiler Guidance

- JAL range limitation documented (11-bit = 2048 nibble max)
- Trampoline strategy for calling functions beyond 0x7FF
- Code layout optimization recommendations
- Hot function placement in low memory (JAL-reachable range)
- Linker responsibilities for trampoline generation

**Breaking Changes from v0.1.0:**

- Calling convention incompatible with v0.1.0 (argument registers changed)
- r1, r2, r3 now reserved for return addresses (not general-purpose)
- Function prologue requirements changed (leaf functions need no prologue)

**Known Limitations:**

- JAL limited to 11-bit range (first 2048 nibbles of instruction memory)
- Functions beyond 0x7FF require trampoline mechanism
- Return address uses 3 registers (reduces general-purpose register count)
- No PC-relative call instruction (all calls are absolute addressing)

**Remaining for v1.0:**

- Consider JALR with PC-relative addressing ( $\pm 128$  nibble range)
- Investigate full 12-bit JAL encoding options
- Additional example programs demonstrating call/return patterns
- Hardware verification of dedicated return address path
- Compiler implementation with trampoline generation
- Linker specification for code layout optimization

## D.4 Future Versions

### D.4.1 Planned for v1.0 (Silicon)

- JALR instruction with PC-relative addressing (under consideration)
- Full 12-bit JAL range if encoding space found
- Hardware verification complete
- Tapeout-ready RTL
- Complete test suite
- Assembler and toolchain with trampoline support
- Reference implementation in Verilog

#### D.4.2 Potential v2.0 Enhancements

- Hardware multiply/divide
- Interrupt support with context save/restore
- Memory-mapped I/O specification
- DMA controller
- Cache hierarchy
- Floating-point coprocessor (4-bit mantissa experiments)
- Branch delay slots for zero-cycle branches