

# Intel Pentium (P5) Microarchitecture

## First Superscalar x86 - Dual-Issue Pipeline (1993)

**Specifications:** 60-200 MHz, 3.1-3.3M transistors (0.8-0.35 µm), 273-pin PGA, 8 KB L1 I-cache + 8 KB L1 D-cache, 64-bit data bus, 32-bit address bus (4 GB physical), Superscalar (2-way), Branch prediction, APIC, FPU integrated, RDTSC instruction

### Dual-Issue Superscalar Pipeline



### Execution Units

- Integer ALU** (U-pipe)
  - ADD, SUB, AND, OR, XOR
  - DEC, NEG, NOT
  - CMP, TEST
  - 1 cycle latency
  - Subset of U-pipe ops
  - Simple ops only
  - No shifts, no multiply
- Floating Point Unit** (U-pipe only)
  - MUL: 3 cycles
  - FDIV: 39 cycles (FDIV bug!)
  - Pipelined (3-stage)
  - 64-bit data bus
- Load/Store Unit**
  - Aligned access: 1 cycle
  - Misaligned: 2 cycles
  - Dynamic branch prediction
  - Max 1 memory op per clock
  - 256-entry BTB (Branch Target Buffer)
- Branch Unit**
  - 2-bit saturating counter
  - Misprediction penalty: 3-4 cycles

### Cache Hierarchy



### Register File



8 GPRs  
32-bit

**MMX Registers:** (added in Pentium MMX, New in Pentium 1997)

- MM0-MM7 (64-bit words)
- Aliased to FPU ST(0-7)
- RDTSC (Instruction)  
Reads cycle counter
- SIMD operations (Single Instruction, Multiple Data, High-res timing)
- Data Types: Ring 3 accessible

### MMX Instructions (P5 MMX)

- 4x 16-bit integers (packed words)
- 2x 32-bit integers (packed dwords)
- 1x 64-bit integer (quadword)

#### Instructions:

- PADDB/W/D: Packed add
- PSUBB/W/D: Packed subtract
- PMULL/H: Packed multiply
- PAND/OR/XOR: Packed logical
- PSLL/RL/RA: Packed shift

#### Use Cases:

- Graphics (pixel operations)
- Video encoding/decoding
- Audio processing
- Image filtering

### APIC (Advanced Programmable Interrupt Controller)

#### Features:

- Per-CPU interrupt handling
- Inter-Process Local APICs (IPI)
- Local timer interrupt
- Performance monitoring counter
- Thermal sensor interrupt
- Registers (MMIO):
  - Mapped at 0xFFFF000000
- I/O APIC/I/O APIC (External)
  - External chip (on motherboard)
  - Routes IRQs to Local APICs
  - Supports up to 24 IRQs
  - Programmable routing
  - Edge or level-triggered

### Pentium FDIV Bug (1994)

**Description:** Floating-point division bug in early Pentium (60-100 MHz) caused incorrect results for certain inputs.

**Example:**  $4195835.0 / 3145727.0$  should give  $1.333820449136241002$  but Pentium returned  $1.333739068902037589$

**Root Cause:** Lookup table in FPU had 5 missing entries (out of 1066). Used in SRT division algorithm.

**Impact:** Intel initially downplayed issue ("affects 1 in 9 billion divisions"). Public outcry forced full recall.

**Cost:** \$475 million USD replacement program. Major PR disaster for Intel.

**Detection:** cpuid instruction returns family=5, model=1 or 2. Check for errata via BIOS update or OS flag.

**Workaround:** Linux kernel detects and disables FPU if bug present, uses software floating-point emulation.

**Fixed:** Pentium 75 MHz and later (stepping C1 and beyond) have corrected lookup table.

**Code:** 0.5-0.8 IPC

### System Integration:

- Socket 4 (Pentium 60/66, 273-pin PGA)
- Socket 5 (Pentium 75-200, 320-pin SPGA)
- Socket 7 (Pentium MMX, 321-pin SPGA)
- Chipssets: 430FX (Triton), 430HX, 430VX

### Operating Systems:

- DOS (16-bit real mode)
- Windows 3.x (16-bit protected mode)
- Windows 95/98/NT (32-bit)
- Linux 1.x+ (full 32-bit support)
- MINIX 3 (uses protected mode, paging)

### Compilers:

- GCC: -march=pentium
- MSVC: /G5
- ICC: -mtune=pentium

**IPC (Instructions Per Cycle):**  
 • Single-issue: 0.5-0.8 IPC

### Performance Characteristics

#### Latencies:

- Integer ADD/SUB: 1 cycle
- Integer MUL: 10 cycles (non-pipelined)
- FP ADD: 3 cycles (pipelined)
- FP MUL: 3 cycles (pipelined)
- FP DIV: 39 cycles (non-pipelined)
- L1 cache hit: 1 cycle
- L1 cache miss (to memory): 30-50 cycles

#### Branch Prediction:

- Prediction rate: 80-90% (depends on code)
- Misprediction penalty: 3-4 cycles
- No speculative execution (unlike P6)

### Pentium vs 486:

- 2x performance on integer code (dual-issue)
- 3-5x performance on FP code (pipelined FPU)
- Better branch prediction
- Larger caches (8KB vs 8KB unified on 486)
- Faster bus (60-66 MHz vs 33 MHz)

### Pentium Variants:

- Pentium 60/66: 0.8 µm, 5V, 3.1M transistors
- Pentium 75-200: 0.6-0.35 µm, 3.3V, split voltage
- Pentium MMX: 0.35 µm, 2.8V, 4.5M transistors
- Pentium OverDrive: Socket upgrade for 486