



# Lecture 2: ARM Processor Example

# Outline

---

- Design Partitioning
- ARM Processor Example
  - Architecture
  - Microarchitecture
  - Logic Design
  - Circuit Design
  - Physical Design
- Fabrication, Packaging, Testing

# Activity 2

---

- Sketch a stick diagram for a 4-input NOR gate

# Coping with Complexity

---

- How to design System-on-Chip?
  - Many millions (even billions!) of transistors
  - Tens to hundreds of engineers
- Structured Design
- Design Partitioning

# Structured Design

---

- **Hierarchy:** Divide and Conquer
  - Recursively system into modules
- **Regularity**
  - Reuse modules wherever possible
  - Ex: Standard cell library
- **Modularity:** well-formed interfaces
  - Allows modules to be treated as black boxes
- **Locality**
  - Physical and temporal

# Design Partitioning

---

- **Architecture:** User's perspective, what does it do?
    - Instruction set, registers
    - ARM, MIPS, x86, Alpha, PIC, RISC-V, ...
  - **Microarchitecture**
    - Single cycle, multicycle, pipelined, superscalar?
  - **Logic:** how are functional blocks constructed
    - Ripple carry, carry lookahead, carry select adders
  - **Circuit:** how are transistors used
    - Complementary CMOS, pass transistors, domino
  - **Physical:** chip layout
    - Datapaths, memories, random logic
-

# Gajski Y-Chart



# ARM Architecture

---

- Example: subset of ARM processor architecture
  - Drawn from DDCA ARM Ed. (Harris & Harris)
- ARM is a 32-bit architecture with 16 registers
  - Consider 8-bit subset using 8-bit datapath
  - Only implement 8 registers (R0 - R7)
  - 8-bit program counter in R7
- You'll build this processor in the labs
  - Illustrate the key concepts in VLSI design

# Instruction Set

---

- Data Processing
  - ADD, SUB, AND, OR
  - Each accepts register / immediate 2nd sources
- Memory
  - LDR, STR
  - Positive immediate offset
- Branch
  - B

# Instruction Encoding

- 32-bit instruction encoding
  - Requires four cycles to fetch on 8-bit datapath

## Data-processing



## Memory



## Branch



| I = 1 Immediate | I = 0 Register |
|-----------------|----------------|
| 11:8<br>0000    | 11:7<br>00000  |
| 7:0<br>imm8     | 6:5<br>00      |

| cmd  | instr |
|------|-------|
| 0100 | ADD   |
| 0010 | SUB   |
| 0000 | AND   |
| 1100 | OR    |

# Fibonacci (C)

$$f_0 = 1; f_{-1} = -1$$

$$f_n = f_{n-1} + f_{n-2}$$

$$f = 1, 1, 2, 3, 5, 8, 13, \dots$$

```
int fib(void)
{
    int n = 8;           /* compute nth Fibonacci number */
    int f1 = 1, f2 = -1; /* last two Fibonacci numbers */

    while (n != 0) {     /* count down to n = 0 */
        f1 = f1 + f2;
        f2 = f1 - f2;
        n = n - 1;
    }
    return f1;
}
```

# Fibonacci (Assembly)

```
# fib.asm
# Register usage: R0 = 0, R3 = n, R4 = f1, R5 = f2
# Put result in address 255

fib    add R3, R0, #8          # initialize n = 8
       add R4, R0, #1          # initialize f1 = 1
       add R5, R0, #-1         # initialize f2 = -1

loop   subs R3, R0, end       # n = 0?
       beq done                # then done
       add R4, R4, R5          # f1 = f1 + f2
       sub R5, R4, R5          # f2 = f1 – f2
       add R3, R3, #-1         # n = n-1
       b loop                 # repeat until done

done   str R4, [R0, #255]     # store result in adr 255
```

# ARM Microarchitecture

- Multicycle µarchitecture ( Harris DDCA ARM Ed.)



# Multicycle Controller



# Multicycle FSM



# Logic Design

- Start at top level
  - Hierarchically decompose ARM into units



# Hierarchical Design



# HDLs



- Hardware Description Languages
  - Widely used in logic design
  - Verilog and VHDL
- Describe hardware using code
  - Document logic functions
  - Simulate logic before building
  - Synthesize code into gates and layout
    - Requires a library of standard cells

# Verilog Example

```
module fulladder(input a, b, c,
                  output s, cout);
    sum      s1(a, b, c, s);
    carry    c1(a, b, c, cout);
endmodule
```



```
module carry(input a, b, c,
              output cout)
    assign cout = (a&b) | (a&c) | (b&c);
endmodule
```

# Circuit Design

---

- How should logic be implemented?
  - NANDs and NORs vs. ANDs and ORs?
  - Fan-in and fan-out?
  - How wide should transistors be?
- These choices affect speed, area, power
- Logic synthesis makes these choices for you
  - Good enough for many applications
  - Hand-crafted circuits are still better

# Example: Carry Logic

☐ **assign** cout = (a&b) | (a&c) | (b&c) ;



Transistors? Gate Delays?

# Gate-level Netlist

```
module carry(input a, b, c,  
             output cout)  
  
    wire x, y, z;  
  
    and g1(x, a, b);  
    and g2(y, a, c);  
    and g3(z, b, c);  
    or  g4(cout, x, y, z);  
  
endmodule
```



# Transistor-Level Netlist

```
module carry(input a, b, c,
              output cout)

    wire      i1, i2, i3, i4, cn;

    tranif1 n1(i1, 0, a);
    tranif1 n2(i1, 0, b);
    tranif1 n3(cn, i1, c);
    tranif1 n4(i2, 0, b);
    tranif1 n5(cn, i2, a);
    tranif0 p1(i3, 1, a);
    tranif0 p2(i3, 1, b);
    tranif0 p3(cn, i3, c);
    tranif0 p4(i4, 1, b);
    tranif0 p5(cn, i4, a);
    tranif1 n6(cout, 0, cn);
    tranif0 p6(cout, 1, cn);

endmodule
```



# SPICE Netlist

```
.SUBCKT CARRY A B C COUT VDD GND
MN1 I1 A GND GND NMOS W=1U L=0.18U AD=0.3P AS=0.5P
MN2 I1 B GND GND NMOS W=1U L=0.18U AD=0.3P AS=0.5P
MN3 CN C I1 GND NMOS W=1U L=0.18U AD=0.5P AS=0.5P
MN4 I2 B GND GND NMOS W=1U L=0.18U AD=0.15P AS=0.5P
MN5 CN A I2 GND NMOS W=1U L=0.18U AD=0.5P AS=0.15P
MP1 I3 A VDD VDD PMOS W=2U L=0.18U AD=0.6P AS=1P
MP2 I3 B VDD VDD PMOS W=2U L=0.18U AD=0.6P AS=1P
MP3 CN C I3 VDD PMOS W=2U L=0.18U AD=1P AS=1P
MP4 I4 B VDD VDD PMOS W=2U L=0.18U AD=0.3P AS=1P
MP5 CN A I4 VDD PMOS W=2U L=0.18U AD=1P AS=0.3P
MN6 COUT CN GND GND NMOS W=2U L=0.18U AD=1P AS=1P
MP6 COUT CN VDD VDD PMOS W=4U L=0.18U AD=2P AS=2P
CI1 I1 GND 2FF
CI3 I3 GND 3FF
CA A GND 4FF
CB B GND 4FF
CC C GND 2FF
CCN CN GND 4FF
CCOUT COUT GND 2FF
.ENDS
```

# Physical Design

---

- Floorplan
- Standard cells
  - Place & route
- Datapaths
  - Slice planning
- Area estimation

# ARM Floorplan



# ARM Layout



# Standard Cells

- Uniform cell height
- Uniform well height
- M1  $V_{DD}$  and GND rails
- M2 Access to I/Os
- Well / substrate taps
- Exploits regularity



# Synthesized Controller

- Synthesize HDL into gate-level netlist
- Place & Route using standard cell library



# Pitch Matching

- Synthesized controller area is mostly wires
  - Design is smaller if wires run through/over cells
  - Smaller = faster, lower power as well!
- Design snap-together cells for datapaths and arrays
  - Plan wires into cells
  - Connect by abutment
    - Exploits locality
    - Takes lots of effort

|   |   |   |   |   |
|---|---|---|---|---|
| A | A | A | A | B |
| A | A | A | A | B |
| A | A | A | A | B |
| A | A | A | A | B |
| C |   | C |   | D |

# ARM Datapath

- 8-bit datapath built from wordlices
- Zipper at top drives control signals to datapath



# Slice Plans

- Slice plan for bitslice
  - Cell ordering, dimensions, wiring tracks
  - Arrange cells for wiring locality



# Area Estimation

- Need area estimates to make floorplan
  - Compare to another block you already designed
  - Or estimate from transistor counts
  - Budget room for large wiring tracks
  - Your mileage may vary; derate by 2x for class.

**Table 1.10** Typical layout densities

| Element                              | Area                                                                                 |
|--------------------------------------|--------------------------------------------------------------------------------------|
| random logic (2-level metal process) | $1000 - 1500 \lambda^2$ / transistor                                                 |
| datapath                             | $250 - 750 \lambda^2$ / transistor<br>or $6 \text{ WL} + 360 \lambda^2$ / transistor |
| SRAM                                 | $1000 \lambda^2$ / bit                                                               |
| DRAM (in a DRAM process)             | $100 \lambda^2$ / bit                                                                |
| ROM                                  | $100 \lambda^2$ / bit                                                                |

# Design Verification

- ❑ Fabrication is slow & expensive
  - MOSIS 0.6μm: \$1000, 3 months
  - 65 nm: \$3M, 1 month
- ❑ Debugging chips is very hard
  - Limited visibility into operation
- ❑ Prove design is right before building!
  - Logic simulation
  - Ckt. simulation / formal verification
  - Layout vs. schematic comparison
  - Design & electrical rule checks
- ❑ Verification is > 50% of effort on most chips!



# Fabrication & Packaging

- Tapeout final layout
- Fabrication
  - 6, 8, 12" wafers
  - Optimized for throughput, not latency (10 weeks!)
  - Cut into individual dice
- Packaging
  - Bond gold wires from die I/O pads to package



# Testing



- Test that chip operates
  - Design errors
  - Manufacturing errors
- A single dust particle or wafer defect kills a die
  - Yields from 90% to < 10%
  - Depends on die size, maturity of process
  - Test each part before shipping to customer

# Custom vs. Synthesis

## □ 8-bit Implementations



# ARM1 Processor

- ❑ 1<sup>st</sup> generation ARM chip (1985)
- ❑ 24,800 transistors
- ❑ 3-stage pipeline
- ❑ 8 MHz
- ❑ 3  $\mu$ m process
- ❑ 49 mm<sup>2</sup>
- ❑ 82-pin PLCC package
- ❑ Average power = 0.1 W
- ❑ 2-phase nonoverlapping clocks
- ❑ Built for Acorn Computer



© 1985 ARM Ltd.