

## Single-cycle MIPS processor

1 exe / clk  $\xrightarrow{\text{future}}$  more

### 1. state holding elems

Prog. counter (register) holds addr of curr exe inst.

Inst. mem. read-only for prog. inst.

add  $\rightarrow$  32-bit inst.

Data mem. place for data during exe

$1R A^{(32)} \rightarrow RD^{(32)}$

$1W A^{(32)} + WD^{(32)}$



Reg. file (new!) holds 32 mips registers

$2R A1^{(5)} \rightarrow RD1^{(32)}, A2^{(5)} \rightarrow RD2^{(32)}$

$1W A3^{(5)} + WD^{(32)}$

$\hookrightarrow$  2 for R-type inst., sw, beg (2 source reg.)



### 2. build path to exe 1 inst.

$lw \quad b \quad 5 \quad 5 \quad 16$   
op rs rt imm

1. Fetch instr. get out 32b

2. read rs (25:21)

3. add offset to imm

4. compute mem. addr (base + offset)

5. read dat. from mem and write to regfile.

6. go to next instr. (+4b)



### 3. extend datapath for more inst. (what's same/diff?)

E. SW  $\rightarrow$  now write to data mem.  $op \quad b \quad 5 \quad 5 \quad 16$   
read rs, rt simul

E. R-type class (diff: ALU op, same: source reg. reading from)

$op \quad rs \quad rt \quad rd \quad shamt \quad funct$   
 $(I) \quad 6 \quad 5 \quad 5 \quad 5 \quad 16 \quad +4+4n$

E. beg op rs rt imm (offset from fallthru, not curr.) Reuse addr. calc and sgn extend.

#### 4. add controller



#### 1. Main decoder

Opcode 5:0

| Inst    | r | opcode | Reg W | Reg Dst | ALU Src | Branch | Mem W | Mem2Reg | ALUop   | j |
|---------|---|--------|-------|---------|---------|--------|-------|---------|---------|---|
| R-type  |   | 000000 |       |         | 0       | 0      | 0     | 0       | 0 funct | 0 |
| lw      |   | 100011 |       | 0       |         | 0      | 0     | 0       | 00 (+)  | 0 |
| sw      |   | 101011 | 0     | X       |         | 0      |       | X       | 00 (+)  | 0 |
| beg     |   | 000100 | 0     | X       | 0       |        | 0     | X       | 01 (-)  | 0 |
| E.addiu |   | 001000 |       | 0       |         | 0      | 0     | 0       | 00      | 0 |
| E.j     |   | 000010 | 0     | X       | X       | X      | 0     | X       | XX      |   |

#### 2. ALU decoder

Funct 5:0 (if needed)

| ALUOp | funct  | ALUControl   |
|-------|--------|--------------|
| 00    | xxxxxx | 010 Add      |
| 01    | xxxxxx | 110 Subtract |
| 10    | 100001 | 010 Add      |
| 10    | 100011 | 110 Subtract |
| 10    | 100100 | 000 And      |
| 10    | 100101 | 001 Or       |
| 10    | 101010 | 111 Slt      |



$$\frac{\text{sec}}{\text{prog}} = \frac{\text{instr.}}{\text{prog}} \cdot \frac{\text{clk cycles}}{\text{instr.}} \cdot \frac{\text{sec}}{\text{clk cycle}}$$

CPI (1 for us) ↳ clk period (crit. path, ...)

E. longest: lw



Add all delays

Take longest

# 117 Pipeline (not 1-cycle)

S-S is sequential. Pipeline is parallel



Use pipereg to store partial results (ana shift register)

Problem: Write reg wrong stage! Need to go all the way!



Control: same unit. Each pipereg stores ctrl. signals of its stage

Shorthand



Hazards Result of instr. needed before it's produced → Add hazard unit

1. data result needed before available

E. add \$s0, \$s2, \$s3

and \$t0, \$s0, \$s1 → not yet ready! OK if same cycle

2. ctrl next addr not known at fetch E beg.

Sln. 1. Insert nops to delay later instr. Can put useful work there.

: bad abstraction, what if not run on piped?

2. Forwarding (bypassing)

pick value before write back

bypassing register

ALU input can come from:



(00) 1. register file (general)

M→E

(10) 2. memory stage (instr. in E reads a reg. that the instr. in M will write)

W→E

(01) 3. writeback stage (W ~ )



FwdAE

$M \rightarrow E$  if  $(RegWriteM) \text{ and } (rsE \neq 0) \text{ and } (rsE = WriteRegM)$

$W \rightarrow E$  if  $(RegWriteW) \text{ and } (W \sim) \text{ and } (W \sim W)$

FwdBE

01

$rtE \sim$

$rtE \sim$

X if producer is a load

Can't fwd, must stall!

Detect at runtime, insert nop.

Stall tells pipeline and stuff behind to wait!

disable pipe reg and clear E pipe reg → nop

Stall F, Stall D, Flush E

lw?

lw dest reg = either D src. reg?

$\text{lw} = \text{MemToReg } E \text{ and } ((\text{rsD} = \text{rtE}) \text{ or } (\text{rtD} = \text{rtE}))$



Control hazard: branch

1. Early branch resolution → decide in D, before E

New hazard: may need value  $M \rightarrow D$ . No need for  $E \rightarrow D$

Fwd AD  $M \rightarrow D$  if  $(\text{RegWrite } M) \text{ and } (\text{rsD} \neq 0) \text{ and } (\text{rsD} = \text{WriteReg } M)$  (usual)

Fwd BD



C2 resolves branch

instr. fetched in C3.

speculate whether branch is taken (guess +4)

✓: good, already in pipeline

X: flush wrong instr., zero reg. → bubble

| branch | cycle                  | 1                 | 2 | 3 | 4 | 5 |
|--------|------------------------|-------------------|---|---|---|---|
| Stall  | lw \$t0, 4(\$0)        | F   D   E   M   W |   |   |   |   |
|        | add \$t1, \$t1, \$t2   | F   D   E   M     |   |   |   |   |
|        | beq \$t0, \$t1, target | F   D   E         |   |   |   |   |

\$t0, t1 not available. Need stall

Stall F, Stall D, Stall E = ldstall + branchstall (if instr. in F. produce)

? branch in decode

? write to mem

is dest reg = either branch source?

branchstall = Branch D and RegWrite E and  $((\text{WriteReg } E = \text{rsD}) \text{ or } (\text{WriteReg } E = \text{rtD}))$

OR ~

MemToReg M

~ M ~

~ M ~



## Summary 4 types

producer

(M) later

2

4



1. ✓, fwd  $M \rightarrow E$

F D E M W

detect → fwd

2.  $w$

— 11

M-E fwd

### 3. beg

bee Tw

## branchstat()

4. xx

beg.      lw

M-D fwd

beg X /w

branch staff

beg X X

✓ <register file will handle it>

## Performance Analysis

CPI (cycles-per-instr.), ideal = 1, > 1 for bubbles

E. 54% R, 25% load, 10% store, 11% branch (exe dynamic count, loops --- included)

if 40% loads used by nxt. instr.; 25% branches taken

$$\text{load CPI } 0.4 \cdot 2 + 0.6 \cdot 1 = 1.4$$

$$\text{b CPI } 0.25 \cdot 2 + 0.75 \cdot 1 = 1.25$$

$$\Sigma = 0.54 \cdot 1 + 0.25 \cdot 1.4 + 0.1 \cdot 1 + 0.11 \cdot 1.25 = 1.1275!$$

Crit path: longest thru any stage



| Element             | Delay          |
|---------------------|----------------|
| Register clk-to-Q   | $t_{pcq}$      |
| Register setup      | $t_{setup}$    |
| Multiplexer         | $t_{mux}$      |
| ALU                 | $t_{ALU}$      |
| Memory Read         | $t_{memread}$  |
| Register file read  | $t_{RFread}$   |
| Register file setup | $t_{RFsetup}$  |
| Equality            | $t_{eq}$       |
| AND gate            | $t_{AND}$      |
| Memory Write        | $t_{memwrite}$ |
| Register file write | $t_{RFwrite}$  |

$$T_c = \max \left\{ \begin{array}{l} t_{pcq} + t_{memread} + t_{setup} \\ 2(t_{RFread} + t_{mux} + t_{eq} + t_{AND} + t_{ALU} + t_{setup}) \\ t_{pcq} + t_{mux} + t_{memwrite} + t_{setup} \\ 2(t_{pcq} + t_{mux} + t_{RFwrite}) \end{array} \right\}$$

Fetch  
Decode  
Execute  
Memory  
Writeback

$$= 2(150 + 25 + 40 + 15 + 25 + 20) \text{ ps} = 550 \text{ ps}$$

Decode

CPI ↑, delay ↓, product ↓  
chopped 5 ports!