

Chapter 4The ProcessorIntroduction

\* CPU performance factors

→ instruction count (ISA & compiler)

→ CPI & cycle time (CPU hardware)

NEUTRAL: SIMD SIMD <— IEEE 803.13 standard for SIMD instructions

\* simplified MIPS instructions (simoset)

memory reference: lw, sw

arithmetic/logical: add, sub, and, or, sll, srl

control transfer: beq, j

datapath

Q: How does an instruction affect parts of the hardware?

→ pipelined

→ non-pipelined

Q: Sequential  
(one after another)

\* MIPS → any instruction boils down to 3 formats.

Instruction Formats

R → op, rs, rt, rd, shamt, funct → arithmetic/logical op

I → op, rs, rt, addn | const immediate → when operands are always registers

J → op, target addn → branch, memory

R  
add \$t0, \$t1, \$t2  
 $\$t0 = \$t1 + \$t2$

I  
lw \$t0, 80(\$t0) → base address, offset immediate

sw \$t0, 80(\$t0) → base, offset immediate, memory

→ save

MIPS has only one type of branching

→ equality

→ bne, beq

generally MIPS operations have destination on the left.  
→ sw is an exception as it moves

lw \$t0, 80(\$t0)  
rs rt rd ns

sw \$t0, 80(\$t0)  
rs rt ns

ori \$t0, \$t1, 10  
rt rs immediate

beq \$t0, \$t1, L10  
rt rs

check equality using subtraction

$Z=1 \rightarrow \text{equal}$   
 $Z=0 \rightarrow \text{not equal}$

④ Is the branch taken?

→ check if  $Z$  is 1 → need to check the instruction

⑤ If the instruction is **beq** &  $Z=1 \rightarrow$  branch is taken

→ other ways can also cause  $Z=1$ , even R-formations.

means → jump → updates PC

↳ shift the immediate 2 bits.

→ Why 4 bit jump?

2 types of memory

→ flat memory → consecutive memory

→ band memory → blocks & breaks

→ loss of bit width → flat memory has to store all the data in one place

→ each location has an address

→ only possible to take 4 bytes from hardware

→ memory

→ hardware limitation



\* gets 4 location values at once.

→ can't get immediates.

→ abstraction

→ 000100 → 1 } cut off 2 bits in 8 bit instruction  
→ 001000 → 2 }

while operating in hardware → shift by 2.

paragraph → must start from start

→ can't enter in the middle.

→ index the paragraphs

→ shift 2 bits to get the address.

J address → 26 bit → shift by 2 → to get actual address.  
↳ paragraph address

## Combinational Elements

Wabba re-

clock rate  $\rightarrow \dots$  GHz

↳ cycle length  $\rightarrow$  nsp

→ hardware delay → 1 ns

~~big problem~~

- 2 types of ckt
  - combinational
  - sequential

→ clock 30  
no clocks in combinational

## necessity of the clock

## Storage Element: Reg File

stone 32 negs. ~~negisten site~~ in MIPS terms

## memory snap

The diagram illustrates a dual port RAM architecture. It features a central rectangular box labeled "32 32-bit registers". Above this box, three control lines are shown: "Write enable" with an arrow pointing to the top edge, "busw" with an arrow pointing to the left edge, and "Clk" with an arrow pointing to the bottom-left corner. Three arrows point from the top of the box down to the labels "RW", "RA", and "RB" respectively. On the right side of the box, two output lines are labeled "bus A" and "bus B".

④ read is a combinational operation → no clock needed (although delay occurs)

~~6~~ can read 28  
write 1 in  
one cycle

put  
on  
wind

Black-white

**RW → write** (read/write) Non-discriminative  
types (read/write)

busA, busB → output

6 new inmate in bat

WMT:  $\text{softmax}(\text{linear}(x))$   $\rightarrow$  always has some value.

write enable → controls write

$\hookrightarrow$  1  $\rightarrow$  registers are valid  $\rightarrow$  exec

→ 1. Register and make account  
→ 2. Logging in and in



refers to add. R → read from 2, write to 1  
of memory → lw, sw → separate → not possible  
to do both together  
→ supports one address at a time

same clock cycle → n/w → concurrency issues.

↳ shared memory access  
→ all cores use same hard disk.

Some Logic Design...



Latches

latch → si bus  
→ no one → contention  
→ power → bus

clk-to-Q → clock trigger to valid output time  
delay

combinational time → longest time of all parts in combinational part. → fastest

F/F → needs setup time of fastest, others

↳ data must be present before clk trigger.

after clk trigger → some delay → then ops.

quantz  
clock skew → one physical clock

cycle time = clk-to-Q + Longest Delay + Setup + Clock Skew

## Processor

Clock

different chips are at different distances from the clock.  
 ↳ different clock delays  
 consider all these

longer resistance

### Datapath: IF Unit

fetch → bring instruction from memory  
 using instruction address.

tasks of processor

- fetch
- decode
- execute

Instruction Memory & Data Memory.

Readonly

⊗ if code writes in instruction memory → viruses (must be)

→ address to memory → get instruction

↳ update PC.



control  
signals

shouldn't always  
update PC

↳ not always PC + 4

↳ branch/jump  
may happen

one of these 3  
addresses will  
be connected



CSE-209

Assignment 2: Implement a 32-bit adder with two 16-bit inputs. The result is 32-bit. The inputs are denoted by  $a$  and  $b$ . The output is denoted by  $c$ .

combinational vs sequential  $\rightarrow$  clock

memory vs register

Leithen n on w  $\rightarrow 2n, 1w$

registering to extend

cycle time = CLK-to-Q + minimum word access time + pipeline delay

Datapath Instruction Fetching (IF Unit)



PC  $\rightarrow$  dedicated register  $\rightarrow$  program counter

④ Write requires CLK  $\rightarrow$  sequential.

④ assume ~~no~~ CLK isn't needed. PC is already available.

should it be written?  $\rightarrow$  NO

Add RTL

add rd, rs, rt

R-Format  
Arithmetic / logical  
op with both ops from reg

|          |      |      |      |         |         |
|----------|------|------|------|---------|---------|
| $op = 0$ | $rs$ | $rt$ | $rd$ | $shamt$ | $funct$ |
|----------|------|------|------|---------|---------|

$$\begin{aligned} & rd, rs, rt \\ & add, \$t_0, \$t_1, \$t_2 \\ & \$t_0 = \$t_1 + \$t_2 \end{aligned}$$



first peripheral

↳ always writes neg addn

next 2 → read neg → RS, RT

$busa, busb \rightarrow O/P$

$\downarrow$   
 $RS$

$\downarrow$   
 $RT$

↳ goes to ALU

data input  $\rightarrow busw \rightarrow$  from ALU op

$WE \rightarrow 1, clk \rightarrow$  writes

↳ doesn't always write

↳ ensures if the  $RD, busw \rightarrow$  can't be written / is in writeable config

↳ from control.



app code of arithmetic / logical op.  
from control

for I-type

memory transfer, immediate transfer

OR I

$\$t_1, \$t_2, 10$

RT RS Immediate value

Lyndon II

has register write to RT

but in write peripheral of RF, there is RD

you → use MUX



control - RD, RT, RS, RT



get immediate from IM  $\rightarrow$  last 16 bit<sup>+</sup>

ALU  $\rightarrow$  32 bit

2 ways

$\rightarrow$  sign extension  $\rightarrow$  sign 0  $\rightarrow$  first 16  $\rightarrow$  0 else 1

$\rightarrow$  0 extension  $\rightarrow$  first 16  $\rightarrow$  0

logical  $\rightarrow$  first 16  $\rightarrow$  meaningless  $\rightarrow$  0 extension

arithmetic  $\rightarrow$  signed  $\rightarrow$  signed extension



extension module.

control  
↳ 0/sign  
for determining extension

op of ALU goes to  $(RT)$

Load       $Rt$        $Rs$   
 lw       $\$t1, 80(\$a0)$   
 $\$t1 = MEM [80 + \$a0]$

has neg write  
to  $Rt$   
already in write  
peripheral

ALU  $\rightarrow$  addition  
of immediate

$Rs$   
already there

Signed extension  
Cohere.

Addition op not final op

$\hookrightarrow$  addn  $\rightarrow$  goes to memory





④ swap RS, RT  $\rightarrow$  need 2 more mux  $\rightarrow$  instruction bits of RS, RT are fixed.

~~RT RS~~  $\rightarrow$  according to instruction format  
~~beq \$t1, \$t2, 100~~

comparison  $\rightarrow$  use sub

PC  $\rightarrow$  PCty

update

$\rightarrow$  no neg write

$\rightarrow$  ALU  $\rightarrow$  sub

RS  $\xrightarrow{\text{mux}}\text{RT}$

} ALU of ~~not~~  
needed  
operations

mux already

$\rightarrow$  check Z flag & check if operation was beq.

{ Z  $\leftarrow$  0 or 1  
depends on flags



$\rightarrow$  address PC + 1, immediate





Full Datapath

ALU operation  $\rightarrow$  4 bit control

load/store : F = add

branch : F = subtract

R-type : F depends on funct

| ALU control | function |
|-------------|----------|
| 0000        | AND      |
| 0001        | OR       |
| 0010        | add      |
| 0110        | sub      |
| 0111        | slt      |
| 1100        | NOR      |

→ ALU only needs to know the op  
not the instruction.



→ opcode  $\rightarrow$  6 bits  
↳ using function can  
get 4 bits of control  
↳ but gets complex

use 2-level decoding.  
multilevel decoding

→ derive 2 bits from 6 bits of opcode

propagated forward to ALU control

→ all controls are obtained from main control unit.  
→ can get 4 bits of ALU as well  $\rightarrow$  but gets complex  
→ use instruction opcode.

ALUOP  $\rightarrow$  input to ALU control  $\rightarrow$  4 bits

funct  $\rightarrow$  input to  
6 bits

easier.



PCSrc control can't be used directly. → AND with Zend

① RegDst → 0 → RD RT  
    L → 1 → RD

② Branch → 0/1 → from opcode [if instr. is branch]

③ MemRead → not needed

Memory Access → costly

↳ slower than register → close proximity

↳ high performance

read memory always → ops become slow.

→ read only when necessary → MemRead

assert → performance

and not cancellation for

④ MemtoReg → Mem op on ALU if p

⑤ MemWrite → write enable MEM

⑥ ALUSrc → RT on Imm.

⑦ RegWrite → write enable RF

⑧ PCsrc → 0 → nextaddr; PC+4

    1 → a n ; BA

### R-Type Instruction

write reg → RD → RegDst : ①

Imm → not needed.

funct → ALU control

ALUOp → ALU Op

ALU → Ifp → RS

    L → RT → ALU Src: 0

    z → not needed

MemtoReg → 0 → ALU Op → valid

## Load

ReadReg 1 → RS

ReadReg 2 → X

WriteReg → RT

RegDst → 0

Imm → sign extend

ALUSrc → MUX

func → Don't care

control from ALU op

ALUi/p → RS, Imm

MemRead → 1

MemtoReg → 1 → O/p → Mem.

element gathering for write

abit 0 → 0

abit 1 → 1

XOR - Branch

XOR - equal

XOR - diff

branch prediction

branch decide - forward value

ALU i/p → RS, Imm

abit 0 → 0

abit 1 → 1

Control & MUX - Write

## Beq

RS, ~~RT~~ RR & WR → X

Imm

write Enable → ~~deassert~~

ALUSrc → Imm

Imm → shift left 2

→ MUX → PCt4, Imm

forwarding paths

Fig

YUM EXE 2nd

2nd & 3rd extensions

## Control Values

| Instruction | RegDst | ALUSrc | MemtoReg | Reg Write | Mem Read | Mem Write | Branch | ALU op1 | ALU op2 |
|-------------|--------|--------|----------|-----------|----------|-----------|--------|---------|---------|
| R-format    | 1      | 0      | 0        | 1         | 0        | 0         | 0      | 1       | 0       |
| lw          | 1      | 0      | 0        | 0         | 1        | 0         | 0      | 0       | 0       |
| sw          | 1      | 0      | 0        | 0         | 1        | 0         | 0      | 0       | 0       |
| beq         | 1      | 0      | 0        | 0         | 0        | 1         | 1      | 0       | 0       |

## Implementing Jumps

2 addn.  
31:26 25:0

Reg write  $\rightarrow$  X

ALU op  $\rightarrow$  X

ALUSrc  $\rightarrow$  X

no memory func.

only change  $\rightarrow$  ins br fetching.

take last 26 bits  $\rightarrow$  shift left 2  
prev addn.

MUX  $\rightarrow$  i/p  $\rightarrow$  Jump addn

$\hookrightarrow$  BA / PC<sup>t4</sup>

$\hookrightarrow$  sequential next instn.

10 controls

additional for jump.

Concatenation unit

X number of bits of  
append first 4 bits of  
original PC  
PCUJA most 4 bits  
 $\rightarrow$  second MUX b

PC<sup>t4</sup>

PCB

## MIPS Pipeline

5 stages, one steppenstage

|     |          |
|-----|----------|
| IF  | → 200 ps |
| ID  | → 100 ps |
| EX  | → 200 ps |
| MEM | → 200 ps |
| WB  | → 100 ps |

} why nonrevenue slice?

memory ops need more length

why not slice memory op?

↳ atomic op.

↳ if possible → 8 stages → speedup = 8

tradeoff { stage ↑  
speedup ↑  
problems ↑

## Pipeline performance

Total → 800 ps



$$\text{Speedup} = \frac{0.8}{0.2} = 4$$



## Pipelining & ISA design

MIPS ISA → designed for pipelining.

↳ fixed instr size 4 byte

x86 → 1-15 byte variable size.

fewer addressing formats:

MIPS → 3

x86 → 7 for memory access

Load / store addressing

MIPS → memory & register op not possible

ISA  
family of  
86 processor  
x86  
specific  
processor

ISA  
family of  
86 processor  
x86  
specific  
processor

arithmetic on mem & neg not possible

→ load first

x86 → one of the operands can be memory  
not both

↳ Memory needed both before & after ALU

→ memory is never needed before ALU op. → better

Alignment of memory operands:

MIPS → per cycle 32 bits

8086 → 16 bit processor

↳ brings 16 bit data

↳ may arrive in 1 or 2 cycles.

stage splitting / difficulty ↑ → use 2 → waste 50%

④ MIPS write back at last

→ no intermediate neg write.

# MIPS → 5 stages

↳ pipelining can run into errors,

### Hazards

→ structural → when a single time slice / instance has the same resource requested by two instructions.

→ Data

→ control

instruction from memory  
load/store to memory



uses Instruction memory & Data Memory  
↳ solves structural hazard using  
resource splitting.



if  $I_1, I_2$  were not present  $\rightarrow$  RAW hazard.

Read after Write  $\rightarrow$  head immediately after write

add \$30, \$t0, \$t1

sub \$t2, \$30, \$t3

naive solution:  $\rightarrow$  NOP operation (stalling / bubble)

$\hookrightarrow$  32-bit 0  $\rightarrow$  in MIPS

$\oplus$  insert 2 instructions in between

$\hookrightarrow$  trade off  $\rightarrow$  more time

each RAW hazard needs 2x waiting  $\rightarrow$  penalty.

any hazard can be solved by waiting.



$\hookrightarrow$  forwarding / bypassing  $\rightarrow$  not always possible

② Load-use data hazard.

lw \$30, 20(\$31)

smo \$t2, \$30, \$t3



use 1 NOP + forwarding

→ New & better

branch and B

order &  
data?



→ forward to bottom EX  
(forwarding hazard removed)

→ try to add some instruction in between  
for stalling.  
↳ done by compiler.

fixed size

-0, 1, 2, 3 inc  
compiler  
optimizations  
are significant

→ checks if there is any independent  
instruction that can be inserted in between.

→ Control Hazard to avoid it

→ Branching, no per

stall after branch

fast mapping

→ branch taken or not taken → decided after MEM

→ branch taken or not taken → decided after MEM

add \$4, \$5, \$6

log \$1, \$2, \$0

still one instruction.

as it is known that  
the op is a branch  
one only after ID.

3 stalls needed.

→ IF after MEM



solution → branch prediction

diff. schemes.

9110  
times  
stall  
not  
required

9 will  
connect,  
1 time discards  
3 ins by  
stall

→ predict branch is not taken.

(static branch prediction)

→ 90% 99% time connect.

dynamic branch prediction

### Delayed Branch

beq \$1, \$2, \$4 → max write

add \$4, \$5, \$6

if none of the branch

neg one needed, insert  
instruction in between

up to 3 → then mem

has decision  
at 1st the bba

anywhere in the  
program can go any  
place

branch always

means add to PC

multi-cycle pipeline diagram



ID  
instruction decode.

PC+4 needs to be stored throughout

↳ branch decision is taken in MEM

all stages update PC+4

but is needed for branch. → branching

→ after 3 stages,

why not use PC+4-12

PC+16?

→ not guaranteed.

(can't have connect values.)

branching may happen midway.

if PC+12 & PC+16

(PC+12, PC+16)

MEM → needs RT for store.

convention

write back

memory

when write

write



④ long in (wd)

→ write reg value gets updated.



⑤ 3 clock cycles back  
some even in any  
pipelinedness.

scn → store & carry the write reg.



Load, Store

Pipelined Control

Control

9 controls excluding jump.

when is the  
control signal  
generated?

(IF) no control

regwrite → forward

(ID) no control

(EX) ALU src, ALUOp, RegDst

1

2

1 → 4 bits

(ID)  
each clk  
has new  
insts.

(MEM) branch, MemRead, MemWrite

(WB) MemtoReg, RegWrite (st), RegWrite (ld)

(IF) generation not possible. → instr. not known

(ID) instr. op code known → generate Control  
6 bits of opcode to 9 control

↳ store previous control signals.



Why not EX, MEM, WB in only the pipelines



→ gets overwritten

→ use carrying forward stores values valid for  
and forward that point.

### Data Hazards in ALU Instructions

sub \$2,\$1,\$3

read from

and \$12,\$2,\$5

or \$13,\$6,\$2

add \$14,\$2,\$2

sw \$15,\$0(\$2)

first write to \$2

four reads from \$2

performs (not being flushed)

changes from executing another



$\$2$   
 written on  
 rising edge.

- ↳ number of write-backs
- how to forward?
- how to tell if it is needed?
- how to detect hazard?

check RD of prev instr. (immed)



if matches (RS, RT)  
 of current,  
 when we get this?

Check EX/MEM.RD

$$\begin{aligned} \text{EX/MEM.RD} &= \text{IF ID} \\ &\quad \text{IF ID} \\ &= \text{EX/MEM.RS} \\ \text{EX/MEM.RD} &= \text{EX/MEM.RT} \end{aligned}$$

if

HAZARD

Known

immediate instruction

1st type RD is saved

also variant of RAW

② How to detect the third one? → check MEM/WB

$$\begin{aligned} \text{EX/MEM} & \quad \text{IF ID.RS} = \text{MEM/WB.RD} \\ & \quad \text{MEM/WB.RD} = \text{IF ID.RT} \end{aligned}$$

HAZARD  
2nd type

detection condition.

previous 2 + ...

RD not always written. eg. bea → nouwrite

↳ write controlled using RegWrite

→ ALU gets calculated value which was not written.

→ wrong forwarding

→ check control bits

check control of RF → if writeEnB → ..

[EXIMEM, Regwrite & condition 1]

[~~MEM/WB, Regwrite & condition 2~~]

① what if \$RD is \$ZERO? if not write.

→ can be used in instr.

→ not allowed in hardware.

→ wasted instr.

add \$0, \$1, \$2, ..., \$n

[EXIMEM, RD ≠ 0]

• MEM/WB-RD ≠ 0

→ detected forward fill now

forwarding

latch mem

regfile = latch mem

forwarding

how to forward?

select one of 3 values.



RS → detect → FwdA

RT → detect → FwdB

EX/MEM, RD == ID/EX, RS →  $\begin{cases} \text{FwdA} \rightarrow 10 \\ \text{FwdB} \rightarrow 00 \end{cases}$

bind forward - avoid fix

MEM/WB, RD == ID/EX, RT →  $\begin{cases} \text{FwdA} \rightarrow 00 \\ \text{FwdB} \rightarrow 01 \end{cases}$

Double Data Hazard

add \$1,\$2,\$3

add \$2,\$1,\$3

add \$1,\$4,\$4

both hazards  
read after write  
read after after write

double data hazard

Double Data Hazard

add \$1, \$1, \$2  
 add \$1, \$1, \$3  
 add \$1, \$1, \$4

double hazard

how & which value to take?

both hazards  
EX & MEM

CT till 115

Ch-4  
 intro to  
 4.3



|        | cc:  | 1  | 2  | 3  | 4  | 5  | 6  | 7  | ABus |
|--------|------|----|----|----|----|----|----|----|------|
| \$2:30 | \$1: | 10 | 10 | 10 | 10 | 10 | 10 | 10 | 10   |
| \$3:20 |      |    |    |    |    |    |    |    | 10   |
| \$4:4  |      |    |    |    |    |    |    |    | 10   |

{ if → ABus  
 EX & MEM → 5th → fwd EX  
 when checking MEM/WB, check EX/IMEM

{ if → ABus ← EX, XE, QE = 0s from MEM/WB.  
 If ABus = 10, → don't fwd from MEM/WB.

MEM hazard

if (MEM/WB.Regwrite & (MEM/WB.RD ≠ 0))

& !(EX/IMEM.Regwrite) & (MEM/EX/IMEM.RD ≠ 0))

→ if true → don't fwd from MEM/WB.

→ if false → fwd from MEM/WB.

## Load → Store (Mem to Mem Copy)

no register to memory connection  
register same

lw \$21 20(\$1)

sw \$21 15(\$3)

Reg to Reg → possible → some RF

Reg to MEM → no connection

must go via ALU



→ use forwarding. (④. 4E n buffers and branch bypassing)

④ → how to detect this & how to forward? HW  
 ↳ abstract path  
 tries max TS very early  
 using MUX, source DM  
 → not full datapath  
 ↳ truth table for mux  
 $[R_{A, D1, D2} = R_{B, X1, X2} \text{ no exception}]$

## Load-Use data Hazard

↳ stall + finding on code schedule  
 (branch) ↳ not guaranteed

↳ must handle using hardware.

↳ RAW / RAAM  
 ↳ 3pm: fwding

④ How to detect?

④ How to stall?

DEFL-STRG (DEP-DEFL-TRXH) bim forwarding rule

↳ Branch detection

↳ branches under EPC

↳ XMMT-SWAP (readable prior branching)

Raw hazard is detected in EX stage

Stall → stop values & instructions pushed to pipeline

↳ use control value 0 → virtually no write

↳ all write signals are deasserted.

virtually  
garbage instruction

steps:

① control signal = 0

control is generated in ID

? How to stop?

hazard is detected in EX

↳ hazard must be detected in ID. ②

how?

beginning of work & end of work

match with prior RT curr RS/RT.

[ $ID/EX.RT == IF/ID.RS$  or  $ID/EX.RT == IF/ID.RT$ ]

↳ hazard

check control

② how to detect a load? → check MemRead

only asserted for Load

Forward

$ID/EX.MemRead$  and ( $ID/EX.RT == IF/ID.RS$  or  $ID/EX.RT == IF/ID.RT$ )

IF/ID → no controls.

→ Load-use hazard.

→ forward using additional value in MUX.

\* the operations still take place but no write happens.

But a new instruction enters in the meantime & the current instruction needs to be reviewed.

→ other values are reserved. PC gets updated.  $\Rightarrow PC + 4$

→ new instruction loaded

→ how to resume? Next pipelines are already preserving values.

\* need to pass IFID to next IFID & IM word to next IM

→ ~~next~~  $\Rightarrow PC + 8 \rightarrow$  stop this. (2)

→  $PC + 4$  is written at the end of CC3.

→ set PC write signal = 0.

next one also  $PC + 4$

→ stop IFID write. → also overwritten @ end of CC3.

→ also set [IFID write = 0]

\* set next controls = 0, PC write = 0, IFID write = 0

One stall → no more hazard. → other problems solved using fwd.

(Stall) NOP push → some activity