

# Lec 14-3-22

Page No. \_\_\_\_\_

# 15A

- 1 Limited memory
  - 2 Slow memory
  - 3 Ease of Programming



for Limited Memory  
Implicit operands

## Address format

Leave at form

18

Can we construct a machine? Yes!!

|    |                |   |
|----|----------------|---|
| P. |                | M |
| R. | AM             | E |
| O. |                | M |
| C. |                | O |
| S. | Day/<br>Instal | R |
| S. |                | Y |

(Processor takes data of instr from memory & does comput.)



## Lifd Semantics

temp memory  
(stack)

take 1 or 2  
operands from  
top of stack,  
& return to  
top of stack.

1 address for each

39

- cannot throw out 'open'

<sup>↑</sup>these are "explicit" (explicitly mentioned where the operands, dest<sup>"</sup> are...)



# Zec 14-3-22

Page No. \_\_\_\_\_  
Date \_\_\_\_\_

### # TSA

- 1 Limited memory
- 2 Slow memory
- 3 Ease of Programming

| Opn           | Operand 1 | Operand 2 | dest <sup>n</sup> |
|---------------|-----------|-----------|-------------------|
| $\rightarrow$ | 32        | 32        | 32                |

↳ 103 bits.

1. → try to minimize no. of bits per instruction.

$$f = B + C$$

9000 3000 5000 3 addresses  
ADD 3000, 5000, 9000

- cannot throw out 'opns'

↑ there are "explicit" (explicitly mentioned where the operands, dest<sup>n</sup> are...)

for Limited Memory

Implicit Operands

0-address format

Instruction

↑

Can we construct a machine? Yes!!



(Processor takes data & data comput...)

LIFO Semantics

temp memory (stack)  
Memory takes 1 or 2 operand  
from top of stack,  
& returns to top of stack.

1 address format:

↓

39 bits

A = B + C \* D  
PUSH E 39 Memory address.  
PUSH D 39  
MUL 7  
PUSH B 39  
ADD 7  
POP A 39



Top Counter

$$A = B + C * D + E / F \rightarrow$$

just 2 want work



E/F not ...

PUSH C

PUSH D

MUL

PUSH B

ADD

POP T1

PUSH E

PUSH F

DIV

PUSH T1

ADD

extra (pair)  
due to limited  
temp. stack

F/F

E/F

B/C

$$A = B + C$$

+ BC

BC +

Postfix/Polish notation

\* Big pages — Algo-  
Using  
Stack ]

$$A = B + C * D$$

$$= B C D * +$$

NB

PUSH B  
PUSH C  
PUSH D  
MUL  
ADD  
POPA

(convert to Polish notation  
just seen L  $\rightarrow$  R)

AM  
to write  
code direct ✓

1-address  
[ opnd | operand ]  
7 32

eg ADD A  
SUB B



( $\leftarrow$  was PUSH, PUF)  
(LOAD STORE)  
A = B + C  
LOAD B  
ADD C  
STORE A

#lec 15-3-22

TSA



$$A = B + C + D$$

LOAD C

MUL D

ADD B

STORE A

$$A = B + C + D + E$$

LOAD B

MUL C

STORE T1  
LOAD D  
MUL E  
ADD T1  
STORE A

(- only 1 space in Temp (accumulator)  
in main memory.)

Accumulator is  
still the bottleneck

$$\text{e.g. } A = B + C$$

$D = A + E$   
used multiple times

$P = A + R$   
Store A in  
some registers



a) LOAD B 40  
ADD C 40  
STORE R0 10  
LOAD E 40  
ADD R0 10  
STORE D

① 32 → 2

+ no need of making memory reference

b) MUL C :  
10000000000000000000000000000000  
+ 10000000000000000000000000000000  
+ 10000000000000000000000000000000  
10000000000000000000000000000000  
cycles



3 address

op[1] op[2] dest[1]  
 $7 + 2 + 2 + 2 = 13 \text{ bit}$

Load Store

op[1] op[2] 7 + 2 + 32

In Load, reg → mem  
In Store : reg → mem  
(given address)

LOAD R1, B (Load into R1)

LOAD R2, C

ADD R1, R2, R0  $7+2\times 3=13$

LOAD R3, E

ADD R3, R0, R1

Overwritten (e.g. P.)

(No performance  
if C later used again.)



Stats : 32 reg.  $\rightarrow$  80-85% live var. held ✓  
64 reg.  $\rightarrow$  ~95% .. ..

- Spill Code | e.g.  $x_0$  var.  
↓  
load-store pair..

# Instructions need to be accessed from memory  
e.g. 2010 cycles for accessing the instruction

$$A = B + C$$

$$D = A + E$$

$$R3$$

Page No. 1

$$A = B * C + D$$

$$MUL R0, R1, R4$$

$$ADD R2, R4, R3$$

$$R1, R0, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

$$R0 * R2 + R3 [Lab]$$

$$R0, R1, R2, R3$$

# Lec 17-3-22

ISA Constraints

- 1. Limited Memory : minimize #bytes
- 2. Slow memory : complex instrn
- 3. Ease of Programming

for ( $i=0$ ;  $i<1000$ ;  $i++$ )  
     $A(i) = B(i) + C(i);$

$A(0) = B(0) + C(0);$   
    :  
    :  
 $A(999) = \dots + \dots;$

Can use (ptrs to addresses) (store these in registers)

R1 = 5000

R2 = 9000

R3 = 1000

ADD R1, R2, R3 , i.e add things at where addresses  
are contents of R1, R2

& store at loc with address = content  
of R3

OPCODE opr1 opr2 Dest.

AM  
Address mode

1. Direct Memory
2. Register direct (content = opr.)
3. Register indirect (i.e. address of opr.)
4. Base + offset (S.I.)

Next Iter^n :

5004 (1 operand : 4 bytes)  
9004  
1004 (ptrs)

AB[0] addr. M3 is A + 40 (#bytes)  $\frac{[(R_1)40]}{\text{offset}}$

4. Base + offset  
const.

5. Base + Index (ex. top of stack)  
i. in another register { 2 registers  
    one R1 (base add. (A))  
    other RS (i) }  
    R1 R5

6. Double ptr.p.

MEMORY INDIRECT



7. AUTO INCREMENT / DECREMENT

e.g.

PUSH R1, R2  
R1 contains  
address of top of stack

M(R2) = R1

R2 = R2 + 1

automatically  
by AUTO. INC (ptr of top of stack  
incremented..)



(P. 8 on 182)  
One new A/M (for arithmetic/logical  
instructions)

### - IMMEDIATE



i.e. no need to access anything from reg/memory just that is bits are the operands

To distinguish,  
ADD vs ADD I

→ now 6 bits # 32 op's only.

(ys) ↓ but I @



see from

I type  
Immediates:

OPR1

#263 ✓  
---  
000 000 00  
for R-type

↓

127 op's

R  
I  
B1  
J

### Memory Comm<sup>t</sup> (DLX..)

Memory to Reg : LOAD  
Reg to Mem. : STORE

B+D : Base + Displacement.

B+I : Base + Immediate

(R type) OPR1 DEST IMMEDIATE

R1 → word size bytes for planning

DISPLACEMENT

this reg has to be loaded.

base 8x10 imm cause this value directly

LOAD

↓

R2 = M(R1+80)

(load)

STORE

M(R1+80) = R2

DN word (4 bytes)

SH half word → least sig. 16 bits will be loaded

SB

LB (Byte → " " 1 byte " )

BB

⑩ Base + Index

(~ R type :)

R2

- e.g.  $i=0$  ( $< 1000$ ,  $i=i+1$ )  
 constants |  $M_1$  is: treat as var., but  
 don't change value.  
 but we've limited memory

### (8) IMMEDIATE (VALUE)

For Cache  
 16 bits: ~85% constants we use are  
 32": ~99% "



e.g. MAC  $\Rightarrow$  3 OPR, 1 DEST



OPCODE OPCODE OPCODE OPCODE  
 Connect to FU Connect to MUXES, DEMUX

- John Cocke (IBM) -  $\rightarrow$  fast b/c memory access b/w  
 RISC  
 Reduced Instruction Set Computer  
 CISC  
 Complex/Complete

### # Encoding



Considering MAX:  $7 + (3 \times 32) \times 4 = 147$

for fixed len encoding (RISC)

for CISC: var. len encoding



# 2021-3-22

### ISA

1. Arithmetical Logic

2. Memory Command Unit (189)  
→ LOAD/STORE

3. Control Flow Change.  
Unconditional conditional

Conditional

[Oper? ] ADDRESS

$$f(i=j)$$

SUB R<sub>1</sub>, R<sub>2</sub>, R<sub>3</sub>

BEQZ R<sub>3</sub>, ADDRESS

(Branch if equality with zero) probably it is: branch if... flag is set..

FLAG

CONDITION CODE REG  
STATUS REGISTER



CISC

RISC

- will focus on RISC  
- small instr.  
J# A/M2  
fixed len.  
Latency

I R1

# DLX / MIPS → RISC-V

- 32 bits ISA (every instr in 4 bytes) (or RISC)

- 32 Registers: R<sub>0</sub> - R<sub>31</sub>

- Load/Store architecture (only these instr's can access main memory)

- Arithmetic / Logical instr's



(\* can access only from reg. p.) arithmetic, logical

64 opcodes.

ADD

SUB

AND

OR

XOR

SLL : shift left (\*2)

SRL : " right (/2)

SLT : Set on less than

SLE : " " " or equal to

SGT : " " " greater than

SGE : " " " or equal to

SEQ : Set on equal

$$ADD R_1, R_2, R_3 \Rightarrow (R_3 = R_1 + R_2)$$

if ( $i < j$ )  $\equiv$  [content of R<sub>i</sub> is less than j?]

SET R<sub>1</sub>, R<sub>2</sub>

SET R<sub>1</sub>, R<sub>2</sub>, R<sub>3</sub>

if ( $R_1 < R_2$ )  $R_3 = 0 \dots 0$

(each reg. size is 32 bits)

else  $R_3 = 0 \dots 1$

(not bit reversed)

(if  $R_1 > R_2$  then  $R_3 = 1 \dots 0$ )

Page No. 1 Date 1

A() ->  
 B() -> JAL B  
 C() -> JAL C  
 R31 = 104 (current)  
 takes to sc4.  
 JR R31  
 again to 504!!  
 104 vanished

ADD R5, R0, 1000H  
 R5 = 10001000H (32 bits)  
 "nStack ✓

SWI R31, (R5)0 (Base + Displ.)  
 ADD R5, R5, 4 (Stack ptr.)  
 ← Code #2 line ↑ & then SUB R5, R5, 4  
 LOAD R31, R5

Example:  
 Base A[100] Add all elements of A.  
 R1 = Counter = 100  
 R2 = #Elements = 0  
 R3 = 00001000

Initialize: (0)  
 ADD R1, R0, 100  
 ADD R2, R0, 0  
 ADD R3, R0, 1000H  
 LOAD R4, (R3)0 [0 offset]  
 ADD R2, R2, R4 ] add to sum  
 ADD R3, R3, 4 ] move to next loc  
 SUB R1, R1, 1  
 BNEQ R1 Loop ] if ≠ 0, loop (goto label Loop)  
 (-5) [ 20-52 = #Elements ]  
 STORE R2, (R3)0  
 HLT

```

    # word
    int main() {
        :
        push r10, r11 → fn(4)
        push r10, r11 → fn(4)
        push those into stack
        :
        fn( ) → pop those ←
        R10
        R11
    }

```



(pushed left  
to right...)  
(can also  
push R-2L)

Lec 24-3-22

# ISA

184 Intel : ISA indirection

# ARM ISA

- all conditional instrns

IF (X == 0)

THEN

A = A + 1  
↑ R1

ADD R1, R1, #1 (no branch req.)

- R15 is PC (Program Counter). [ARM].

R15 = R15 + R1

R15 = R1 + R2 ↳ i.e. go to addr R1+R2

- Load, Store multiple registers.

- Simple Instrns

- A few instrns format

- A few A/M

- fixed (len) encoding.



- ① ALL (Arith. & Log.)
- ADD
  - SUB
  - AND
- let only R-type

- ② Memory Comm.
- LW (Load Word)
  - SW
- B+D (Base + Disp) I-type

- ③ Branch
- |               |                                                                  |
|---------------|------------------------------------------------------------------|
| control - J   | J-type                                                           |
| control - BEQ | I-type (e.g. BEQ R <sub>1</sub> , R <sub>2</sub> , LOC)<br>Imm16 |



Controller instructs,  
Data Path performs (workhorse)

N.B.

### DATA PATH

① FATAL UNITS → ALU

② STORAGE →

- Programmer's registers (aka architecture reg.)
- Program Counter / Instrptr [pointing to the next instr to be executed]

- Temporary Storage

### STEERING LOGIC



### CONTROLLER

Comb'nal logic  
Finite State Machine (FSM)



Single go NOT possible in CISC. (not even know simple instr or net...)  
(e.g. we fetch 4 bytes)

vs

RISC

### # Control Flow Change: (DLX)

# BEQZ, REG, LOC ← I-type

[opn: | ORR | DEST" | IMM 16]

→ if R2 is 0  
more ( $80 \times 4 = 320$  bytes)  
#instr's

\* - Instr<sup>n</sup> PTR. (Inv to keep 1 Reg. for this...)  
IP = IP + IMM16 × 4  
(i.e., every instr<sup>n</sup> is 4 bytes) (DLX - fixed instr<sup>n</sup> size...)

branch # (REG) = 0  
→ BNEQ REG, LOC

### # Uncond'l

# J: Jump  
+ LOC

[opn: | Imm 26 ]

IP = IP + IMM26 × 4

(i.e., this many instr<sup>n</sup>'s away from curr.)

# f'calls: Inv to come back to that loc".

JAL LOC

Jump & Link  
the curr.  
addr. from  
where jumping/leaving

: ~ store into R31 (return addr.)

(R31 gives info  
of returning addr.-)

JR Reg. (nR type)

### Lec 22-3-22

# DLX

|   |      |      |      |      |   |   |   |   |
|---|------|------|------|------|---|---|---|---|
| R | OPR0 | OPR1 | OPR2 | Dest | 5 | 5 | 5 | 6 |
|   | imm  |      |      |      |   |   |   |   |

I [opn: | OPR4 | DEST | IMM16 ]

T [opn: | IMM26 ]

Also in  
pdf DLX - Inst<sup>n</sup> set  
(folder)

Base + Index: R-type  
Base + Disp.: I-type

BEQZ RX, LOC  
relative offset  
branch  
☒ #instr's (not #bytes).

BNEQ

J (Jump) → ↑ IMM26 +  $2^{26} \times 4 = 2^{29}$  bytes jump possible.

~256 MB (+ fast:  $2^{29} \times 256$ )

JR (jump to register) : JR RX

register

(i.e., moved to loc given in RX  
only ~26 bits.. vs 32 bits for 4 GB  
memory)

(fixed for?)

- R31 contains return addr.

(JAL) LOC → Jump & Link

JALR RX

Reg

- return can be encoded by JR R31

→ R31 overwritten

→ We had  $\# \text{instr} \times \text{Op} \times \text{cycle time}$   
 program ~fixed  
 for RISC



Can read  
only 2,  
& write atmost 1.



will  
select  
out reg  
based  
on A3.  
Reg-WRITE (~EN signal).

Prof: Demux gives  
 $\begin{cases} F_0 \\ F_1 \\ F_2 \end{cases}$  & I did not D3 written  
 (contradict starting A demux)  
 it was  
 $\begin{cases} F_0 \\ F_1 \\ F_2 \end{cases}$  outputting 0 0A0 0

Page No. / Date /

↓ Same as DLX

Page No. / Date /

R-type      open DFR1 DFR2 DFR3 FUNC

e.g. ADD R1, R2, R3       $R1 = R2 + R3$

|        |       |       |       |           |        |
|--------|-------|-------|-------|-----------|--------|
| st     | 26 25 | 21 20 | 16 15 | 11 10 9 8 | 0      |
| 000000 | 00010 | 00011 | 00001 | R2        | R1 ADD |

- PC will tell from where to fetch the instr. [ & update PC ]

- PC → Mem-Address



RT#1 Load, J I imm16x4 (if not zero e.g. R1+39 loc 39 load  
 VS  
 BEQ J I imm16x4 (if not zero, then there's branching to another instruction - multiples of 4)

### # Memory Comm B+D J-type



R2 = M[R1 + 1000] [Load R2 with things at address R1 + 1000]



IMM16: pad to make it 32 bits

(+ pad with MSB, 0 or signed offset)

sign Extender

(SE16 means 16 bits of 32 bits)

(SE26)



OPC : M[R1 + 1000] = R2

BEQ R1, R2, 100

|     |      |      |    |       |    |
|-----|------|------|----|-------|----|
| 31  | 25   | 21   | 20 | 16    | 15 |
| OPC | OPR1 | OPR2 | I  | MM 16 | 0  |
|     | R1   | R2   |    | 100   |    |

If (R1 == R2) checked by SUB if zero flag set →  
 PC = PC + 4 + (100x4)

else

PC = PC + 4



J: PC = PC + 4 + I imm16x4 (unconditional)

(J-type)

[JE26]

& no MUX (unconditional unless BEQ)  
 (see NP for "complete" --)

26 bits to 32.

NP1: Load  $\#1mm16 \times 4$  at  $R1 + 39 loc$  e.g.  $R1 + 39 loc \#1mm16 \times 4$   
 VS  
 BEQ  $R1 + 39 loc \#1mm16 \times 4$  branching to another  $\#1mm16 \times 4$  multiple of 4

# Memory Comm B+D I-type



$R2 = M[R1 + 1000]$  [Load R2 with things at address  $R1 + 1000$ ]



BEQ  $R1, R2, 100$

OPR1 21 25 21 20 16 15  
 R1 R2 100

if  $(R1 == R2)$  checked by SUB if zero flag set  $\Rightarrow$   
 $PC = PC + 4 + (100 \times 4)$

else  $PC = PC + 4$



I:  $PC = PC + 4 + \#1mm16 \times 4$  (unconditional)

(I-type)

(SE16)

& no MUX ( $\Rightarrow$  unconditional)  
 without BEQ  
 (See NB for "complete" --).

26 bits to 32..

# Lec 28-3-22

ARM  
# All combined



|   | R      | I   | S                 | D                   | C     | F     | J | K | L | Mem Read                             | Mem Write | RF | PC |
|---|--------|-----|-------------------|---------------------|-------|-------|---|---|---|--------------------------------------|-----------|----|----|
| R | 000000 | 1 1 | 1 $\Rightarrow$ X | 1 1 0               | 0 0 1 | 1 0   |   |   |   | (h,l : no data mem in R)             |           |    |    |
| I | 000000 |     |                   | 1 1 X 0             | 1 1   | 1 0 1 |   |   |   |                                      |           |    |    |
| S | 000000 |     |                   | 1 1 X 0 0           | 1     | 0 1 0 |   |   |   |                                      |           |    |    |
| D | 000000 |     |                   | 1 1 $\Rightarrow$ X | 0 0   | 0 0 0 |   |   |   |                                      |           |    |    |
| C | 000000 |     |                   | 1 1 0 1 X 0 X       | 0 0 0 |       |   |   |   | (-WIFC 32 bit)<br>(J: unconditional) |           |    |    |

Fill table & use K-maps on col's to get ckt for control vars  
(control logic).

CPI = 1 (every instrn: 1 cycle). (single go # Ø : can see)

$$\frac{\text{Time}}{\text{Prog.}} = \frac{IC \times CPI \times T}{T} = \frac{\text{Dynamic}(2)}{\text{Count}}$$

Mem read/write :  $2n_p$

RF " " :  $1n_p$

ALU :  $2n_p$

target path: load instr



$$2n_p + 1 + 2 + 2 + 1 \xrightarrow{\text{RF}} \text{write RF}$$

assume (MUXes time!!)

$\Rightarrow 8n_p$  (max)

$$\text{Freq of opns} = \frac{1}{8n_p} = 125\text{MHz}$$

- for ( $i=0$  ;  $i < 1000$  ;  $i++$ ) } static count of instrn = 1  
A(i) = A(i)+1 } dynamic ... = 1000

#  $\uparrow$  was S  $\Rightarrow$  S' C: single cycle if  $\checkmark$ , e.g.  $\uparrow$  load.  
(can do S  $\Rightarrow$  P  $\Rightarrow$  P'  $\Rightarrow$  S' too.)

$\rightarrow$  Trans. system.

$\rightarrow$  Sub-tasks

1. Instr Fetch | 'housekeeping' tasks
2. Update PC (Inv to do)
3. Operand read
4. Instr Comput or Comput of addr. of memory
5. Read/Write Memory
6. Update the state (write to RF).

# For All Instr

1. Instr fetch
2. Update PC
3. Opr read
4. Instr Comput
5. Write back to RF (read/write mem. Ø)

### States

R-type (ALU) Arithmetic

1. Fetch instruction
2. Update PC
3. Read operand
4. Execute
5. Write back result



Page No. 257  
Date

to be written...



S<sub>0</sub> → S<sub>1</sub> → S<sub>2</sub> → S<sub>3</sub>



(R-type)

no more reg. for this  
use control signal PC-WRK1-  
or PC-WRK2





ESM↑:  
1st clock tick -- job  
2nd " " -- job  
:

# Lec 29-3-22

S → S'

CPI = 1

T = 8 ns

$$\text{Time} = \text{IC} \times \text{CPI} \times T$$

$$\text{Prog} \quad \times 1 \times 8 \text{ ns}$$

$$NAPS = 125 \quad (\because 125 \text{ MHz} \rightarrow \frac{1}{125 \times 10^6})$$

Page No. 245  
Date

S → P → P' → S'

Load: 8 ns

but, Jump: 8 ns  $\Rightarrow$  8 ns idle  $\Rightarrow$  break down as ( )

1. Instrn fetch
2. Update PC
3. external read
4. Execute / Compute the addr

5. Mem. access

6. Write back result

Load = 8 ns

AL = 6 ns

(Mem, log) ST = 7 ns

Branch = 5 ns

J = 4 ns

(?) Values (vs later 214..)

# Data Path



# Load B+D (Base + Displ.)

LW R<sub>1</sub>, (R<sub>2</sub>), 100 : R<sub>2</sub> = M[R<sub>1</sub> + Imm16]

I-type [opn] [R<sub>2</sub>] [Disp] Imm16  
6 5 5 16

reg. into which values loaded

1. Fetch instr
2. Update PC
3. Read operand
4. Compute addr.
5. Read memory
6. Write back result into RF

PC → mem. addr., alu-A  
Mem. Data → SFR  
+4 → alu-B  
aluC → PC (PC+write).

Same as S4

S4 IR<sub>25-21</sub> → RF-A1 (only 10 bits)  
RF-D1 → T1 <

S5 T1 → alu A  
IR<sub>20-16</sub> → SE16 → alu-B  
alu-C → T3 (contains the mem loc...)

→ pt: no need of 2 separate memories now - (vs single go @T1..)

S6 T3 → Mem-Ad  
Mem. Data → T3 < has the load result  
(write into RF)  
(loaded ✓)

S7 T3 → RF-D3 (demux)  
IR<sub>20-16</sub> → RF-A3 (loaded into register)  
Dest<sup>4</sup>  
Control  
for Load oper.



→ write oper occurs at end of cycle (at this rising edge)

# Store SW R<sub>1</sub>, 100  
1. Instr fetch  
2. Update PC  
3. Read operands  
4. Compute addr.  
5. Write to memory

PC → Mem. Addr, aluA  
Mem. Data → IR  
+4 → alu-B  
alu-C → PC

S9 → S<sub>10</sub> → S<sub>11</sub> → S<sub>12</sub>  
4 steps/states

S10 IR<sub>25-21</sub> → RF-A1  
IR<sub>20-16</sub> → RF-A2  
RF-D1 → T1  
RF-D2 → T2  
("base")  
+ reading of dest "data in advance"  
dest "data in advance" → M[R<sub>1</sub> + Imm16] = R2  
("this part the value to be stored")  
reg

S11 T1 → alu A  
IR<sub>15-0</sub> → SE16 → aluB  
alu C → T3  
T3 now has R<sub>1</sub> + Imm16  
(mem. addr.)

S12 T3 → Mem-Ad  
T2 → Mem D  
Storage

# Lec . 31-3-22

~~BEQ R<sub>g1</sub>, R<sub>g2</sub>, Imm~~

J-type [OPn | OPR1 | T<sub>dest</sub> | IMM16 ]  
 31 26-25 31-20 16-15  
 PC = PC + 4 + IMM16 \* 4  
 PC = PC + 4 + 0

IF (R<sub>g1</sub> == R<sub>g2</sub>) then  
 $PC = PC + 4 + IMM16 \times 4$

\*? IMM16 is #instr<sup>n</sup> (BEQ, J)  
 (+ 200 top: 64)

Else

$PC = PC + 4$

1. Fetch instr<sup>n</sup> { "background" tasks.

2. Update PC

3. Read operands (R<sub>g1</sub> & R<sub>g2</sub>)

4. Compare (Subtraction) ~~PC~~  
 5. If zero then  $PC = PC + 4 + IMM16 \times 4$

PC → Mem → A<sup>16</sup>, alu-A  
 t4 → alu-B  
 alu-C → PC  
 Mem. Data → IR (i.e. instr<sup>n</sup> fetched from memory, put into IR).

A → C  
 B → C  
 zero flag set if zero result

S13

IR<sub>25-21</sub> → RF-A1  
 IR<sub>20-16</sub> → RF-A2  
 RF-D1 → T1 (temp reg)  
 RF-D2 → T2

PC → alu-A  
 IR<sub>15-0</sub> → SE<sub>16</sub> → 2S → alu-B

alu-C → T3

loaded in S13...

T1 → alu-A  
 T2 → alu-B  
 PC → alu2-A  
 IR<sub>15-0</sub> → SE<sub>16</sub> → 2S → alu2-B  
 If (zero == 1) then  
 alu2-C → PC  
 T3 → PC

Lpt. 4.5 can do n || busy PC +  
 + mem. req.  
 commutes  
 data then if  
 busy → stall  
 load  
 (otherwise nothing)

back PC ←

PC + 4

Aim: minimize # states  
 so that C.R.L.

& here: 2 ALU's req...  
 BUT can do other tasks  
 in S14 itself!!

Optimized in pencil. If PC pre-computed

S13 → S14 → S15. & only 1 alu.

& "memory"

vs in single  
 cycle  
 2 memory &  
 2 ALU's



S13



# Jump inst: J-type format.



1. Fetch instr

2. Update PC  $\leftarrow \text{PC} + 4$
3. Compute  $\text{PC} + 4 + \text{Imm}26 \times 4$  (no "Read OPR" here, op)
4. Update PC

(when fetching instr, don't know which PC it is...)



So,  $S_1, S_2, S_3, S_4$  same: in terms of operations they're (not "equivalent," b/c next states states not same).

Can make  $S_1 = S_5 = S_{10} = S_{14} = S_7$   
 ↓ (redundant oper); extra for jump)

Can make these equivalent by using a decoder, that decodes next state based on oper field.



Can combine  
 S3, S8 tho  
 IR  $\rightarrow$  diff: IR  
 as:  
 IR  $\rightarrow$  25  
 IR  $\rightarrow$  control (oper core of the job...)

⇒ 9 states  $\emptyset$ .

- sharing of resources ✓ (add'l mem, ALU, but reg & T...  $\emptyset$ )
- Control signals for each state need to be decided





Page No. 3.17  
Date 1/1

IF ID EX MEM WB

gets R<sub>1</sub> = 50

$20 + 30 = 50$  computed & stored in pipeline register (E/MEM)

if dest<sup>thick</sup> of I<sub>1</sub> = src of I<sub>2</sub>, then problem.

If R<sub>M1</sub><sup>ID</sup> = R<sub>E2</sub><sup>EX</sup> or R<sub>M2</sub><sup>ID</sup> = R<sub>E1</sub><sup>EX</sup>

if so, do:



e.g. 10% dep.  $\Rightarrow 1 \times 0.9 + 3 \times 0.1 = 1.2$  (CPI)  
 penalty for 10%.

$\Rightarrow$  Can do: read from pipeline reg.  
 L k/a Data forwarding / Data bypassing.



Hv to detect..

### ① Data Dependency

$$I_1: r_1 = r_2 + r_3$$

$$I_2: r_4 = r_1 + r_5 \quad \text{dependency}$$

$$\begin{array}{l} r_1: 10 \\ r_2: 20 \\ r_3: 30 \\ r_4: 40 \\ r_5: 50 \end{array}$$

### # Characteristics of pipeline:

1. fetches 1 instr at a time
2. Unifid pipelines
3. lock step advancement (can't advance until 3rd stage advanced)

# Clock Pd : max time taken by any state

ALU: 8ns  
Mem: 2ns  
RF: 1ns

$$t_{max} = 8ns \text{ vs earlier: } 8ns$$

~~Execution~~: without R → 4 cycles

|     |   |      |
|-----|---|------|
| LH  | 5 | -10  |
| SW  | 4 | -8   |
| BEQ | 3 | -4   |
| J   | 3 | -6ns |

in curr. M.

$$\text{Time} = IC \times CPI \times t_{max}$$

depends on freq. of opnys  
e.g. R 50%, LH 20%, SW 10%, BEQ 10%, J 10%

But, if -

|                       |                              |
|-----------------------|------------------------------|
| 50%                   | $CPI = 4 \times 0.5 + \dots$ |
| 10%                   | $+ 3 \times 0.1$             |
| 10% $\Rightarrow 3.8$ | (weighted mean) = 4          |
| 20%                   |                              |
| 10%                   |                              |

↓  
performed better  
Time same as single cycle!!

$$IC \times t_{max} = 4 \times 2ns = 8ns$$

Single cycle:  $CPI = 1$  (best)  
multi " :  $t = 2ns$  (~)

If Assume

- all instr's are indep. of each other
- no change in control flow (i.e. top-bottom exec.), then  $CPI = 1$   
(bridge, car assembly, bricks, M)

pipeline archi

Fec 3-4-22.

Pipelined Architecture

Chaining of Instructions

Single Cycle : CPI = 1

Multicycle: T ↓

Subtasks

1. Instruction fetch (IF)

2. Operand read (ID = Direct decode?)

3. Execute (EX)

4. Memory Access (Mem)

5. Write back result (WB)

A. If ID EX Mem WB

B. If ID EX - Mem - WB

C. " " " "

D. " " " "

E. " " " "

Assembly line



Controllers (for control signals)  
(Combinational logic: e.g. K-maps..)

NP

IF ID EX MEM WB



2 cycles wasted if  $R_1 = R_2$

- Can predict: do branch if last time taken (i.e. branched), then branch this time too

| PC  | BTB | HB | History Bit |
|-----|-----|----|-------------|
| 100 | 200 | 1  |             |
| 300 | 400 | 1  |             |

uncond'nat  $\Rightarrow$  always branched

don't need cond'nat

last time branched...

$\rightarrow$



isn't prediction

$\rightarrow$  I2

to fetch this, br to use PC+4  
but we already changed it  
to PC+4 + (IMM16x4) for I20...  
 $\rightarrow$  use this.

## Branch / Smith Predictor



Dependency  $\rightarrow 2$

$$I_1: r_3 = r_2 + r_3$$

$$I_2: r_4 = r_3 + r_4$$

$$I_3: r_7 = r_4 + r_8$$

If  $R_{RD}^{ID} = R_{RF}$  or  $R_{RD}^{ID} = R_{Mem}^{Mem}$



$I_1: r_1 \leftarrow Mem(r_1 + r_2)$  (i.e.  $M[r_1 + r_2]$ ) (Load)

 $r_2 = r_1 + r_3$ 

1

2

3

4  $I_3, I_2 - I_1$

5  $I_1, I_2, I_3 - I_1$

If  $\alpha$  fn load & 0.1  $\alpha$  (load+dependency)

$$\downarrow$$

$$CPI = 1 + \frac{(0.1\alpha) \times 1}{10}$$

Compiler can help:

e.g.  $a = b + c$   
 $d = a + e$

Load  $r_1, b$   
 Load  $r_2, c$   
 Add  $r_3, r_1, r_2$   
 Store  $r_3, d$

Shift it  $\otimes$

now  
 third dep  
 does not cause stalling  
 (cancel data by reading)

k/a Code Reordering

Page No. 2019  
 Date

## # Lec 5-4-22

### Pipelined Architecture (contd.)

216 Chapt.

Assumptions

- ① Independence of Inst's
- ② No change in control flow

I<sub>1</sub>: J loc  
I<sub>2</sub>: ADD R<sub>1</sub>, R<sub>2</sub>, R<sub>3</sub>

I<sub>3</sub>  
I<sub>4</sub>  
I<sub>5</sub>  
I<sub>6</sub>

I<sub>7</sub>  
I<sub>8</sub>  
I<sub>9</sub>  
I<sub>10</sub>

I<sub>11</sub>  
I<sub>12</sub>  
I<sub>13</sub>  
I<sub>14</sub>

I<sub>15</sub>  
I<sub>16</sub>  
I<sub>17</sub>  
I<sub>18</sub>

I<sub>19</sub>  
I<sub>20</sub>  
I<sub>21</sub>  
I<sub>22</sub>

I<sub>23</sub>  
I<sub>24</sub>  
I<sub>25</sub>  
I<sub>26</sub>

I<sub>27</sub>  
I<sub>28</sub>  
I<sub>29</sub>  
I<sub>30</sub>

I<sub>31</sub>  
I<sub>32</sub>  
I<sub>33</sub>  
I<sub>34</sub>

I<sub>35</sub>  
I<sub>36</sub>  
I<sub>37</sub>  
I<sub>38</sub>

I<sub>39</sub>  
I<sub>40</sub>  
I<sub>41</sub>  
I<sub>42</sub>

I<sub>43</sub>  
I<sub>44</sub>  
I<sub>45</sub>  
I<sub>46</sub>

I<sub>47</sub>  
I<sub>48</sub>  
I<sub>49</sub>  
I<sub>50</sub>

I<sub>51</sub>  
I<sub>52</sub>  
I<sub>53</sub>  
I<sub>54</sub>

I<sub>55</sub>  
I<sub>56</sub>  
I<sub>57</sub>  
I<sub>58</sub>

I<sub>59</sub>  
I<sub>60</sub>  
I<sub>61</sub>  
I<sub>62</sub>

I<sub>63</sub>  
I<sub>64</sub>  
I<sub>65</sub>  
I<sub>66</sub>

I<sub>67</sub>  
I<sub>68</sub>  
I<sub>69</sub>  
I<sub>70</sub>

I<sub>71</sub>  
I<sub>72</sub>  
I<sub>73</sub>  
I<sub>74</sub>

I<sub>75</sub>  
I<sub>76</sub>  
I<sub>77</sub>  
I<sub>78</sub>

I<sub>79</sub>  
I<sub>80</sub>  
I<sub>81</sub>  
I<sub>82</sub>

I<sub>83</sub>  
I<sub>84</sub>  
I<sub>85</sub>  
I<sub>86</sub>

I<sub>87</sub>  
I<sub>88</sub>  
I<sub>89</sub>  
I<sub>90</sub>

I<sub>91</sub>  
I<sub>92</sub>  
I<sub>93</sub>  
I<sub>94</sub>

I<sub>95</sub>  
I<sub>96</sub>  
I<sub>97</sub>  
I<sub>98</sub>

I<sub>99</sub>  
I<sub>100</sub>  
I<sub>101</sub>  
I<sub>102</sub>

I<sub>103</sub>  
I<sub>104</sub>  
I<sub>105</sub>  
I<sub>106</sub>

I<sub>107</sub>  
I<sub>108</sub>  
I<sub>109</sub>  
I<sub>110</sub>

I<sub>111</sub>  
I<sub>112</sub>  
I<sub>113</sub>  
I<sub>114</sub>

I<sub>115</sub>  
I<sub>116</sub>  
I<sub>117</sub>  
I<sub>118</sub>

I<sub>119</sub>  
I<sub>120</sub>  
I<sub>121</sub>  
I<sub>122</sub>

I<sub>123</sub>  
I<sub>124</sub>  
I<sub>125</sub>  
I<sub>126</sub>

I<sub>127</sub>  
I<sub>128</sub>  
I<sub>129</sub>  
I<sub>130</sub>

I<sub>131</sub>  
I<sub>132</sub>  
I<sub>133</sub>  
I<sub>134</sub>

I<sub>135</sub>  
I<sub>136</sub>  
I<sub>137</sub>  
I<sub>138</sub>

I<sub>139</sub>  
I<sub>140</sub>  
I<sub>141</sub>  
I<sub>142</sub>

I<sub>143</sub>  
I<sub>144</sub>  
I<sub>145</sub>  
I<sub>146</sub>

I<sub>147</sub>  
I<sub>148</sub>  
I<sub>149</sub>  
I<sub>150</sub>

I<sub>151</sub>  
I<sub>152</sub>  
I<sub>153</sub>  
I<sub>154</sub>

I<sub>155</sub>  
I<sub>156</sub>  
I<sub>157</sub>  
I<sub>158</sub>

I<sub>159</sub>  
I<sub>160</sub>  
I<sub>161</sub>  
I<sub>162</sub>

I<sub>163</sub>  
I<sub>164</sub>  
I<sub>165</sub>  
I<sub>166</sub>

I<sub>167</sub>  
I<sub>168</sub>  
I<sub>169</sub>  
I<sub>170</sub>

I<sub>171</sub>  
I<sub>172</sub>  
I<sub>173</sub>  
I<sub>174</sub>

I<sub>175</sub>  
I<sub>176</sub>  
I<sub>177</sub>  
I<sub>178</sub>

I<sub>179</sub>  
I<sub>180</sub>  
I<sub>181</sub>  
I<sub>182</sub>

I<sub>183</sub>  
I<sub>184</sub>  
I<sub>185</sub>  
I<sub>186</sub>

I<sub>187</sub>  
I<sub>188</sub>  
I<sub>189</sub>  
I<sub>190</sub>

I<sub>191</sub>  
I<sub>192</sub>  
I<sub>193</sub>  
I<sub>194</sub>

I<sub>195</sub>  
I<sub>196</sub>  
I<sub>197</sub>  
I<sub>198</sub>

I<sub>199</sub>  
I<sub>200</sub>  
I<sub>201</sub>  
I<sub>202</sub>

I<sub>203</sub>  
I<sub>204</sub>  
I<sub>205</sub>  
I<sub>206</sub>

I<sub>207</sub>  
I<sub>208</sub>  
I<sub>209</sub>  
I<sub>210</sub>

I<sub>211</sub>  
I<sub>212</sub>  
I<sub>213</sub>  
I<sub>214</sub>

I<sub>215</sub>  
I<sub>216</sub>  
I<sub>217</sub>  
I<sub>218</sub>

I<sub>219</sub>  
I<sub>220</sub>  
I<sub>221</sub>  
I<sub>222</sub>

I<sub>223</sub>  
I<sub>224</sub>  
I<sub>225</sub>  
I<sub>226</sub>

I<sub>227</sub>  
I<sub>228</sub>  
I<sub>229</sub>  
I<sub>230</sub>

I<sub>231</sub>  
I<sub>232</sub>  
I<sub>233</sub>  
I<sub>234</sub>

I<sub>235</sub>  
I<sub>236</sub>  
I<sub>237</sub>  
I<sub>238</sub>

I<sub>239</sub>  
I<sub>240</sub>  
I<sub>241</sub>  
I<sub>242</sub>

I<sub>243</sub>  
I<sub>244</sub>  
I<sub>245</sub>  
I<sub>246</sub>

I<sub>247</sub>  
I<sub>248</sub>  
I<sub>249</sub>  
I<sub>250</sub>

I<sub>251</sub>  
I<sub>252</sub>  
I<sub>253</sub>  
I<sub>254</sub>

I<sub>255</sub>  
I<sub>256</sub>  
I<sub>257</sub>  
I<sub>258</sub>

I<sub>259</sub>  
I<sub>260</sub>  
I<sub>261</sub>  
I<sub>262</sub>

I<sub>263</sub>  
I<sub>264</sub>  
I<sub>265</sub>  
I<sub>266</sub>

I<sub>267</sub>  
I<sub>268</sub>  
I<sub>269</sub>  
I<sub>270</sub>

I<sub>271</sub>  
I<sub>272</sub>  
I<sub>273</sub>  
I<sub>274</sub>

I<sub>275</sub>  
I<sub>276</sub>  
I<sub>277</sub>  
I<sub>278</sub>

I<sub>279</sub>  
I<sub>280</sub>  
I<sub>281</sub>  
I<sub>282</sub>

I<sub>283</sub>  
I<sub>284</sub>  
I<sub>285</sub>  
I<sub>286</sub>

I<sub>287</sub>  
I<sub>288</sub>  
I<sub>289</sub>  
I<sub>290</sub>

I<sub>291</sub>  
I<sub>292</sub>  
I<sub>293</sub>  
I<sub>294</sub>

I<sub>295</sub>  
I<sub>296</sub>  
I<sub>297</sub>  
I<sub>298</sub>

I<sub>299</sub>  
I<sub>300</sub>  
I<sub>301</sub>  
I<sub>302</sub>

I<sub>303</sub>  
I<sub>304</sub>  
I<sub>305</sub>  
I<sub>306</sub>

I<sub>307</sub>  
I<sub>308</sub>  
I<sub>309</sub>  
I<sub>310</sub>

I<sub>311</sub>  
I<sub>312</sub>  
I<sub>313</sub>  
I<sub>314</sub>

I<sub>315</sub>  
I<sub>316</sub>  
I<sub>317</sub>  
I<sub>318</sub>

I<sub>319</sub>  
I<sub>320</sub>  
I<sub>321</sub>  
I<sub>322</sub>

I<sub>323</sub>  
I<sub>324</sub>  
I<sub>325</sub>  
I<sub>326</sub>

I<sub>327</sub>  
I<sub>328</sub>  
I<sub>329</sub>  
I<sub>330</sub>

I<sub>331</sub>  
I<sub>332</sub>  
I<sub>333</sub>  
I<sub>334</sub>

I<sub>335</sub>  
I<sub>336</sub>  
I<sub>337</sub>  
I<sub>338</sub>

I<sub>339</sub>  
I<sub>340</sub>  
I<sub>341</sub>  
I<sub>342</sub>

I<sub>343</sub>  
I<sub>344</sub>  
I<sub>345</sub>  
I<sub>346</sub>

I<sub>347</sub>  
I<sub>348</sub>  
I<sub>349</sub>  
I<sub>350</sub>

I<sub>351</sub>  
I<sub>352</sub>  
I<sub>353</sub>  
I<sub>354</sub>

I<sub>355</sub>  
I<sub>356</sub>  
I<sub>357</sub>  
I<sub>358</sub>

I<sub>359</sub>  
I<sub>360</sub>  
I<sub>361</sub>  
I<sub>362</sub>

I<sub>363</sub>  
I<sub>364</sub>  
I<sub>365</sub>  
I<sub>366</sub>

I<sub>367</sub>  
I<sub>368</sub>  
I<sub>369</sub>  
I<sub>370</sub>

I<sub>371</sub>  
I<sub>372</sub>  
I<sub>373</sub>  
I<sub>374</sub>

I<sub>375</sub>  
I<sub>376</sub>  
I<sub>377</sub>  
I<sub>378</sub>

I<sub>379</sub>  
I<sub>380</sub>  
I<sub>381</sub>  
I<sub>382</sub>

I<sub>383</sub>  
I<sub>384</sub>  
I<sub>385</sub>  
I<sub>386</sub>

I<sub>387</sub>  
I<sub>388</sub>  
I<sub>389</sub>  
I<sub>390</sub>

I<sub>391</sub>  
I<sub>392</sub>  
I<sub>393</sub>  
I<sub>394</sub>

I<sub>395</sub>  
I<sub>396</sub>  
I<sub>397</sub>  
I<sub>398</sub>

I<sub>399</sub>  
I<sub>400</sub>  
I<sub>401</sub>  
I<sub>402</sub>

I<sub>403</sub>  
I<sub>404</sub>  
I<sub>405</sub>  
I<sub>406</sub>

I<sub>407</sub>  
I<sub>408</sub>  
I<sub>409</sub>  
I<sub>410</sub>

I<sub>411</sub>  
I<sub>412</sub>  
I<sub>413</sub>  
I<sub>414</sub>

I<sub>415</sub>  
I<sub>416</sub>  
I<sub>417</sub>  
I<sub>418</sub>

I<sub>419</sub>  
I<sub>420</sub>  
I<sub>421</sub>  
I<sub>422</sub>

I<sub>423</sub>  
I<sub>424</sub>  
I<sub>425</sub>  
I<sub>426</sub>

I<sub>427</sub>  
I<sub>428</sub>  
I<sub>429</sub>  
I<sub>430</sub>

I<sub>431</sub>  
I<sub>432</sub>  
I<sub>433</sub>  
I<sub>434</sub>

I<sub>435</sub>  
I<sub>436</sub>  
I<sub>437</sub>  
I<sub>438</sub>

I<sub>439</sub>  
I<sub>440</sub>  
I<sub>441</sub>  
I<sub>442</sub>

I<sub>443</sub>  
I<sub>444</sub>  
I<sub>445</sub>  
I<sub>446</sub>

I<sub>447</sub>  
I<sub>448</sub>  
I<sub>449</sub>  
I<sub>450</sub>

I<sub>451</sub>  
I<sub>452</sub>  
I<sub>453</sub>  
I<sub>454</sub>

I<sub>455</sub>  
I<sub>456</sub>  
I<sub>457</sub>  
I<sub>458</sub>

I<sub>459</sub>  
I<sub>460</sub>  
I<sub>461</sub>  
I<sub>462</sub>

I<sub>463</sub>  
I<sub>464</sub>  
I<sub>465</sub>  
I<sub>466</sub>

I<sub>467</sub>  
I<sub>468</sub>  
I<sub>469</sub>  
I<sub>470</sub>

I<sub>471</sub>  
I<sub>472</sub>  
I<sub>473</sub>  
I<sub>474</sub>

I<sub>475</sub>  
I<sub>476</sub>  
I<sub>477</sub>  
I<sub>478</sub>

I<sub>479</sub>  
I<sub>480</sub>  
I<sub>481</sub>  
I<sub>482</sub>

I<sub>483</sub>  
I<sub>484</sub>  
I<sub>485</sub>  
I<sub>486</sub>

I<sub>487</sub>  
I<sub>488</sub>  
I<sub>489</sub>  
I<sub>490</sub>

I<sub>491</sub>  
I<sub>492</sub>  
I<sub>493</sub>  
I<sub>494</sub>

I<sub>495</sub>  
I<sub>496</sub>  
I<sub>497</sub>  
I<sub>498</sub>

I<sub>499</sub>  
I<sub>500</sub>  
I<sub>501</sub>  
I<sub>502</sub>

I<sub>503</sub>  
I<sub>504</sub>  
I<sub>505</sub>  
I<sub>506</sub>

I<sub>507</sub>  
I<sub>508</sub>  
I<sub>509</sub>  
I<sub>510</sub>

I<sub>511</sub>  
I<sub>512</sub>  
I<sub>513</sub>  
I<sub>514</sub>

I<sub>515</sub>  
I<sub>516</sub>  
I<sub>517</sub>  
I<sub>518</sub>

I<sub>519</sub>  
I<sub>520</sub>  
I<sub>521</sub>  
I<sub>522</sub>

I<sub>523</sub>  
I<sub>524</sub>  
I<sub>525</sub>  
I<sub>526</sub>

I<sub>527</sub>  
I<sub>528</sub>  
I<sub>529</sub>  
I<sub>530</sub>

I<sub>531</sub>  
I<sub>532</sub>  
I<sub>533</sub>  
I<sub>534</sub>

I<sub>535</sub>  
I<sub>536</sub>  
I<sub>537</sub>  
I<sub>538</sub>

I<sub>539</sub>  
I<sub>540</sub>  
I<sub>541</sub>  
I<sub>542</sub>

I<sub>543</sub>  
I<sub>544</sub>  
I<sub>545</sub>  
I<sub>546</sub>

I<sub>547</sub>  
I<sub>548</sub>  
I<sub>549</sub>  
I<sub>550</sub>

I<sub>551</sub>  
I<sub>552</sub>  
I<sub>553</sub>  
I<sub>554</sub>

I<sub>555</sub>  
I<sub>556</sub>  
I<sub>557</sub>  
I<sub>558</sub>

I<sub>559</sub>  
I<sub>560</sub>  
I<sub>561</sub>  
I<sub>562</sub>

I<sub>563</sub>  
I<sub>564</sub>  
I<sub>565</sub>  
I<sub>566</sub>

I<sub>567</sub>  
I<sub>568</sub>  
I<sub>569</sub>  
I<sub>570</sub>

I<sub>571</sub>  
I<sub>572</sub>  
I<sub>573</sub>  
I<sub>574</sub>

I<sub>575</sub>  
I<sub>576</sub>  
I<sub>577</sub>  
I<sub>578</sub>

I<sub>579</sub>  
I<sub>580</sub>  
I<sub>581</sub>  
I<sub>582</sub>

I<sub>583</sub>  
I<sub>584</sub>  
I<sub>585</sub>  
I<sub>586</sub>

I<sub>587</sub>  
I<sub>588</sub>  
I<sub>589</sub>  
I<sub>590</sub>

I<sub>591</sub>  
I<sub>592</sub>  
I<sub>593</sub>  
I<sub>594</sub>

I<sub>595</sub>  
I<sub>596</sub>  
I<sub>597</sub>  
I<sub>598</sub>

I<sub>599</sub>  
I<sub>600</sub>  
I<sub>601</sub>  
I<sub>602</sub>

I<sub>603</sub>  
I<sub>604</sub>  
I<sub>605</sub>  
I<sub>606</sub>

I<sub>607</sub>  
I<sub>608</sub>  
I<sub>609</sub>  
I<sub>610</sub>

I<sub>611</sub>  
I<sub>612</sub>  
I<sub>613</sub>  
I<sub>614</sub>

I<sub>615</sub>  
I<sub>616</sub>  
I<sub>617</sub>  
I<sub>618</sub>

I<sub>619</sub>  
I<sub>620</sub>  
I<sub>621</sub>  
I<sub>622</sub>

I<sub>623</sub>  
I<sub>624</sub>  
I<sub>625</sub>  
I<sub>626</sub>

I<sub>627</sub>  
I<sub>628</sub>  
I<sub>629</sub>  
I<sub>630</sub>

I<sub>631</sub>  
I<sub>632</sub>  
I<sub>633</sub>  
I<sub>634</sub>

I<sub>635</sub>  
I<sub>636</sub>  
I<sub>637</sub>  
I<sub>638</sub>

I<sub>639</sub>  
I<sub>640</sub>  
I<sub>641</sub>  
I<sub>642</sub>

I<sub>643</sub>  
I<sub>644</sub>  
I<sub>645</sub>  
I<sub>646</sub>

I<sub>647</sub>  
I<sub>648</sub>  
I<sub>649</sub>  
I<sub>650</sub>

I<sub>651</sub>  
I<sub>652</sub>  
I<sub>653</sub>  
I<sub>654</sub>

I<sub>655</sub>  
I<sub>656</sub>  
I<sub>657</sub>  
I<sub>658</sub>

I<sub>659</sub>  
I<sub>660</sub>  
I<sub>661</sub>  
I<sub>662</sub>

I<sub>663</sub>  
I<sub>664</sub>  
I<sub>665</sub>  
I<sub>666</sub>

I<sub>667</sub>  
I<sub>668</sub>  
I<sub>669</sub>  
I<sub>670</sub>

I<sub>671</sub>  
I<sub>672</sub>  
I<sub>673</sub>  
I<sub>674</sub>

I<sub>675</sub>  
I<sub>676</sub>  
I<sub>677</sub>  
I<sub>678</sub>

I<sub>679</sub>  
I<sub>680</sub>  
I<sub>681</sub>  
I<sub>682</sub>

I<sub>683</sub>  
I<sub>684</sub>  
I<sub>685</sub>  
I<sub>686</sub>

I<sub>687</sub>  
I<sub>688</sub>  
I<sub>689</sub>  
I<sub>690</sub>

I<sub>691</sub>  
I<sub>692</sub>  
I<sub>693</sub>  
I<sub>694</sub>

I<sub>695</sub>  
I<sub>696</sub>  
I<sub>697</sub>  
I<sub>698</sub>

I<sub>699</sub>  
I<sub>700</sub>  
I<sub>701</sub>  
I<sub>702</sub>

I<sub>703</sub>  
I<sub>704</sub>  
I<sub>705</sub>  
I<sub>706</sub>

I<sub>707</sub>  
I<sub>708</sub>  
I<sub>709</sub>  
I<sub>710</sub>

I<sub>711</sub>  
I<sub>712</sub>  
I<sub>713</sub>  
I<sub>714</sub>

I<sub>715</sub>  
I<sub>716</sub>  
I<sub>717</sub>  
I<sub>718</sub>

I<sub>719</sub>  
I<sub>720</sub>  
I<sub>721</sub>  
I<sub>722</sub>

I<sub>723</sub>  
I<sub>724</sub>  
I<sub>725</sub>  
I<sub>726</sub>

I<sub>727</sub>  
I<sub>728</sub>  
I<sub>729</sub>  
I<sub>730</sub>

I<sub>731</sub>  
I<sub>732</sub>  
I<sub>733</sub>  
I<sub>734</sub>

I<sub>735</sub>  
I<sub>736</sub>  
I<sub>737</sub>  
I<sub>738</sub>

I<sub>739</sub>  
I<sub>740</sub>  
I<sub>741</sub>  
I<sub>742</sub>

I<sub>743</sub>  
I<sub>744</sub>  
I<sub>745</sub>  
I<sub>746</sub>

I<sub>747</sub>  
I<sub>748</sub>  
I<sub>749</sub>  
I<sub>750</sub>

I<sub>751</sub>  
I<sub>752</sub>  
I<sub>753</sub>  
I<sub>754</sub>

I<sub>755</sub>  
I<sub>756</sub>  
I<sub>757</sub>  
I<sub>758</sub>

I<sub>759</sub>  
I<sub>760</sub>  
I<sub>761</sub>  
I<sub>762</sub>

I<sub>763</sub>  
I<sub>764</sub>  
I<sub>765</sub>  
I<sub>766</sub>

I<sub>767</sub>  
I<sub>768</sub>  
I<sub>769</sub>  
I<sub>770</sub>

I<sub>771</sub>  
I<sub>772</sub>  
I<sub>773</sub>  
I<sub>774</sub>

I<sub>775</sub>  
I<sub>776</sub>  
I<sub>777</sub>  
I<sub>778</sub>

I<sub>779</sub>  
I<sub>780</sub>  
I<sub>781</sub>  
I<sub>782</sub>

I<sub>783</sub>  
I<sub>784</sub>  
I<sub>785</sub>  
I<sub>786</sub>

I<sub>787</sub>  
I<sub>788</sub>  
I<sub>789</sub>  
I<sub>790</sub>

I<sub>791</sub>  
I<sub>792</sub>  
I<sub>793</sub>  
I<sub>794</sub>

I<sub>795</sub>  
I<sub>796</sub>  
I<sub>797</sub>  
I<sub>798</sub>

I<sub>799</sub>  
I<sub>800</sub>  
I<sub>801</sub>  
I<sub>802</sub>

I<sub>803</sub>  
I<sub>804</sub>  
I<sub>805</sub>  
I<sub>806</sub>

I<sub>807</sub>  
I<sub>808</sub>  
I<sub>809</sub>  
I<sub>810</sub>

I<sub>811</sub>  
I<sub>812</sub>  
I<sub>813</sub>  
I<sub>814</sub>

I<sub>815</sub>  
I<sub>816</sub>  
I<sub>817</sub>  
I<sub>818</sub>

I<sub>819</sub>  
I<sub>820</sub>  
I<sub>821</sub>  
I<sub>822</sub>

I<sub>823</sub>  
I<sub>824</sub>  
I<sub>825</sub>  
I<sub>826</sub>

I<sub>827</sub>  
I<sub>828</sub>  
I<sub>829</sub>  
I<sub>830</sub>

I<sub>831</sub>  
I<sub>832</sub>  
I<sub>833</sub>  
I<sub>834</sub>

I<sub>835</sub>  
I<sub>836</sub>  
I<sub>837</sub>  
I<sub>838</sub>

I<sub>839</sub>  
I<sub>840</sub>  
I<sub>841</sub>  
I<sub>842</sub>

I<sub>843</sub>  
I<sub>844</sub>  
I<sub>845</sub>  
I<sub>846</sub>

I<sub>847</sub>  
I<sub>848</sub>  
I<sub>849</sub>  
I<sub>850</sub>

I<sub>851</sub>  
I<sub>852</sub>  
I<sub>853</sub>  
I<sub>854</sub>

I<sub>855</sub>  
I<sub>856</sub>  
I<sub>857</sub>  
I<sub>858</sub>

I<sub>859</sub>  
I<sub>860</sub>  
I<sub>861</sub>  
I<sub>862</sub>

I<sub>863</sub>  
I<sub>864</sub>  
I<sub>865</sub>  
I<sub>866</sub>

I<sub>867</sub>  
I<sub>868</sub>  
I<sub>869</sub>  
I<sub>870</sub>

I<sub>871</sub>  
I<sub>872</sub>  
I<sub>873</sub>  
I<sub>874</sub>

I<sub>875</sub>  
I<sub>876</sub>  
I<sub>877</sub>  
I<sub>878</sub>

I<sub>879</sub>  
I<sub>880</sub>  
I<sub>881</sub>  
I<sub>882</sub>

I<sub>883</sub>  
I<sub>884</sub>  
I<sub>885</sub>  
I<sub>886</sub>

I<sub>887</sub>  
I<sub>888</sub>  
I<sub>889</sub>  
I<sub>890</sub>

I<sub>891</sub>  
I<sub>892</sub>  
I<sub>893</sub>  
I<sub>894</sub>

I<sub>895</sub>  
I<sub>896</sub>  
I<sub>897</sub>  
I<sub>898</sub>

I<sub

if no indep instr found  $\rightarrow$  put NOP (no op) after branch  
 DLX / MIPS / SPARC :  $\xrightarrow{\text{Branch Delay Slot}}$

So, if  $r_1 = r_0 + r_2$   
 $\text{beq } r_4, r_5, \text{skip}$   $\xrightarrow{\text{shuffle}}$   
 to compute skip addr,  
 do  $\text{PC} + 4 + 3MM \times 4$   
 then "next" addr.  
 $\therefore$  programmer coded well  
 $r_1 = \dots$   
 $\text{beq} \dots$   
 but addrs not  
 $\text{beq} \dots$   
 $r_1 = \dots$



$$\frac{\text{Data}}{\text{Second}} = \frac{\text{Data}}{\text{Instn}} \times \frac{\text{Instn's}}{\text{Cycle}} \times \frac{\text{Cycle}}{\text{Second}}$$

BW Bandwidth  
 $4B$  (32 bits)  
 $\xrightarrow{1}$  (CPI) $^{-1}$   
 $\xrightarrow{10^9}$   
 $\xrightarrow{\text{max IPC}}$   
 In case 20% load/store  
 $(4 + 0.2 \times 4)B$

$$= 12 \text{ Gbps}$$

$$= 36 \text{ Gbps}$$

→ 4-8x3 Gbps  
 c.g. core #7  
 4 cores, each takes 16B  
 $\downarrow$   
 $4.8 \times 3 \times 16 \text{ Gbps}$   $\xrightarrow{\text{this kind of BW needed}}$

# for ( $i=0; i < 1000; i++$ ) {

$I_1$  :  $\xrightarrow{\text{(Locality of reference)}}$

$I_2$  :  $\xrightarrow{\text{16B boundary}}$

$I_3$  :  $\xrightarrow{\text{another step (shuf + assembly line)}}$

    time overlapped by  $I_1$   
 $\xrightarrow{200+1}$   $\xrightarrow{1+1}$   $\xrightarrow{1+1}$   $\xrightarrow{1+1}$

$\downarrow$   $\xrightarrow{204}$

    Next time  
 $\xrightarrow{1+1}$   $\xrightarrow{1+1}$   $\xrightarrow{1+1}$   $\xrightarrow{1+1}$

$\downarrow$   $\xrightarrow{4}$  overlapped by next...

$\downarrow$   $\xrightarrow{204+4 \times 999}$

otherwise:  $\frac{200 \times 4 \times 1000}{4000} \xrightarrow{400}$

CPI:  
 $\frac{4200}{4000} = 1.05$

```
# for(i=0; i<1000; i++) {
    if (i%2 == 0) {
        I20
        I21
    }
}
```

```
else {
    I0
    I1
}
```

3



→ do 7 : separate chunks. (8Bytes  $\frac{1}{2}$  here)



$$CPI = \frac{201 + 201 + 2 + \dots + 2}{2000} = \frac{402 + 2 \times 999}{2000}$$

$$= \frac{399 + 4 + 2 \times 998}{2000}$$

$$= \frac{2398}{2000} \approx 1.2 \text{ (close to 1)}$$

⇒ bring instr's w/ 2 branches.



32 kB cache mem.

Questions ↑ :



1. Where to place?

2. How to identify? — base I<sub>21</sub> etc.

3. When to evict? — after memory access

4. How to handle write? — (data in cache, memory)

1024 B  
Assume 16B memory & 16B chunk/Blocks





but

e.g.  $I_0 - I_1$ ,  $I_{20} - I_{21}$  (example)

same chunk  $\leftarrow B_1$   $\leftarrow B_5$

$\Rightarrow$  performance ↓ +. ( $\because$  will kick-out the other). if hit bring  $B_1$ , pending  $B_5$ .

So, allow to go anywhere on the 4 chunks

now hv to use 6 bits to identify  
TAG (6 bits).

But now need to match 6 bits.



#### 4 questions 229

- ①: mis(ion)  
Problem: e.g. for ( )  
if ( $i_2 = 0$ )  
 $B_1 \leftarrow \{I_0, \dots\}$

$\leftarrow$  else  $I_{20} \leftarrow \{I_0, \dots\}$

then after every iteration  $CBAQ$  content replaced.

②: M2; e.g. 4GB RAM, 64B blocks, 32KB cache  $\rightarrow$  512 things to compare in II.

"Fully Asynchronous"

③: Middle Way: make sets of chunks of Cache Mem.  
& direct mapping to sets, but inside set, can be placed anywhere.



$\hookrightarrow$  1 bit reg. (LSB = 1/2)  
& 5 bits to identify within set.  
(TAG)

# Lec 6-4-92

## # Branch Predictor



e.g. of loop

→ 1110 2 bits (History)

use MCB for pred" ( $O_1 : \frac{NT}{T}$ ) not taken  
taken



$\Rightarrow$  Now only 1 wrong decision (at exit)  
( $\neq$  time of all: 3 wrong decisions)

2 bits :  $3 + 999 \times 1$  out of  $10 \times 1000 \rightarrow 85-90\%$  accuracy  
 Single bit : ~80-85%.

Example  $\chi^2_{n-1}$ , 90% accuracy

$$\downarrow$$

$$CPI = 1 + \alpha \times \frac{1}{10} \times 2$$

$$= (1 + 0.2\alpha)$$

$$\text{if } \alpha = 0.2 \rightarrow \frac{1+0.04}{\cancel{1.04}} = \underline{\underline{1.04}}$$

# Correlation b/w branches

( $\leftarrow$  2-bit branch predictor considering all branches as independent)

$$\# \tau_1 = r_1 + r_2$$

12 : beg.  $r_4, r_5$ , skip

$$I_3 : \lambda_7 = \lambda_8 + h_9$$

Can put a rule in ISA:  
an instr<sup>n</sup> is always executed  
after branch.  
↓

skip:  $r_7 = r_6 + h_7$

125

can do : reshuffle  $I_2$   
must be indep of branch  
in reg A...  
↓  
1 cycle saved

Even if  $f = 100\text{ Hz}$  (bit rate), then CPI  $\sim 1.15$   
data dep, control flow



IF1 - IF2 - ID - EX - MEM1 - MEM2 - WB

$$\# \quad \text{IPC} = \frac{1}{\text{CPI}}$$

Inst's per cycle

CPI  $\approx 1$  (1.3 / 1.4) for pipelined.

\* can pick instr's from multiple Src's  $\Rightarrow$  independent



✓ CPI = 5 for each src  
But combined, CPI = 1

Context (each process has)

1. Memory (Core segment/Data segment/Stack segment)
2. RF
3. PC
4. Files

Need multiple PCs



& multiple RFs



# Blocks in a set: determined by how many concurrent matches can be done  
e.g. 5 bits in prev ex.

Qs 1, 2 done

Q3: Whom to Evict? [no question in Direct Mapping, bcoz e.g. [B1 B5 B9...]  
→ kick out existing.]

- can do: remove blocks by FIFO

  
But not an efficient M, bcoz B1 might be used...

- LRU: least recently used

e.g. 

Now B5 used → do 

then B1 → [B1 B5 B20 B7] →  
then B21 → [B21 B1 B5 B20]

then :

hence to maintain queues for each set (for set associative mapping)

hence to do lot of work.

approx'g: FIFO BUT NOT MRU

Q4: How to handle writers:

- different images in cache & memory
- single processor → first see of int in Cache (multi-hw?)

If hit in system  
the miss → read from memory

to signify that a CB is modified → maintain 1 bit



- if modified, then if evicting,  
write back to memory

→ can use a temp. loc.  
& then write back from it (to save time)

but memory traffic ↑

Write-back policy

- Write-through policy

→ while modifying cache → write into memory too  
(✓ also for multiple processors). ↓

while evicting, no dirty bit reg...

Q5: How much gains?



hit rate: h (~90-95%)

misses": m = 1 - h

$$t_A = t_C + m \cdot t_M$$

: always cache is accessed at first

$$= 1 + 0.1 \times 200 = 21$$

✓ for 2-level cache

$$t_A = t_{L1} + m_1 t_{L2} + m_1 m_2 t_M$$

$$= 1 + 0.1 \times 10 + 0.01 \times 200$$

$$= \boxed{4}$$

10 cycles to access L2 cache

- 3-level cache 32 KB 512 KB 2 MB

1 10 50

↓ ( $t_{L_i} = 10, i=1,2,3$ )

$$CPI = \cancel{1} + t_{L1} + m_1 t_{L2} + m_1 m_2 t_{L3} + m_1 m_2 m_3 t_M$$

$$= 1 + 0.1 \times 10 + 0.1 \times 0.1 \times 50$$

$$+ (0.1)^3 \times 200$$

$$= 1 + 1 + 0.5 + 0.2$$

$$= \boxed{2.7}$$

(diminishing returns!)

Ques: (Ghz, 1GHz) -> Single cycle  
Main access

$$M_2 = 3 \times 10^9 + 2 \times 32 + 25 < 2^2$$

$t_{L2} = 2$

$t_M = 1$

$t_A = 1 + 10 + 2 + 1 = 14$

Sec 11-4-21

Hierarchy of Memory



$$t_A = t_{L1} + m_1 t_{L2} + m_1 m_2 t_{L3} + m_1 m_2 m_3 t_M$$

3C-model

1: Compulsory Miss (cold start) (~ first time brought the block)

2: Capacity miss

3: Conflict miss — max in the direct mapping, min in fully assoc.

further capacity miss  
(cannot afford edge detection)



# Dec. 12-4-22

Multiple RFs for : e.g.  $a'_1 = b'_1 + b'_2$   
and  $b'_2 = b'_2 + b'_3$

To improve CPI of individual src,b 112e  
e.g. matrices.

- Threads (theoretically,  $10^4$  for  $B$ ) :: indep
- execute same instr's : 100 mult + 99 add's
- private PC, PAF, STACK, A
- Actv. RF
- SRA (from ZF, G, EX, MEM, WB)

gives CPI  $\approx 1$  for the matrix ex.

### Parallel Programming

e.g.  $\text{for } (i=0; i<1000; i++)$  → pThread  
 $\quad \quad \quad A(i) = B(i) + C(i)$  ←  
 $\quad \quad \quad X A(i) = A(i-1) + B(i)$  → threads not indep.  
 Dependency

Logical cores (my: 2x Physical cores)  
 can provide indep. Stream of instr's.

- Example:



so  $A \cdot B \Rightarrow$  store  $B$  as  $B^T$  to prevent this.

Graphs: CSR/OSC

To low CPI < 1 : Fetch multiple instr's at once (use multiple pipes)

| TF  | ZP  |
|-----|-----|
| ID  | ID  |
| EX  | EX  |
| MEM | MEM |
| WB  | WB  |

✓ if indep. instr's  
BUT ↗

- I<sub>1</sub>:  $a = b + c$
- I<sub>2</sub>:  $d = a + e$
- I<sub>3</sub>:  $p = q + r$
- I<sub>4</sub>:  $s = p + t$
- I<sub>5</sub>:  $i = j + k$

Data flow / control graphs: DFG



Theoretically, I<sub>1</sub>, I<sub>3</sub>, I<sub>5</sub> in || → "1 cycle" (just three)  
 Then I<sub>2</sub>, I<sub>4</sub> " " → 2 cycles  
 $\Rightarrow I_1, I_3, I_5$   
 $I_2$   
 $I_4$

$$IPC = 5/2 = 2.5 (> 1!)$$

To produce the DFG.



### Superscalar

Paradigm shift: Program ~~exec~~ "guided exec" (PC, ...)  
to  
Data guided exec.

Issues :- debugging difficult. (e.g. just a shred has been updated, but in 1 cycle, a, p, i updated)

- use temporary storage (don't write p, i into RF (so that program can debug), but store in temp ✓)



SMT, P4 (Hyperthreading)  
(can use different inst streams)

Page No. 241  
Date

Size, Associativity  
of Cache.