

# Syllabus

May 3, 2022 9:01 PM

Endsem planning

Computer Architecture

CA Basics (1-6)

MIPS ISA (7-10)



Datapath design (11-14)

Pipelining (15-19)

Memory hierarchy + Caching (19-22)

Data level parallelism

Advanced CA

Shared + Gate

smashers

①

③

Shared



# TODO

May 3, 2022 11:20 PM

Syllabus:

- ★ 1) Memory Interfacing
  - Mem. Hierarchy
  - 2 level, 3 level
  - Cache Mapping
  - Replacement
  - Secondary Mem.
- 2) I/O Interfacing
  - Memory Mapped, Isolated
  - Data Transfer Modes
  - Programmed I/O, Interrupt driven, DMA, Synchronous/Asynchronous.
  - Serial / Parallel

- 3) Machine Instruction & Addressing Modes, Interrupt
- 4) Control Unit Design → Handwired, Microprogrammed  
Timing signals, Microinstruction format

- 5) ALU and Data Path
- 6) Number System and Conversions
- 7) Data Representation
  - fixed
  - floating point
- 8) Pipelining
  - Synchronous
  - Asynchronous
  - Instruction Pipe.
  - etc.

Qx3

++

# Basics

March 10, 2022 2:37 PM

- Computer architech time
- Moore Law  
x 2 transistors, 2 yrs
- Execution time

$$T = I \times CPI \times T$$

↓      ↓      ↓  
Instruction    cycles    clock  
count           per       period

- Datapath

$$\text{Speedup} = \frac{\text{perf}_A}{\text{perf}_B} = \frac{T_B}{T_A}$$

A | B



$$T_f = \frac{E \left( \frac{I}{S} \right) + (1-E) T}{F + (1-F)}$$

$$T_f = t(\bar{s}) \cancel{\rightarrow} = E/s + (1-E)$$

- $S \rightarrow \infty \propto$  Speedup

$$T' = \frac{1}{(1-E)} \rightarrow \text{limited by unspeed part}$$

- Amdeahl's Law (fixed problem size)



Enhanced



- Gustafson Law

$w \rightarrow$  workload,  $P \propto$

$w \rightarrow \underline{\text{workload}}, P \}$



(same execution time)

$$\text{Speedup} = (1-P) + SP$$



• other  $\rightarrow$  FLOPS

• parallel processing

• pipelining

? 10



- Derived concurrency

IF → Fetch

ID → Decode

EX - Execute

MEM - Mem op

WB - Write Back

5 steps

- Power



Dynamical  
(operations)

Static

(leakage  
current)



$$P_{dynam} = A C V^2 f$$

• independent

- $P_{dyn} = AC\sqrt{f}$
- o independent of  $f$

- o  $\propto$  Temp

- $P = \sqrt{I_{leak}}$

- $P_T = P_{Stat} + P_{dyn}$
- $= V^T I_{leak} + AC\sqrt{2}f$

$$f_{max} \propto [V - V_{th}]$$

- $E = \int_0^T P_{avg} T = \underline{\underline{P_0 T}}$

- Reduction



own  
unused  
parts

switching

data movement /

- $f = 500 \times 10^6 \text{ Hz}$

$$P_{\text{dy}} = 80 \text{ W} = A c v^2 f$$

$$P_{\text{stat}} = V I_{\text{leak.}} = 10 \text{ kJ}$$

$$E = 540 \text{ kJ}, \quad T$$

$$T = \frac{540}{90} = 6 \text{ sec}$$

$$P'_{\text{stat}} = 0.9 P_{\text{stat}}$$

$$\underbrace{f \rightarrow f/2}_{T \rightarrow T \times 2}, \quad P = P_{\text{dy}} + P_s$$

$$= \cancel{\frac{P_{\text{dy}}}{2}} + \frac{9 P_s}{10}$$

$$E = 49 \times \frac{P}{40} + 9$$

$$P'_f = 49 \text{ kW}$$

$$r_f = 49 \text{ PW}$$

## MIPS Theory

### ISA → Instruction set Architecture

- Introduction to ISA
- MIPS ISA: A design example
  - Instruction format
    - R-format, I-format, and J-format
  - Addressing modes
    - Register addressing
    - Indirect addressing
    - Immediate addressing
    - Base addressing
    - PC-relative addressing
    - Pseudo-direct addressing
  - Functionality
    - ALU instructions
    - Data transfer instructions
    - Control instructions
- MIPS procedure call
- Design issues and relative advantages of RISC and CISC

### MIPS (Microprocessor without interlocked pipeline)

### RISC Arch (Reduced instruction set computer)





- 32 Register
- Program counter (PC)
  - ↳ next ins
- Accumulator (Acc)
- ④
  - Each register has 32-bit (1 word) - 4 Bytes (1 byte = 8 bits)

| Name      | Register number | Usage                                          |
|-----------|-----------------|------------------------------------------------|
| \$zero    | 0               | the constant value 0                           |
| \$at      | 1               | reserved for assembler                         |
| \$v0-\$v1 | 2-3             | return values of functions and expression eval |
| \$a0-\$a3 | 4-7             | function arguments/parameters                  |
| \$t0-\$t7 | 8-15            | temporaries, caller saved register             |
| \$s0-\$s7 | 16-23           | saved values, calle saved registers            |
| \$t8-\$t9 | 24-25           | more temporaries                               |
| \$k0-\$k1 | 26-27           | Kernel use : reserved for OS                   |
| \$gp      | 28              | global pointer                                 |
| \$sp      | 29              | stack pointer                                  |
| \$fp      | 30              | frame pointer                                  |
| \$ra      | 31              | return address                                 |

1 byte = 8 bits  
1 word = 4 bytes  
= 32 bits



Temporaries are not preserved across procedure calls,  
while saved values are preserved across procedure calls.

- 
- CS 211 - Computer Architecture
- 1
- word → 4bytes
  - MIPS instruction
  - format



R      I      J  
register    Imm    Jump  
               val

- **R type**
- add      \$t<sub>0</sub>, \$t<sub>1</sub>, \$t<sub>2</sub>  
 ALU      G



- **Imm**
- add      \$t<sub>0</sub>, \$t<sub>1</sub>, 1  
 $(t_0 = t_1 + 1)$

- lw, sw      array  
 lw      \$t<sub>1</sub>, p(\$s)

lw \$t\_1 , <sub>UL TO U</sub>  
offset  
 by multiple  
 of 4

- beq → branch equal to
- beqz → branch equal to zero
- J type (Jump type)
- Syscall
- A → 

len → \$s<sub>2</sub>  
 longest , \$t<sub>0</sub> ✓

- la , \$v<sub>2</sub> , 4  
 ↳ Load address  
 ↳ n , immediate

- ↳ load immediate
- move \$f,\$t<sub>2</sub>  
 $f_1 \rightarrow t_2$
- function / sub routines /  
 methods / procedure (same)
- fun(\$a<sub>0</sub>, \$a<sub>1</sub>, \$a<sub>2</sub>, \$a<sub>3</sub>)  
 $\downarrow$   
 return \$v<sub>0</sub> → return (v<sub>0</sub>)  
 ↳ arguments (\$a<sub>0</sub>-a<sub>3</sub>)

# Datapath Design

March 10, 2022 3:54 PM

- MUX



Datapath ✓

4:1

(Flow of info)

address

through

bus

- Decoder (Reverse of Mux)

- D-FF (1bit)



| Ip | Out |
|----|-----|
| 0  | 0   |
| 1  | 1   |

ak(x) → Memory

- add \$f<sub>0</sub>, \$f<sub>1</sub>, \$f<sub>2</sub>

RA





- Register (Register working)
- ALU
- PC / IR (Instruction Register)  
(Program counter)      ↳ current instruction

Next instruction address

$$PC = IR + 4$$

- Datapath (R Type - Register addressing)
  - ~~Slides~~
  - Data in Register
  - Waiting / Reading op  $\rightarrow$  Registers

- add \$t<sub>0</sub>, \$t<sub>1</sub>, \$t<sub>2</sub> ( $t_0 = t_1 + t_2$ )
  - write
  - read register file



- Datapath (1 type - Base addressing)

- lw uses longest path
- Min clock period (single cycle Datapath)

| Class    | Fetch | Decode | ALU | Memory | Write Back | Total |
|----------|-------|--------|-----|--------|------------|-------|
| R-format | 2     | 1      | 2   | 0      | 1          | 6ns   |
| LW       | 2     | 1      | 2   | 2      | 1          | 8ns   |
| SW       | 2     | 1      | 2   | 2      |            | 7ns   |
| Branch   | 2     | 1      | 2   |        |            | 5ns   |
| Jump     | 2     |        |     |        |            | 2ns   |

Consider the frequency of operation as shown in table below. What will be the average time per instruction?

| Instruction | Frequency |
|-------------|-----------|
| Arithmetic  | 50%       |
| Loads       | 20%       |
| Stores      | 10%       |
| Branches    | 20%       |

$N$  instructions

$$\text{Total time} = 8N$$

$$\text{Actual usage time} = \frac{6 \times 0.5}{+ 7 \times 0.4 + 5 \times 0.2}$$

$$= 6 \cdot 3 N$$

~~5.6  
+ 0.8~~

Idle time =  $8N - 6 \cdot 3 N = 1.7 N$

% idle time =  $\frac{1.7}{8} \times 100 = 21.25\%$

• Hardware utilization =  $100 - 21.25$   
 $= 78.75\%$

• Multi cycle

Data path

→ split the steps

# Pipelining

April 7, 2022 2:44 PM

- Out of order execution (No straight order of exec)
  - Branch predictions  
Branch statements take more time
- Superscalar processors
  - Multicycle
  - VLIW (Very long instruction word)

|                    | 1              | 2              | 3              | 4              | 5              | 6 |
|--------------------|----------------|----------------|----------------|----------------|----------------|---|
| Instruction Fetch  | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub> | I <sub>4</sub> | I <sub>5</sub> |   |
| Instruction Decode | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub> | I <sub>4</sub> | I <sub>5</sub> |   |
| Execute            |                | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub> | I <sub>4</sub> |   |
| Memory Access      |                |                | I <sub>1</sub> | I <sub>2</sub> | I <sub>3</sub> |   |
| Write back         |                |                |                | I <sub>1</sub> | I <sub>2</sub> |   |



this is called a reservation table

- Stage delay, speedup, efficiency
- $T_{n+1}$



Calculation

$$\left( \frac{T_{WP}}{T_P} \right)$$



- Hazards in Pipelining

1) Data Hazard  $\rightarrow$  Read after write (RAW)

2) Structural Hazard  $\rightarrow$  multiple instruction same resource

3) Control Hazard  $\rightarrow$  Control type jump instructions, flushed immediate instructions

- Stall  $\rightarrow$  doing nothing in pipeline



- Control Hazard

In ... the intermediate step

- Flushing the intermediate steps in branching type.
- Highly performance deplete
- Solution? → decider in If stage  
create stall for next instruction  
↳ Scoring intermediate steps



$$\begin{array}{l} \checkmark I_1: R_1 \leftarrow R_2 * R_3 \\ \checkmark I_2: R_2 \leftarrow R_4 + R_5 \end{array}$$

write



$$I_1(\text{Domain}) \cap I_2(\text{Range}) \neq \emptyset$$

• write after write (WAW)

$$\begin{array}{l} I_1: R_3 \leftarrow R_1 * R_2 \\ I_2: R_3 \leftarrow R_4 + R_5 \end{array}$$

(2 diff values  
at same time)



$$\text{Range}(I_1) \cap \text{Range}(I_2) \neq \emptyset$$

• for RAW, WAW, WAR (Anti)



Register Renaming  
Take last value

WAR Hazard (2) 20/50

Storing Temp values but not overriding registers  
 ✓ WAW Hazard  
 I1:  $R3 \leftarrow R1/R2$   
 I2:  $R3 \leftarrow R4+R5$

WAR Hazard  
 I1:  $R3 \leftarrow R1/R2$   
 I2:  $R1 \leftarrow R4+R5$

- Remove stalls → Data forwarding

- In-order / Out of order execution



Compiler generated order

→ compiler generated order not followed.

→ dynamic changing of order

- Instruction reordering

- For WAW, WAR

Run-time Dominance /

For ~~W1~~ | ~~V1~~ & Register Renaming /  
location changes → Reallocation  
but final structure  
Same

- Architecture, OS, Compilers

- Loop unrolling

- Example



more wasted  
clock cycles

→ runs faster  
→ More register pressure

How data hazards can be eliminated?

## How data hazards can be eliminated?

### ■ Summary

- ■ Detect and wait (stalling unnecessarily)
- Data forwarding
- Reordering (out of order)
- Renaming (to remove WAR and WAW hazard)
- Loop unrolling and reordering

## Branch Prediction

Branch prediction deals with branch penalty by continuing executing the instructions following the branch **speculatively**.

- If the branch does not occur then there was no pipeline stall.
- If, however, the branch does occur, the pipeline must be flushed of the speculative instructions.
- On average, this reduces pipeline stalls by 50%.

✓ **Static branch prediction** techniques:- The actions for the branch are fixed for each branch during the entire execution. (when behaviour is highly predictable)

✓ **Dynamic branch prediction** techniques:- The prediction decision may change depending on the execution history. (when behaviour is not predictable)

## Verilog in 2 hours [English]

# Decrementer



21:42 / 2:21:16 • Design Example: Decrementer &gt;

Scroll for details



# FIFO



Review FIFO

## concatenation

```
data_in = {a,b,4'd0};
```



**MEMORY**



# Modeling Registers

```
always@(posedge clk)
begin
    if(reset) reg_out <= 0;
    else if(en) reg_out <= reg_in;
end
```



48:00 / 2:21:16 • Verilog code for Registers &gt;

Scroll for details



# Memory organization

May 4, 2022 9:37 PM

- Independent / hierarchical



$$T_{avg} = H_1 T_1 + (1-H_1) H_2 (T_1 + T_2)$$

---

3 Level





$$T_{avg} = H_1 T_1 + (1-H_1) H_2 T_2 + (1-H_1)(1-H_2) \frac{T_3}{T_3}$$



$$T_{avg} = H_1 T_1 + (1-H_1) H_2 (T_1 + T_2) + (1-H_1)(1-H_2) \frac{(T_1 + T_2 + T_3)}{T_3}$$

Hit ratio =

## \* Cache Mapping

a) Direct Mapping -

Cache  $\rightarrow$  Lines,  
                

Main Mem  $\rightarrow$  Blocks



1c mod n

0-127 (128 words)



Block  
number      Block offset



$\omega$   $\rightarrow$  Line number  $\leftarrow$  Byte size

$$\log \left( \frac{\text{size Main Memory}}{\text{size cache memory}} \right)$$

b) Fully associative

- More hits.

No fixed position

• Ad / Pisad - Direct Mapping



## U Line number

- easy access
- Disadvantage

↓  
conflict Miss

### c) Set association

- (Direct + Fully associa~~tion~~)



- Locality of Reference

- Locality of Reference



Cache

↳ More hits

- Higher block size  $\uparrow$

More  
spatial locality

- Temporal locality  
(time)



LRU → Least recently used

Cache replacing policies

- FIFO
- ↓
- LRU
- MRU
- Random

(Set associative Mapping)

- Main Memory - RAM

- HDD
- LRU

- FIFO