

SIMD single instr multiple data

SSE .

AVX adv vector extension



same instr on multiple data points together.  $\Rightarrow$  hand coded library implementation to make use of direct instrs: specialised for FFT, convl etc.

$\rightarrow$  Stack Archi: only few of the stack operands <sub>top ones</sub>  
 $\hookrightarrow$  Bring operands to stack first. Top 2 are operands and ans written on ToS (overwrite B)

$\rightarrow$  Accum. Archi: Accum. registers  
 Load A  $\rightarrow$  accum. reg.  
 Add B  $\rightarrow$  B in memory (can use)  
 Store C  $\rightarrow$  write result on accum. reg of A

$\rightarrow$  Reg-Mem Arc: uses both reg. and mem  
~~saves no. of instr wrt ISA~~ in ALU. (intel x86)

$\rightarrow$  Reg-Reg Arc: all ALU ops only in reg.  
 "Load - Store reg archi". (ARM, MIPS, RISC-V)  
 $\hookrightarrow$  more no. of instr typically. coz we have to fetch into reg. Uniform latency of ALU ops.

$\rightarrow$  No. of registers

## No. of registers

more registers  $\rightarrow$  better performance.



but more reg  $\Rightarrow$  bitlines length / inc $\uparrow$   
and hence delay (RC) inc

(16) regs. found optimal  $\begin{cases} 32 \text{ bit} \\ 64 \text{ bit} \end{cases}$  today



Instn Size: reg-reg :  $5 \begin{cases} (3) \\ \text{regs} \end{cases} + 6 \begin{cases} (\text{opcode}) \\ = 21 \end{cases}$   
mem-mem  $\begin{cases} 32(3) = 96 \end{cases}$

mem-mem  $\rightarrow$  r-r  
Code Density:  $\frac{\# \text{ of insts}}{\text{total size of code in B}}$

Execution Time: mem-mem  $\Rightarrow$  ALU non-deterministic  
 $\Rightarrow$  implement $n$  diff  $\Rightarrow$   
reg-reg  $<$  time than mem-mem

Pipelining: reg-reg better, since uniform latency of ALU

No. of bits for reg encoding: reg-reg needs more reg than mem-mem.



for backward compatibility and changing parts of a reg.  
what to do to change an len

takes same amt of time to change any of subparts



double word

000 - 111

LE  $\Rightarrow$  LSB  $\Rightarrow$  000

BE  $\Rightarrow$  MSB  $\Rightarrow$  000

} same in both



$\rightarrow$  Addressing modes (VAX left = destn)

N reg  $\Rightarrow$  each log N bits

Indexed — one is like base and other like displacement e.g. array start in R1 & offset in R2,  $R2 += 4$  each loop.

Mem Indirect



Linked list better

Take R3 as an addr and go there. then treat it as a mem and go there

Auto inc/dec for loops.

when R3=n  
then reset to 0.

$(R3) +$   $\Rightarrow$  go then  $R3++$

$-(R3)$   $\Rightarrow R3--$  then go

(last is  $(n-1)$  so R3 loaded with n)

amt of inc/dec can be prog

Scaled :  $100 + R2 + d \times R3$

scaling const, hidden.  
can be programmed using instr.