

$$\begin{aligned}
 P\{X \in A\} &= P\left\{X \in A, Y \in (-\infty, \infty)\right\} \\
 &= \int_{-\infty}^{+\infty} \int_A f(x, y) dy dx \\
 &= \int_A f_x(x) dx
 \end{aligned}$$

CREATE USER 'bhadresh'@'localhost' IDENTIFIED BY 'bhadresh'.

GRANT ALL PRIVILEGES ON '<name>.\*' TO 'bhadresh'@'localhost'

FLUSH PRIVILEGES

## THE PROCESSOR

- \* brain
- \* Compute, decide, control flow

### Performance :

- \* inst. count (IC) - n(inst. exec. for prgrm)
  - ISA  $\rightarrow$  IC, richer ISA  $\rightarrow$   $\downarrow$  IC,  $\uparrow$  inst. complexity
  - compilers  $\downarrow$  IC by optimizing code
- \* CPI (cycles / inst) - mean(clk. cycles / inst)
  - inst. types  $\rightarrow$  n(cycles) {arithmetic, memory acc., branch, etc}
  - hardware design  $\rightarrow$  CPI

\* Cycle Time (CT) - duration (each CLK cycle)

processor's CLK freq.  $\rightarrow$  CT ( $3\text{GHz} \equiv 3 \times 10^9 \text{ inst/sec}$ )

longest delay path  $\rightarrow$  CT

(seq. of components that takes the longest time for signals to propagate through - slowest route.)

for the processor to fn., it must accommodate this as well, hence longest path delay  $\rightarrow$  max CLK freq, hence CT as well)

CPU Time  $\rightarrow$  time taken to exec all inst

$$= n(\text{inst}) \times t_{\text{per inst}} = n(\text{inst}) \times \frac{\text{cycles}}{\text{per inst}} \times \frac{\text{time}}{\text{per cycle}}$$
$$= IC \times CPI \times CT$$

$$\therefore \boxed{\text{CPU Time} = IC \times CPI \times CT}$$

MIPS Implementation approaches:

\* two implementations

single-cycle ( $1 \text{ inst} \equiv 1 \text{ cycle}$ )

→ pipelined (multiple inst. processed simult. in diff stages)

\* MIPS inst. subset focus:  $l|w$  (load word),  $sw$  (store word)

↳ add, sub, and, or, slt (set less than)

↳ beq (branch if eq.),  $j$  (jump)

R-format -

a) reg-reg       $\uparrow$        $\uparrow$        $\uparrow$        $\uparrow$        $\uparrow$        $\uparrow$        $\uparrow$        $\uparrow$        $\uparrow$   
 $\text{add, sub, lw, ...}$       source      dest      shift amt      operation / code

b) form: [opcode] [rs] [rt] [rd] [sh. amt] [funct] of fn.

# Instruction Execution Process

- \* Fetch: Use prog. counter (PC) to get inst. from mem.
- \* Decode: Det. type of inst. & read registers
  - (det. type → using some binary pattern, opcode - based on this, diff. control signals will configure datapath, activated by processor)
  - Read register → 32 reg.s - 5-bits req'd to specify which to use - this number is sent to reg. file during decode - file o/p the values stored in those reg's, & vals are sent to ALU for operations)
- \* Execute - ALU → arithmetic/ logic, calc. mem. addr, calc. branch target.
- \* Memory ACCESS - for lw/sw inst.
- \* Update PC - move to next inst. at (pc+4) | branch target.

## Load / Store instruction

example -

take lw \$t0, 12(\$s1) → load whatever's at addr. (\$s1 + 12) into \$t0 reg.

- 1) get base addr (say \$s1 = 1000 is addr.)
- 2) calc. eff. addr. ( $1000 + 12 = 1012 \rightarrow$  val at this addr. to be fetched).
- 3) acc. memory (go to  $addr = 1012$ , & read the 32 bit val there, say 42)
- 4) put value into \$t0 (i.e.,  $\$t0 = 42$ )

load:



Store:

sw \$ to, 12(\$S1)



16 bit offset  
+ values → everything can cover almost

## Branch Instruction

\* focus - beq.

\* beq, \$t0, \$t1, offset

1) IF - fetch from I-M at addr. in PC  
- PC + 4 done for next inst.

2) ID - opcode = 4 for branch  
- Reg nos gotten ( $\$t_0, \$t_1$ , basically addr)  
- 16 bit offset gotten

3) CMP -  $\$t_0, \$t_1$  values gotten from reg-file  
- ALU subtracts them.  
- flag = zero, if  $res = 0$ . (equal)

4) Branch - 16 bit offset extended to 32 bits  
- Shifted left by 2 bits, as:  
\*  $inst \equiv \text{words}$ ,  $word \equiv 4 \text{ bytes}$   
\* Offset  $\rightarrow$  unit of words, not bytes.  
- this is added to PC + 4 to form target address  
(Offset  $\rightarrow$  relative to inst. following the branch. So, its  $PC + 4 + \text{sign-extended offset} \ll 2$ )

5) Branch Decision - if  $ALU \equiv \text{Zero}$   $\Rightarrow$  Set then branch target  
- else,  $PC = PC + 4$  remains  
 $\hookrightarrow$  MUX is fed with ALU zero val.

# Composing Datapath Elements

- ? Why compose - single design supports all inst-  
(R-format, Load/store, Branch)  
- to consider:  
\* Shared resources  
\* MUX  
\* control signals

Key design decisions -

1. Separate Inst. & Data Memories :

- \* can't use same mem. for both inst in single-cycle design
- \* IM during IF, DM during EX/MEM (no conflicts here) (harvard architecture)

2. MUX placement:

- \* ALU input - select b/w 2nd reg. value (for R-type) & sign-extended offset (load/store) {lw/sw \$t0, xc(\$s)}
- \* Reg Write (Data) - select b/w ALU o/p (R-type) & mem data (load)
- \* Reg Write (Addr) - select b/w rd field (R-type) & rt field (load)  
consider: add \$t0, \$s1, \$s2  
lw \$t0, 12(\$s1).

design problem:  
1 write i/p, but  
2 diff. fields.  
↳ dep. on type of inst.

in add - \$t0 is specified in  
rd field  
in lw - \$t0 is specified in  
rt field.

Reg. Write Addr  $\rightarrow$   $2i|P$ :  $r_d$  field in inst (bits 15-11)  
 $r_t$  field in inst (bits 20-16)



→ Select WHERE to write, based on  
 control signal (Reg DST = 1  $\rightarrow r_d$ ,  
 else  $r_t$ )

\* PC-input: selects b/w PC+4, branch target  
 & jump target  $\leftarrow$  direct location

### 3. Control Signal Distribution:

- ↳ RegWrite
  - ↳ ALU Op
  - ↳ MemRead
  - ↳ MemWrite
  - ↳ Reg-ALU MUX
  - ↳ D-Mem-Reg MUX
  - ↳ Branch mux
- ? jump not included.

### COMPOSED DATAPATH



இதுவரை நம் காற்றுக்கொண்டது (whatever we've done till now):

## Full Datapath



Chapter 4 — The Processor — 20

## SIMPLE IMPLEMENTATION SCHEME

\* Control logic: to know what the inst. is & what each component shd. do for that inst.

\* 2 control units → Main, ALU

↳ Main : \* takes opcode (bits 31-26)

\* generate control signals for regWrite, memRead/write, ALUOp, RegDst, MemtoReg, ALUSrc, PCSrc & jump/branch

↳ ALU : \* needs to know what op to perform (add, sub, ...)  
\* this control is not directly from opcode,

as R-type = SAME OPCODE !

↑ MODULARITY

\* so, ALU control gets ALUOp & funct field from inst  
funct → decides op.

↓  
(5-0 bits for R-type)

| opcode | ALUOp | Operation        | funct  | ALU function     | ALU control |
|--------|-------|------------------|--------|------------------|-------------|
| lw     | 00    | load word        | XXXXXX | add              | 0010        |
| sw     | 00    | store word       | XXXXXX | add              | 0010        |
| beq    | 01    | branch equal     | XXXXXX | subtract         | 0110        |
| R-type | 10    | add              | 100000 | add              | 0010        |
|        |       | subtract         | 100010 | subtract         | 0110        |
|        |       | AND              | 100100 | AND              | 0000        |
|        |       | OR               | 100101 | OR               | 0001        |
|        |       | set-on-less-than | 101010 | set-on-less-than | 0111        |

R-type inst : \* opcode = 000000

- (e.g. add \$t0,\$t1,\$t2) \* RegDst = 1 (dest is rd)  
 \* ALUSrc = 0 (use reg-value, not offset)  
 \* MemtoReg = 0 (no write from memtoReg)  
 \* RegWrite = 1 (writing to rd, so yes)  
 \* memRead | Write = 0  
 \* ALUOp = 10 (let ALU decide based on funct)

\* ALU control gets funct.  
 funct = 100000 → ALU Control o/p = 0010  
 = 00010 → sub → control o/p = 0110  
 = 00100 → and → 0000  
 & so on ...

\* result = 2 vals added + stored at rd.

Load inst : \* opcode = 100011  
 (lw rt, off(rs)) \* main control:

↳ memtoReg = 1

↳ readMem = 1

↳ writeMem = 0

↳ ALUSrc = 1 → use offset-

↳ ALUOp = 00 → basic addr-addn.

↳ RegWrite = 1

↳ RegDst = 0 (write to rt)

\* ALU control: 00 → just add, so base + offset done & fetch word from memory, writes into rt.

Branch-On-Equal : \* opcode = 001000  
 (beq \$r<sub>1</sub>, \$r<sub>2</sub>, label) \* main control :

↳ WriteReg = 0

↳ Write / Read Mem = 0

↳ ALUSrc = 0

↳ ALUOp = 01 → subtract

↳ Branch = 1

\* ALU : ALUOp = 01 → sub.

if  $r_1 - r_2 = 0$ , then PC → branch target.

## Datapath With Control



Chapter 4 — The Processor — 24

| Signal     | Purpose                                                     |
|------------|-------------------------------------------------------------|
| RegDst     | Selects destination register: rd (R-type) or rt (load)      |
| RegWrite   | Enable writing to register file                             |
| ALUSrc     | Use immediate or register as second ALU input               |
| MemRead    | Enable reading from memory (load)                           |
| MemWrite   | Enable writing to memory (store)                            |
| MemtoReg   | Selects value to write to register: from memory or from ALU |
| Branch     | True if instruction is a beq                                |
| Jump       | True if instruction is a j → Shall address                  |
| ALUOp[1:0] | Sent to the ALU control unit                                |

? How are jumps handled

\* take upper bits of PC + 4

\* concatenate them with 26-bit addr. left shifted by 2. (again, inst. are words, so ye)

\* PC needs a 3-way MUX for:

↳ PC + 4

↳ jump addr.

↳ branch addr.

\* New control signal for jump.

} addrs. to the structure.

# Datapath With Jumps Added



## PERFORMANCE ISSUES :

- \* Slowest instruction determines clock cycle time  
(e.g.)  $Clw : I \cdot M \rightarrow \text{Reg-Read} \rightarrow \text{ALU} \rightarrow \text{MEM} \rightarrow \text{Reg-Write}$   
↳ LONGEST DELAY path!  
∴ simple insts. like add shd. also wait.

Hence, INEFFICIENT & VIOLATES RISC principle  $\rightarrow$  make the common case fast.

## PIPELINING

- \* Overlapping of multiple inst.

Analogy: say if 1 washer, 1 dryer, 1 folder - instead of waiting one full load to go thru all before starting next, you overlap:

| Time   | Washer | Dryer  | Folder |
|--------|--------|--------|--------|
| 0-30   | Load 1 |        |        |
| 30-60  | Load 2 | Load 1 |        |
| 60-90  | Load 3 | Load 2 | Load 1 |
| 90-120 | Load 4 | Load 3 | Load 2 |

You finish 4 loads in the time it would've taken to do 1.  
→ That's pipelining!

w/o pipelining → CPU waits  $\dagger$  5 stages before next one  
 → 1 inst  $\equiv$  5 cycles  
 → other units idle while not being used  
 → waste of hardware.

with pipeline → each stage of datapath is  
 ALWAYS DOING USEFUL WORK.  
 → means, 1 inst  $\equiv$  1 cycle, after initial  
 ramp up!!!

Applying to MIPS :

| Cycle | IF      | ID      | EX      | MEM     | WB      |
|-------|---------|---------|---------|---------|---------|
| 1     | Instr 1 |         |         |         |         |
| 2     |         | Instr 2 | Instr 1 |         |         |
| 3     | Instr 3 |         | Instr 2 | Instr 1 |         |
| 4     |         | Instr 4 | Instr 3 | Instr 2 | Instr 1 |
| 5     | Instr 5 |         | Instr 4 | Instr 3 | Instr 2 |

Stages → IF, ID, EX, MEM, WB  
 imagine pipelining this → each inst  $\equiv$  5 cycles,  
 but new inst starts EVERY cycle

this works because each inst. uses different part of  
 the datapath at each stage. So, we can:

- ↳ Split datapath with pipeline registers b/w stages.
- ↳ feed new insts. into pipeline as prev. ones move forward.

to keep insts. from overwriting each other, we insert pipeline registers. essentially these are clk. pts.



→ pipeline registers.

\* hold o/p of prev stage

\* feeds i/p to next stage in the NEXT CLK cycle

\* keeps diff. insts. data separate.

∴, EACH INST moves forward one stage/clock, like water flowing step by step down a stair stepped waterfall

## Performance Intuition

w/o pipeline → 1 inst ≡ 5 cycles

∴ throughput = 0.2 inst/cycle pipeline.

with pipeline → 1 inst ≡ 1 cycle (after ramp-up)

∴ throughput = 1 inst/cycle  
⇒ ~ 5x the speed.

\* throughput -  
 $S_n(\text{comp-inst} / \text{unit time})$

↳ speed of

| Instr    | Instr fetch | Register read | ALU op | Memory access | Register write | Total time |
|----------|-------------|---------------|--------|---------------|----------------|------------|
| lw       | 200ps       | 100 ps        | 200ps  | 200ps         | 100 ps         | 800ps      |
| sw       | 200ps       | 100 ps        | 200ps  | 200ps         |                | 700ps      |
| R-format | 200ps       | 100 ps        | 200ps  |               | 100 ps         | 600ps      |
| beq      | 200ps       | 100 ps        | 200ps  |               |                | 500ps      |



## HAZARDS

\* stall / disrupt the flow

STRUCTURAL HAZARDS :

- \* two stages accessing same hardware  
(e.g. IM & DM)
- \* Sol: use diff. mems for both (IM, DM)

DATA HAZARDS :

- \* an inst. depends on the result of prev one.

(e.g.) add \$t<sub>1</sub>, \$t<sub>2</sub>, \$t<sub>3</sub>  
sub \$t<sub>4</sub>, \$t<sub>1</sub>, \$t<sub>5</sub> ← needs \$t<sub>1</sub> before it's ready.

CONTROL HAZARDS :

- \* next inst. depends on result of branch
- \* won't know branch status until EX or MEM.

## DESIGN CONSIDERATIONS

↳ all inst: 32 bits → easy to decode

↳ few inst formats → uniformity

↳ reg. operands in fixed positions → faster decode

↳ load/store - only memory access - memory stage is predictable

# Pipelined Datapath & Control



## What Each Pipeline Register Stores

Let's look at what data needs to be passed from one stage to the next:

### ► IF/ID (after Instruction Fetch):

- Instruction (to decode)
  - PC+4 (used for branch calculation later)
- w/o this, we won't know what was decoded

### ► ID/EX (after Decode):

- Register read values: ReadData1, ReadData2
- Immediate value (sign-extended)
- Destination register numbers (rt, rd)
- Control signals: ALUOp, ALUSrc, etc.

in EX stage:

- \* do ALU
- \* where reg. to go
- \* write to Mem/Reg

w/o this, dk which operands to compute (what to do)

### ► EX/MEM (after Execute):

- ALU result
- Register value for sw
- Destination register number
- Control signals: MemRead, MemWrite, MemtoReg, RegWrite

in MEM, we need:

- \* addr & data for sw
- \* addr to read for lw
- \* almost nil for R-type

w/o diz, mem. b like: whose res.

is this? (sd. I even write to Mem.)

decide what & where to write

### ► MEM/WB (after Memory access):

- Data read from memory
- ALU result (for instructions that don't access memory)
- Destination register number
- Control signals: MemtoReg, RegWrite

Control Signals are stored to tell the datapath how to handle the instruction at that stage

each inst → in diff. stage. So,

- \* Keep its OWN DATA (regval/add)
- \* Carry control signals (know what to do)
- \* Not mess with data from inst before/after it.

pipeline regs → backpacks

pipeline regs are like a baton in a relay race.

- Once an instruction finishes a stage (say, ID), it hands off its baton (data + control signals) into ID/EX.
  - That baton is only used by the EX stage during the next clock cycle.
  - After that, it's no longer needed by EX, and gets replaced with the next instruction's baton.
- \* So you're not relying on reuse of old data—you're ensuring that every instruction has a fresh and private copy of what it needs, just in time.

## Assurance Comes from the Clocking Discipline

The datapath is designed using a strict synchronous clocking model:

- At every rising edge of the clock:
  - Each stage writes its output into the next stage's pipeline register.
  - Each stage reads its input from its own pipeline register.

This provides a kind of "lockstep" movement, where:

- Stage N writes → Stage N+1 reads → Register overwritten
- But the read happens **before** the overwrite—guaranteed by the clock cycle boundaries.

## IF for Load, Store, ...



## ID for Load, Store, ...



## EX for Load



## MEM for Load



## WB for Load



## Corrected Datapath for Load



this write won't conflict as R/W happens at diff times, f W

Chapter 4 — The Processor — 57

those places happen only after MEM/WB

the same datapath probm - for R-type too...



## Pipelined Control



~~add \$r<sub>1</sub>, \$r<sub>2</sub>, \$r<sub>3</sub>~~  
~~add \$r<sub>4</sub>, \$r<sub>1</sub>, \$r<sub>2</sub>~~

SOLVED BY  
FORWARDING

IF      ID      EX      MEM      WB

I<sub>1</sub>

I<sub>2</sub>

I<sub>1</sub>  
I<sub>2</sub>

I<sub>1</sub>  
I<sub>2</sub> → I<sub>1</sub>

exec. with  
older \$r<sub>1</sub>

