

## Defining Comp Arch.

ISA : → the programmable instruction set of a processor → the lang to communicate with a spf.

Can be of diff types →  
 reg - mem ISA → x86  
 Load - store ISA → RISC  
 (arms, mips).

### Mem addressing

→ typically byte aligned.  
 a mem object of size  $\Rightarrow$  bytes residing at address A will imply →  
 $A \bmod s = 0$

### Addressing modes

Ways to specify mem objects.

MIPS modes → Register, immediate,  
 displacement / offset based

ARM modes → MIPS modes + PC relative addr.  
 sum of 2 regs.

Type & size of operands → 8, 16, 32, 64 → bits.  
 char | int/float | double.  
 unicode

### Operations

Data transfer, Arithmetic, Logical,  
 Control,

### Encoding

R format: opcode rs rt rd

Program Data transfer      opcode rs rt immediate.

J forward jump →      opcode address

## Energy & Power within MP

Dynamic Power consumption in clocked circuits

$$P_{\text{dynamic}} \propto \frac{1}{2} \cdot C \cdot V_{dd}^2 \cdot f$$

clk rate  $\sim 3.3 \text{ GHz}$   $^{203}$  towards saturated.

Techniques ① clock gating  $\rightarrow$  turn off non-inactive units.

② DVFS.

③ On chip temp sensor  
 $\hookrightarrow$  turn off units based on thermal scenario.

DRAM freq., spin disk at low rate.

### Cost trend

$$\text{IC cost} = \frac{\text{die cost} + \text{testing} + \text{packaging cost}}{\text{Final test yield}}$$

### Dependability

$$\text{Availability} = \frac{MTTF}{MTTF + MTTR}$$

MTBF,

$$MTTF_{\text{system}} = \frac{1}{\text{failure rate}}$$

System with multiple ~~units~~  $\rightarrow$  <sup>multiple</sup> failure rates add up.

$$MTTF_{\text{power supply board}} = \frac{\frac{MTTF_{\text{power supply}} / 2}{MTTR_{\text{power supply}}}}{MTTF_{\text{power supply}}}$$

## Performance index

- ① response time / exec time
- ② total work done inside some  $T \rightarrow$  throughput

executive  $\propto \frac{1}{\text{performance}}$ .

## Benchmarks

set of representative workloads in a domain.

Desktop  $\rightarrow$  SPEC standard Perf Eval Conf

SPECINT

SPECFP

Server  $\rightarrow$  SPECFS file server  
 SPECWeb web server  
 SPECvirt virtualized services  
 TP (Transaction processing  
 benchmarks)

## Emb

EEMBC

electronic design News Embedded  
 Microprocessor Benchmark  
 Consortium.

## Quantitative principles of Computer Design

- ① Take advantage of parallelism  
 Moore's law does not hold any more.
- ② Principle of Locality.
- ③ Optimize the common case first.

## Amdahl's law

Speedup =  $\frac{\text{Performance of entire task using the enhancement when possible}}{\text{Performance of entire task w/o any enhancement}}$

$$Ex_{\text{new}} = Ex_{\text{old}} \times \left[ \left( 1 - \frac{\text{fraction enhanced}}{\text{speedup enhanced}} \right) + \frac{\text{fraction enhanced}}{\text{speedup enhanced}} \right]$$

$$\begin{aligned} \therefore \text{speedup overall} &= \frac{Ex_{\text{new}}}{Ex_{\text{old}}} = \frac{Ex_{\text{old}}}{Ex_{\text{new}}} \\ &= \frac{1}{\left( 1 - \frac{\text{fraction enhanced}}{\text{speedup enhanced}} \right) + \frac{\text{fraction enhanced}}{\text{speedup enhanced}}} \end{aligned}$$

$$CPI = \frac{CPV \text{ cycles for program}}{\text{Instruction count}}$$

$$\begin{aligned} CPV \text{ Time} &= \text{Instr count} \times CPI \times \text{clk cycle time} \\ &= \frac{\text{Instructions}}{\text{Program}} \times \frac{\text{clock cycles}}{\text{Instruction}} \times \frac{\text{Seconds}}{\text{clock cycle}} \end{aligned}$$

$$CPV \text{ clock cycles} = \sum_i^{n_{\text{instr}}} IC_i \times CPI; \quad \rightarrow \text{non-specific intr. instr } i$$

$$CPI = \frac{\sum_{i=1}^n IC_i \times CPI_i}{\text{Instr count}} = \sum_{i=1}^n \frac{IC_i}{\text{Instr count}} \times CPI_i$$

JLP,

hazard discussion — directly from  
Dynamic scheduling [rop small] → brok.

Tomasulo starting discussion — directly from  
brok.

→ notes

## Dynamic Schedule

- (1) → can handle binary execution, even if the compiler has not performed inter scheduling.
- (2) can exploit dynamic dependencies unknown at compile time.
- (3) tolerate execution delay due to events like mem access, cache miss, - - -

key idea → allow instructions later to go forward if hazards are resolved (structural / data hazards).

separate the issue process to two stages  
check structural hazards,

wait for ~~next~~ operand availability.

issue is in-order, execution & completion is out-of-order.

Ex:

|       |                |                 |                      |     |                |       |
|-------|----------------|-----------------|----------------------|-----|----------------|-------|
| DIV.D | F <sub>0</sub> | F <sub>2</sub>  | F <sub>4</sub>       |     |                |       |
| ADD.D | F <sub>6</sub> | F <sub>8</sub>  | <u>F<sub>8</sub></u> | S.D | F <sub>6</sub> | O(R1) |
| SUB.D | F <sub>8</sub> | F <sub>10</sub> | F <sub>14</sub>      |     |                |       |
| MUL.D | F <sub>6</sub> | F <sub>10</sub> | F <sub>8</sub>       |     |                |       |

name dep
 

|                                                |
|------------------------------------------------|
| anti-dependence (WAR) → (ADD, SUB), (S.D, MUL) |
| o/p-dependence WAW → ADD, MUL.                 |

data dependence ( DIV → ADD ,  
ADD → S.D ,  
SUB → MUL )

Eliminate all name dep by renaming.

|       |                |                 |                 |  |  |  |
|-------|----------------|-----------------|-----------------|--|--|--|
| DIV.D | F <sub>0</sub> | F <sub>2</sub>  | F <sub>4</sub>  |  |  |  |
| ADD.D | <u>S</u>       | F <sub>0</sub>  | F <sub>8</sub>  |  |  |  |
| S.D   | <u>S</u>       | O(R1)           |                 |  |  |  |
| SUB.D | T              | F <sub>10</sub> | F <sub>14</sub> |  |  |  |
| MUL.D | F <sub>6</sub> | F <sub>10</sub> | T               |  |  |  |

// also, need to replace future F<sub>8</sub> reads by T

## Tomasulo

Register renaming performed by reservation stations. (Rs)

Rs → buffer insts operands waiting issue.

a. If will fetch the operand as soon as it is computed & available in FU o/p.

b. pending insts → will designate Rn for their o/p.

More Rs than registers.

eliminates hazards (WAR, WAW)

CDB → common data bus.

update all units waiting for o/p in parallel → FU o/p → Reg file  
→ Rs.

Load/store buffer

hold data/addr coming from/going to Rs.

Tomasulo scheme → uses Rn as extended virtual register.  
other approaches may use additional registers.  
ROB etc.

1 cycle latency → data computed in FU  
& in reg file.

Load/store instr → step 1 → compute effective addr when bnd reg's available & it is placed in load/store buffer.

Loads exec with availability of memory.

Stores exec --- value being stored in n.  
→ in program order.



## Register status tag.

The tag for any register says which AS contains the instn that will produce result for this register.

Each RS has seven fields.

$O_p \rightarrow$  operation to perform on source operands  $S_1$  &  $S_2$

$Q_j, Q_k \rightarrow$  which RS is due to produce results.  
 $=0 \Rightarrow$  result already available.

$V_j, V_k \rightarrow$  operand values when available

$A \rightarrow$  hold info for mem addr calculation  $1d/1t$ .

Barry  $\rightarrow$  status field.

Regfile  $Q_i \rightarrow$  which RS  $/p$  comes here.  
 for each reg

Ex:  $\rightarrow$

|        |                    |   |
|--------|--------------------|---|
| 1. L.D | $F_6, 32(R_2)$     | 1 |
| L.D    | $F_2, 44(R_3)$     | 1 |
| MUL.D  | $F_0, F_2, F_4$    | 1 |
| SUB.D  | $F_8, F_2, F_6$    | 1 |
| DIV.D  | $F_{10}, F_0, F_6$ | 1 |
| ADD.D  | $F_6, F_8, F_2$    | 1 |

from book (b-177)

all instns issued.

first load completed. result in CDB.

$L D_2 \rightarrow$  waiting on memory.

can avoid WAW, WAR hazard,

ex:  $\rightarrow$  DIV, ADD both issued

2 possibilities  $\rightarrow$  i) instn to update  $F_6$  before DIV is complete.  
 $L D \rightarrow$  then  $V_K$  gets result  $\rightarrow$  DIV exec  $\rightarrow$  ADD exec in II.

(ii)  $L D$  not complete,  $Q_K \rightarrow$  Load, RS  
 Let ADD exec & update  $F_6$ .

## Problems on Thought

You want to build a transaction system.  
 If you use 120 processors to achieve this capacity,  
 where each processor transaction would take 10 ms on a  
 processor, how many ms each extra processor need  
 work ??

$$\cancel{\text{single}} \quad \text{desired capacity gain} = \frac{1000}{100} = 10.$$

$$10 = \frac{120}{1 + (120-1) \times x}$$

$$10 = \frac{120}{1 + (120-1) \times x}$$

$$100 + (120 - 1) \times n = 1000$$

$$\Rightarrow x = \frac{900}{119} = 7.69 \text{ m/sec.}$$

~~case~~ -  $\frac{900}{419}$  from - / rec.

$$\Rightarrow 1 \text{ transaction} = \frac{119 \times 100}{99} \text{ ms.}$$

computing average

throughput workloads, each requiring no operations,

→  $n$  workloads, each requiring  $m$  operations;  
consider that thought for a precursor for these workloads are  
 $m$        $m$        $m$        $m$        $m$        $m$

$$\frac{m}{t_1}, \frac{m}{t_2}, \dots, \frac{m}{t_n}. \therefore \text{average throughput} = \frac{m \cdot n}{\sum t_i}$$

$$\text{diff n. of operations} = \frac{m_1 + \dots + m_n}{t_1 + t_2 + \dots + t_n}$$

$\sim$  harmonic mean.

Compute average speed if

(1000 ops) workload 1

$$\frac{\text{proc A}}{10 \text{ op/sec}} \quad \frac{\text{proc B}}{20 \text{ op/sec}}$$

HM of speed up

A/B

$$= \frac{1}{\frac{1}{2} \left( \frac{\frac{12}{10}}{\frac{12}{10}} + \frac{\frac{12}{5}}{\frac{12}{5}} \right)}$$

$$= \frac{1}{\frac{1}{2} \left( \frac{20}{10} + \frac{10}{20} \right)} = 0.8 \text{ m.e.}$$

AM of speedups B/A

$$= \frac{1}{2} \left( \frac{10}{20} + \frac{20}{10} \right) = 0.8.$$

confusing, both can not be true.

Geometric mean makes sense =

$$\frac{A}{B} = \sqrt{\frac{\frac{100}{10}}{\frac{100}{20}}} = 1$$

Consider latencies →

|     |   |          |
|-----|---|----------|
| LD  | — | 1 cycle  |
| ADD | — | 2 cycle  |
| MUL | — | 6 cycle  |
| DIV | — | 12 cycle |

C1K

1 LD F6 ... issue

2 ~~LD~~, ~~LD~~<sup>ex</sup>, LD<sub>2</sub> issue

3 LD<sub>1</sub> com, LD<sub>2</sub> ~~commit~~, MUL issue

4. ~~LD<sub>2</sub>~~, MUL ex, ~~SUB~~ issue →

5. MUL ex, SUB ex, DIV issue

MUL can  
execute  
w/o LD  
during WB  
to F2  
using RS

6. n , n ex, Add issue

7. , n , SUB commit, n , Add ex

8. n , , , , n ,

9. n , , , , n , Add com

10. MUL commit, , DIV ex,

$ILP \neq IPC$

= in perfect tree with perfect BP.

Ex → 2 issue, 0-0-0,  
2 instr  
/cycle

1 MUL  
2 ADD/SUB/XOR FU

|          |     |    |    | cycle |   |
|----------|-----|----|----|-------|---|
| ALU1     | ADD | R1 | A2 | 1     |   |
| use ALU2 | SUB | R4 | R1 | R5    | 2 |
|          | XOR | R6 | R7 | R8    | 1 |
|          | MUL | A5 | R8 | R9    | 1 |
|          | ADD | R9 | A8 | R9    | 1 |

$\frac{5}{2} = 2.5 \text{ ILP}$

IPC

ILP in program prospects → no info of processor resources.

| <u>IPC</u> | <u>cycle</u> | <u>FU</u>      |  |
|------------|--------------|----------------|--|
| ADD        | 1            | ADD            |  |
| SUB        | 2            | SUB            |  |
| XOR        | 1            | <del>XOR</del> |  |
| MUL        | 1            | MUL            |  |
| ADD        | 2            | ADD            |  |

$\frac{5}{2}$

if only 1 ADD/SUB/XOR →  $IPC = \frac{5}{4}$

$IPC \leq ILP$

# Tomasulo back

## Instruction

Issue

Wait until  
station r empty

FP op

Action

```

if (Regstat[rs].Q; 1 0)
{ Rr[r].Qj ← Regstat[rs].Q;
  Rr[r].Dj ← Regstat[rs].D }
```

else
{ Rr[r].Vj ← Reg[rs];
 Rr[r].Qj ← 0; }

if (Regstat[rt].Q; 1 0)
{ Rr[r].Qk ← Regstat[rt].Q; }

else
{ Rr[r].Vk ← Reg[rt];
 Rr[r].Qk ← 0; }

Rr[r].Busy ← yes;
 Regstat[rd].Q ← r;

if (Regstat[rs].Q; 1 0)
{ Rr[r].Qj ← Regstat[rs].Q; }

else
{ Rr[r].Vj ← Regs[rs];
 Rr[r].Qj ← 0; }

Rr[r].A ← imm

Rr[r].Busy ← yes

Regstat[rt].Q; ← r;

if (Regstat[rt].Q; 1 0)

{ Rr[r].Qk ← Regstat[rt].Q; }

else,

{ Rr[r].Vk ← Reg~~rt~~[rt].Q;
 Rr[r].Qk ← 0; }

Load  
/ store

Buffer r  
empty

Load only



store only



Book example  
181 page

Doing ld & st 0-0-0,

If ld & st access same address,

if ld before st in program order,

interchange will result in WAR

i.e. if st is before ld, interchange will lead to RAW.

Similarly, interchanging two stores → WAW,

i.e. for executing a ld, check if it incomplete st with same addr.

similarly, check for store.

one option is to calculate effective addresses in program order.

If ld address, match the addr of any active st, then ld instruction is not sent to load buffer until the conflicting store completes.

stores operate similarly, only that you need to check for same address in both ld & st buffers.

Drawbacks of Tomaru's

→ WAW support, RS & CDB.

Renaming architectural registers to larger set of physical registers using RAT  
(Reg allocation table)

& buffering of ~~reg~~ operands from reg file.

RAT → ..... translate architectural registers to physical registers.

## HW branch speculation

A wide issue processor may need to execute a branch every cycle to maintain max perf.

## HW branch speculation

- ① dynamic branch prediction, → choose which inst to execute;
  - ② speculation to allow inst exec before control dep are resolved.
  - ③ dynamic scheduling (to deal with diff combination of basic blocks.)
- w/o speculation → dynamic sched works only inside a BB.

Extend Tomarnto to support speculation.

Separate off the two parts

↳ bypassing of results among insts for executing them speculatively & the actual completion of insts.

introduce an 'insta commit' phase, to allow inst to exec 0-0-0 but commit in-order.

adding a phase means introducing additional data buffers. → ROB commit stat.  
↑

ROB → 8 fields → op, destination, value, ready field.

↓  
Br,  
st,  
ld,  
ALU

<MEM,  
Reg>

↓  
result before  
commit

ROB replaces store buffers.

Every instr has a position in ROB  
before commit.

Hence, we tag update results with  
ROB entry no, rather than RS no.

### Instr exec steps

- ① Issue : get instr from queue. Issue instr if there is an empty slot in a matching RS & empty slot in ROB.  
Send operands to RS if available in Reg/ROB.  
Do book-keeping → i.e. ROB/RS entry etc.  
stall if no matching RS free / ROB full.

- ② Execute : wait in RS for operands.  
L.D → effective addr calc in int pipeline.  
& load from mem to ROB  
S.D → only eff. addr calc.

- ③ Write Res :



- ④ Commit

if branch instr  
if branch instr  
reach ROB head  
with incorrect  
prediction status,  
ROB is flushed.

If store  
update  
Mem  
rather  
than  
Reg  
file.

if others  
(normal commit)  
when instr  
reach ROB head  
update Reg  
file from  
ROB entry  
Rem instr  
from ROB.

ROB ensure → ILP across BB.  
o-o-o exec, in-order commit.

## Ex. from Book

add 2 cycles.  
mul 6 cycles.  
div 12 cycles.

|       |             |
|-------|-------------|
| L.D   | F6, 32(R2)  |
| L.D   | F2, 44(R3)  |
| MUL.D | F0, F2, F4  |
| SUB.D | F8, F2, F6  |
| DIV.D | F10, F0, F6 |
| ADD.D | F6, F8, F2  |

### Modifications

RS  $\rightarrow$  8j. R<sub>k</sub> fields now have  
ROB entry no. in place of RS no.

Register status field  $\rightarrow$  ROB entry

We compute diff entries when MUL.D is under exec.  
SUB. has completed  
but cannot commit.

Fig from Book

Key observation  $\rightarrow$  no. instr after the earliest uncompleted instr is allowed to complete.

Implication of ROB  $\rightarrow$  we can execute code dynamically (supporting speculation) while maintaining "precise interrupt model".

State consistency  $\rightarrow$  "precise interrupt model".  
ex: if MUL.D creates an interrupt, we can wait (before handling the interrupt)  
fill it makes ROB head.

$\hookrightarrow$  this yields a precise exception since instr commits are in order.

In this ex, SUB & ADD can finish before MUL/DIV.  
W/O ROB, then F8 & F6 get changed &  
interrupt due to MUL will lead to inconsistent state.

Common case in some arch  $\rightarrow$   
floating point exceptions are allowed to be  
imprecise  $\rightarrow$  as program will terminate.

Page fault exception  $\rightarrow$  need precise handling.

Fig from Book

Then a few imp points: →

exception is raised → record in ROB.

if branch mispredict → exception flushed with all instrs after branch.

if intra branch ROB head & no mispredict  
↳ exception taken.

general: handle exception as soon as earlier branches get resolved.

handling stores in speculative system  
store into update mem only when ROB head reached.  
mem update happens only when it is no longer speculative.

store op → V<sub>k</sub> field of atom reservation  
↳ ROB

In reality, atom result can arrive late, just before some ROB reaches head.

directly from CDB → ROB entry.

HW tracks if any completing atom has any dependency on atom with pending update.

effect → allow store to pass through write result stage w/o result available.

WAK/WAW → in-order commits.

RAW → i) L.D cannot load value (2nd step of L-D exec)  
if an ROB entry of S.D  
has matching destination field.

ii) L.D eff address calculation in program order w.r.t all earlier stores.

extend these ideas to → INT pipeline, multithread proc.

## More examples

### Branch Misprediction Recovery.

LD R<sub>1</sub>, 0(R<sub>1</sub>) → consider a cache miss here  
 BNE R<sub>1</sub>, R<sub>2</sub>, Label → NT  
 ADD R<sub>2</sub>, R<sub>1</sub>, R<sub>1</sub> → stall  
 MUL R<sub>3</sub>, R<sub>3</sub>, R<sub>4</sub> → h<sub>0</sub>  
 DIV R<sub>2</sub>, R<sub>3</sub>, R<sub>7</sub> → h<sub>0</sub>  
 ... → Will domain make wrong updates??

### Phantom exceptions:

BEQ R<sub>1</sub>, R<sub>2</sub>, Label  
 DIV R<sub>0</sub> R<sub>0</sub> R<sub>5</sub>

consider BEQ predicted as NT & subsequently DIV throwing an exception.

exception result (marked as E<sup>x</sup>) in ROB as separate entry.

### quiz

|  |                                                      |           | New state               |
|--|------------------------------------------------------|-----------|-------------------------|
|  | ADD R <sub>2</sub> , R <sub>2</sub> , R <sub>1</sub> | committed | committed               |
|  | LW R <sub>1</sub> , 0(R <sub>2</sub> )               | executing | "                       |
|  | ADD R <sub>3</sub> , R <sub>4</sub> , R <sub>5</sub> | done      | "                       |
|  | DIV R <sub>3</sub> , R <sub>2</sub> , R <sub>3</sub> | executing | unexecuted. (Exception) |
|  | ADD R <sub>1</sub> , R <sub>4</sub> , R <sub>4</sub> | done      | "                       |
|  | ADD R <sub>3</sub> , R <sub>2</sub> , R <sub>2</sub> | done      | "                       |

### ROB update game:

|    |                |                                   |
|----|----------------|-----------------------------------|
| 1. | R <sub>1</sub> | R <sub>2</sub> + R <sub>3</sub>   |
| 2. | R <sub>3</sub> | R <sub>5</sub> + R <sub>6</sub>   |
| 3. | R <sub>1</sub> | ROB <sub>1</sub> + R <sub>7</sub> |
| 4. | R <sub>1</sub> | R <sub>4</sub> + R <sub>8</sub>   |
| 5. | R <sub>2</sub> | R <sub>9</sub> + ROB <sub>2</sub> |

RAT  
R<sub>1</sub> ROB<sub>4</sub>

R<sub>2</sub> ROB<sub>5</sub>

R<sub>3</sub> ROB<sub>2</sub>

R<sub>7</sub>

latest  
renames

fp exception  
handler  
can also  
put a new  
small value  
& let execution  
continue.

ROB renaming

In original TOMarnto,

Regfile update happened based on RAT.

If a matching FU o/p with same RS Tag is ready to update.

Here, it is not the case, ROB entries get committed in sequence for correct exception handling.

Regfile update →  $R_1 R_2 + R_3 \rightarrow ROB_1 + R_7 \rightarrow R_4 + R_8$   
 $R_2 R_9 + ROB_2$   
 $R_3 R_5 + R_6$   
 $R_4$

ROB example (OM page 134)

### Unified RS

3RS      2RS  
ADD      MUL  
/sub      /div

Issue is in/orden.  
if add is free but  
RS for MUL/div is  
occupied no add instr  
cannot be used if add  
RS is free

5RS  
↓  
ADD, MUL

Logic complicated.



3rd option is pretty aggressive.

if store produce match add them  
Ld. has to recover, it has loaded  
state data & supplied to future  
instns.

All that does not but occurs  
infrequently.

### 000 LS execution

Load R3 = O(R6) diff, miss

Add R7 = R5 + R9 X

Store R9 → O(R7) X

done. ✓ Sub R1 = R1 - R2 ✓ R1 o/p

✓ Load R8 = O(R1) ✓ ← cache hit

all fetched & decoded, dispatched first  
load & cache miss.

what if,

~~Mem[R1]~~ = ~~Mem~~ R7 after R7 is updated  
by add.

the Load is to be invalidated.

options don't do o/o/o      Id → st → Id  
do in-order for LS instns.

### Ld to st fwd

Ld → multiple others going in, inc only one matching,  
st → whom to give which Id to go & give  
value to,  
can be multiple Ld.

~~Ex~~

| L/S | PC     | seq addrs | Value                      |
|-----|--------|-----------|----------------------------|
| L   | 0XF098 | 0x 3290   | — 42 → commit              |
| S   | 4C     | 3410      | — 25 → still not committed |
| S   | 54     | 3290      | — -17                      |
| L   | 60     | 3418      | — 1234                     |
| L   | F840   | 3290      | — -17 ← fwd                |
| L   | F858   | 3380      | — 1                        |
| S   | F85C   | 3290      | — 0                        |
| L   | F870   | 3410      | — 25 ←                     |
| L   |        | 3290      | — 0 ← not from previous.   |
| L   |        | 3380      | — 1 ← bnf interest;        |
|     |        | ██████    | cache still not updated    |

### Data cache

|         |        |
|---------|--------|
| 0x 3290 | — 42   |
| 3380    | — 1    |
| 3410    | — 38   |
| 3418    | — 1234 |

Q how Ld / st commit ??

## Relation between LSQ, ROB & RS

ISSUE Ld/st

- ROB entry
- LSQ entry — kind of RS for Ld/st } cannot issue Ld/st unless and ROB & LS has free space.

Issue Non Ld/st

- ROB entry } can not issue unless ROB &
- RS matching RS is available.

Execute Ld/st → compute addrs. ::>  
produce value ::>

for Ld → ::> followed by ::> as value comes from mem add  
for st → ::>, ::> in any order as value comes from register.

Write-Result → only for Ld  
not st.

st just ~~keeps~~ keeps value in LSQ  
LSQ will fwd to matching successor loads & also send to memory on commit.

Ld gets results & immediately broadcasts to matching RS so that dependent instruction can now proceed.

Commit Ld → from LSQ & ROB entry.

... st → ..... & also send write to mem.