

# **High Performance Computer Architecture (CS60003)**

**Dept. of Computer Science & Engineering  
Indian Institute of Technology Kharagpur**

**Spring 2023**



# Load Store Queue (LSQ)

# What is really Out-of-Order?

- Fetch
  - Decode
  - Issue
  - Exec
  - Broadcast
  - Commit
- } Inorder
- } Out of Order respecting true dependencies
- } Inorder
- (RETIRED)

# Memory Access Ordering

- We have seen OOO processors track when one instruction uses register values produced by another instruction.
- However loads and store instructions can access same memory locations, without using the same register!
  - How do we handle them in the OOO processor?
  - Does loads and stores need to be done strictly in program order?
  - Can we reorder them?

# Handling Dependencies

- Control Dependencies: Branch Prediction
- False Dependencies: Register Renaming
- True Dependencies: Respected by Tomasulo's Algorithm
- What about memory dependencies?

sw R1, 0(\$R3)  
 Lw R2, 0(\$R4)

} there could be a dependency  
if the addresses are same  
⇒ in which case we have to  
do inorder.

g: How do we handle them?

# When do we allow stores to write?

- During Commit:
  - It is unsafe to write memory before it is committed.
  - Because in case of exception/branch miss if it is cancelled, we have to undo the effect of a memory write.
  - This is unstable, as we have seen.
- But then does the load have to wait till the stores are committed?
  - That would be very late.
  - Ideally, we would like to get data loaded into registers as soon as possible, to be used in other following instructions.
  - For this we shall be introducing another hardware unit, called the load-store queue (LSQ).

# Load Store Queue (LSQ)

- We need this unit to get the load instructions to get data as soon as possible, while the store does not write to memory until commit.

*Structure just like ROB*

- we put things in-order,
- remove also at commit in-order.

| L/S | Address | Value | C |
|-----|---------|-------|---|
|     |         |       |   |
|     |         |       |   |
|     |         |       |   |
|     |         |       |   |

# Load Store Queue (LSQ)

- We need this unit to get the load instructions to get data as soon as possible, while the store does not write to memory until commit.

|              |
|--------------|
| Iw R1, 0(R1) |
| sw R2, 0(R3) |
| Iw R4, 0(R4) |
| sw R5, 0(R0) |
| Iw R5, 0(R8) |

| L/S | Address | Value | C |
|-----|---------|-------|---|
|     |         |       |   |
|     |         |       |   |
|     |         |       |   |
|     |         |       |   |

# Load Store Queue (LSQ)

- We need this unit to get the load instructions to get data as soon as possible, while the store does not write to memory until commit.

|              |
|--------------|
| Iw R1, 0(R1) |
| sw R2, 0(R3) |
| Iw R4, 0(R4) |
| sw R5, 0(R0) |
| Iw R5, 0(R8) |

Program  
Order  
↓

| L/S | Address | Value | C |
|-----|---------|-------|---|
| L   | 100     |       |   |
| S   | 200     | 50    | ✓ |
| L   | 200     | 50K   |   |

Memory

store to load forwarding

For every load, we check for in-flight stores with same address. If no match, we go to memory, else just forward.

# Load Store Queue (LSQ)

- We need this unit to get the load instructions to get data as soon as possible, while the store does not write to memory until commit.

|              |
|--------------|
| Iw R1, 0(R1) |
| sw R2, 0(R3) |
| Iw R4, 0(R4) |
| sw R5, 0(R0) |
| Iw R5, 0(R8) |

Program  
Order  
↓

We assumed previous store's  
Value & address are  
ready. It may happen  
the store isn't complete.  
What do we do?



# Options

- In-order: Make loads and stores in-order
  - But that means say a load takes a lot of time due to a cache miss all the following load/store instructions wait.
- Wait for all Previous Store Addresses:
  - We check for if address matches with any previous store.
  - If none, load address memory,
  - if yes it matches but store is not ready we wait for it to complete.
- Go Anyway (most aggressive!)
  - When we have the address of the load, we check if it matches with any of the previous stores we wait.
  - If none, we ignore that some may match and just go for the memory.

# Recovery

- Two situations can occur:
  - Store address is some address, but not 176 (the load address).
    - In this case, the load was correct.
- What if the store address is 176?
  - In this case, the load can possibly get wrong result
  - It needs to recover!
  - Thus, after the store we check for all following load instructions and matching addresses.
  - We redo the loads.

# Most Modern Processors do Go Anyway!

- Based on the fact that most often the address will not match, this strategy is best, performance wise.



# 000 Load and Store

LOAD R3, O(RG)

ADD R7, R3 + R9

STORE R4, O(R7)

SUB R1, R1, R2

LOAD R8, O(R1)

# 000 Load and Store

LOAD R3, O(RG)

ADD R7, R3 + R9 →

STORE R4, O(R7)

SUB R1, R1, R2

LOAD R8, O(R1)

Dispatched, but cache miss

Cannot be dispatched  
cannot be dispatched

Dispatched, R1 is written in ROB (can compute in-order).  
Then this LOAD can proceed

Eventually, first load is done.

First load result is a value in R3, hence

⇒ ADD gets completed writing to R7

If store address,  $0+R7 \neq 0+R1$ , all is well.

But if they are, Load has loaded stale data.

# In order loads & stores

- The store is considered done when it knows the address and the value of R4 is ready (not when it commits).
- At this point the load can proceed!**
- It checks for address match with any prior store.
  - Thus, most instructions are Out of order, but loads and stores are in-order!
  - Clearly, not an optimal strategy.
  - Most often  $(0+R7) \neq (0+R1)$ .



→ We may needlessly delay the load  
It can result also in a cache miss, and  
increase the penalty!

# Memory Ordering Quiz

|              | In Memory | Out Memory |
|--------------|-----------|------------|
| LW R1, 0(R2) |           |            |
| SW R1, 4(R2) |           |            |
| LW R1, 0(R3) |           |            |
| SW R1, 4(R3) |           |            |
| LW R1, 0(R4) |           |            |
| SW R1, 4(R4) |           |            |

- We allow 0-0 load-store execution.
- All loads are say cache misses, needing 40 cycles.
- Stores are waiting for preceding loads to write to the registers.

# Memory Ordering Quiz

If 2<sup>nd</sup> LW say,  $(4 + R2) = (0 + R3)$   
needs to be redone.

|              | In Memory | Out Memory  |
|--------------|-----------|-------------|
| LW R1, 0(R2) | 1         | 41          |
| SW R1, 4(R2) |           | → 42 (done) |
| LW R1, 0(R3) | 2         | 42          |
| SW R1, 4(R3) |           | → 43 (done) |
| LW R1, 0(R4) | 3         | 43          |
| SW R1, 4(R4) |           | → 44 (done) |

- We allow 0-0 load-store execution.
- All loads are say cache misses, needing 40 cycles.
- Stores are waiting for preceding loads to write to the registers.

# Memory Ordering Quiz

(store cycles in an  
in-order execution)

|              | In Memory                                                   | Out Memory |
|--------------|-------------------------------------------------------------|------------|
| LW R1, 0(R2) | 1                                                           | 41         |
| SW R1, 4(R2) | In cycle 42, this store knows<br>R1 can be considered done. |            |
| LW R1, 0(R3) | 43                                                          | 83         |
| SW R1, 4(R3) | Likewise for cycle 84.                                      |            |
| LW R1, 0(R4) | 85                                                          | 125        |
| SW R1, 4(R4) | 126                                                         |            |

# Store to Load Forwarding

- Load: Which earlier store do we get the value from?
- Store: Which later load do I give my value to?
- Lets consider an example.

# LSQ Example

*oldest*



| L/S | PC     | Seq   | Addr   | Value |
|-----|--------|-------|--------|-------|
| L   | 0xF048 | 41773 | 0x3290 |       |
| S   | 0xF04C | 41774 | 0x3410 |       |
| S   | 0xF054 | 41775 | 0x3290 |       |
| L   | 0xF060 | 41776 | 0x3418 |       |
| L   | 0xF840 | 41777 | 0x3290 |       |
| L   | 0xF858 | 41778 | 0x3300 |       |
| S   | 0xF85C | 41779 | 0x3290 |       |
| L   | 0xF870 | 41780 | 0x3410 |       |
| L   | 0xF628 | 41781 | 0x3290 |       |
| L   | 0xF63C | 41782 | 0x3300 |       |

|        |      |
|--------|------|
| 0x3290 | 42   |
| 0x3300 | 1    |
| 0x3410 | 38   |
| 0x3418 | 1234 |

*youngest*



# LSQ Example

Commit of load : Writing to Reg file  
 Commit of store : Writing to cache.

oldest →

| L/S | PC     | Seq   | Addr   | Value                        |
|-----|--------|-------|--------|------------------------------|
| L   | 0xF048 | 41773 | 0x3290 | 42                           |
| S   | 0xF04C | 41774 | 0x3410 | 25 [Does not write to cache] |
| S   | 0xF054 | 41775 | 0x3290 | -1 ↗                         |
| L   | 0xF060 | 41776 | 0x3418 |                              |
| L   | 0xF840 | 41777 | 0x3290 |                              |
| L   | 0xF858 | 41778 | 0x3300 |                              |
| S   | 0xF85C | 41779 | 0x3290 |                              |
| L   | 0xF870 | 41780 | 0x3410 |                              |
| L   | 0xF628 | 41781 | 0x3290 |                              |
| L   | 0xF63C | 41782 | 0x3300 |                              |

|        |      |
|--------|------|
| 0x3290 | 42   |
| 0x3300 | 1    |
| 0x3410 | 38   |
| 0x3418 | 1234 |

→ youngest

# LSQ Example

Commit of load: Writing to Reg file  
 Commit of store: Writing to cache.

oldest →

| L/S | PC     | Seq   | Addr   | Value                        |
|-----|--------|-------|--------|------------------------------|
| L   | 0xF048 | 41773 | 0x3290 | 42                           |
| S   | 0xF04C | 41774 | 0x3410 | 25 [Does not write to cache] |
| S   | 0xF054 | 41775 | 0x3290 | -17                          |
| L   | 0xF060 | 41776 | 0x3418 | 1234                         |
| L   | 0xF840 | 41777 | 0x3290 | -17 (from L)                 |
| L   | 0xF858 | 41778 | 0x3300 | 1                            |
| S   | 0xF85C | 41779 | 0x3290 |                              |
| L   | 0xF870 | 41780 | 0x3410 |                              |
| L   | 0xF628 | 41781 | 0x3290 |                              |
| L   | 0xF63C | 41782 | 0x3300 |                              |

|        |      |
|--------|------|
| 0x3290 | 42   |
| 0x3300 | 1    |
| 0x3410 | 38   |
| 0x3418 | 1234 |

before loading  
 check if  
 address  
 matches  
 with any  
 previous  
 store. If no  
 so go to  
 memory.

youngest →

# LSQ Example

Commit of load: Writing to Reg file  
 Commit of store: Writing to cache.

oldest →

| L/S | PC     | Seq   | Addr   | Value                        |
|-----|--------|-------|--------|------------------------------|
| L   | 0xF048 | 41773 | 0x3290 | 42                           |
| S   | 0xF04C | 41774 | 0x3410 | 25 [Does not write to cache] |
| S   | 0xF054 | 41775 | 0x3290 | -17                          |
| L   | 0xF060 | 41776 | 0x3418 | 1234                         |
| L   | 0xF840 | 41777 | 0x3290 | -17 (Fwd from)               |
| L   | 0xF858 | 41778 | 0x3300 | 1                            |
| S   | 0xF85C | 41779 | 0x3290 | 0 [Does not write to cache]  |
| L   | 0xF870 | 41780 | 0x3410 | 25                           |
| L   | 0xF628 | 41781 | 0x3290 | 0                            |
| L   | 0xF63C | 41782 | 0x3300 | 1                            |

|        |      |
|--------|------|
| 0x3290 | 42   |
| 0x3300 | 1    |
| 0x3410 | 38   |
| 0x3418 | 1234 |

before loading  
 check if  
 address  
 matches  
 with any  
 previous  
 store. If no  
 so go to  
 memory.

youngest →

# LSQ Example-Commits

Commit of load : Writing to Reg File  
 Commit of store : Writing to Cache.

*oldest*

The diagram illustrates the execution of a sequence of memory operations (loads and stores) in an LSQ (Load/Store Queue). The table below tracks the operations by sequence number, PC, address, and value. Handwritten annotations show the flow of data from the LSQ to the Register File (RF) and Cache, with red lines indicating the commit of stores and green lines indicating the commit of loads.

*youngest*

**Register File**

|        |           |
|--------|-----------|
| 0x3290 | 42 - 17/0 |
| 0x3300 | 1         |
| 0x3410 | 3825      |
| 0x3418 | 1234      |

**Write to Cache**

**RF**

**Commit of store only allows write**  
 $\Rightarrow$  if more is an exception pre-exception

**Commit of load : Writing to Reg File**

**Commit of store : Writing to Cache.**

**LSQ Data:**

| L/S | PC     | Seq   | Addr   | Value |
|-----|--------|-------|--------|-------|
| L   | 0xF048 | 41773 | 0x3290 | 42    |
| S   | 0xF04C | 41774 | 0x3410 | 25    |
| S   | 0xF054 | 41775 | 0x3290 | -17   |
| L   | 0xF060 | 41776 | 0x3418 | 1234  |
| L   | 0xF840 | 41777 | 0x3290 | -17   |
| L   | 0xF858 | 41778 | 0x3300 | 1     |
| S   | 0xF85C | 41779 | 0x3290 | 0     |
| L   | 0xF870 | 41780 | 0x3410 | 25    |
| L   | 0xF628 | 41781 | 0x3290 | 0     |
| L   | 0xF63C | 41782 | 0x3300 | 1     |

# LSQ, ROB, and RS

- Issue Load/Store:
  - A ROB entry
  - A LSQ entry
- Issue Non\_Load/Store:
  - A ROB entry
  - A RS
- Execute Load/Store:
  - Compute Address
  - Produce Value
- Write Result (Broadcast)
  - Only Load. Its value is used by following instructions.
  - Store does not write, except in the LSQ
- Commit Load/Store:
  - Free ROB and LSQ entries
- Commit Store:
  - Write data to memory
- Commit Load:
  - Write data to Register File

# Quiz

sw R1, 0(R2)

lw R2, 0(R2)

Does lw access Cache or memory?

# Quiz

sw R1, 0(R2)

lw R2, 0(R2)

Does lw access Cache or memory?

No!

# Quiz

sw R1, 0(R2)

lw R2, 0(R2)

It gets its value from:

- A Result Broadcast
- A Reservation Station
- A ROB entry
- A LSQ entry.

# Quiz

sw R1, 0(R2)

lw R2, 0(R2)

It gets its value from:

- A Result Broadcast
- A Reservation Station
- A ROB entry (as store is not a register producing instruction, it does not write the value in the ROB. It is written only in LSQ for forwarding, and later commit to memory.)
- A LSQ entry

# Almost a Modern Processor!





# **Thank You!**