

# Lecture 07 - Dependences and Pipelining

기반  
설계

→→ 짧은 시간에 완료...

Jaejin Lee

Dept. of Computer Science and Engineering, College of Engineering

Dept. of Data Science, Graduate School of Data Science

Seoul National University

<http://aces.snu.ac.kr>

# Processor, CPU, and Core

- Poorly defined terms in these days
  - A processor or CPU typically refers to the physical chip (package)
- Core
  - A hardware unit that fetches instructions and executes them
  - Contains hardware components involved in executing instructions
    - ALU, FPU, (private) L1/L2 caches, etc.
- Processor or CPU
  - The combination of one or more cores with shared resources between cores and some supporting hardware
  - A processor sometimes refers to a processor that contains a single core or the core itself depending on the context

- A term used by Intel  
; 코어 밖의 하드웨어
- Hardware components that are not in the core
  - QPI controllers, last level caches (e.g., L3 cache), on-chip memory controllers, etc.

64비트 x86 CPU?  
Nowadays: El.  
2C 32G  
1A (15<sup>9</sup>)

# Compilers and Compilation Process

- A compiler is a program that automatically translates another program from some programming language to machine code

pyun 872222 812.; ; **컴파일러 번역**  
(번역)

```
#include <stdio.h>

int main()
{
    printf("hello, world\n");
}
```



# Typical Compilation Phases



# Interpreters and Interpretation Process

- An **interpreter** is a program that **takes as input** a specification of a program in some language and the **output** of the input program is its output  
*이해하는 것*
- It translates the source code into some **intermediate representation**
  - A parser in the interpreter takes a string representation of a program and produces a structural parse of the input program
  - An **evaluator** in the interpreter executes the intermediate representation
  - The interpreter may execute stored precompiled code made by a compiler
- Python, Java **bytecode interpreter** → *python → interpreter → interpreter execute*  
*bytecode* ↗ *intermediate rep*

# Compilation vs. Interpretation



# Techniques to Improve a Single Core Performance

- As VLSI technology improves, more room becomes available on a chip  
VLSI 기술 → 허가운

- Two major techniques
  - Instruction pipelines
  - On-chip caches

Very Large Scale Integrated.

(v) dependence  $\Rightarrow$  依存関係  
dependency

## Dependences

- An ordering relationship between two computations
  - Any ordering of execution that obeys all dependences will produce the same result as that of the original program

A는 예전에 B를 먼저 사용함.

이후에 A는 B의 결과를 사용함.

### Data dependences

- Flow (true) dependence
- Anti dependence
- Output dependence
- Input dependence (not really a dependence)



Res  
on

$\therefore A \rightarrow B$

CU

### Control dependences

fetch from where?  
 $\Rightarrow$  jump, if



THUNDER  
Research Group  
서울대학교 천동 연구실



# Flow Dependence

- True dependence
- Instruction P writes a memory location that instruction Q later reads (read after write)



## Anti Dependence

- False dependence (can be removed)
- Instruction P reads a memory location that instruction Q later writes (write after read)

P: ... = a;

...

Q: a = ...;



## Output Dependence

- False dependence (can be removed)
- Instruction P writes a memory location that instruction Q later writes (write after write)

기존의 location  
+ 그 뒤쓰기

P: a = ... ;

...

Q: a = ... ;



## Input Dependence

- Not really a dependence
  - For caches
- Instruction P reads a memory location that instruction Q later reads (read after read)

앞고, 뒤고

P: ... = a;

...

Q: ... = a;



앞고 메모리에서  
여기 빌드, 뒤에 힐  
쓰쓰쓰!

**//cache hit for the access to a**

→ 캐시 충돌



## Basic Blocks

- A sequence of statements that is always entered at the beginning and exited at the end without halt or possibility of branching except at the end
- Two consecutive instructions are in the same basic block if only if the execution of the first instruction guarantees the execution of the next instruction

```

read m
t0 = 1
t1 = 1
if m <= 1 goto L2
i = 2
L0: if i <= m goto L1
return t0
L1: t2 = t0 + t1
t0 = t1
t1 = t2
i = i + 1
goto L0
L2: return m

```

2 번째 실행  
basic block 01번  
? 1번의 B Block인가.  
여기서는 A.

## Identifying Basic Blocks

- The first instruction is a leader
- Any instruction that is the target of a conditional or unconditional jump is a leader
- Any instruction that immediately follows a conditional or unconditional jump is a leader
- For each leader, its basic block consists of itself and all instructions up to but not including the next leader or the end of the intermediate program



# Control-Flow Graphs

- A control-flow graph (CFG) is a directed-graph representation of all paths that might be traversed through a program during its execution
- Flow-of-control information
- A directed graph
  - Each node in the graph represents a basic block
  - Directed edges are used to represent the potential flow of control

```

read m
t0 = 1
t1 = 1
if m <= 1 goto L2
i = 2
L0: if i <= m goto L1
return t0
L1: t2 = t0 + t1
t0 = t1
t1 = t2
i = i + 1
goto L0
L2: return m
  
```



Situ &amp;

## Control Dependences

- An instruction  $Q$  has a control dependence on a preceding instruction  $P$  if the outcome of  $P$  determines whether  $Q$  should be executed or not
- An instruction  $Q$  is said to be control dependent on another statement  $P$  if and only if
  - There exists a path from  $P$  to  $Q$  such that every instruction  $I \neq P$  within the path will be followed by  $Q$  in each possible path to the end of the program, and
  - $P$  will not necessarily be followed by  $Q$ , i.e., there is an execution path from  $P$  to the end of the program that does not go through  $Q$

**P:** `if (x > 3)`  
**Q:** `a = ...;`  
**R:** `a = ...;`

$\xrightarrow{P \text{ be true (and)}}$   
 $\xrightarrow{\text{Depend}}$



Instruction Pipeline

- Pipelining is a hardware technique that increases instruction throughput (e.g., the number of instructions executed in a CPU clock cycle)
- Assume
  - Each instruction takes 3 cycles to complete
  - Each pipeline stage takes a single cycle



~~10^n - 1 + 2^n \* 2^e~~

## Pipeline Hazards

What are pipeline hazards?

- Typical five-stage pipeline

- IF: instruction fetch
- ID: instruction decode and register fetch
- EX: execute
- MEM: memory access
- WB: register write back

High clock frequency ( $> 3$ )



# Pipeline Hazards (cont'd)

- Occur when the next instruction does not execute in the following clock cycle
- Data hazards
- Structural hazards
- Control hazards

# Data Hazards

- Occur when a result is needed in the pipeline before it is available



# Resolving Data Hazards

- Stalling the pipeline

- A bubble represents a delay in execution of an instruction in the pipeline to resolve a hazard



# Resolving Data Hazards (cont'd)

- Software
  - Insert independent instructions (or no-ops)

$r_1 \leftarrow O(ho \quad op)$

```
add r1, r2, r3
add r4, r2, r1
sub r5, r3, r6
xor r5, r1, r6
```



```
add r1, r2, r3
no-op
no-op
add r4, r2, r1
sub r5, r3, r6
xor r5, r1, r6
```

Handwritten annotations:

- A blue bracket groups the two "no-op" instructions.
- A blue arrow points from the first "no-op" to the second "no-op".
- The text "Add" is written above the first "no-op".
- The text "r1 = r1" is written next to the second "no-op".
- The text "or" is written below the second "no-op".

# Resolving Data Hazards (cont'd)

- Hardware

- Stalling the pipeline (i.e., insert bubbles)
- Transparent register file
  - Resolves data hazard at the **WB** stage
  - If the register file is asked to read and write the same register in the same cycle, the register file allows the written data forwarded to the stage in which the data is needed

**C**

- First half cycle: the register is written with the data
  - Second half cycle: the register is read into the pipeline (in the ID stage)
- Data forwarding
  - Feeding output data into a previous stage of the pipeline
  - Resolves data hazard at the **EX** and **MEM** stages



# Structural Hazards

- Occur when a part of the processor's hardware is needed by two or more instructions at the same time
- Due to resource conflicts
  - A single memory unit that is accessed both in the IF stage and the MEM stage
- Resolving structural hazards



# The Program Counter

*spec' reg'ister*

- The program counter (PC) (a.k.a. the instruction pointer, the instruction address register, and the instruction counter) is a processor register that indicates where a computer is in its program sequence
- The PC is incremented after fetching an instruction, and holds the memory address of ("points to") the next instruction that would be executed



있을까 좋다.

## Control Hazards

- Occur because of branches
- The pipeline does not know the branch target until the instruction reaches the MEM stage
  - Assume PC is set in the MEM stage
- The pipeline continues fetching instructions sequentially
  - These fetched instructions cannot be allowed to execute because the programmer has diverted control



# Resolving Control Hazards

- Software

- Delayed branch

- Perform instruction scheduling into branch delay slots with instructions before the branch instructions
    - From the target address (when it is known to be taken)
    - From fall through (when it is known to be not-taken)



# Resolving Control Hazards (cont'd)

- Hardware
  - Stalling the pipeline until the branch target is known
  - Branch predictors (complicated) to prevent stalling the pipeline



- Programs written for a pipelined processor deliberately avoid branching to avoid performance degradation

→ 는 어떤 풀이  
→ 는 어떤 풀이

# In-order Execution

- Steps

- D → EX : JISPRHL
- HNU 32Y21  
△ HKU
- hkt  
high  
↓  
no block  
⇒ printed
- Fetch and decode the next instruction
  - If its operands are available, the instruction is dispatched to the appropriate functional unit
  - Otherwise, the processor stalls until they are available
  - The result is written back to register file
    - A register file is an array of processor registers in a CPU



Assume that the EX stage for load/store instructions performs memory operations (i.e., MEM)

# Out of Order Execution (OoO)

- Dynamic instruction scheduling by hardware
- Steps
  - Fetch and decode the next instruction
  - Issue it to the appropriate reservations station
  - It waits in the reservation station until its operands are available
  - The instruction is dispatched to the appropriate functional unit and executes
  - The execution completes, and the result is queued in the reorder buffer in the commit unit
  - The commit unit write results to registers and memory in program fetch order

Inst 013  
 Alt 013  
 → 01213 to RS



# Issuing and Dispatching an Instruction

- Once the instruction has passed the ID stage, we say that the instruction is issued
- When all operands are available, we say that the instruction is dispatched from a reservation station to the execution (functional) unit

작업을 넘어서는 순간 "issue 수구"  
Execution // "dispatch"

# Tomasulo's Algorithm

- IBM 360/91
    - ○ ↗
  - Single issue ↗
  - In-order issue → ↗  
in-order
  - Out-of-order dispatch
  - Out-of-order execution
    - Issue & Dispatch
  - In-order commit
- Best strategy는 적을 한 번에 구하는.

# Retirement (Graduation)

committ 5/13/27

- An instruction retires when the reorder buffer slot of an instruction is freed either
  - because the instruction commits (the result is made permanent) or
  - because the instruction is removed (without making permanent changes)

Inst committed → pipeline 품목

ex. D-opton.



# Single Superscalar Processors

Only inst. can

①

- Dynamically issue multiple instructions in each clock cycle
  - A typical superscalar processor fetches and decodes several instructions at a time
  - In-order or out-of-order issue



## VLIW Processors

- Use a long instruction word that contains a fixed number of instructions that are fetched, decoded, issued, and executed synchronously



- Static instruction scheduling by a compiler

→ word이 어떻게 됩니까?

○ 몇 개의 instruction을 fetch하는지

(word가 몇 개인가 알 수 있으면!) 그게 가능합니다.