

1

S1: Load R1, A      /  $R1 \leftarrow \text{mem}(A)$   
 S2: Add R2, R1      /  $R2 \leftarrow R1 + R2$   
 S3: Mov R1, R3      /  $R1 \leftarrow R3$   
 S4: Store B, R1      /  $\text{mem}(B) \leftarrow R1$



- S2 **(FD)** on S1      • S2 is FD on itself
- S3 **(AD)** on S2
- S3 **(OJ)** on S1
- S2 & S4 are II

## Example 2.2 . - page 55

(2)

$$\left. \begin{array}{l} P_1: C = D \times E \\ P_2: M = G + C \\ P_3: A = B + C \\ P_4: C = L + M \\ P_5: F = G \div E \end{array} \right\} \text{Assume:}$$

- each statement needs one step to execute
- No pipelining

### Dependence graph



Note:  $I_2 \cap I_5 \neq \emptyset \rightarrow$  does not affect exploiting the parallelism

(3)



Sequential execution  
in 5 steps



Parallel execution

(assuming 2 adders are  
available per step)  
simultaneously

# Example 2.3 - page 58

Cycle 1



Cycle 2

Cycle 3

$L_i$ : Load Operations  $\rightarrow 4$

Arithmetic operations  $\rightarrow 4$

# of instructions 8

- # of instructions per cycle  
 $8/3 = \underline{2.67}$

Software Parallelism

This is a program graph.

## Example 2.3 (cont'd)

Cycle 1



Cycle 2

Cycle 3

Cycle 4

Cycle 5

Cycle 6

Cycle 7

- 2-issue processor is given
- { One load/write [one memory access] and One arithmetic (+, ×, ÷, -) simultaneously.

$$\Rightarrow \frac{8}{7} = 1.14 \text{ instructions per cycle.}$$



HARDWARE PARALLELISM

### Example 2.3 Cont'd

Attempting to match the SP !!

- dual-processors, but single-issue processors are given



Cycle 1



Cycle 2

⇒ 6 cycles are needed  
to execute 12 instructions.

Cycle 3

Cycle 4

Cycle 5

Cycle 6





(a) Fine-grain program graph before packing



زن

(b) Coarse-grain program graph  
after packing

Figure 2.6 A program graph before and after grain packing in Example 2.4. (Modified from Kruatrachue and Lewis, *IEEE Software*, Jan. 1988)

Begin

|    |          |
|----|----------|
| 1. | $a := 1$ |
| 2. | $b := 2$ |
| 3. | $c := 3$ |
| 4. | $d := 4$ |
| 5. | $e := 5$ |
| 6. | $f := 6$ |
| 7. | $g := a$ |
| 8. | $h := c$ |
| 9. | $i := d$ |

10.  $j := e \times f$   
11.  $k := d \times f$  ←  
12.  $l := j \times k$   
13.  $m := 4 \times l$   
14.  $n := 3 \times m$  ←  
15.  $o := n \times i$   
16.  $p := o \times h$   
17.  $q := p \times q$  ←  
q

End



(a) Fine grain (Fig. 2.6a)



(b) Coarse grain (Fig. 2.6b)

Figure 2.7 Scheduling of the fine-grain and coarse-grain programs. (I: idle time; shaded areas: communication delays)

## Node Duplication



Figure 2.8 Node-duplication scheduling to eliminate communication delays between processors. (I: idle time; shaded areas: communication delays)

$$\begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} \times \begin{bmatrix} B_{11} & B_{12} \\ B_{21} & B_{22} \end{bmatrix} = \begin{bmatrix} C_{11} & C_{12} \\ C_{21} & C_{22} \end{bmatrix}$$



(b) Calculation of communication delay  $d$



(c) Fine-grain program graph

Figure 2.9 Calculation of grain size and communication delay for the program graph in Example 2.5. (Courtesy of Krutachue and Lewis; reprinted with permission from IEEE Software, 1988)

Fig 2.10 "523" → "533"

2.11 (b).