

# SIMD Supercomputer

8/8/19



(Operational model of SIMD computer)

Proc  $i$ : Processor  $i$

Mem  $i$ : Memory  $i$

Control unit has direct control on PE.

## SIMD Mlc model

$$Mlc \leftarrow M = (N, C, I, M, R)$$

N: No. of PEs

C: set of instruc. directly executed by co

I: " " " broadcasted " "

M: set of masking schemes

R: set of data routing func?

• Hence, N is no. of PEs in the mlc.

e.g. 8 Illiac has 64 PEs,

and CM-2 has 65,536 PEs.

- C is set of instruc. directly executed by CU including scalar and program flow instructions.
- I is set of instruc. broadcasted by CU to all PEs for parallel processing. These includes arithmetic, logic, data routing, masking, and other local operations. executed by each active PE over the data within the PE.
- M is the masking schemes where each mask partitions the set of PEs into enabled & disabled subsets.
- R is the set of data routing func. specifying various patterns to be set up in the interconnection Netw for interc PE comm.

### PRAM Models

↓  
~~Parallel~~ Parallel Random Access Mc.



(PRAM model of multiprocessor system)

- PRAM Model can be used for parallel algorithm development and for scalability & complexity analysis.
- An  $n$ -processors PRAM has a globally addressable memory. The shared memory can be distributed among processors or centralized at one place.
- The PCs operate on a synchronised read memory, compute, and write memory cycle.
- With shared memory, the model must specify how concurrent read & concurrent write of memory are handled. (bcz many  $P_i$  wants to read or write, i.e., a mechanism must be provided).
- There are 4 memory update options are available :-

### i) Exclusive Read (ER)

It allows atmost 1 processor to read from memory loc. in each cycle.

### ii) Exclusive Write (EW)

It allows atmost 1 processor to write into memory loc. at a time.

### iii) Concurrent Read (CR)

It allows multiple processors to read from same memory loc.

### iii) Concurrent write (CW)

- It allows multiple processors to write into same memory loc.
- Some policies are required to avoid the write conflicts

Prepare note for below topics:-

- a) EREW - PRAM Model
  - b) CREW
  - c) ERCW
  - d) CRCW
- } PRAM Variants.

09/08/19

#### a) EREW

- most restrictive model as it forbids more than one processor from reading or writing same memory cell simultaneously.

#### b) CREW

→ conflicts are avoided by mutual exclusion. Concurrent reads to same mem loc are allowed.

#### c) ERCW

exclusive read & concurrent write.

#### d) CRCW:

concurrent read & write are allowed.

In what way, we can define parallelism, is discussed here.

## i) Data dependency: (DD)

Ordering relationship b/w statements is given by data dependency graph. (DDG)

The nodes of dependency graph represents the prog. statements (instructions).

The directed edges with diff. labels represents the relation b/w diff. statements.

The data dependency graph is used to exploit ~~parallel~~ b/w parallelism of the prog.

5 classes of DD are:

### a) Flow dependency:

The statement  $s_2$  is flow dependent on statement  $s_1$ , if an execution path exist from  $s_1$  to

$$s_1 \xrightarrow{\text{path}} s_2 \quad \text{OR} \quad s_1 \rightarrow s_2$$

and atleast one o/p of  $s_1$  feeds in as IP to  $s_2$ .

### b) Anti dependency:

$s_2$  is anti dependent on  $s_1$ , if  $s_2$  follows  $s_1$  in program order and if o/p of  $s_2$  overlaps the IP to  $s_1$ .

cond<sup>n</sup> 1

cond<sup>n</sup> 2

09/08/19

$S_1 \rightarrow S_2$

discussed hence.

by data  
prog.

nts the

start ~~end~~

+ on

from  $S_1$  to  $S_2$

11P to  $S_2$ .

condP 1  
2 follows  $S_1$   
overlaps

c) O/P dependence:

$S_1 \rightarrow S_2$

2 statements are O/P dependent from  $S_1$  to  $S_2$ , if the produced (write) the same O/P value variable.

d) I/O dependence:

$S_1 \xrightarrow{I/O} S_2$

Read and write are I/O statements. I/O dependence occurs not bcz same variable is involved, but bcz the same file is referenced by both I/O statements.

e) Unknown dependence:

The dependence b/w 2 statements cannot be determined in some situations. for eg -  
→ subscript of a variable is ~~itself~~ itself subscript  
→ " does not contain the loop index variable.  
→ A variable appears more than once with subscripts having diff. coefficients of loop variable.  
→ Subscript is non-linear in loop index variable.

eg

$S_1: \text{Load } R_1, A$

$S_2: \text{ADD } R_2, R_1$

$S_3: \text{MOV } R_1, R_3$

$S_4: \text{STORE } B, R_1$

1.  $R_1 \leftarrow M[A]$  /

2.  $R_2 \leftarrow (R_1) + (R_2)$  /

3.  $R_1 \leftarrow (R_3)$  /

4.  $M[B] \leftarrow (R_1)$  /

↑ contents of  $R_1$

- $s_1$  and  $s_2$  is flow dependent bcz of registers
- $s_2$  &  $s_3$  are anti " potential conflict.

→  $s_3$  &  ~~$s_1$~~   $s_1$  is o/p dependence.

→  $s_1$  &  $s_4$  are flow dependence.

→  $s_2$  &  $s_4$  is indepen.

→  $s_3$  &  $s_4$  is flow dependent.



→  $s_2$  is flow dependent on itself bcz (R1) is I/O as well as o/p in same statement.

| <del>eg. Code fragment with I/O operation</del> | $s_1$ : Read(4), A(I) <del>Read</del> | / Real array A from file 4 /  |
|-------------------------------------------------|---------------------------------------|-------------------------------|
|                                                 | $s_2$ : process                       | / Process data /              |
|                                                 | $s_3$ : write(4), B(I)                | / write array B into file 4 / |
|                                                 | $s_4$ : close(4)                      | / close file " " /            |



## 2) Control Dependency (CD)

14/8/19

A situation where order of execution of statements cannot be determined before runtime. (because of some branch statements)

e.g. if statement of FORTRAN will not be resolved until runtime.

Diff branches of IF, may introduce or eliminate data dependencies among instruc.

### Eg 1 Control independent loop

```
DO 10 I=1,N  
      A(I)=C(I)  
      IF (A(I).LT.0)  
          A(I)=1
```

10 CONTINUE

### Control dependent loop

```
DO 20 I=1,N  
      IF (A(I-1).EQ.0)  
          A(I)=0
```

20 CONTINUE

In this, where control dependency exists, parallelism is not exploited.

- Better compilers are needed to exploit more parallelism.

## 3) Resource Dependency (RD)

It means the conflicts that arise when using shared resources like int units, floating pt units, registers & memory area.

\* If conflicting resource is ALU, it is known as ALU dependence.

\* If the conflicts involve workplace settings then it is called as interference dependence.

→ The transformation of a sequentially coded program into a parallel executable form is done either manually by preprogrammers using explicit parallelism or by compiler detecting implicit parallelism auto.

However, some prog are inherently sequential in nature.

So, they cannot be decomposed into n branches.

### H/w and S/w parallelism

#### H/w parallelism

- 1) It is built into the m/c's architecture & h/w multiplicity.
- 2) It is known as m/c parallelism.

#### S/w

- 1) It is exploited by concurrent execution of m/c lang instru. in a prog as coded by a preprogrammers or as generated by HLL compiler.

- 2) It is a func of algo, programming style & compiler optimization.

- 3) It displays what is resource utilization.

- 4) It can also indicate peak performance of processor resources.

- 3) It displays patterns of simultaneously executable operations.

- a) Grain size or granularity: It is measure of amount of computation involved in glo process.
- b) Grain: It is basic segment that is chosen for II processing.
- c) Latency: It is time measure of the comm' overhead incurred b/w m/c subsys.
- e) time required by PE to access the memory is known as memory latency. e) time required by 2 processes to synchronize with each other is called synchronization latency.

## Levels of Parallelism

Parallelism has been explored at 5 levels of processing.



developing in hdded Cpt :-

### Level-1: Instruction level / statement level parallelism

At this level of parallelism, a typical grain (or prog's segment) consists of less than 20 instruc? So, it is called fine grain parallelism.

- \* The lower the level, finer will be granularity of s/w process.

Adv:

→ At this level there will be huge amount of parallelism available. for which we need optimizing compiler.

It is compiler which automatically detects parallelism.

It converts source code to a || form which can be understood by the run-time system.

- \* Detecting instruc? level parallelism by an ordinary programmer in source code is difficult.

### Level-2: Loop level parallelism

Hence grain size will have less than 500 instruc?

If some loop operations are non-recursive & independent in successive iterations, then we can vectorize them for pipelined execution.

- \* It is difficult to parallelize recursive loops.
- \* This level of parallelism is better & finer as compared to Level-3 & Level-5.
- \* This type of grain is very suitable for || comp.

### Level-3: Procedure Level Parallelism

- At this level, a typical grain corresponds to medium size at task, procedural, subroutine or function levels.
- Grain size < 2000 instructions.
- Detection of parallelism at this level is much more difficult than at fine-grain levels.
- Programmers must restructure a program at this level. They need compilers.

### Level-4: Subprogram Level parallelism

- Typical grain corresponds to subprog or job steps.
- Grain size in thousands of instruc.
- Earlier, parallelism at this level has been exploited by algo designers or programmers & not by compiler.

### Level-5: Job / Program Level Parallelism

- At this level, grain corresponds to independent jobs or programs for a II execution.
- Grain size: from ~~1000~~<sup>million</sup> to of instruc in a single prog.  
∴ it is called coarse grain level parallelism.
- To achieve this level of prog level IIism, we need prog loaders & an OS or space-sharing sys.
- Time sharing sys explores this level of parallelism only.

## Conclusions

1. Fine grain parallelism is exploited at intra- or loop level. It requires a parallelizing or vectorizing compiler.
2. Medium grain parallelism is exploited at task or job step. It requires compilers as well as significant programming rules.
3. Coarse grain parallelism is exploited at procg. level.  
It requires effort of os & efficient algo. Message-passing multiproc have been used for medium & coarse-grain computations.

\* The finer the grain size, the higher is the potential for parallelism & higher is comm and scheduling overhead.

Fine grain provides a higher degree of parallelism but more will be "comm" overhead. This may or may not be true with coarse grain computations.

Massive parallelism is often found at fine grain levels like data parallelism on SIMD or MIMD comp.

→ Shared variable comm is often used to support fine grain & medium grain computations.

## Grain Packing?

- Introduced by Lewis in 1988 for II programming.
- The idea of grain packing is to apply fine grain first in order to achieve a higher degree of parallelism.
- Then, we combine (pack) multiple fine-grain nodes into a coarse-grain node if it can eliminate unnecessary comm<sup>n</sup> delays or reduce scheduling overhead.
- Usually, all fine-grain operations within a single coarse-grain node are assigned to same CPU for execution.
- Fine grain partition of prog<sup>n</sup> demands more IPC than that required in coarse grain partition.
- Thus, grain packing offers a trade off b/w parallelism & comm<sup>n</sup> overhead. Grain packing removes comm<sup>n</sup> delays.
- Internal delays among fine grain operations within the same coarse-grain node are negligible. This is due to fact that comm<sup>n</sup> delay is contributed mainly by inter PE delays and not by the delays within same processor.
- Hence, the choice of optimal grain size is must. It is only then we can achieve the shortest schedule for nodes on II comp.

Eg 1 Consider following program -

Begin

- |                  |                  |                                                             |
|------------------|------------------|-------------------------------------------------------------|
| 1) $a := 1$      | 2) $b := 2$      | grain size = 1<br>as there is one variable / const on       |
| 3) $c := 2$      | 4) $d := 4$      |                                                             |
| 5) $e := 6$      | 6) $f := 6$      | grain size = 2<br>as there are 2 variables / cons<br>on RHS |
| 7) $g := a * b$  | 8) $h := c * d$  |                                                             |
| 9) $i := d * e$  | 10) $j := e * f$ |                                                             |
| 11) $k := d * f$ | 12) $l := j * k$ |                                                             |
| 13) $m := u * l$ | 14) $n := 3 * m$ |                                                             |
| 15) $o := n * i$ | 16) $p := o * h$ |                                                             |
| 17) $q := p + q$ |                  |                                                             |

End.

Show that fine grain prog graph before packing.

Convert it to coarse grain prog graph after packing.

Soln Prog graph shows structure of prog. Each node is measured by no. corresponds to computational unit in prog.

Grain size is measured by no. of basic m/c cycles both processor & memory cycles that are needed to execute all operations within node.



|         |                       |
|---------|-----------------------|
| $(n,s)$ | denote?               |
| $(n,i)$ | (node-no, grain size) |
| $(z,k)$ | (lfp, delay)          |
| $(y,j)$ | (olp, delay)          |
| $(g,h)$ | lfp<br>olp            |

So, grain size reflects the no. of computations involved in program segment.

\* fine grain nodes have a smaller grain size & coarse grain nodes have larger grain size.



(Fine grain parallelism)

Here,

- Nodes 1, 2, 3, 4, 5 and 6 are data fetch operations from memory that takes one cycle.
- All other nodes (7 to 17) are CPD operations, each requiring 2 cycles to complete.

→ for developing embedded software



(coarse grain parallelism)

Here,

→ we do grain packings which forms coarse grain node with larger grain sizes (from 4, to, 8).

→ Node (A, 8) is combination of nodes (1, 1), (2, 1),  
 $(3, 1)$ , (4, 1), (5, 1), (6, 1) & (1, 2) of previous fig.

→ Grain size of node-A is summation of all grain sizes  
of all these nodes i.e.

$$1+1+1+1+1+1+2=8$$

So, we get (A, 8).

→ Similarly, others are done.

Eg-2 Draw dependence graph for foll.

a)  $S_1 : A, B+D$

$S_2 : C \rightarrow A * 3$

$S_3 : A, A+C$

~~S4~~  $E = A/2$



- b)
- $$S1 : x = \sin(y)$$
- $$S2 : z = x + w$$
- $$S3 : y = -2.5 * w$$
- $$S4 : x = \cos(z)$$



Cg 3 Consider FORTRAN Code:

```

DO 10 I = 1, N
S1: A(I+1) = B(I-1) + C(I)
S2: B(I) = A(I) * K
S3: C(I) = B(I) - I
  10 Continue.
  
```

Determine data dependencies of this.



- Cg 4
- $$S1: \text{load } R1, 1024$$
- $$S2: \text{load } R2, M(10)$$
- $$S3: \text{Add } R1, R2$$
- $$S4: \text{Store } M(1024), R1$$
- $$S5: \text{Store } M(1024), 1024.$$

$I(R_i)$  - content of  $R_i$

Memory  $(10)$  contains 64 initially.

→ For developing embedded software

- a) Draw dependence graph to show all dependencies



Prepared

i) New size

ii) Graph:

switching

iii) Node de

a node

iv) Dictionary

→ A ref

→ A add

to rule

v) Parallel

data block

→ A low

complex

pair of

vi) Bisection

edges that

into 2 half

→ The high

require

data set

- Q. 5  
Ans  
Sols  
are  
the  
long  
for  
com  
  
a) b) Are there any resource dependences of only on copy of each functional unit in available in CPU?

→ Yes, there are storage dependences b/w instruc pairs - S2, S3 & S4, S5.

→ There is a resource dependence b/w S1 & S2 on load unit and another b/w S4 & S5 on store unit.

→ There is an ALU dependence b/w S3 & S4 & a storage dependence b/w S1 & S5.

- c) Repeat above for full code:

S1 : Load R1, M[100]

S4 : Add R2, R1

S2 : Move R2, R1

S5 : Store M[100], R1

S3 : Inc R1



## Properties of Nlw

- i) Nlw size: No. of nodes present in the graph.
- ii) Graph: A graph of a nlw consists of nodes that represent switching pts and edges that represents comm links.
- iii) Node degree (d): The no. of edges or links that are incident on a node is called node degree. ( $\text{node deg} = \text{in deg} + \text{out deg}$ )
- iv) Diameter (D):
  - d reflects no. of no. ports required for node.
  - d should be kept as constant as small as possible to reduce dev cost & to achieve modularity.
- v) Diameter (D): The diameter of a nlw is the largest dist b/w 2 nodes.
- vi) A low diameter is better bcz it puts lower bound on complexity of an algo. requiring comm b/w arbitrary pair of nodes.
- vii) Bisection width (b): It is defined as the min. no. of edges that must be removed in order to divide the nlw into 2 half.
- viii) The higher bisection width is better bcz in algo requiring large amount of data movement, the size of data set divided by "b" puts a lower bound on the

## developing embedded Software

complexity of "algo.

- (iii) No. of edges per node: A node represents a PE and an edge represents comm path b/w 2 PE. So, it represents the no. of comm paths per processor.
- (iv) Max edge(path) length: It is based on the nodes(PE) and edges(path) represent a 3D space Nlw so that this max edge length is a constant, independent of Nlw size.

## Routing

A data routing Nlw is used for inter-PE data comm. There are data routing func (DRF) that are implemented on an inter-PE routing Nlw.



(Role of DRF)

## Some of DRFs

1. Permutations
2. Perfect Shuffle & exchange
3. Hypercube routing func
4. Broadcast & multicast

## I) Permutation

If we have small  $t$  objects, then there are  $t!$  permutations by which  $t$  obj can be reordered.

We use a cycle notation to specify a permutation func.

The permutation

$\pi = (a,b,c), (d,e)$  stands for

$a \rightarrow b, b \rightarrow c, c \rightarrow a, d \rightarrow e, e \rightarrow d$  in a circular fashion.

Also note that the cycle  $(a,b,c)$  has a period of 3 and the other  $(d,e)$  has period of 2.

So, total no. of permutation,  $\pi$  has a period of

$$3 \times 2 = 6$$

i.e.  $\pi$  has  $3 \times 2 = 6$  period.

Now, if we apply permutation 6 times, then this many times data routing can be performed and Identity mapping,  $I = (a), (b), (c), (d), (e)$  is obtained. Permutations can be implemented by using a crossbar switch, a multistage n/w, a shifting or broadcast operations.

→ permutation capability of a n/w is often used to indicate data routing capability.

When  $n$  is large, permutation speed often dominates the performance of data routing n/w.

### 2) Perfect shuffle and exchange

In 1971, Harold Stone suggested that if we are given a binary string as

$$a_0 a_{n-1} \dots a_2 a_1 \text{ then}$$

$$\text{shuffle } (a_0 a_{n-1} \dots a_2 a_1) = (a_{n-1} a_{n-2} \dots a_2 a_1 a_n)$$

eg shuffle (001) = 010

shuffle (101) = 011

- \* The shuffle transform is a left shifting of binary strings by 1 bit

### 3) Hypercube routing function



We consider a 3D binary cube as shown in fig.

Any node address is 3 bit long ( $c_2, c_1, c_0$ )

Since, there are 3 bits in node add, we can have 3 routing func.  $\Theta$ ,

These 3 routing func' will be on the basis of the routing of 3 bits i.e. either c<sub>0</sub> or c<sub>1</sub> or c<sub>2</sub>.

### Case-I : Routing on c<sub>0</sub> (LSB)



### Case-II : Routing on c<sub>1</sub> (middle bit)

20/8/19

This type of routing is possible iff the middle bit i.e. c<sub>1</sub> differs on their node address.



### Case-III : Routing of c<sub>2</sub> (MSB)

This type of routing is possible iff MSB bit i.e. c<sub>2</sub> differs on their node add.



\* In general, an n-dimensional hypercube will have n routing func', defined by each bit in n-bit add.

These data exchange can be used in routing msg in hypercube multicomps.

#### 4) Broadcast and Multicast

Broadcast means from one node data may be send simultaneously to all other nodes in the nw.

Multicast means from many nodes data may be sent to all other nodes in the nw.

#### Factors affecting performance of interconn. Nw

→ Functionality: How the nw supports the func<sup>n</sup> like data routing, interrupts etc.

<sup>max</sup> <sup>NW</sup>

→ Latency: it means <sup>worst case</sup> time delay for a unit message to be transferred through the nw.

<sup>min</sup> <sup>max</sup> → Bandwidth: it means the max? data transfer rate in a ~~unit~~ nw. (unit MB/s or GB/s)

<sup>min</sup> → H/w cost: it refers to h/w cost like wires, switches, connectors etc.

<sup>max</sup> → Scalability: it means ability of nw to expand. with a scalable performance with increasing m/c resources.

## Mesh Netw

A 2D ( $3 \times 3$ ) mesh netw is shown below.



The Illiac-IV, DAP, 600 multiprocessors, CM-2 & Intel Paragon use diff types of mesh archt.

In general, a  $K$ -dim mesh has  $n^k$  nodes

$n$  - # nodes in each direction with each node

having an internal node degree of  $2k$ .

Dally's J-mlc in 3D-mesh.



(Illiac IV mesh)

Properties:

node deg,  $d=4$

diameter,  $D = 2(\sqrt[3]{n}-1)$  where  $(\sqrt[3]{n})$  is mesh size &  $\sqrt[3]{n}$

no. of links,  $L = 2n - 2\delta$

Bisection width,  $b = \delta$

→ for developing embedded software

For grid mesh, properties are:

$$d = 4$$

$$\Delta = \delta - 1$$

$$l = 2n$$

$$b = 2\delta$$

- data process
- transform

③

### Torus topology

- variant of mesh archt.
- It combines ring & mesh & extends to higher dimensions.
- has ring connec<sup>n</sup> along each row & along each column of array.
- a  $(n \times n)$  binary torus has node deg of 4.
- symmetric topology.
- Properties:

$$d = 4$$

$$\Delta = 2 \left\lfloor \frac{\pi}{2} \right\rfloor$$

$$l = 2n$$

$$b = \infty \quad (\infty = \sqrt{n})$$



torus of size  $(3 \times 3)$

### Systolic Arrays

- electronic ckt.
- It is so called bcz data from memory unit is sent in rhythmic fashion through an array similar to way in which human heart regularly pumps blood via arteries.

- data processed by PEs
- transformed ILP as returned to memory.



(A symbolic GA arch.)

Primary characteristics of GA are :-

- flow of data is rhythmic & regular.
- data can be bidirectional
- each cell or PE performs identical operation.
- processing time of all cells are equal.
- comm' b/w cells is serial & periodic.
- individual cells are connected to nearest neighbours.
- cells except those at boundary of array do not communicate with outside world.



(GA topology)

→ for developing embedded sys

### Adv of SAS

- deg of pipelining is high
- simple I/O system
- high speed & low cost sys.

- synchronised multiprocess
- modular design

③

### Disadv of SAS

- signal delays result in global synchronization limits
- higher BW is required for inter PE comm.
- poor run time fault tolerance.

### Chordal Ring

- If we incr node degree from 2 to 3 or 4, we get 2 chordal rings.
- more the links are added, higher is node degree & shorter is n/w diameter.



### Barrel Shifter

- It is obtained from ring only by adding extra links.
- A node is connected to another node if they are at dist of power of 2.

#### Properties:-

Nodo size,  $N = ?$

node degree,  $d = ?$

Diameter,  $D = ?$

### Hypercube

- can be of diff we define it
- symmetric
- mag transit when two nod then mag tra
- $\approx$  no. of in hyper transfer

embedding is  
to another  
hypercube  
ring

since, hyper  
can be taken

## Hypercube

- can be of diff dimension.  
we define its " " as node degree of each node
- symmetric topology. So, node degree is same for all nodes.
- msg transit in hops b/w such node pairs.  
When two nodes or processors are not directly connected  
then msg transit b/w them in multiple hops.
- max<sup>n</sup> no. of hops necessary under the worst case for  
n-dim hypercube is n.
- msg transfer can be done either manually or auto.

Embedding is defined as process of transformation of one  
N/w to another N/w.

e.g. hypercube can embed a mesh

" " " tree  
ring " embedded in torus (called perfect embedding)

\* since, hypercube is symmetric archi, any processor  
can be taken as root of tree.



Hyperrcube  
embedding a mesh



hypercube embedding tree.



- Quality of embedding is indicated by term dilation
- Dilation is defined as max<sup>n</sup> no. of links in embedding  
n/w corresponding to one link in embedded n/w

## Dynamic Parallel Interconn Netw

### T. Digital bus

- It is collection of wires & connectors for doing comm among PEs, memories & I/O devices. These devices are attached to bus.
- Bus contention may arise when multiple reqt are active. So, bus arbitration logic must be able to allocate or deallocate bus service reqt one by one. That is why this bus is also called contention bus or time sharing bus among multiple functional modules.
- not costlier to implement
- Disadv: limited BW.



(Shows a multiprocessor sys with bus topology)

How this bus is practically implemented?

→ The sys bus is often implemented on backplane of a PCB (printed ckt board).

→ Other boards for processors, memories or device interface are plugged into backplane board via connection or cables.

- The active devices (or master devices) like PEs etc. generate request to address memory. The passive devices (or slave devices) like memories or peripherals respond to request.
- This digital bus is used on time sharing basis.
- Issues like bus arbitrarion, interrupt handling, coherence protocols & transaction processing must be handled.

## II. Switch Modules / Switches

- $(m \times n)$  switch module
- $m = \text{IP}$                           eg.  $(4 \times 4)$ ,  $(8 \times 8)$ , ~~32~~ $(4 \times 8)$
- $n = \text{OP}$ .

- binary switch ( $m = n = 2$ )

- practically,  $m$  &  $n$  are powers of 2.

$$m = n = 2^k$$

| Switch sizes | legitimate states | Permutation conn? |
|--------------|-------------------|-------------------|
|--------------|-------------------|-------------------|

|              |                   |        |
|--------------|-------------------|--------|
| $2 \times 2$ | $4 (m = 2^2 = 4)$ | 2      |
| $4 \times 4$ | 256               | 24     |
| $8 \times 8$ | 16,777,216        | 40,320 |
| $n \times n$ | $n^n$             | $n!$   |

- Hence, each up one to one & one to many mappings are allowed but many to one mappings are not allowed due to conflicts at OP terminal.

## Dynamic Network Switches (DNW)

1. Single stage NW
2. Multistage NW

### 1. Single stage NW

- It is switching NW with

$N$  IIP switches &  $N$  OIP switches  
( $I \leq N$ )                                    ( $O \leq N$ )

- Each IS is  $1:D$  mux.

Each OS is  $D:M:1$  MUX

where  $1 \leq D \leq N$

$1 \leq M \leq N$

- eg circulant ( $D=M=N$ ).

- Single-stage NW also called as recirculating NW.

- higher is NW connectivity, less no. of recirculations.

(Single stage Interconnect  
NW - a conceptual view)

### 2. Multistage NW

- also called MINS

- used in SIMD & MIMD comp.

- uses multiple ( $a \times b$ ) switches in each stage.

- $a \times b$  switches

$a = IIP$

$b = OIP$

eg:  $(i \leftarrow 1 \text{ to } n)$ : Interstage connection

eg:  $(i \leftarrow 1 \text{ to } n)$ : stages from 1 to  $n$ .

# Pipelining

To increase throughput of a system, pipelining is used.

Let an instruction has 2 operations

Fetch(F)      } let them take 1 CPU cycle  
 & execute(E)    }

a) If we execute sequentially,



Let there are  $n$  instructions

∴ no. of clk cycles needed =  $2n$  (no. of stages)

b) On pipelining.



interstage buffer

used to synchronise b/w the 2 operations so that no conflict occurs

Let  $n = 4$

|                | 1              | 2              | 3              | 4              | 5              | 6 | 7 |
|----------------|----------------|----------------|----------------|----------------|----------------|---|---|
| I <sub>1</sub> | F <sub>1</sub> | E <sub>1</sub> |                |                |                |   |   |
| I <sub>2</sub> | -              | F <sub>2</sub> | E <sub>2</sub> |                |                |   |   |
| I <sub>3</sub> | -              | -              | F <sub>3</sub> | E <sub>3</sub> |                |   |   |
| I <sub>4</sub> | -              | -              | -              | F <sub>4</sub> | E <sub>4</sub> |   |   |

# clk cycles  
needed = 5.

8.19  
Let instruc has 4 segments

fetch (F)  
decode (D)  
execute (E)  
write back (W)

|                | 1              | 2              | 3              | 4              | 5              | 6 | 7 | 8 |
|----------------|----------------|----------------|----------------|----------------|----------------|---|---|---|
| I <sub>1</sub> | F <sub>1</sub> | D <sub>1</sub> | E <sub>1</sub> | W <sub>1</sub> |                |   |   |   |
| I <sub>2</sub> | F <sub>2</sub> | D <sub>2</sub> | E <sub>2</sub> | W <sub>2</sub> |                |   |   |   |
| I <sub>3</sub> |                | F <sub>3</sub> | D <sub>3</sub> | E <sub>3</sub> | W <sub>3</sub> |   |   |   |
| I <sub>4</sub> |                | F <sub>4</sub> | D <sub>4</sub> | E <sub>4</sub> | W <sub>4</sub> |   |   |   |
| I <sub>5</sub> |                | F <sub>5</sub> | D <sub>5</sub> | E <sub>5</sub> | W <sub>5</sub> |   |   |   |

# interstage buffer required for 1 instruc = 3

$$\text{Speed up} = \frac{\text{total execution time in non-pipeline}}{\text{in pipeline}}$$

(26/08/19)

a An instruc pipeline consists of 3 stages

fetch, D, E, W.

An 5 instruc. in a certain instruc. sequence need these

strategies for the # cks as shown.

Find no. of cks required to perform instruc.

|    | F | D | E | W |
|----|---|---|---|---|
| ✓1 | 1 | 2 | 1 | 1 |
| ✓2 | 1 | 2 | 2 | 1 |
| ✓3 | 2 | 1 | 3 | 2 |
| ✓4 | 1 | 3 | 2 | 1 |
| ✓5 | 1 | 2 | 1 | 2 |

|   | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
|---|---|---|---|---|---|---|---|---|---|----|----|----|----|----|----|
| 1 | F | D | D | E | W |   |   |   |   |    |    |    |    |    |    |
| 2 | F | - | D | D | E | E | W |   |   |    |    |    |    |    |    |
| 3 | F | - | D | - | D | E | E | E | W |    |    |    |    |    |    |
| 4 | - | F | - | D | D | D | - | E | E | W  |    |    |    |    |    |
| 5 | - | F | - | - | - | D | D | - | E | W  | W  |    |    |    |    |

## Hazards

Control/  
Instruction

Data

Structural

↑  
stall.

go bodge

\* Any cond' that causes pipeline to stall is called hazard.

## Data Hazards

Parti. data when needed, is not available at that time - is cause of this hazard.

It is a cond' in which either source or dest'n operands of an instruc. are not available at the time expected in the pipeline.

Due to this some operations has to be delayed & the pipeline stalls.

Control/Instruc

This hazard  
unavoidable  
(eg. It may  
be forced  
to be fetched)

Structural

The cond'n  
gives this  
the next com  
stage).

If the instruc  
then only one  
Hence, it is  
.., many pro  
and thus de

## Control / Instruction Hazards

This hazard is a cond' in which delay occurs due to the unavailability of instruc?

e.g It may occur due to a cache miss, requiring the instruc. to be fetched from memory.

## Structural Hazards

This cond'n occurs when 2 instruc? require the use of a given bus at the same time.

The most common case is to access memory. (execute or write stage).

If the instruc. & data reside in the same cache unit then only one instruc. can proceed, the other will wait. Hence, it is delayed.

∴, many processors use separate instruc & data caches to avoid this delay.