

### Assignment-3 (A.C.A)

- ① Explain briefly the network characteristics & Hierarchical bus system

#### Network characteristics

The networks are designed with many choices like timing switching & control strategy like in case of dynamic interconnections are under program control of the multiprocessor.

#### Timing

- \* Synchronous - Controlled by a global clock which synchronizes all network activity.
- \* Asynchronous - use handshaking or interlock mechanisms for communication & especially for co-ordinating devices with speed.

#### Switching method

- \* Circuit switching - a pair of communicating devices control the path for the entire duration of data transfer
- \* packet switching - large data transfer broken into smaller pieces each of which can complete for use of the path

#### Network Control

- \* Centralized - global controller receives and acts on requests
- \* Distributed - requests handled by local devices independently

## F.P.T Hierarchical Bus System

- \* A Bus System connects various subsystems of a hierarchy of buses and subsystems connecting components in a computer.
- \* Each bus is formed with a number of signal, control, & power lines. Different buses have to perform different interconnection functions.
- \* In general the Hierarchy of bus system are packaged at different levels as depicted



IF (Interface logic)

ROC (ROM controller)

ROB (ROM Processor)

LM (Local memory)

MC (Memory controller)

CC (Communication controller)

\* Local Bus: Buses implemented on printed-circuits boards are called Local Buses.

\* On a processor board one finds a local bus which provides a common communication path among major components (chips) mounted on board.

Q) Explain the Snoopy bus protocol approach for Cache.

- \* Snoopy protocol achieves data consistency among the cache and shared memory through a bus watching mechanism
- \* In following diagram, two Snoopy bus protocols create different results consider 3 processors ( $P_1, P_2, P_3$ ) maintaining consistent copies of block X in their local cache.



a) Consistent copies of block X are in shared memory & three processor caches.



b) After write - invalidate operation by  $P_1$  ..,



- \* Using a write - Invalidate protocol, the processor  $P_i$  modifies (write) its cache from  $X$  to  $X'$  and all other copies are invalidated via the bus. Invalidated blocks are dirty meaning they should not be used.
- \* The write-update protocol demands the new blocks content  $X'$  be broadcasted to all caches via the bus.

## Backplane Bus

- A Backplane is printed circuit on which many connection are used to plug in functional boards.
- A System bus, consisting of shared signal path and utility lines

## EI Bus

Bus devices are connected to a Computer Systems through EI bus such as SCSI (Small Computer System Interface) bus. This bus is made of coaxial cables with taps connecting disk, printer and other devices to a processor through an I/O controller.

## Hierarchical Bus System

This is a multilevel three structure in which the leaf nodes are processor & their private caches (denoted by  $C_{ij}$  &  $P_j$ ) These are divided into several clusters each of them are connected through cluster bus.



3) Show the architectural characteristics of cross bus switch & multiprocessor memory system

### Crossbar Network

Crossbar network connects every input to every output through a crosspoint switch. A crossbar network is a single stage, non-blocking permutation network.

### Crosspoint switch Design

Out of  $n$  crosspoint switches, in each column of  $n \times m$  crossbar mech., only one can be connected at a time. Crosspoint switches must be designed to handle the potential contention for each memory module. A crossbar switch avoids competition for bandwidth by  $O(N)$  switches to connect  $N$  inputs to  $N$  outputs.



Each processor provides a request line, read/write line, a set address line, & a set of datalines lines to a crosspoint switch for a single column.

The crosspoint switch eventually responds with an acknowledgement when the access has been completed.

## Multipoint Memory

Since Crossbar switches are expensive and not suitable for systems with many processor or memory modules, multipoint memory modules may be used instead.

A multipoint memory module has multiple connection ports for processor (or I/O device). The memory controller in the module handles the arbitration & switching that might otherwise have been accomplished by Grouped switch.



A two-function switches can assume only two possible states namely state 0 or exchange. However a four function switch box can be any of four possible states.

A multistage network is capable of connecting any input terminal to any output terminal.

- 4) Illustrate the following terms associated with multihop network.
- i) store & forward routing at packet level.
  - ii) wormhole routing at flat level

### i) Store & forward Routing



a) store-and-forward routing using packet buffer in successive nodes

- \* Packets are the basic unit of information flow in store-and-forward network
- \* Each node is required to use a packet buffer
- \* A packet is transmitted from a source node to a destination node through a sequence of intermediate nodes
- \* When a packet reaches an intermediate node, it is stored in a buffer
- \* Then it is forwarded to the next node if the desired output channel of a packet buffer in the preceding node are both available

### 2. Wormhole Routing



b) wormhole routing using fit buffer in successive routers

- \* Packets are subdivided into smaller flits. Flit buffer are used in the hardware router attached to nodes.
- \* The transmission from the source node to the destination node is done through a sequence of routers.
- \* All the flits in the same packet are transmitted in order as inseparable components in pipelined fashion
- \* Only the header flit knows where the packet is going
- \* All the data flits must follows the header flit
- \* Flits from different packets cannot be mixed up.
- \* Flits from different packets may be forwarded to the wrong destination.

⑤ Explain the vector processing principle with compound vector processing

Vector :- A vector is a set of scalar data items, all of the same type, stored in memory

Vector processor :- A vector processor is an ensemble of hardware resources, including vector registers, functional pipeline processing elements & regular control, for performing vector operations.

Vector processing :- A vector processing occurs when arithmetic or logical operation are applied to vector. It is distinguished from scalar processing, which operates on one or one pair of data. Vector processing is faster and more efficient than scalar processing.

Vectorization: The conversion from scalar code to vector code is called vectorization.

Vectorizing Compiler: A compiler capable of vectorization is called a vectorizing compiler (vectorizer).

### Vector Instruction Types

1. Vector-Vector instructions
2. Vector-Scalar Instructions
3. Vector-Memory instructions
4. Vector-reduction instructions
5. Gather and Scatter instructions

### Compound vector processing

A compound vector function (CVF) is defined as a composite function of vector operations converted from a looping structure of linked scalar operations.

Do 10  $i = 1, N$

Load R1, X(i)

Load R2, Y(i)

Multiply R1, S

Add R2, R1

Store Y(i), R2

10 Continue

Where  $X(i)$  &  $Y(i)$   $i = 1, 2, \dots, N$  are two source vectors originally residing in memory. After the computation, the resulting vector is stored in  $Y(i)$  back to the memory.

S is immediate constant supplied to the multiply instruction.

After vectorization the above scalar SAXPY code is converted to a sequence of five vector instructions.

|                                  |                 |
|----------------------------------|-----------------|
| $M(x:x+N-1) \rightarrow V_1$     | vector load     |
| $M(y:y+N-1) \rightarrow V_2$     | vector load     |
| $S \times V_1 \rightarrow V_1$   | vector multiply |
| $V_2 \times V_1 \rightarrow V_2$ | vector add      |
| $V_2 \rightarrow M(y:y+N-1)$     | vector store    |

*x and y are starting memory addresses of the X & Y vectors, respectively;  $V_1$  &  $V_2$  are two N-element vector registers in the vector processor*

$$CVF : Y(1:N) = S \times (1:N) + V(1:N) \quad \text{or}$$

$$Y(1) = S \times (1) + V(1)$$

where index 1 implies that all vector operations involve N elements.

- \* Typical CVF for one-dimensional arrays are load, store multiply, divide logical & shifting operations
- \* The number of available vector registers & functional pipelines impose some restrictions on how many CVFs can be generated simultaneously.

6) List & explain the various approaches of Latency Hiding technique.

## Latency Hiding Techniques

### Shared Virtual Memory

Single address-space multiprocessor/multi-computer must use shared virtual memory.

### The Architecture Environment



- \* It consists of multiple multiprocessor cluster connected through scalable low latency interconnection network
- \* Physical memory was distributed among the processing nodes in various clusters. The distributed memory formed a global address space
- \* Cache Coherence was maintained using an invalidating distributed directory-based protocol. For each memory block, the directory keeps track of remote nodes caching it
- \* When write occurred, point-to-point messages were sent to invalidate remote copies of the block
- \* Acknowledgment messages were used for originating node when an invalidation was completed

- \* Two levels of local cache were used per processing node. Loads & writes were separated with the use of write buffers for implementing weaker consistency models.
- \* The main memory was shared by all processing nodes in the same cluster. To facilitate prefetching of the directory-based coherence protocols, directory memory & remote-access caches were used for each cluster.
- \* The remote-access cache was shared by all processor in same cluster.

② what is meant by Branch prediction

- \* About 15% to 20% of instruction in a typical program are branch and jump instruction including returns.
- \* Therefore - if hardware resources are to be fully utilized in a superscalar processor - the processor must start working on instructions beyond a branch, even before the branch instruction itself has completed. This is only possible through some form of branch prediction.



- \* A basic Branch prediction techniques uses a so-called two bit predictor. A two bit counter is maintained for every conditional branch instruction in the program.

- \* The two bit Counter has four possible states. These four states & possible transitions b/w these states are shown Fig. 12.15
  - \* When the Counter state is 0 or 1, the respective branch is predicted as taken; when the counter state is 2 or 3, the branch is predicted as not taken.
  - \* When the Conditional branch instruction is executed & the actual branch outcome is known, the state of respective two-bit counters is changed & broken line arrows.
  - \* When two successive predictions come out wrong the prediction is changed from branch taken to branch not taken, vice versa
- (8) What are the different dependencies existing in the instruction level parallelism

### Data Dependencies

A RAW dependence - i.e. true data dependence - will hold up the execution of the dependent instruction if the overall value required as its input operand is not available.

Operand forwarding can be added to this schema to speed up the supply of needed O/P operand as soon as its value have been completed

### WAR and WAW dependencies

Anti-dependence & O/W dependences respectively also hold up the execution of dependent instruction & create a possible pipeline stall.

## 2. Control dependences

The instruction in reorder buffer belong to branch in the program which should not have been taken there has been a mispredicted branch. Clearly then the reordered buffer should be flushed along with other elements of pipeline.

- \* Therefore the performance impact of control dependence in running program is determined by accuracy of branch prediction technique employed
- \* The reorder buffer plays no direct role in handling of control dependences

## 3. Resource Dependences

- If an instruction needs a functional unit to execute, but the unit is not free, then instruction must wait for unit to become free. Clearly no technique in the world can change that.
- \* The processor designer can aim to achieve at least this: if a subsequent instruction needs to use another functional unit which is free then the subsequent instruction can be executed out of order.

Q9 Demonstrate How the Thread level parallelism is implemented to reduce the limitation in the Exploiting Instruction level parallelism.

Ans \* One way to reduce the burden of dependences is to combine with hardware support within the processor instruction from multiple independent threads of execution

\* Such hardware support for multi-threading would provide the processor with pool of instructions in various stages of execution, which have a relatively smaller number of dependences amongst them, since the threads are independent of another

\* Let us consider once again the processor with instruction pipeline of depth eight, & width targeted. Superclass performance of four instruction completed in every clock cycle

\* Now, suppose that these instructions come from four independent threads of execution. Then on average the number of instruction in the processor at any one of time from one thread would be  $4 \times \frac{8}{4} = 8$

\* with the thread being independent of one another there is a smaller total number of data dependences amongst the instruction in processor

\* further, with control dependences also being separated into four threads, less aggressive branch prediction need.

- \* Another major benefit of hardware support is multithreading i.e. that pipeline stall are very effectively utilized.
- \* If one thread runs into pipeline stall for access to main memory, say, then another thread makes use of corresponding processor clock cycles which would otherwise wasted.
- \* Thus hardware support for multi-threading the processor must be designed to switch b/w threads either on the occurrence of pipeline stall.

Dependency that are switching threads, hardware support for multi-threading may be classified as

1. Coarse grain multi-threading
2. Fine grain multi-threading
3. Simultaneous multi-threading

(10) Relate how the Operand buffering techniques are implemented in instruction level parallelism

### Operand Forwarding

It helps in reducing the impact of true data dependences in instructions stream. Consider the following sequence

ADD R1, R2, R3  
SHL R #4, R3, R4

The result of ADD instruction is stored in destination register R3, & then shifted by eight for four bits in the second instruction with shifted value being placed R4.

ADD R1, R2, R3



SHIFTR #4, R3, R4

(Raw dependence b/w two instruction)

- \* In pipeline processor, ideally second instruction should be executed one stage-and therefore one clock cycle behind the first
- \* The second instruction above cannot be executed in one cycle behind the first
- \* When carried out in this order, clearly two data transfer operations take two cycles





- \* Since instructions execute in parallel on multiple functional units, the reorder buffer serves the function of bringing completed instructions back into a order which is consistent with program order.
- \* Note that instruction may complete in an order which is not related to program order, but must be committed in program order.
- \* At any time, program state & processor state are defined in terms of instructions which have been committed. Their results are reflected in appropriate register and/or memory locations.

- \* Entries in the reorder buffer are completed instruction which are queued in program order
- \* However, since instructions do not necessarily complete in program order we also need a flag with each reorder buffer entry to indicate whether the instruction in the position has completed

(ii) Discuss the different vector access memory schemes.

### C-Access memory Organization

The m-way low-order interleaved memory structure allows m-memory words to be accessed together in overlapped structure. This concurrent access has been known as C-access

The access cycles in various memory modules are staggered. The lower-order bits choose the module, & the high order 6 bits select the word within every module, where  $T_n = 2^6$  &  $a_{tb} = n$  is the address length.

### S-access memory Organization

The low-order interleaved memory can be rearranged to enable simultaneous access or S-access. In the method all memory modules are created simultaneously in a synchronized method.

### C/S-access memory organization

A memory organization in which C-access & S-access are combined is called C/S access. The scheme where 1 access bus are applied with m interleaved memory modules connected to every bus.

The C/S access memory as suitable for use in vector multiprocessor configurations

- \* It provides parallel pipelined access to a vector data set with high bandwidth. A particular vector cache design is required with each processor to maintain smooth data movement b/w the memory multiple vector processors.

## (12) Illustrate Processor Consistency Models.

Sln:-

- \* Every write operation can be divided into several sub-write to all memories.
- \* A read from one such memory can happen before the write to this memory completes.
- \* Therefore data can never be stable.
- \* A processor under PC can execute a younger load when an older store needs to be stalled.
- \* Read before write, read after read & write before write ordering is still preserved in this model.
- \* The processor consistency model is similar to PRAM consistency model with a stronger condition that PRAM consistency model requires all writes to same memory locations must be seen in the same sequential order by all processors.
- \* Processor consistency is weaker than Sequential consistency but stronger than PRAM consistency model.

- \* The Stanford DASH multiprocessor system implements the variation of multiprocessor consistency which is incomparable to Goodmon's definitions.
- \* All processor needs to be consistent in order in which they see writes by one processor & in the way they see writes by different processor to the same location.

(13) Define Parallel programming models.  
Discuss any two models.

Programming model → Simplified and transparent view of computer hardware/software system.

Parallel programming model is specifically designed for multiprocessors, multicomputer / vector / SIMD computers

We have 8 programming models

- 1) Shared variable model
- 2) Message-Passing model
- 3) Data parallel model
- 4) Object Oriented model
- 5) Function & Logic model

### Data Parallel Model

- \* Used in SIMD computers. Parallelism handled by Hardware Synchronization & flow control.
- \* Fortran 90 → data parallel lang.

- \* Require pre-distributed data sets

### Data parallelism

- \* This technique used in array processor (SICAS)
- \* Issue → match problem size with machine size

### Array language Extension

- \* Various data parallel ~~use~~ language used
- \* Represented by high level data types
- \* CFD for illac. 4, DAP footer for distributed array processor
- \* Target to make the number of PE's of problem size

### Function & Logic Model

- \* Functional programming Language → Lisp, Sipal & Stand 88

Logic programming languages → Concurrent Prolog & Parlog

### Functional programming model

- Should not produce any side effects
- Non Concept of storage, assignment & branching
- Single assignment & data flow language.
- Functional in nature

### Logic Programming Model

- Used for knowledge processing from large database
- Support implicity search strategy
- And parallel execution and/or parallel reduction technique used.
- Used in artificial intelligence

(14) With neat labelled diagram illustrate different parallelizing compilers.

\* Role of compiler to remove burden of optimization  
↳ generation

3 phases:

- 1) Flow analysis
- 2) Optimization
- 3) Code generation



### 1) Flow analysis

• Reveals design flow patterns to determine data &

control dependencies

• Flow analysis carried at various execution levels.

1) Instruction level  $\rightarrow$  VLSI or Superscalar processor.

2) Loop Level  $\rightarrow$  SIMD & Systolic Computer

3) Task level  $\rightarrow$  Multiprocessor

## Optimization

- Transformation capability of user programs to explore hardware
- Explore better performance
- Goal to maximize speed of code execution
- To minimize code length
- To local & global optimizations

## Parallel Code generation

Compiler directive can be used to generate parallel

Code:

- 2 optimizing compilers:
  - 1) Parafuse & Parallel 2
  - 2) PFC & Parascope

## Parafuse & Parallel 2

- Transforms sequential program of Fortran 77 into parallel programs
- Parafuse consists two programs that are encoded & packed.

## PFC AND Parascope

- Translates Fortran 77 to Fortran 90 code
- PFC package extended to PFC + Fortran parallel code generation on shared memory multiprocessor.
- PFC performs analysis in following steps below
  - 1) Inter-procedure flow analysis
  - 2) Transformation
  - 3) Dependence analysis
  - 4) Vector code Generation

15) Illustrate the Dynamic Scheduling of Pipeline using Tomasulo's Algorithm

- Tomasulo's Algorithm is another method of implementing dynamic scheduling
- This scheme is invented by Robert Tomasulo

Ex:-

MULT D F4, F2, F2  
ADD D F2, F0, F6

Contains anti-dependence since the first instruction reads from F2 & the second instruction writes F2 (a WAR hazard). However, there is no data dependence as is shown by code below

MULT D F4, F2, F2  
ADD D F8, F0, F6.

\* The anti-dependency is removed without changing the semantics of code simply by changing the F2 to an F8. However we don't have to use F8.

\* we can use any available register. In fact suppose there were some physical registers in the chip that instruction set was not aware of ie there were some extra registers

→ we can use those extra registers & leave the rest of the compiler to play with.

\* This is what register renaming is: it allows the hardware to detect the data dependence and eliminate it by storing the result of one instruction somewhere else.

Thus, the register remaining

MULT D F4, F2, F2  
ADD F2, F0, F6

The Add can execute & finish before the multiply  
starts even though working at the code  
suggests that would result in erroneous  
answers.