

USN 

|  |  |  |  |  |  |  |  |
|--|--|--|--|--|--|--|--|
|  |  |  |  |  |  |  |  |
|--|--|--|--|--|--|--|--|

**NMAM INSTITUTE OF TECHNOLOGY, NITTE**  
(An Autonomous Institution affiliated to VTU, Belagavi)

**Seventh Semester B.E. (CSE) (Credit System) Degree Examinations**

November – December 2017

**14CS702 – ADVANCED COMPUTER ARCHITECTURE**

Duration: 3 Hours

Max. Marks: 100

**Note: Answer Five full questions choosing One full question from each Unit.**

**Unit – I**

- |    |                                                                                                             | Marks | BT* |
|----|-------------------------------------------------------------------------------------------------------------|-------|-----|
| 1. | a) Define and discuss Amdahl's law showing the processor performance equation.                              | 7     | L*1 |
|    | b) Illustrate the concept of concurrent and parallel execution with neat diagrams.                          | 7     | L5  |
|    | c) Discuss the types and levels of parallelism that can be exploited while designing computer architecture? | 6     | L3  |
| 2. | a) List and explain in few sentences the different quantitative principles involved in computer design?     | 5     | L2  |
|    | b) Explain different classes of computers.                                                                  | 5     | L2  |
|    | c) Explain two categories of Parallel Computers with neat block diagrams.                                   | 10    | L4  |

**Unit – II**

- |    |                                                                                                                                                                                                                                                             |    |    |
|----|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|----|
| 3. | a) Define Pipelining in the context of computer architecture design mentioning the advantages. Discuss five stage MIPS pipeline data path with a block diagram.                                                                                             | 10 | L4 |
| b) | For the following reservation table of a nonlinear pipeline find the minimal average latency (MAL) for a collision free scheduling and calculate the efficiency of the pipeline taking into account all the necessary steps involved in the design process. |    |    |

|       | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|-------|---|---|---|---|---|---|---|---|
| $S_1$ | X |   |   |   |   | X |   | X |
| $S_2$ |   | X |   | X |   |   |   |   |
| $S_3$ |   |   | X |   | X |   | X |   |

10 L6

- |    |                                                                                                                                                                 |    |    |
|----|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|----|----|
| 4. | a) List and explain the types of hazards encountered while pipelining. Discuss three generic types of hazard that depend on the result of prior instruction.    | 10 | L3 |
| b) | Explain any 3 ways to improve the performance of a pipeline. Discuss the reservation table of a linear and a non-linear pipeline.                               | 7  | L2 |
| c) | For a k-stage pipeline and 'n' tasks give the equations of (i) total time to process 'n' tasks in a pipeline, (ii) speedup factor and (iii) pipeline throughput | 3  | L1 |

P.T.O.

**Unit – III**

5. a) Explain loop level parallelism? Name the technique used to implement loop level parallelism? Explain the technique taking an example of your own. 10  
 b) Define Tomasulo's Algorithm and discuss the importance of different components involved in Tomasulo's Algorithm with a neat block diagram? 10
6. a) What is the significance of branch prediction? Discuss four dynamic branch prediction techniques with a block diagram. 10  
 b) Explain with a neat block diagram the technique of ROB driven hardware based speculation? 10

**Unit – IV**

7. a) Discuss the cache coherence problem? With a block diagram explain different ways (scenarios) in which the cache data becomes inconsistent? 10  
 b) Explain multi-stage networks with a general block diagram? Also discuss the concept of Multi-stage Omega network with a block diagram? 10
8. a) Discuss the working model of Snoopy Bus Protocol? Explain Write-Through Cache and Write-Back Cache Protocols with transition diagrams? 10  
 b) Explain static and dynamic interconnection networks with a neat diagram? List and draw any four specific dynamic interconnection networks. 10

**Unit – V**

9. a) With an example bring out the concept of software pipelining? Give the performance graph of software pipelining and loop unrolling? 10  
 b) Write a brief note on hardware support using predication? Explain the four methods of hardware support for compiler speculation? 10
10. a) Define and give merits of VLIW architecture? Explain the concept of VLIW taking an example. 10  
 b) How loop level parallelism is detected using dependencies? Give example for detecting and eliminating dependencies within the loop? 10

**BT\* Bloom's Taxonomy, L\* Level**

\*\*\*\*\*



USN

|  |  |  |  |  |  |  |  |  |
|--|--|--|--|--|--|--|--|--|
|  |  |  |  |  |  |  |  |  |
|--|--|--|--|--|--|--|--|--|

**NMAM INSTITUTE OF TECHNOLOGY, NITTE**

(An Autonomous Institution affiliated to VTU, Belagavi)

**Seventh Semester B.E. (CSE) (Credit System) Degree Examinations****Supplementary Examination – July 2017****13CS702 – ADVANCED COMPUTER ARCHITECTURE**

Duration: 3 Hours

Max. Marks: 100

**Note: Answer Five full questions choosing One full question from each Unit.****Unit – I****Marks BT\***

- a) State and explain Amdahl's law. Suppose that we want to enhance the processor used for web serving. The new processor is 10 times faster on computation in the web serving application than the original processor. Assuming that the original processor is busy with computation 40% of the time and is waiting for I/O 60% of the time, what is the overall speedup gained by incorporating the enhancement? 8 L4
- b) Describe UMA and NUMA models for multiprocessor systems with neat diagrams. 8 L2
- c) Explain concurrent execution models. 4 L2
- a) Explain different classes of computers. 6 L2
- b) Distinguish between concurrent and parallel execution models. 8 L4
- c) Describe levels of available functional parallelism. 6 L2

**Unit – II**

- a) Give the structure of 5 stage MIPS pipeline, Explain the operations performed by every pipe stage for the execution of instruction L.D F0,0(R1) 8 L\*4
- b) Suppose that data references constitute 40% of the mix, and that the ideal CPI of the pipelined processor, ignoring the structural hazard, is 1. Assume that the processor with the structural hazard has a clock rate that is 1.05 times higher than the clock rate of the processor without the hazard. Disregarding any other performance losses, is the pipeline with or without the structural hazard faster, and by how much? 6 L5
- c) Explain the 3 types of data hazards with suitable examples 6 L2
- a) Calculate the minimal average latency and the efficiency of the nonlinear pipeline with the following reservation table.

1 2 3 4

|    |   |   |   |
|----|---|---|---|
| S1 | X |   | X |
| S2 |   | X |   |
| S3 |   | X |   |

- b) List the RAW data hazards among different instructions given below. Determine if any of these data hazards can be overcome using forwarding technique. Give reasons if a particular data hazard cannot be overcome even with forwarding. 8 L3

i1: lw r1, 0(r2)  
i2: sub r4,r1,r6  
i3: and r6,r1,r7  
i4: or r8,r1,r9

- i) Explain control hazards in a 5 stage pipeline and suggest solutions to overcome the control hazard. 6 L5

6 L2

- Unit – III
5. a) With suitable examples, explain the functioning of the following branch predictors:
- 2-bit predictors
  - Correlating predictors
- b) In a processor having Tomasulo-based dynamic scheduling and speculation (using ROB) feature, determine the sequence of clock cycles in which the write-back and commit operations are performed for the following instruction sequence.

8 L2

| Instruction | Operands |
|-------------|----------|
| DIV         | R2,R3,R4 |
| MUL         | R1,R5,R6 |
| ADD         | R3,R7,R8 |
| MUL         | R1,R1,R3 |
| SUB         | R4,R1,R5 |
| ADD         | R1,R4,R2 |

The latencies for the functional units are as follows:

| Functional units | Latencies |
|------------------|-----------|
| add              | 2         |
| multiply         | 10        |
| divide           | 40        |
| load             | 2         |

- The number of load buffers is 2, number of FP add reservation stations is 3, and number of FP multiply reservation stations is 2.
- c) Write a short note on VLIW processors
6. a) Show the following loop unrolled so that there are two copies of the loop body. Assume that the number of loop iterations is multiple of 2. Eliminate any obviously redundant computations and do not reuse any of the registers.

8 L5  
4 L1

Loop: L.D F0,0(R1)  
 ADD.D F4,F0,F2  
 S.D F4,0(R1)  
 SUBI R1,R1,#8  
 BNEZ R1, Loop

Consider the following intervening latencies between two dependent instructions:

| Instruction producing result | Instruction using result | Latency |
|------------------------------|--------------------------|---------|
|------------------------------|--------------------------|---------|

|            |            |   |
|------------|------------|---|
| FP ALU op  | FP ALU op  | 3 |
| FP ALU op  | SD         | 2 |
| LD         | FP ALU op  | 1 |
| LD         | SD         | 0 |
| Int ALU op | Branch     | 1 |
| Int ALU op | Int ALU op | 0 |

- b) Discuss how the Branch Target Buffer increases the instruction fetch bandwidth.
- c) Draw the block diagram of basic Tomasulo based processor and explain the three steps involved in the execution of an instruction in this processor.

8 L3  
6 L4  
6 L2

**13CS702**

**Supplementary – July 2017**

**Unit – IV**

- |    |                                                                                                                      |    |    |
|----|----------------------------------------------------------------------------------------------------------------------|----|----|
| 7. | a) Explain any three static connection networks. Explain the factors affecting network performance.                  | 10 | L2 |
|    | b) Describe the three states of a full map directory and explain how eviction process occurs in a limited directory. | 10 | L2 |
| 8. | a) Explain snoopy bus protocols with suitable state transition diagrams.                                             | 10 | L2 |
|    | b) Explain cache inconsistencies with suitable diagram.                                                              | 10 | L2 |

**Unit – V**

- |     |                                                                                                                                                                                                                                           |    |    |
|-----|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|----|
| 9.  | a) What is VLIW? What is the responsibility of compiler for VLIW processor?                                                                                                                                                               | 4  | L1 |
|     | b) Consider a loop like this<br><pre>for(i=1;i&lt;=100;i=i+1) {     a[i]=a[i]+b[i]; /*s1*/     b[i+1]=c[i]+d[i]; /*s2*/ }</pre><br>i) What are the dependences between S1 and S2?<br>ii) If loop is not parallel how to make it parallel? | 6  | L3 |
|     | c) Briefly explain the IA-64 architecture.                                                                                                                                                                                                | 10 | L2 |
| 10. | a) Compare loop unrolling and software pipelining with example.                                                                                                                                                                           | 8  | L4 |
|     | b) Explain functional units and instruction issue of Itanium 2 processor.                                                                                                                                                                 | 6  | L2 |
|     | c) Write short note on copy propagation and tree height reduction.                                                                                                                                                                        | 6  | L4 |

**BT\* Bloom's Taxonomy, L\* Level**

\*\*\*\*\*

# NMAM INSTITUTE OF TECHNOLOGY, NITTE

(An Autonomous Institution affiliated to VTU, Belagavi)

## Seventh Semester B.E. (CSE) (Credit System) Degree Examinations

November - December 2016

Duration: 3 Hours

### 13CS702 – ADVANCED COMPUTER ARCHITECTURE

Max. Marks: 100

Note: Answer Five full questions choosing One full question from each UNIT

#### Unit – I

1. a) Define principle of locality. Marks 2 BT\* 2 L\*2  
 b) Suppose that following measurements are made:

Frequency of FP operations = 25%

Average CPI of FP operations = 4.0

Average CPI of other instructions = 1.33

Frequency of FPSQR = 2%

CPI of FPSQR = 20

Assume that the two design alternatives are to decrease the CPI of FPSQR by 2 or decrease the average CPI of all FP operations to 2.5. Compare these two design alternatives using processor performance equation.

- c) With a neat block diagrams explain UMA, NUMA and COMA models for multiprocessor system. 8 L4  
 2. a) State and explain Amdahl's law. Also represent the speedup gained equation. 10 L4  
 b) List and explain Flynn's classification of parallel computer architectures. 6 L2  
 c) Describe main aspects of scheduling policy in concurrent execution. 4 L2  
 d) What are synthetic benchmarks? 8 L2  
2 L2

#### Unit – II

3. a) Discuss how control hazard affects the performance in a pipelined processor. Explain how the control hazard can be overcome by scheduling the branch delay slot. 8 L5  
 b) For the following reservation table of a nonlinear pipeline find the minimal average latency (MAL) for a collision free scheduling and the efficiency of the pipeline.

|    | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|----|---|---|---|---|---|---|---|---|
| S1 | x |   |   |   |   | x |   | x |
| S2 |   | x |   | x |   |   |   |   |
| S3 |   |   | x |   | x |   | x |   |

- c) Explain the RAW and WAR hazards with suitable examples. 8 L3  
4 L2

4. a) Draw the diagram of classic 5 stage MIPS pipeline. Explain the operations performed in each stage of the pipeline for the execution of R, I, and J types instructions. 10 L4  
 b) An unpipelined processor takes 6 ns to work on one instruction. The pipelined version of the processor has 6 stages with the following lengths: 1.0ns; 0.8ns; 0.4ns; 1.2ns; 1.3ns; 1.3ns. It then takes 0.3 ns to latch its results into latches. Answer the following, assuming that there are no stalls in the pipeline.  
 i) What are the cycle times in both processors?  
 ii) How long does it take (in nano-seconds ) to finish one instruction in both processors? (Note : Ignore the initial fill time in the pipelined processor)

- iii) What is the speedup achieved by the 6 stage pipeline with respect to 'unpipelined processor?
- iv) How long does it take (in nano-seconds) to finish 1000 instructions in both processors? (Note : Do not ignore the initial fill time in the pipelined processor)
- c) Derive an expression for the overall speedup of a pipelined processor.

5 L4

**Unit – III**

5. a) Unroll the following loop such a way that there are four copies of the loop body. Assume that the number of iterations in the given loop is multiple of 4:

```

Up: L.D F0, 0(R1)
      ADD.D F4,F0,F2
      S.D F4, 0(R1)
      SUBI R, R1,#8
      BNEZ R, Up
    
```

Given- The intervening latencies between two dependent instructions is as follows:

| Instruction producing result | Instruction using result | Latency |      |
|------------------------------|--------------------------|---------|------|
| FP ALU op                    | FP ALU op                | 3       |      |
| FP ALU op                    | SD                       | 2       |      |
| LD                           | FP ALU op                | 1       |      |
| LD                           | SD                       | 0       |      |
| Int ALU op                   | Branch                   | 1       |      |
| Int ALU op                   | Int ALU op               | 0       | 8 L3 |

- b) With block diagram briefly explain the structure of basic Tomasulo based processor. Also explain the three steps involved in the execution of an instruction using basic Tomasulo's algorithm.
- c) Give the structure of (2, 2) correlating predictor and explain its working.

6 L2  
6 L6

6. a) For a Tomasulo-based dynamic scheduling (without speculation) processor, what would be the content of reservation stations and register tags after the third instruction ( i.e. SUB.D) in the following code sequence has executed and written its result. How many clock cycles does it take to complete the execution of all instructions?

|       |             |
|-------|-------------|
| L.D   | F6, 32(R2)  |
| MUL.D | F0, F2, F4  |
| SUB.D | F8, F6, F2  |
| DIV.D | F10, F0, F6 |
| ADD.D | F6, F8, F2  |

Given: The number of clock cycles needed for the execution of floating-point add is 2 cycles, multiply is 10 cycles, divide is 40 cycles and load is 2 cycles. Also assume that the number of load buffers is 2, number of FP adder reservation stations is 3, and number of FP multiplier reservation station is 2.

- b) With state diagram discuss the functioning of a 2-bit dynamic branch predictor. What are its predictions in different iterations for the branch instruction of the inner for loop in the following code:

```

for (p=0; p<10; p++)
  for (q=0; q<3; q++)
    // whatever code
  
```

- c) Discuss the role of Reorder Buffer in Hardware-based speculation?

10 L5

6 L4  
4 L1**Unit – IV**

7. a) Explain chained directory and full map protocols.  
 b) Describe write through and write back cache coherence protocols with suitable diagrams.

10 L2

10 L2

**13CS702**

**SEE – November – December 2016**

8. a) Define cache coherence problem? Explain cache inconsistencies caused by data sharing and I/O. 10 L2  
b) Explain any three static connection networks with examples. 10 L2

**Unit – V**

9. a) Write a short note on 12 L4  
i) Predicate instruction  
ii) Copy propagation and Tree height reduction  
b) Write features of itanium – 2 processor. 8 L2
10. a) Use the GCD test to determine whether dependences exist in the following loop. 10 L3  
i) for ( $i=1$ ;  $i \leq 100$ ;  $i=i+1$ ) {  
  
 $X[2*i+3] = X[2*i] * 5.0;$   
}  
ii) for ( $i=0$ ;  $i < 300$ ;  $i++$ )  
  
{  $a[301*i+5] = b[i]*c[i];$   
  
 $d[i] = a[301*i]+e;$  }  
b) Explain hardware support for preserving the exception behavior. 10 L2

BT\* Bloom's Taxonomy, L\* Level

\*\*\*\*\*



**NMAM INSTITUTE OF TECHNOLOGY, NITTE**  
*(An Autonomous Institution affiliated to VTU, Belagavi)*  
**Seventh Semester B.E. (CSE) (Credit System) Degree Examinations**  
**Make up Examinations - January 2017**

**13CS702 – ADVANCED COMPUTER ARCHITECTURE**

Duration: 3 Hours

**Note: Answer Five full questions choosing One full question from each Unit.****Unit – I**

- |                                                                                                                                                                                                                                                                                                                                                                                                                                            |       |     |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------|-----|
| 1. a) What are benchmarks? Describe desktop and Server benchmarks.                                                                                                                                                                                                                                                                                                                                                                         | Marks | BT* |
| b) Explain COMA model of multiprocessor system with a neat diagram.                                                                                                                                                                                                                                                                                                                                                                        | 6     | L*2 |
| c) Describe the levels and utilization of functional parallelism.                                                                                                                                                                                                                                                                                                                                                                          | 6     | L4  |
|                                                                                                                                                                                                                                                                                                                                                                                                                                            | 8     | L2  |
| 2. a) Define Flynn's classification of parallel architectures.                                                                                                                                                                                                                                                                                                                                                                             | 4     | L2  |
| b) Explain different classes of computers.                                                                                                                                                                                                                                                                                                                                                                                                 | 6     | L2  |
| c) In a graphics processor, suppose FP square root is responsible for 20% of the execution time of a critical graphics benchmark. One strategy is to enhance the FP square root hardware and speed up this operation by a factor of 10. The other alternative is just to try to make all Fp instructions run faster by a factor of 1.6. FP instructions are responsible for half of execution time. Compare these two design alternatives. | 10    | L5  |

**Unit – II**

- |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |    |    |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|----|
| 3. a) Draw the block diagram of classic 5 stage MIPS pipeline. Discuss the operations performed in each stage of the pipeline for the execution of branch and store instructions.                                                                                                                                                                                                                                                                                                               | 10 | L4 |
| b) Consider an unpipelined processor. Assume that it has a 1ns clock cycle and that it uses 4 cycles for ALU operations and branches, and 5 cycles for memory operations. Assume that the relative frequencies of these operations are 40%, 20%, and 40%, respectively. Suppose that due to clock skew and set up, pipelining the processor adds 0.2ns of overhead to the clock. Ignoring any latency impact, how much speed up in the instruction execution rate will we gain from a pipeline? | 6  | L5 |
| c) What do you mean by structural hazard in a pipeline. Suggest solutions to overcome structural hazards.                                                                                                                                                                                                                                                                                                                                                                                       | 4  | L2 |
| 4. a) Compute the MAL and efficiency of the nonlinear pipeline that has the following reservation table.                                                                                                                                                                                                                                                                                                                                                                                        |    |    |

|    | 1 | 2 | 3 | 4 | 5 |
|----|---|---|---|---|---|
| S1 | x |   |   |   |   |
| S2 |   | x |   |   | x |
| S3 |   |   | x | x |   |

8 L3

- |                                                                                                                                         |   |    |
|-----------------------------------------------------------------------------------------------------------------------------------------|---|----|
| b) Consider a branch that is taken 80% of the time. On average, how many stalls are introduced for this branch for each approach below: |   |    |
| i) Stall the fetch until branch outcome is known.                                                                                       |   |    |
| ii) Assume not-taken and squash if the branch is taken.                                                                                 |   |    |
| iii) Assume a branch delay slot:                                                                                                        |   |    |
| a. No instruction is found to put in the delay slot                                                                                     |   |    |
| b. An instruction before the branch is put in the delay slot                                                                            |   |    |
| c. An instruction from the taken side is put in the delay slot                                                                          |   |    |
| d. An instruction from the not-taken side is put in the slot                                                                            | 6 | L4 |
| c) Explain RAW, WAR, and WAW hazards with suitable examples.                                                                            | 6 | L2 |

**Unit - III**

5. a) Show the following loop unrolled on a dual issue (1 FP instruction and 1 anything else) superscalar processor, so that there are five copies of the loop body. Assume that the number of loop iterations is a multiple of 5.

Loop: L.D F0,0(R1)  
 ADD.D F4,F0,F2  
 S.D F4,0(R1)  
 SUBI R1,R1,#8  
 BNEZ R1, Loop

Consider the following intervening latencies between two dependent instructions:

| <b>Instruction producing result</b> | <b>Instruction using result</b> | <b>Latency</b> |
|-------------------------------------|---------------------------------|----------------|
| FP ALU                              | FP ALU                          | 3              |
| FP ALU                              | SD                              | 2              |
| LD                                  | FP ALU                          | 1              |
| LD                                  | SD                              | 0              |
| Int ALU                             | Branch                          | 1              |
| Int ALU                             | Int ALU                         | 0              |

- b) In a processor having Tomasulo-based dynamic scheduling and speculation (using ROB) capacity, determine the sequence of clock cycles in which the write-back and commit operations are performed for the following instruction sequence.

ADD.D F8, F2, F6  
 MULT.D F10, F0, F6  
 SUB.D F6, F8, F2  
 MULT.D F6, F10, F6

The latencies for the functional units are as follows:

| <b>Functional units</b> | <b>Latencies</b> |
|-------------------------|------------------|
| Add/Sub                 | 2                |
| Multiply                | 10               |
| Divide                  | 40               |

The number of FP add reservation stations is 2, and number of FP multiply reservation stations is 2. Assume that there are enough entries in ROB.

- c) Write a short note on super scalar processors. 8 L5
6. a) Give the structure of a floating point unit using Tomasulo's algorithm to handle speculation. Name the various fields of Reservation stations and Reorder buffer in this hardware and discuss the role of Reorder Buffer. 4 L1
- b) Draw the block diagram of a (3, 2) Correlating predictor. Assuming 4K bits in the branch history table (BHT), calculate the number of entries in the BHT for the following (m,n) predictors: 8 L2
- i) (10,2)
  - ii) (2,2)
- c) Discuss the functioning of branch target buffer. 6 L6
- Unit - IV**
7. a) What is cache coherence problem? Explain cache inconsistencies due to data sharing and process migration. 6 L4
- b) Describe the any two directory based protocols with neat diagrams 10 L2
8. a) Explain write -Invalidate snoopy cache – coherence protocols for write-through and write –back caches. 10 L2

**13CS702**

**Make up - January 2017**

b) Explain the following

i) Linear Array ii) Perfect shuffle and Exchange iii) Digital Buses

10 L2

**Unit – V**

9. a) What is VLIW and EPIC? Explain.

6 L1

b) Explain loop carried dependency and true data dependency with example.

8 L4

c) Explain software pipelining.

6 L2

10. a) How compiler can reduce the impact of dependent computation to achieve more ILP? Explain.

6 L2

b) State the GCD test heuristic to detect the loop dependence. Using GCD test, determine whether dependence exist in the following loop.

```
for (i=1; i<=100; i=i+1) {
```

```
X[2*i+3] = X[2*i] * 5.0;
```

```
}
```

8 L5

c) Explain the Intel IA-64 architecture register state and instruction format.

6 L2

L3 BT\* Bloom's Taxonomy, L\* Level

\*\*\*\*\*



USN 

|  |  |  |  |  |  |  |  |  |  |
|--|--|--|--|--|--|--|--|--|--|
|  |  |  |  |  |  |  |  |  |  |
|--|--|--|--|--|--|--|--|--|--|

**NMAM INSTITUTE OF TECHNOLOGY, NITTE**

(An Autonomous Institution affiliated to VTU, Belagavi)

**Seventh Semester B.E. (CSE) (Credit System) Degree Examinations****Supplementary Examinations - July 2016****12CS702 – ADVANCED COMPUTER ARCHITECTURE**

Duration: 3 Hours

Max. Marks: 100

**Note: Answer Five full questions choosing One full question from each Unit.****Unit – I**

- |                                                                                                                                                            |        |
|------------------------------------------------------------------------------------------------------------------------------------------------------------|--------|
| a) Derive the processor performance equation.                                                                                                              | 06 L*4 |
| b) Define and explain the amdals Law.                                                                                                                      | 06 L2  |
| c) What is functional parallelism? Explain various levels of available and utilized functional parallelism in detail.                                      | 08 L2  |
| a) Explain various shared memory multiprocessor system models and distributed memory multicompilers with neat diagram.                                     | 10 L2  |
| b) Suppose we are on the design team for a new processor. The clock runs at 250 MHz. We assume that the processor only executes one instruction at a time. |        |
- The following table gives instruction frequencies for SPEC Benchmark S<sub>B</sub>.

| Instruction Type        | Frequency | Cycles   |
|-------------------------|-----------|----------|
| Load and Store          | 30%       | 6 Cycles |
| Arithmetic instructions | 50%       | 4 Cycles |
| All others              | 20%       | 3 Cycles |

(i) Calculate average clock cycles per instruction.

(ii) Determine the native MIPS processor speed for the benchmark in millions of instructions per second.

- |                                          |       |
|------------------------------------------|-------|
| c) Explain server and desktop benchmark. | 06 L2 |
|------------------------------------------|-------|

**Unit – II**

- |                                                                                                                                             |       |
|---------------------------------------------------------------------------------------------------------------------------------------------|-------|
| a) List and explain in detail, the three classes of hazards in a pipeline. Identify the various data hazards in the following instructions. | 04 L5 |
|---------------------------------------------------------------------------------------------------------------------------------------------|-------|

LOOP: L.D F0,0(R1)

ADD.D F4,F0,F2

S.D F4,0(R1)

DAADUI R1,R1,#-8

BNE R1,R2,LOOP

14 L2

- |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |       |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------|
| b) Consider an un-pipelined processor, which has a 1 ns clock cycle and that uses 4 cycles for ALU operations and branches, and 5 cycles for memory operations. Assume that the relative frequencies of these operations are 40%, 20% and 40% respectively. Suppose that due to clock skew and setup, pipelining the processor adds 0.2 ns of overhead to the clock. Ignoring any latency impact, how much speedup in the instruction execution rate will we gain from the pipeline? | 06 L4 |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------|

4. a) Consider the following reservation table for a four-stage pipeline with a clock cycle  $\zeta = 20$  ns.

|    | 1 | 2 | 3 | 4 | 5 | 6 |
|----|---|---|---|---|---|---|
| S1 | X |   |   |   |   | X |
| S2 |   | X |   | X |   |   |
| S3 |   |   | X |   |   |   |
| S4 |   |   |   | X | X |   |

- i) What are the forbidden latency and the initial collision vector?
  - ii) Draw the state transition diagram.
  - iii) Determine the MAL associated with the shortest greedy cycle.
  - iv) Determine the pipeline throughput corresponding to the MAL and given  $\zeta$ .
  - v) Determine the lower bound on the MAL.
- b) Differentiate between linear and nonlinear pipeline processor. Also explain how linear pipeline processor is constructed.

14 L5  
06 L2

### Unit – III

5. a) Unroll the following loop such a way that there are three copies of the loop body. Assume that the number of iterations in the given loop is multiple of 3.

Up: L.D F0, 0(R1)  
ADD.D F4,F0,F2  
S.D F4, 0(R1)  
SUBI R, R1,#8  
BNEZ R, Up

Given- The intervening latencies between two dependent instructions is as follows:

| Instruction producing result | Instruction using result | Latency |
|------------------------------|--------------------------|---------|
| FP ALU op                    | FP ALU op                | 3       |
| FP ALU op                    | SD                       | 2       |
| LD                           | FP ALU op                | 1       |
| LD                           | SD                       | 0       |
| Int ALU op                   | Branch                   | 1       |
| Int ALU op                   | Int ALU op               | 0       |

- b) With suitable examples, explain the three types of data hazards. Also suggest any possible solutions to overcome each of these hazards.
- c) With block diagram explain the operation of Branch Target Buffer.
- 08 L3  
06 L2  
06 L2
6. a) Give the structure of a floating point unit using Tomasulo's algorithm to handle speculation. Name the various fields of Reservation stations and Reorder buffer in this hardware and discuss the role of Reorder Buffer.
- b) Design and give the block diagram of a (3, 2) Correlating predictor.
- c) Distinguish between one-bit and two-bit dynamic branch predictors.
- 08 L4  
06 L6  
06 L2

### Unit – IV

7. a) Illustrate Goodman's write once cache coherence protocol.

08 L3

- b) Describe full map directory protocol and how eviction takes place in limited directory protocol with suitable diagram. 12 L2
8. a) Contrast two schemes in a chained directory protocol with neat diagram. 08 L4
- b) Illustrate write invalidate snoopy protocol with suitable state transition diagram. 12 L3

**Unit – V**

9. a) Explain the hardware support for compiler speculation in VLIW architecture. 08 L2
- b) State the GCD test heuristic to detect the loop dependence. Using GCD test, determine whether dependence exists in the following loop:  
`for (i=0; i<300; i++)  
{ a[301*i+5] = b[i]*c[i];  
d[i] = a[301*i]+e; }` 06 L5
- c) List the different types of dependences between the statements of the following loop. Resolve the output and anti dependences in this loop.  
`for (i=1; i<=100; i=i+1)  
{  
Y[i] = X[i] / c; /* S1 */  
X[i] = X[i] + c; /* S2 */  
Z[i] = Y[i] + c; /* S3 */  
Y[i] = c - Y[i]; /* S4 */  
}` 06 L3
10. a) With suitable example explain software pipelining. 08 L4
- b) Consider the following code:  
`if (A==0) {S=T;}`  
Assuming that registers R1, R2 and R3 hold the values of A, S and T, respectively, show the code for this statement with the branch and with the conditional move. 06 L3
- c) Discuss the salient features of Intel IA-64 architecture. 06 L1

BT\* Bloom's Taxonomy, L\* Level

\*\*\*\*\*

C  
USN 

|  |  |  |  |  |  |  |  |  |  |  |  |
|--|--|--|--|--|--|--|--|--|--|--|--|
|  |  |  |  |  |  |  |  |  |  |  |  |
|--|--|--|--|--|--|--|--|--|--|--|--|

**NMAM INSTITUTE OF TECHNOLOGY, NITTE**

(An Autonomous Institution affiliated to VTU, Belagavi)

**Seventh Semester B.E. (CSE) (Credit System) Degree Examinations****Make up Examinations – January 2016****12CS702 – ADVANCED COMPUTER ARCHITECTURE**

Duration: 3 Hours

Max. Marks: 100

**Note: Answer Five full questions choosing One full question from each Unit.****Unit – I****Marks BT\***

1. a) Explain any two useful quantitative principles of computer architecture design. 4 L\*2  
b) What is Amdal's law? Derive the equation for Amdal's law. 6 L4  
c) Explain Desktop and Server benchmarks in detail. 6 L2  
d) A System contains Floating point (FP) and Floating Point Square Root (FPSQR) unit. FPSQR is responsible for 20% of the execution time. One proposal is to enhance the FPSQR hardware and speedup this operation by a factor of 15 second alternate is just to try to make all FP instructions run faster by a factor of 1.6 times faster with the same effort as required for the fast FPSQR. Compare the two design alternative. 4 L4
  
2. a) Explain UMA, COMA and Distributed multicomputers. 8 L2  
b) Explain the scheduling policies of concurrent execution. 6 L2  
c) Discuss about functional and data parallelism with example. 6 L2

**Unit – II**

3. a) With a neat diagram explain how MIPS instruction can be implemented in at most 5 clock cycles. 10 L2  
b) Consider the following pipeline reservation table.

|    |   |   |   |   |
|----|---|---|---|---|
| S1 | X |   |   | X |
| S2 |   | X |   |   |
| S3 |   |   | X |   |

  - a) What are the forbidden latencies.
  - b) Draw the state transition diagram.
  - c) List all the simple cycles and greedy cycles.
  - d) Determine the optimal constant latency cycle and the MAL.
  - e) Determine the pipeline throughput for a clock period of 2ns.10 L4
  
4. a) Explain the five clock cycles of RISC instruction set. 5 L2  
b) What are the bounds on MAL in nonlinear pipeline? Explain with example how to optimize MAL by inserting noncompute delays. 10 L2  
c) Discuss with example data hazards requiring stall. 5 L2

**Unit – III**

5. a) Show the following loop unrolled so that there are two copies of the loop body, assuming the number of loop iterations is a multiple of 2. Eliminate any obviously redundant computations and do not reuse any of the registers.

```

Loop: L.D F0,0(R1)
      ADD.D F4,F0,F2
      S.D F4,0(R1)
      SUBI R1,R1,#8
      BNEZ R1, Loop
  
```

Consider the following intervening latencies between two dependent instructions:

| <b>Instruction producing result</b> | <b>Instruction using result</b> | <b>Latency</b> |      |
|-------------------------------------|---------------------------------|----------------|------|
| FP ALU op                           | FP ALU op                       | 3              |      |
| FP ALU op                           | SD                              | 2              |      |
| LD                                  | FP ALU op                       | 1              |      |
| LD                                  | SD                              | 0              |      |
| Int ALU op                          | Branch                          | 1              |      |
| Int ALU op                          | Int ALU op                      | 0              | 8 L5 |

- b) Explain with diagram the basic structure of MIPS floating point unit using Tomasulo's algorithm. 6 L2
- c) Assuming 4K bits in the branch history table (BHT), calculate the number of entries in the BHT for the following (m,n) predictors:  
 i) (0,2)  
 ii) (2,2)  
 iii) (10,2) 6 L3

6. a) In a Tomasulo-based dynamic scheduling (without speculation) processor, show what the reservation stations and register tags look like for the following code sequence when third instruction ( ie SUB.D) has completed and written its result. How many clock cycles it takes to complete the execution of all instructions?

|       |             |
|-------|-------------|
| L.D   | F6, 32(R2)  |
| MUL.D | F0, F2, F4  |
| SUB.D | F8, F6, F2  |
| DIV.D | F10, F0, F6 |
| ADD.D | F6, F8, F2  |

Assume the following latencies for the floating-point functional units: add is 2 clock cycles, multiply is 10 clock cycles, divide is 40 clock cycles and load is 2 clock cycles.

Also assume that the number of load buffers is 2, number of FP adder reservation stations is 3, and number of FP multiplier reservation station is 2.

- b) Differentiate between true data dependence and name dependence with suitable examples. 8 L4
- c) With the state diagram explain the working of 2-bit dynamic branch predictor. 6 L2
- 6 L1

**Unit – IV**

7. a) What is cache coherence problem? Illustrate the inconsistency in data sharing and process migration
- b) Explain write invalidate snoopy protocol with state transition graphs. 10 L4

10 L2

8. a) Contrast full map and limited directory based protocol. 12 L4  
 b) Explain any 2 static connection networks and discuss the network performance issues. 8 L2

**Unit – V**

9. a) State the GCD test heuristic to detect the loop dependence. Using GCD test, determine whether dependences exist in the following loop:  

$$\begin{aligned} \text{for } & (i=1; i \leq 100; i=i+1) \\ & \{ \\ & \quad X[2*i+3] = X[2*i] * 5.0; \\ & \} \end{aligned}$$
 8 L5
- b) Explain the hardware mechanisms to support compiler speculation. 6 L2  
 c) Discuss the merits and demerits of VLIW architecture. 6 L4
10. a) Discuss the key features of Intel Itanium processor. 8 L1  
 b) Write the software-pipelined version of the following loop:  
 Loop:      L.D            F0, 0(R1)  
              ADD.D        F4, F0, F2  
              S.D            F4, 0(R1)  
              DADDUI       R1, R1, #-8  
              BNE           R1, R2, Loop 8 L3
- c) Compare software pipelining and loop unrolling. 4 L4

BT\* Bloom's Taxonomy, L\* Level

\*\*\*\*\*



USN \_\_\_\_\_

**NMAM INSTITUTE OF TECHNOLOGY, NITTE**

(An Autonomous Institution affiliated to VTU, Belagavi)

**Seventh Semester B.E. (CSE) (Credit System) Degree Examinations**

November - December 2015

**12CS702 – ADVANCED COMPUTER ARCHITECTURE**

Duration: 3 Hours

Max. Marks: 100

**Note: Answer Five full questions choosing One full question from each Unit.****Unit – I****Marks BT\***

1. a) Explain Different Classes of Computers. 08 L2  
b) What is benchmark? Explain different types of benchmark with different significance. 08 L2  
c) With neat diagram explain the UMA, NUMA and COMA multiprocessor systems. 08 L2
2. a) Briefly explain the concept of concurrent and parallel execution. 08 L2  
b) Following measurements are made :  
 $Frequency\ of\ Floating\ point(FP)\ operations = 25\%$   
 $Average\ CPI\ of\ FP\ operations = 4.0$   
 $Average\ CPI\ of\ other\ instructions = 1.33$   
 $Frequency\ of\ Floating\ Point\ Square\ Root(FPSQR) = 2\%$   
 $CPI\ of\ FPSQR = 20$   
Assume that the two design alternatives are to decrease the CPI of FPSQR to 2 or decrease the average CPI of all FP operations to 2.5.  
Compare FP and FPSQR design alternatives using processor performance equation 08 L4  
c) Explain the available and utilized levels of functional parallelism. 08 L2

**Unit – II**

3. a) What are data hazards? How to minimize data hazards? Explain with suitable example. 10 L2  
b) Consider the following reservation table for a four stage pipeline with a clock cycle of 20ns.

|    | 1 | 2 | 3 | 4 | 5 | 6 |
|----|---|---|---|---|---|---|
| S1 | X |   |   |   |   | X |
| S2 |   | X |   | X |   |   |
| S3 |   |   | X |   |   |   |
| S4 |   |   |   | X | X |   |

- i) What are the forbidden latencies and the initial collision vector?  
ii) Draw the state transition diagram for scheduling the pipeline.  
iii) Determine the MAL associated with the shortest greedy cycle.  
iv) Determine the pipeline throughput.  
v) Determine the lower bound on the MAL for this pipeline.

10 L4

4. a) Derive the equation for finding the actual speedup from pipelining. 8 L4  
 b) Give a comparison between synchronous and asynchronous model of linear pipeline. 4 L2  
 c) With reference to pipeline processors define the following  
     i) performance cost ratio(PCR)  
     ii) clock skewing  
     iii) clock cycle and throughput  
     iv) efficiency and throughput. 8 L2

**Unit – III**

5. a) Show the following loop unrolled on a dual issue (1 FP instruction and 1 anything else) superscalar processor, so that there are five copies of the loop body. Assume that the number of loop iterations is a multiple of 5.

Loop:   L.D F0,0(R1)  
          ADD.D F4,F0,F2  
          S.D F4,0(R1)  
          SUBI R1,R1,#8  
          BNEZ R1, Loop

Consider the following intervening latencies between two dependent instructions:

| Instruction producing result | Instruction using result | Latency |      |
|------------------------------|--------------------------|---------|------|
| FP ALU                       | FP ALU                   | 3       |      |
| FP ALU                       | SD                       | 2       |      |
| LD                           | FP ALU                   | 1       |      |
| LD                           | SD                       | 0       |      |
| Int ALU                      | Branch                   | 1       |      |
| Int ALU                      | Int ALU                  | 0       | 8 L3 |

- b) Explain how Branch Target Buffer increases the instruction fetch bandwidth. 6 L2  
 c) Assuming 1K bits in the branch history table (BHT), calculate the number of entries in the BHT for the following (m,n) predictors:  
     i) (0,2)  
     ii) (2,2)  
     iii) (8,1) 6 L5

6. a) For a Tomasulo-based dynamic scheduling (without speculation) processor, what will be the content of reservation stations and register tags after the third instruction ( i.e. SUB.D) in the following code sequence has executed and written its result. How many clock cycles does it take to complete the execution of all instructions?

|       |             |
|-------|-------------|
| L.D   | F6, 32(R2)  |
| MUL.D | F0, F2, F4  |
| SUB.D | F8, F6, F2  |
| DIV.D | F10, F0, F6 |
| ADD.D | F6, F8, F2  |

Given: The number of clock cycles needed for the execution of floating-point add is 2 cycles, multiply is 10 cycles, divide is 40 cycles and load is 2 cycles.

Also assume that the number of load buffers is 2, number of FP adder reservation stations is 3, and number of FP multiplier reservation station is 2.

10 L5

12CS702

SEE – November – December 2015

- b) Write the state diagram for 2-bit dynamic branch predictor. Explain the functioning of 2-bit dynamic branch predictor considering the branch of the inner loop in the following code:

```
for ( i=0; i<100; i++)  
    for ( j=0; j<3; j++)  
        // whatever code
```

6 L2

- c) Discuss the role of Reorder Buffer in handling speculative execution on Tomasulo based processor.

4 L4

#### Unit – IV

7. a) Explain perfect shuffle and exchange and hyper cube routing with an example. 10 L2  
b) Contrast limited and chain directory based protocol. 10 L3
8. a) Describe write-through and write back caches with state transition diagram. 10 L4  
b) Illustrate the inconsistency in writable data and process migration 10 L2

#### Unit – V

9. a) List the different types of dependences between S1 and S2 in the loop given below. Is this loop parallel? If not, show how to make it parallel.  
for (i=1; i<=100; i=i+1)  
{  
 A[i] = A[i] + B[i]; /\* S1 \*/  
 B[i+1] = C[i] + D[i]; /\* S2 \*/  
}
- b) Explain how dependence between two references to same array in a loop is detected. 6 L4  
c) Discuss the key features of Intel Itanium processor 6 L2
10. a) Show a software-pipelined version of the following loop, which increments all the elements of an array (whose starting address is in R1) by the content of F2.  
Loop:      L.D      F0, 0(R1)  
             ADD.D    F4, F0, F2  
             S.D      F4, 0(R1)  
             DADDUI   R1, R1, #-8  
             BNE      R1, R2, Loop
- b) Give any two examples to illustrate the significance of predicated instructions in VLIW architecture. 6 L5  
c) Give the overview of the Intel IA-64 architecture. 6 L1

BT\* Bloom's Taxonomy, L\* Level

\*\*\*\*\*

**NMAM INSTITUTE OF TECHNOLOGY, NITTE**

(An Autonomous Institution affiliated to VTU, Belgaum)

**Seventh Semester B.E. (CSE) (Credit System) Degree Examinations****Make up Examinations – January 2015****CS702 – ADVANCED COMPUTER ARCHITECTURE**

Duration: 3 Hours

Max. Marks: 100

**Note: Answer Five full questions choosing One full question from each Unit.****Unit – I**

- |    |                                                                                      |    |
|----|--------------------------------------------------------------------------------------|----|
| 1. | a) Show that ratio of geometric mean is equal to geometric mean of performance ratio | 6  |
|    | b) Explain Amdahl's law                                                              | 4  |
|    | c) Write down the processor performance equation                                     | 5  |
|    | d) Generic model for messaging passing multiprocessor                                | 5  |
| 2. | a) What are the models of shared memory multiprocessors                              | 10 |
|    | b) Explain the types and level of parallelism                                        | 10 |

**Unit – II**

- |    |                                                                                                                     |    |
|----|---------------------------------------------------------------------------------------------------------------------|----|
| 3. | a) Implement the steps of a RISC instruction set with stages of pipeline                                            | 10 |
|    | b) For the following reservation table of a non linear pipeline, compute the MAL for a collision – free scheduling. |    |

|    | 1 | 2 | 3 | 4 |
|----|---|---|---|---|
| S1 | X |   |   | X |
| S2 |   | X |   |   |
| S3 |   |   | X |   |

10

- |    |                                                                 |    |
|----|-----------------------------------------------------------------|----|
| 4. | a) What is a hazard? Explain the classes of hazards.            | 10 |
|    | b) Write the equations for performance of pipeline with stalls. | 10 |

**Unit – III**

- |    |                                                                                          |    |
|----|------------------------------------------------------------------------------------------|----|
| 5. | a) Explain loop unrolling with example                                                   | 10 |
|    | b) Describe how speculative execution is performed using Tomasulo algorithm and ROB.     | 10 |
| 6. | a) Differentiate static and dynamic branch prediction                                    | 10 |
|    | b) Explain the Tomasulo based processor with a block diagram and the steps of execution. | 10 |

**Unit – IV**

- |    |                                                                |   |
|----|----------------------------------------------------------------|---|
| 7. | a) Draw and explain Write through cache and write back cache.  | 4 |
|    | b) What is data parallel computation? Explain with an example. | 8 |
|    | c) Summarize Static and dynamic network characteristics        | 8 |
| 8. | a) With the help of state diagram explain write once protocol  | 6 |
|    | b) Design and explain EMC-R processor                          | 8 |
|    | c) Draw the dataflow graph for computing cosx                  | 3 |
|    | d) Draw evolution tree of dynamic dataflow machine             | 3 |

**P.T.O.**

9. a) Consider a loop like this  
For( $i=1; i \leq 100; i=i+1$ )  
 $A[i]=A[i]+B[i]; /*S1*/$   
 $B[i+1]=C[i]+D[i]; /*S2*/$  what are the dependencies between S1 and S2? Is this loop parallel? If not, show how to make it parallel.  
b) Explain Functional units and instruction issue of Itanium 2 processor with performance
10. a) Show a software pipelined version of this loop, which increments in all the elements of an array whose starting address is in R1 by the content of F2:  
Loop: LD F0,O(R1)  
      ADDD F4,F0,F2  
      SD F4,O(R1)  
      DADDUI R1,R1,#-8  
      BNE R1,R2,Loop  
You may omit the startup and clean up code  
b) Draw and explain the execution pattern for software pipelined loop and unrolled loop.  
c) How do you eliminate dependent computation?

\*\*\*\*\*



**NMAM INSTITUTE OF TECHNOLOGY, NITTE**

(An Autonomous Institution affiliated to VTU, Belgaum)

Seventh Semester B.E. (CSE) (Credit System) Degree Examinations

December - 2014

CS702 – ADVANCED COMPUTER ARCHITECTURE

**Duration: 3 Hours**

Max. Marks: 100

**Note: Answer Five full questions choosing One full question from each Unit.**

Unit - I

1. a) What is a benchmark suite? Explain. Discuss the Desktop benchmarks and Server benchmarks  
b) Derive processor performance equation.  
c) Suppose the following measurements have been made

10  
5

Frequency of FP operations = 25%  
Average CPI of FP operations = 4.0  
Average CPI of other instructions = 1.33  
Frequency of FPSQR= 2%  
CPI of FPSQR = 20

Assume that the two design alternatives are to decrease the CPI of FPSQR to 2 or to decrease the average CPI of all FP operations to 2.5. Compare these two design alternatives using the processor performance equation.

2. a) Explain the concepts of concurrent and parallel execution.  
b) With block diagrams explain the UMA, NUMA and COMA multiprocessor systems.  
c) State Amdahl's Law. How it defines speed up? Explain. On what factors the speed up depends? Derive an expression for speed up in terms of execution time.

5

Unit - II

3. a) Consider the three staged pipeline processor specified by the following reservation table:

|    | 1 | 2 | 3 | 4 |
|----|---|---|---|---|
| S1 | X |   |   | X |
| S2 |   | X |   |   |
| S3 |   |   | X |   |

- i) List the set of forbidden and permissible latencies and collision vector.

ii) Draw a state transition diagram showing all possible cycles without causing the collision in the pipeline.

iii) Identify all the simple cycles and greedy cycles from the transition diagram. What is the MAL of this pipeline? What is the efficiency of this pipeline?

iv) If pipeline clock period is 20ns find the throughput of this pipeline.

b) An examination paper has 6 questions to be answered and there are 5000 answer books. Each answer takes 3 minutes to correct. If 3 teachers are employed to correct the papers in pipeline mode, how much time will be taken to complete the job of correcting 5000 answer papers? What is the efficiency of processing? The clock rate of the pipeline is 25MHZ. Calculate the throughput.

c) List the 3 classes of Hazards. Find out the various data hazards in the following instruction stream. Also remove the hazards if possible.

-10

6

```

I1: i=j-k;
I2: m=k+2;
I3: n=k+j;
I4: j=10;
I5: k=j+10;
I6: k=n/j;
I7: m=k-n;

```

4. a) Explain how MIPS pipeline can be extended to handle floating point operations  
b) Write the difference between Static and Dynamic pipeline.  
c) Show that a pipelined processor with  $k$  stages can be  $k$  times faster (at the maximum) than an equivalent non pipelined processor

4

10

P.I.O.

## Unit – III

5. a) What is Instruction Level Parallelism? Explain. What are the different types of dependences? Explain  
 b) What do you mean by loop unrolling? Consider the following code and find out any improvement in the code.

```
void main()
{
    int A[1000][1000];
    int i, j;
    int sum=0;
    for(i=0; i<1000; i++)
        for(j=0; j<1000; j++)
            sum += A[i][j];
}
```

- c) Explain how to overcome data hazards with dynamic scheduling with example.  
 6. a) What is a branch prediction buffer? Compare 1 bit and 2 bit branch predictors.  
 b) With the help of a neat diagram, explain the structure of a floating point processor unit based on extended Tomasulo's algorithm to handle speculation. Explain the four steps involved in instruction execution.  
 c) Determine the total branch penalty for a branch-target buffer assuming the penalty cycles for individual mispredictions from Figure Q6.C. Make the following assumptions about the prediction accuracy and hit rate:  
  - Prediction accuracy is 90% (for instructions in the buffer).
  - Hit rate in the buffer is 90% (for branches predicted taken).

| Instruction in buffer | Prediction | Actual branch | Penalty cycles |
|-----------------------|------------|---------------|----------------|
| yes                   | taken      | taken         | 0              |
| yes                   | taken      | not taken     | 2              |
| no                    |            | taken         | 2              |
| no                    |            | not taken     | 0              |

Figure Q6.C

## Unit – IV

7. a) With an example explain cache coherence problem. How write invalidate and write update snoopy protocols maintain cache coherence for write through caches?  
 b) What do you mean by synchronization? Explain the hardware synchronization mechanisms used in multiprocessor systems.  
 c) Distinguish between static and dynamic dataflow computers  
 8. a) Answer the following related to multistage networks:
  - How many states are there in a 4X4 switch module including both broadcast and permutations?
  - How many permutations can be implemented directly in a single pass through the network without blocking, if a 64 input Omega network using 4X4 switch modules is constructed in multiple stages?
  - What is the percentage of one pass permutations compared with the total no. of permutations available in one or more passes through the network?
 b) Explain perfect shuffle and exchange. What are the factors that affect the performance of an interconnection network?  
 c) Explain the following interconnection networks with figures:  
 i) Chordal ring of degree 3 ii) Illiac mesh iii) 4 cube

## Unit – V

9. a) Explain hardware support for preserving exception behavior  
 b) Briefly explain the features of Intel IA-64 architecture  
 10. a) How does the compiler detect dependences in general?  
 b) Write short notes on
  - Software Pipelining
  - Global Code Scheduling

\*\*\*\*\*


**NMAM INSTITUTE OF TECHNOLOGY, NITTE**  
 (An Autonomous Institution affiliated to VTU, Belgaum)  
**Seventh Semester B.E. (CSE) (Credit System) Degree Examinations**  
**Supplementary Examinations - July 2014**

USN

Duration: 3 Hours

Max. Marks: 100

**Note: Answer Five full questions choosing One full question from each Unit.**

**Unit – I**

1. a) Discuss about SPEC benchmarks in detail. Explain how SPEC ratio is calculated. 08  
b) Differentiate between concurrent and parallel execution models. 04  
c) With a neat Block diagram, explain working of different shared memory multiprocessor systems. 08
2. a) Mention the factors that affect the speedup of a processor according to Amdahl's law. Using Amdahl's law find the overall speedup of enhanced system for the following data: Assume that a processor needs to be enhanced for the database transactions. The new processor is 10 times faster on the database transactions. Assume that original processor is busy 40% of time with computation on transactions and is waiting for I/O 60% of the time. 10  
b) What do you mean by functional parallelism? Explain various levels of available and utilized functional parallelism in detail. 10

**Unit – II**

3. a) What do you mean by pipeline hazard? Mention different types of hazards. With a valid example explain different methods of scheduling the branch delay slot. 08  
b) Define speedup, efficiency and throughput of linear pipeline processors. 04  
c) With a neat block diagram, explain classic 5 stage pipeline for a RISC processor 08
4. a) With a neat diagram, explain the events on every pipe stage of MIPS pipeline for processing R, I and J types of instructions 08  
b) For the following reservation table of non linear pipeline find the minimal average latency (MAL) for a collision free scheduling.

|                | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|----------------|---|---|---|---|---|---|---|---|
| S <sub>1</sub> | X |   |   |   | X |   | X |   |
| S <sub>2</sub> |   | X | X |   |   |   |   |   |
| S <sub>3</sub> |   |   | X | X |   | X |   |   |

- c) Discuss data hazard taking a valid example. Explain a technique to overcome data hazard 06

**Unit – III**

5. a) For the loop given below, show how a basic pipelining technique can minimize stalls.  

$$\text{for}(i=1000; i>0; i--)$$

$$x[i]=x[i] + s;$$

Latencies for the dependent units are as follows:

| Instruction producing result | Instruction using result | Latency |
|------------------------------|--------------------------|---------|
| FP Add/Sub                   | FP Add/Sub               | 3       |
| FP Add/Sub                   | SD                       | 2       |
| LD                           | FP Add/Sub               | 1       |
| LD                           | SD                       | 0       |
| Int Add/Sub                  | Int Add/Sub              | 0       |
| Int Add/Sub                  | Branch                   | 1       |

06  
06

06

P.T.O.

**CS702**

- b) If the following instructions are scheduled dynamically using Tomasulo's algorithm, what are the contents of reservation stations and register tags when all the instructions are issued in order.

|      |            |
|------|------------|
| LD   | F6, 34(R2) |
| LD   | F2,45(R2)  |
| MULD | F0,F2,F4   |
| SUBD | F8,F6,F2   |
| DIVD | F10,F0,F6  |
| ADDD | F6,F8,F2   |

Latencies for the functional units are as follows:

| Functional Unit | Latencies |
|-----------------|-----------|
| Add             | 2         |
| Multiply        | 10        |
| Divide          | 40        |
| Load            | 2         |

Consider number of load buffers as 3, number of FP add reservation station is 3, number of FP multiply reservation station is 2.

- c) Giving the Hardware structure of Branch History Table, explain 1 bit prediction scheme in detail
6. a) What do you mean by loop unrolling? Explain with an example how compiler does the loop unrolling statically to expose ILP.
- b) With a neat block diagram, explain four steps involved in execution of an instruction using extended Tomasulo's algorithm to handle speculation. What are the roles of reorder buffer?
- c) Briefly explain tournament predictor scheme.

**Unit – IV**

7. a) Explain in detail about write-through cache snoopy bus protocol with a neat state transition diagram
- b) Draw a 16X16 Omega network using 2X2 switches.
- c) Write a short note on Data flow architecture
8. a) Write the Data flow Graph for the sample instructions given below:  
 Input d,e,f  
 $C_0=0$   
 for i=1 to 8 do  
 begin  
 $a_i = d_i + e_i$   
 $b_i = a_i * f_i$   
 $c_i = b_i + c_{i-1}$   
 end  
 output a,b,c
- b) Briefly explain three generations of multicomputer systems
- c) Explain the construction of 16X16 baseline network using 2X2 switches .

**Unit – V**

9. a) Explain loop carried dependence and non loop carried dependence with a valid example for each one of them
- b) What do you mean by predication in VLIW architecture? give an example to illustrate the importance of predicated instructions
- c) Compare software pipelining and loop unrolling technique
10. a) Explain in detail the concept of software pipelined loop taking a relevant example.
- b) Explain in detail, the hardware mechanisms to support compiler speculation. Also highlight the key methods involved in hardware support for preserving exception behavior

\*\*\*\*\*

USN 

|  |  |  |  |  |  |  |
|--|--|--|--|--|--|--|
|  |  |  |  |  |  |  |
|--|--|--|--|--|--|--|



# NMAM INSTITUTE OF TECHNOLOGY, NITTE

(An Autonomous Institution affiliated to VTU, Belgaum)

## Seventh Semester B.E. (CSE) (Credit System) Degree Examinations

December - 2013

### CS702 – ADVANCED COMPUTER ARCHITECTURE

Duration: 3 Hours

Note: Answer Five full questions choosing One full question from each Unit.

- I. a) What is the need of standard benchmark programs? Explain how overall SPEC ratio is calculated for a benchmark suit. 6  
 b) Explain with block diagrams the UMA, NUMA and COMA multiprocessor systems. 6  
 c) For the purpose of solving a given application problem, a benchmark program is run on two computer systems. On system A, the object code executed 80 million ALU operations, 40 million load instructions, and 25 million branch instructions. On system B, the object code executed 50 million ALU ops, 50 million loads, and 40 million branch instructions. In both systems, each ALU operation takes 1 clock cycles, each load takes 3 clock cycles, and each branch takes 5 clock cycles. Assuming that the clock on system B is 10% faster than the clock on system A, which system is faster for the given application problem and by how much percent? 8
- a) Compute the overall CPI of a computer which executes a program with following instruction mix:
- | Operation | Frequency | No. of Clock cycles |
|-----------|-----------|---------------------|
| ALU ops   | 35%       | 1                   |
| Loads     | 25%       | 2                   |
| Stores    | 15%       | 2                   |
| Branches  | 25%       | 3                   |
- b) With block diagrams and examples explain the SIMD and MIMD multiprocessor architectural classifications. 6  
 c) Suppose that a system contains a special processor for doing floating-point operations and it is determined that 50% of the computations can use the floating-point processor. The speedup of the floating point processor is 15.  
 i) What is the overall speedup achieved by using the floating-point processor?  
 ii) What is the overall speedup achieved if the compiler is modified so that 75% of the computations can use the floating-point processor?  
 iii) What fraction of the computations should be able to use the floating-point processor in order to achieve an overall speedup of 2.25? 8

### Unit – II

- a) Explain how control hazard affects the performance in a 5 stage pipeline and discuss the solutions to control hazard. 8  
 b) Compute the permissible latency, forbidden latency and collision vector for the following reservation table

|    | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|----|---|---|---|---|---|---|---|
| s1 | x |   | x |   | x |   |   |
| s2 |   | x |   |   |   |   |   |
| s3 |   |   |   | x |   |   |   |
| s4 |   |   | x |   |   | x |   |

- c) Derive an expression for overall speedup of a pipelined processor visualizing the pipeline as a mechanism to improve performance by decreasing the CPI. 6

4. a) An unpipelined processor takes 6 ns to work on one instruction. The pipelined version of the processor has 6 stages with the following lengths: 1.0ns; 0.8ns; 0.4ns; 1.2ns; 1.3ns; 1.3ns. It then takes 0.3 ns to latch its results into latches. Answer the following, assuming that there are no stalls in the pipeline.
1. What are the cycle times in both processors?
  2. How long does it take (in nano-seconds) to finish one instruction in both processors?  
(Note : Ignore the initial fill time in the pipelined processor)
  3. What is the speedup achieved by the 6 stage pipeline with respect to unpipelined processor?
  4. How long does it take (in nano-seconds) to finish 1000 instructions in both processors?  
(Note : Do not ignore the initial fill time in the pipelined processor)
- b) Consider an unpipelined processor. Assume that it has a 1ns clock cycle and that it uses 4 cycles for ALU operations and branches and 5 cycles for memory operations. Assume that the relative frequencies of these operations are 40%, 20%, and 40%, respectively. Suppose that due to clock skew and set up, pipelining the processor adds 0.2ns of overhead to the clock. Ignoring any latency impact, how much speed up in the instruction execution rate will we gain from a pipeline?
- c) Draw the block diagram of 5 stage MIPS pipeline. Explain the operations performed by every pipe stage for the execution of instruction ADD R1, R2, R3.

### Unit – III

5. a) Give the structure of a floating point processor unit based on extended Tomasulo's algorithm to handle speculation. Briefly explain the four steps involved in the execution of any instruction in this processor.
- b) Show the following loop unrolled on a dual issue (1 FP instruction and 1 anything else) superscalar processor, so that there are five copies of the loop body. Assume that the number of loop iterations is a multiple of 5.

```

Loop:   L.D F0,0(R1)
        ADD.D F4,F0,F2
        S.D F4,0(R1)
        SUBI R1,R1,#8
        BNEZ R1, Loop
    
```

Consider the following intervening latencies between two dependent instructions:

| Instruction producing result | Instruction using result | Latency |
|------------------------------|--------------------------|---------|
| FP ALU                       | FP ALU                   | 3       |
| FP ALU                       | SD                       | 2       |
| LD                           | FP ALU                   | 1       |
| LD                           | SD                       | 0       |
| Int ALU                      | Branch                   | 1       |
| Int ALU                      | Int ALU                  | 0       |

- c) With block diagram describe the structure of (2, 2) correlating predictor
6. a) Show the contents of reservation stations and register tags for the first 4 clock cycles when the instructions in the following example code are scheduled dynamically using basic Tomasulo's algorithm (without speculation).

```

SUB.D F8, F2, F6
ADD.D F6, F8, F2
DIV.D F10, F0, F6
    
```

The latencies for the functional units are as follows:

| Functional units | Latencies |
|------------------|-----------|
| Add/Sub          | 2         |
| Multiply         | 10        |
| Divide           | 40        |

The number of FP add reservation stations is 2, and number of FP multiply reservation stations is 1.

- b) Explain how Branch Target Buffer increases the instruction fetch bandwidth. 6  
 c) Compare the operation of one-bit and two-bit branch predictors with suitable examples. 6


 Unit - IV

7. a) Explain the following static connection networks with figures. 8  
 i) chordal ring of degree 3  
 ii) binary tree (of level 4)  
 iii) torus  
 iv) a cube
- b) Draw a 16\*16 omega network using 2\*2 switches and perfect shuffle as inter stage connection 6
- c) List out the causes of cache inconsistency. Explain any one cause in detail. 6
8. a) Give the overview of snoopy cache coherence protocols. Write the state transition diagram for snoopy protocol among write-through caches. 8  
 b) Write a short note on dataflow architectures and data parallel architectures 8  
 c) Distinguish between static and dynamic interconnection networks 4

## Unit - V

9. a) State the GCD test heuristic to detect the loop dependence. Using GCD test, determine whether dependence exists in the following loop:  

$$\text{for } (i=1; i \leq 100; i=i+1) \\ \{ \\ \quad X[2*i+3] = X[2*i] * 5.0; \\ \}$$
 8
- b) Briefly explain the following:  
 c) Hardware mechanisms to support compiler speculation  
 ii) Features of Itanium 2 processor 12
10. a) Give the overview of the Intel IA-64 architecture. 8  
 b) With example distinguish between loop-carried dependence and non-loop-carried dependence 6  
 c) Explain the software pipelining 6

-3-

\*\*\*\*\*



# NMAM INSTITUTE OF TECHNOLOGY, NITTE

(An Autonomous Institution affiliated to VTU, Belgaum)

## Seventh Semester B.E. (CSE) (Credit System) Degree Examinations

Supplementary Examinations – June 2013

CS702 – ADVANCED COMPUTER ARCHITECTURE

Note: Answer **Five full** questions choosing **One full** question from each **Unit**

### Unit – I

1. a) What are Benchmarks? Explain the types and significance of it. 4  
b) Explain the levels of parallelism 4  
c) Derive an equation for CPI for using Processor Performance Equation 6  
d) Distinguish between UMA, NUMA and COMA models 6
2. a) Define distributed memory multicomputers 4  
b) Explain the concept of concurrent and parallel executions 4  
c) How do you explain speedup factor in Amdahl's Law? Explain 6  
d) Suppose we have made the following measurements:

Frequency of FP operations = 25%  
Average CPI of FP operations = 4.0  
Average CPI of other instructions = 1.33  
Frequency of FPSQR=2%  
CPI of FPSQR=20

Assume that the two design alternatives are to decrease the CPI of FPSQR to 2 or to decrease the average CPI of all FP operations to 2.5. Compare these two design alternatives using the processor performance equation

6

### Unit – II

3. a) What is pipelining? Explain a classic five-stage pipeline for a RISC processor along with simple implementation of RISC instruction sets. Also draw the simplified version of a RISC data path in pipeline fashion. 10  
b) Explain the three classes of pipeline hazards with neat diagrams. 10
- a) Explain the steps involved in simple implementation of MIPS with diagram. 10  
b) Consider the five-stage pipelined processor specified by the following reservation table.

|       | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|-------|---|---|---|---|---|---|---|---|
| $S_1$ | X |   |   |   | X |   | X |   |
| $S_2$ |   | X | X |   |   |   |   |   |
| $S_3$ |   | X | X | X |   | X |   |   |

- i. List all the set of forbidden latencies and collision vector.  
ii. Draw a state transition diagram.  
iii. List all the simple cycles  
iv. List all the greedy cycles.  
v. What is MAL & MCL of this pipeline?

10

### Unit – III

4. a) Explain control dependences with example 6  
b) What is loop unrolling? Explain with example 6  
c) Describe dynamic scheduling using Tomasulo's Approach 8

P.T.O.

6. a) Explain the different types of dependences ILP with examples  
b) How to increase instruction fetch bandwidth? Explain

#### Unit – IV

7. a) What is data parallel computation? Explain  
b) Describe the three types of cache directory protocols  
c) Explain the various static connection networks with necessary parameters
8. a) Differentiate between Static vs Dynamic dataflow computers  
b) Give the features of fine-grained SIMD architecture with help of MPP system architecture  
c) Explain snoopy bus protocols

#### Unit – V

9. a) Consider the following a loop statements :

```
for (i=1; i<=100; i=i+1)
{
    A[i] = A[i] + B[i]; /* S1 */
    B[i+1] = C[i] + D[i]; /* S2 */
}
```

What are the dependences between S1 and S2? Is this loop parallel? If not, show how to make it parallel.

- b) Explain the Intel IA-64 Instruction Set Architecture
10. a) Explain Trace Scheduling of global code scheduling approach  
b) Explain Functional Units and Instruction Issue of The Itanium 2 Processor

\*\*\*\*\*