

Q.1)

You are given a non-pipelined processor design which has a cycle time of 10ns and average CPI of 1.4. If the 5 stages are 1ns, 1.5ns, 4ns, 3ns, and 0.5ns, what is the best speedup you can get compared to the original processor?

Max Marks: 1

Correct Answer

**Solution:** (3.5)

Cycle time for the pipelined processor is = 4 ns (Max of all the stage delays).

CPI for the pipelined processor is = 1

Cycle time for the non-pipelined processor is = 10 ns

CPI for the Non-pipelined processor = 1.4

Speed up achieved due to pipeline is =  $1.4 * 10 / (1 * 4) = 3.5$

Q.2)

Consider a 64KB cache with 256 sets. Addresses are 32 bits. Tags are 19 bits.

Max Marks: 1

What is the number of bytes per block for this cache?

A 8

B 16

C 24

D 32

Correct Option

**Solution:** (D)

Given that 256 sets are in the cache and tag bits are, 32 bits are in the Address

$\log_2 256 = 8$  set index bits

Now we can figure out how many offset bits there are:

32 bits total - (19 tag bits + 8 index bits) = 5 offset bits

Now we know how many bytes per block there are:

$2^5 = 32$  bytes per block

Q.3)

Given a 32-bit address space and a 32Kbytes cache, determine the number of bits for each field of a memory reference used to access the cache i.e., the cache index field, the tag field, and the byte-select field respectively. for the following cache organizations 8-way Set associative cache with 16 byte blocks.

Max Marks: 1

A 20,8,4

B 8, 20, 4

Correct Option

**Solution:** (B)

Given a 32-bit address space and a 32Kbytes cache,  
each cache line size is 16B  $\Rightarrow$  4 bits for offset

Number of cache lines =  $32\text{KB}/16\text{B} = 2\text{K} = 2^{11}$

Number of sets =  $2^{11} / 8$  (8 way set associative) =  $2^8 = 8$  bits for set index

TAG bits =  $32 - 8 - 4 = 20$

| Tag | Index | offset |
|-----|-------|--------|
| 20  | 8     | 4      |

C 17, 11, 4

D 11, 17 4

Q.4)

Number of 512K  $\times$  8 memory chips and the size of the decoder required to realize a 8M  $\times$  32 RAM

Max Marks: 1

A 64, 6x64

B 64, 4x16

Correct Option

**Solution:** (B)

Number of memory chips required to realize 8Mx32 RAM  
 $= (8\text{M} \times 32) / (512\text{K} \times 8) = 64$

One word =32  
We need 4 512x8 Memory chips are realize one word.  
One row contains 4

C 32, 3x32

D None of these

Q.5)

Max Marks: 1

Suppose that in 1000 memory accesses, 40 misses in the first-level cache and 20 misses in the second-level cache. Assume the miss penalty from L2 cache to memory is 100 cycles, hit time of L2 cache is 10 cycles, the hit time of L1 cache is 1 cycle. Then the average memory access time is\_\_\_\_\_ (Assume that the local miss rate is considered).



Correct Answer

**Solution:** (3.4)

$$\text{Miss rate\_L1} = 40/1000 = 4\% = 0.04$$

$$\text{Local Miss rate\_L2} = 20/40 = 50\% = 0.5$$

$$\begin{aligned}\text{Average memory access time} &= \text{Hit time\_L1} + \text{Miss rate\_L1} * (\text{hit time\_L2} + \text{miss rate\_L2} * \text{miss penalty\_L2}) \\ &= 1 + 0.04 * (10 + 0.5 * 100) = 1 + (0.04 * 60) = 3.4 \text{ clock cycles}\end{aligned}$$

Q.6)

Max Marks: 1

The time delay of the four segments in the pipeline are as follows:

t1=50 ns, t2=30 ns, t3= 95 ns, and t4= 45 ns. The interface registers delay time t= 5 ns. How long would it take to add 100 pairs of numbers in the pipeline? (Measure the time in terms of Microseconds)



Correct Answer

**Solution:** (10.3)

The clock cycle time for the pipeline is the cycle time of the segment taking the longest time i.e. Segment 3

Therefore, Clock cycle = 95 + 5 = 100 ns (time for segment 3)

For n = 100, k = 4, tp = 100 ns.

Time to add 100 numbers =  $(k + n - 1) tp = (4 + 99) 100 = 10,300 \text{ ns} = 10.3 \mu\text{s}$

Q.7)

Max Marks: 1

Assume that you have a typical 5-stage pipelined processor that resolves branch instructions in the memory stage. Assume that the CPI of branch instructions is 4 cycles and their instruction frequency is 20%. The rest 80% instructions have an average CPI of 1.2 cycles. What is the average CPI?



Correct Answer

**Solution:** (1.76)

$$\begin{aligned}\text{Average CPI} &= 0.2 \times 4 + 0.8 \times 1.2 \\ &= 0.8 + 0.96 \\ &= 1.76 \text{ cycles}\end{aligned}$$

Q.8)

Max Marks: 1

Given the following memory values and a one-address machine with an accumulator: Word 20 contains 40

Word 30 contains 50

Word 40 contains 60

Word 50 contains 70

What value do the following instruction load into the accumulator?

Load INDIRECT 20



40



50



60

Correct Option

**Solution:** (c)

| Word | Data |
|------|------|
| 20   | 40   |
| 30   | 50   |
| 40   | 60   |
| 50   | 70   |

Given instruction Load INDIRECT 20

Memory location 20 contains 40 the address of the data.

M[40] = 60

70

Q.9)

On receiving an interrupt from an I/O device, the CPU

Max Marks: 1

A

Halts for a predetermined time

B

Branches off to the interrupt service routine after completion of the current instruction

Correct Option

Solution: (B)

CPU checks the status bit of interrupt at the completion of each current instruction running when there is an interrupt it service the interrupt using ISR

C

Branches off to the interrupt service routine immediately

D

Hands over control of address bus and data bus to the interrupting device

Q.10)

Max Marks: 1

The memory access time is 1 nsec for a read operation with a hit in cache, 5 nsec for a read operation with a miss in cache, 2 nsec for a write operation with a hit in cache, and 10 nsec for a write operation with a miss in cache. The execution of a sequence of instructions involves 100 instruction fetch operations, 60 memory operand read operations, and 40 memory operand write operations. The cache hit ratio is 0.9. The average memory access time (in nanoseconds) in executing the sequence of instructions is:

A

1.26

B

1.68

Correct Option

Solution: (B)

Total number of read =  $100 + 60 = 160$  Total number of write = 40. So, fraction of reads =  $160 / (160 + 40) = 0.8$  And, fraction of writes =  $40 / (160 + 40) = 0.2$  Average access time =  $0.8(0.9 \times 1 + 0.1 \times 5) + 0.2(0.9 \times 2 + 0.1 \times 10) = 1.68$

C

2.46

D

4.52

Q.11)

Max Marks: 2

Assume that:

- a) Pipeline contains 5 stages: IF, ID, EX, M and W;
- b) Each stage requires one clock cycle;
- c) All memory references hit in cache;
- d) Following program segment should be processed:

// ADD TWO INTEGER ARRAYS

```
LW R4 # 400  
L1: LW R1, 0 (R4) ; Load first operand  
     LW R2, 400 (R4) ; Load second operand  
     ADDI R3, R1, R2 ; Add operands  
     SW R3, 0 (R4) ; Store result  
     SUB R4, R4, #4 ; Calculate address of next element  
     BNEZ R4, L1 ; Loop if (R4) != 0
```

Calculate the number of clock cycles for one iteration of the loop, on the simple pipeline without forwarding or bypassing when result of the branch instruction (new PC content) is available after WB stage.

A

14

B

15

C

16

Correct Option

Solution: (C)

|                     | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
|---------------------|---|---|---|---|---|---|---|---|---|----|----|----|----|----|----|----|
| I1: LW R1, 0 (R4)   | F | D | E | M | W |   |   |   |   |    |    |    |    |    |    |    |
| I2: LW R2, 400 (R4) |   | F | D | E | M | W |   |   |   |    |    |    |    |    |    |    |
| I3: ADDI R3, R1, R2 |   |   | F | - | - | D | E | M | W |    |    |    |    |    |    |    |
| I4: SW R3, 0 (R4)   |   |   |   |   | F | - | - | D | E | M  | W  |    |    |    |    |    |
| I5: SUB R4, R4, #4  |   |   |   |   |   |   | F | D | E | M  | W  |    |    |    |    |    |
| I6: BNEZ R4, L1     |   |   |   |   |   |   |   | F | - | -  | D  | E  | M  | W  |    |    |

D None of the above

Q.12)

Max Marks: 2

There are three computers A, B and C. A program P1 takes time 5, 20 and 100 respectively to run on the three computers. Similarly, another program P2 takes times 750, 75 and 15 respectively to run on the same three computers. Which computer is the fastest based on weighted average mean assuming the weights of programs P1 and P2 to be 40% and 60% respectively?

A Computer A

B Computer B

C Computer C

Correct Option

Solution: (C)

|    | Computer A | Computer B | Computer C |
|----|------------|------------|------------|
| P1 | 5          | 20         | 100        |
| P2 | 750        | 75         | 15         |

$$\text{Weighted Arithmetic Mean (WAM)} = \sum_{i=1}^n w_i t_i$$

$$\text{Computer A} = ((0.4 \times 5) + (0.6 \times 750)) / 2 = 226$$

$$\text{Computer B} = ((0.4 \times 20) + (0.6 \times 75)) / 2 = 26.5$$

$$\text{Computer C} = ((0.4 \times 100) + (0.6 \times 15)) / 2 = 24.5$$

D Cannot say.

Q.13)

Max Marks: 2

You have a 2-way set associative L1 cache that is 8KB, with 4-word cache lines. Writing and reading data to L2 takes 10 cycles. The following 32-bit address in hexadecimal will refer to the cache set number \_\_\_\_\_. Assume one word is 4B

0x3400

Solution: (64)

32-bit address  $\Rightarrow$  One word = 4B

Cache line size = 4 words = 16 bytes.

Offset size =  $\log_2(\text{cache line size}) = \log_2(16) = 4$  bits.

Given cache size is 2-way set associative and each cache line will store 16B

Number of Cache lines =  $8\text{KB} / 16\text{B} = 512$

Number of sets =  $512 / 2 = 256$

Index size =  $\log(\text{Number of sets}) = \log(256) = 8$  bits.

Tag size =  $32 - 4 - 8 = 20$  bits

| Tag  | Index | offset |
|------|-------|--------|
| 20   | 8     | 4      |
| 9-31 | 4-8   | 0-3    |

Given address in Hexadecimal 0x3400

00000000 00000000 0011 0100 0000 0000

First 4 bits are offset

Next 8 bits are Index 0100 0000 = 64th cache set

Q.14)

Max Marks: 2

Assume that you have a typical 5-stage pipelined processor that uses forwarding and stalls to solve data hazards. Assume also that the processor resolves branch instructions in the decode stage.

```
Loop: lw $t0, 0($s1)      # $t0=array element
      addu $t0, $t0, $s2    # add scalar in $s2
      sw $t0, 0($s1)        # store result
      addi $s1, $s1,-4       # decrement pointer
      bne $s1, $zero, Loop   # branch $s1!=0
```

Number of clock cycles are needed by this processor to execute one iteration of this loop

is \_\_\_\_\_

Correct Answer

Solution: (11)

|                       | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|-----------------------|---|---|---|---|---|---|---|---|---|----|----|----|
| lw \$t0, 0(\$s1)      | F | D | E | M | W |   |   |   |   |    |    |    |
| addu \$t0, \$t0, \$s2 | F | D | - | E | M | W |   |   |   |    |    |    |

|                           |  |   |   |   |   |   |   |   |   |  |
|---------------------------|--|---|---|---|---|---|---|---|---|--|
| sw \$t0, (\$s1)           |  | F | - | D | E | M | W |   |   |  |
| addi \$s1, \$s1,-4        |  |   |   | F | D | E | M | W |   |  |
| bne \$s1, \$zero,<br>Loop |  |   |   | F | D | - | E | M | W |  |

Q.15)

Max Marks: 2

Consider a computer with the following characteristics: total of 1 MB of main memory; word size of 1 byte; block size of 16 bytes; and cache size of 64 kB. For the main memory addresses of 01234 (in Hexa decimal), the corresponding cache line number (in decimal) is

Correct Answer

**Solution:** (291)Block size = 16 bytes  $\Rightarrow$  4 bits to represent offset/word

Cache size = 64KB

Number of cache lines =  $64\text{KB} / 16\text{B} = 4\text{K} = 2^{12} \Rightarrow 12$  bits are cache lines/IndexMain memory = 1MB  $\Rightarrow$  Number of bits in the Address is 20Number of Tag bits are =  $20 - 12 - 4 = 4$ 

| Tag   | Index | Offset |
|-------|-------|--------|
| 4     | 12    | 4      |
| 16-19 | 4-15  | 0-3    |

Given main memory address is 01234

Binary equivalent is (0000)  $\Rightarrow$  Tag (0001 0010 0011)  $\Rightarrow$  Index (0100)  $\Rightarrow$  Offset

Cache line number is 291

Q.16)

Max Marks: 2

The table below shows instruction-type breakdown for different programs. Using this data, you will be exploring the performance tradeoffs with different changes made to a MIPS processor.

| Program   | Compute | Load | Store | Branch |
|-----------|---------|------|-------|--------|
| Program 4 | 1500    | 300  | 100   | 100    |

Assuming that computes take 1 cycle, load and store instructions take 10 cycles, and branches take 3 cycles, find the execution time of this program on a 3 GHz MIPS processor. (Execution time in Microseconds)

A 1.93

Correct Option

**Solution:** (A)Program 4 cycles:  $1500 \times 1 + 300 \times 10 + 100 \times 10 + 100 \times 3 = 5800$  cyclesProgram 4 execution time:  $5800 \text{ cycles} * 1 \text{ second} / (3 \times 10^9) \text{ cycles} = 1.93 \times 10^{-6} \text{ s}$   
 $= 1.93 \mu\text{s}$ 

B 2

C 2.4

D None of these

Q.17)

Max Marks: 2

A computer has 256 K word memory. The instruction format has 4 fields i.e., Opcode, register field to represent one of the 60 processor registers, mode field represent one of 7 addressing modes and memory address field. How many instructions the system supports when a 32-bit instruction is placed in the one memory cell.

Correct Answer

**Solution:** (32)

instruction format (32-bit instruction):

| Opcode                                                                                                                                                                                                                                                                                                                             | Register | Addressing Mode | Memory Address |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|-----------------|----------------|
| A computer has 256 K word memory i.e. memory is word addressable.<br>60 processor registers will require 6 bits ( $60 < 2^6$ )<br>7 addressing modes will require 3 bits ( $7 < 2^3$ )<br>memory address field will require 18 bits<br>so opcode will have $32 - (6+3+18) = 5$<br>so maximum instructions the system supports = 32 |          |                 |                |

Q.18)

Max Marks: 2

Consider a two-level memory hierarchy with separate instruction and data caches in

**Consider a two-level memory hierarchy with separate instruction and data caches in level 1, and main memory in level 2. The clock cycle time in 1 ns. The miss penalty is 20 clock cycles for both read and write. 2% of the instructions are not found in I-cache, and 10% of data references not found in D-cache. 25% of the total memory accesses are for data, and cache access time (including hit detection) is 1 clock cycle. The average access time of the memory hierarchy will be ..... nanoseconds.**

Correct Answer

**Solution:** (1.76)

The clock cycle time in 1 ns.  
 The miss penalty is 20 clock cycles for both read and write.  
 2% of the instructions are not found in I-cache,  
 and 10% of data references not found in D-cache.  
 25% of the total memory accesses are for data, and cache access time (including hit detection) is 1 clock cycle.  
 75% of the memory accesses for the instruction  
 $\text{Average access time} = 0.75 (0.98 \times 1 + 0.02 \times 20) + 0.25 (0.90 \times 1 + 0.10 \times 20) = 1.76 \text{ ns}$

Q.19)

Max Marks: 2

Assume the classic 5-stage pipeline, with registers being written in the first half of WB and read in the last half of ID. There are two memory ports, one for instructions and one for data. Consider the following code:

L R9, 0(R5)  
 DADD R1, R9, R8  
 DSUB R2, R1, R9  
 DADD R3, R1, R2

How many cycles does the code take, from first fetch to last write-back, assuming no forwarding

Correct Answer

**Solution:** (14)

|                 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |
|-----------------|---|---|---|---|---|---|---|---|---|----|----|----|----|----|
| L R9, 0(R5)     | F | D | E | M | W |   |   |   |   |    |    |    |    |    |
| DADD R1, R9, R8 |   | F | D | - | - | E | M | W |   |    |    |    |    |    |
| DSUB R2, R1, R9 |   |   | F | - | - | D | - | - | E | M  | W  |    |    |    |
| DADD R3, R1, R2 |   |   |   | F | - | - | - | - | D | -  | -  | E  | M  | W  |

Q.20)

Max Marks: 2

. The 5 stages of the processor have the following latencies:

| Fetch | Decode | Execute | Memory | Write Back |
|-------|--------|---------|--------|------------|
| 300ps | 400ps  | 350ps   | 550ps  | 100ps      |

Assume that when pipelining, each pipeline stage costs 20ps extra for the registers between pipeline stages. Then the cycle time and the latency for the pipelined processor is in terms of picoseconds

A

570, 2850

Correct Option

**Solution:** (A)

Pipelining reduces the cycle time to the length of the longest stage plus the register delay. Latency becomes  $CT \times N$  where N is the number of stages as one instruction will need to go through each of the stages and each stage takes one cycle.

$$CT = 550 + 20 = 570 \text{ ps}$$

$$\text{Latency} = 5 \times 570 = 2850 \text{ ps}$$

B

1700, 1700

C

570, 570

D

None of these

close