

4.7. (5)

Para a Adição se fizer as 2 unidades de Add quantas novas máscaras de Zeroput?

~~Add1~~ ~~Reg1~~  
~~Add2~~ ~~Reg2~~

~~Sign ALU = sign extend~~ 8 - 3

~~Reg1~~ ~~Reg2~~

ADD Branch = Reg 4

Shift left 2

ADDPe = Reg 4

4.7.6

Pol. das Reg. do

R R1 - 3

R R2 - 2

Writeback - entra o valor zero logo a 0 (2)

~~Reg Write~~ (1) & control dos regitros a 0

Write Data:

Pode ser uma caixa ou outra por o valor que vem do multiplexor

e' Don't Care na maioria das SW

4.7.5

4.7.6



## Exercícios 4ª aula



**FIGURE 4.17 The simple datapath with the control unit.** The input to the control unit is the 6-bit opcode field from the outputs of the control unit consist of three 1-bit signals that are used to control multiplexors (RegDst, ALUSrc, and MemWrite), signals for controlling reads and writes in the register file and data memory (RegWrite, MemRead, and MemWrite), a 1-bit determining whether to possibly branch (Branch), and a 2-bit control signal for the ALU (ALUOp). An AND gate is used to branch control signal and the Zero output from the ALU; the AND gate output controls the selection of the next PC. Notice that a derived signal, rather than one coming directly from the control unit. Thus, we drop the signal name in subsequent figures.

**4.3** When processor designers consider a possible improvement to the processor datapath, the decision usually depends on the cost/performance trade-off. In the following three problems, assume that we are starting with a datapath from Figure 4.2, where I-Mem, Add, Mux, ALU, Regs, D-Mem, and Control blocks have latencies of 400 ps, 100 ps, 30 ps, 120 ps, 200 ps, 350 ps, and 100 ps, respectively, and costs of 1000, 30, 10, 100, 200, 2000, and 500, respectively.

Consider the addition of a multiplier to the ALU. This addition will add 300 ps to the latency of the ALU and will add a cost of 600 to the ALU. The result will be 5% fewer instructions executed since we will no longer need to emulate the MUL instruction.

- 4.3.1** [10] <§4.1> What is the clock cycle time with and without this improvement?
- 4.3.2** [10] <§4.1> What is the speedup achieved by adding this improvement?



**FIGURE 4.24** The simple control and datapath are extended to handle the jump instruction. An additional multiplexor is used to choose between the jump target and either the branch target or the sequential instruction following! multiplexor is controlled by the jump control signal. The jump target address is obtained by shifting the lower 26 bits of the jump left 2 bits, effectively adding 00 as the low-order bits, and then concatenating the upper 4 bits of PC + 4 as the high-order bits, 32-bit address.

**4.7** In this exercise we examine in detail how an instruction is executed in a single-cycle datapath. Problems in this exercise refer to a clock cycle in which the processor fetches the following instruction word:

101011000110001000000000000010100.

Assume that data memory is all zeros and that the processor's registers have the following values at the beginning of the cycle in which the above instruction word is fetched:

| r0 | r1 | r2 | r3 | r4 | r5 | r6 | r8 | r12 | r31 |
|----|----|----|----|----|----|----|----|-----|-----|
| 0  | -1 | 2  | -3 | -4 | 10 | 6  | 8  | 2   | -16 |

**4.7.1** [5] <§4.4> What are the outputs of the sign-extend and the jump "Shift left 2" unit (near the top of Figure 4.24) for this instruction word?

**4.7.2** [10] <§4.4> What are the values of the ALU control unit's inputs for this instruction?

**4.7.3** [10] <§4.4> What is the new PC address after this instruction is executed? Highlight the path through which this value is determined.

**4.7.4** [10] <§4.4> For each Mux, show the values of its data output during the execution of this instruction and these register values.

**4.7.5** [10] <§4.4> For the ALU and the two add units, what are their data input values?

**4.7.6** [10] <§4.4> What are the values of all inputs for the "Registers" unit?

- 4.X.** Considerando a figura 4.17, assuma que vai implementar duas novas instruções
- ja.l (jump and link)
  - seq (set if equal)

Para cada uma, diga:

- a) quais a unidades funcionais existentes usadas pela instrução
- b) que unidades funcionais adicionais são necessárias
- c) que novos sinais de controlo são necessários
- d) quais os valores de todos os sinais de controlo (para a saída do ALU control, bas a operação efectuada pela ALU)

## Exercícios 5<sup>a</sup> Aula

**4.8** In this exercise, we examine how pipelining affects the clock cycle time of the processor. Problems in this exercise assume that individual stages of the datapath have the following latencies:

| IF    | ID    | EX    | MEM   | WB    |
|-------|-------|-------|-------|-------|
| 250ps | 350ps | 150ps | 300ps | 200ps |

Also, assume that instructions executed by the processor are broken down as follows:

| alu | beq | lw  | sw  |
|-----|-----|-----|-----|
| 45% | 20% | 20% | 15% |

**4.8.1** [5] <§4.5> What is the clock cycle time in a pipelined and non-pipelined processor?

**4.8.2** [10] <§4.5> What is the total latency of an LW instruction in a pipelined and non-pipelined processor?

**4.8.3** [10] <§4.5> If we can split one stage of the pipelined datapath into two new stages, each with half the latency of the original stage, which stage would you split and what is the new clock cycle time of the processor?

**4.8.4** [10] <§4.5> Assuming there are no stalls or hazards, what is the utilization of the data memory?

**4.8.5** [10] <§4.5> Assuming there are no stalls or hazards, what is the utilization of the write-register port of the “Registers” unit?

Calcular o CPI da implementação pipelined (5 andares) para

- 1 instrução
- $10^6$  instruções (sem atrasos)
- $10^6$  instruções, com uma em cada 5 instruções a ser atrasada um ciclo.

Para o código abaixo:

+ Identifique todas as dependências (de dados) existentes.

(Cuidado com os saltos.)

+ Identifique os conflitos de dados.

+ Introduza os *nop* necessários para eliminar os conflitos num *pipeline* sem *forwarding*.

+ Introduza os *nop* necessários para eliminar conflitos no *pipeline* com *forwarding*.

```
1:      or $3, $0, $0
2:      or $8, $5, $0
3:  início: beq $8, $0, fim
4:      addiu $4, $4, -4
5:      lw $9, 0($4)
6:      add $3, $3, $9
7:  beq $0, $0, início
8:      addi $8, $8, -1
9:  fim: sub $3, $7, $3
```

Última alteração: Segunda, 12 Outubro 2015, 20:13

Exercícios

Exercícios

SLUA 2 Sóloci



4.8

|        | IF     | ID     | Ex | MEM | WB |
|--------|--------|--------|----|-----|----|
| 250 Ps | 350 Ps | 150 Ps |    |     |    |
|        |        |        | WB |     |    |

 $\rightarrow$  Pipelined

| IF | ID | Ex | MEM | WB |
|----|----|----|-----|----|
| IF | ID | Ex | MEM | WB |

| IF | ID | Ex | MEM | WB |
|----|----|----|-----|----|
|    |    |    |     |    |

| IF | ID | Ex | MEM | WB |
|----|----|----|-----|----|
|    |    |    |     |    |

4.8.1 non pipelined :  $250 + 350 + 150 + 300 + 200 = 1250 \text{ Ps}$

pipelined : 350

4.8.2

non pipelined : 1250 Ps

pipelined :  $5 \times 350 = 1750$ 

4.8.3

| IF  | ID  | Ex  | MEM | WB  |
|-----|-----|-----|-----|-----|
| 250 | 175 | 175 | 150 | 200 |
|     |     |     |     | 200 |

6 stages de clocking, non overlapping  
frame come

some fan out de 300 Ps

4.8.4.

| ALU | Req | LW  | SW  |
|-----|-----|-----|-----|
| 45% | 20% | 20% | 15% |

$LW + SW = \boxed{35\%}$   
 $\rightarrow$  memoria + memoria  $\rightarrow$   $LW + SW$

operar en que utilizaran memoria?  $R: > 20\% \rightarrow LW + SW$

4.8.5 operar en instrucción que tienen más register?  
 $R: \cancel{ALU} \rightarrow ALU \& LW = \boxed{65\%}$

### Exercício para estimar o EPI

Exercício para estimar o EPI da implementação filhada (Sandes)

Kona:

- ① - 1 instruções
- ② - 16 instruções (nem achar)
- ③ - 16 instruções / com uma nova cada S inturções a ser acharada num c-cdor

$$EPI = \frac{\text{m c-cdor}}{\text{m instruções}}$$



$$A) = EPI = \frac{5}{1} = 5 // \text{ necessitas } 5 \text{ c-cdor para executar } 1 \text{ instruções}$$

$$B) . EPI = \frac{10^6 + 4}{10^6} \approx 1 //$$

Organiza que é sólido de  
pelo lado do lado de  
toda os outros só iniciam

$$C) EPI = \frac{10^6 + 4 + \left(\frac{10^6}{5}\right)}{10^6} \approx 1,2 //$$

Para o código seguinte, identifique todos os dependências de dados

existentes. (o resultado deve ser 01 lista)

Dependências:

Código

on \$3, f0, f0

on \$8, \$5, f0

3: ~~add~~

[Início]

add \$4, f4, -4

lw \$9, 0(\$4)

add \$3, f3, \$9

beq \$0, f0, Início

add \$8, f8, -1

sub \$3, f2, f3

End:

a)  $3 \rightarrow 2 : \$8$

dependência de registro \$8 para \$9 instâncias

3 -> 2 : \$8

5 -> 4 : \$4

6 -> 5 : \$9

8 -> 2 : \$8

3 -> 8 : \$8

6 -> 1 : \$3

Dados e referências locais existem num

dependência dentro da própria 6 -> 6 = \$3

8 -> 8 : \$8

9 -> 1 : \$3

9 -> 6 : \$2

4 -> 4 : \$4

| 3 -> 2 em Cálculos 2 Nop |     |     |     |     |
|--------------------------|-----|-----|-----|-----|
| IF                       | ID  | EX  | MEM | WB  |
| Not                      |     |     |     |     |
|                          | Not |     |     |     |
|                          |     | Beq |     | IF  |
|                          |     |     |     | \$8 |

Q) Identifique os conflictos de dados  
há -

$5 \rightarrow 4$  mechanism of Not

addition



$6 \rightarrow 5$

2 Not

QW



~~3 → 8 2 Not~~

Req.



addi

(c) introduction or not mechanism have eliminated or completed  
num pipeline from forwarding

Q) Apresente a estrutura do endereço de e-mail com forwarding durante a execução de código, sendo a instrução executada 2 vezes. Indique todos os valores individuais e todos os valores que são necessários para formar o argumento de argumento.



Considerando 100% de os cíclios São 12 da imunidade festa ou  
área anterior / calcule a percentagem dos cíclos de religião em  
que todos os andares do prédio São efetivamente usados por  
algum intrusão.

o speellogen hi erdo quo min  
o sifk bane m titel huk el gno%

~~Contra o men cedentia de  
p a 112 Sotofia Comitatu de  
f i sime logo re  
h 5 a h 100 Sotofia ga neg.  
andando a foz que ha 1.21  
h 100 re ad logo 8~~

(g)

Nosso gato colo de modo a que a sua cauda no topo das  
nas a sardinha só praqueles olhos.

Reprise

+ robar a Linda 6 filh. Linda 8

4.8

Tolerância - Tensão de operação da porta de saída para dr.

Ripple



- 16 bits Tensão de saída 350 PS logo só o bloco de saída precisa ser um implementado Ripple

→ Para implementações não Ripple não é normal 1250 PS é só usar o bloco de saída

$$\text{Iw} / \text{máx Ripple} = \cancel{\text{Iw}}$$

e Ripple

Máx resistência de curto circuito

$\leq 350 \text{ PS}$  é uma exceção

Para melhoria só o Ripple só é necessário o ID que é 2 x 10 mAh tensão ~~maior~~ a de Ripple e menor a de HCM melhora a performance

Só é utilizada de ID que é na festa com hardware

$$\text{Speedup} = \frac{\text{Performance of CPU}_A}{\text{Performance of CPU}_B}$$

$$T = \frac{n \cdot \text{cycles}}{f} = \frac{n \cdot \text{instructions} \times \text{CPI}}{\text{CPI}}$$

$\Rightarrow n \cdot \text{instructions} \times \text{CPI} \propto T$

| abs | freq | W   | SW  |
|-----|------|-----|-----|
| 45% | 20%  | 20% | 15% |

$$\text{CPI} = 1 \text{ instruction} = 5 \text{ cycles}$$

$$\text{CPI} = 10^6 + 4 \text{ cycles} = \text{CPI} = 1$$

$$\text{CPI} = 10^6 \times 1 \text{ instruction} = 5 \text{ instructions} \text{ at once} = 1 \text{ cycle} =$$

$$10^6 \times 10^2 + 4 =$$

$\therefore \text{CPI} = 1,2$

| 1  | 1 m<br>Instructions | cycles                 |
|----|---------------------|------------------------|
| 1  |                     | 5 cycles               |
| 2  |                     | $5 + 1 \text{ cycle}$  |
| 3  |                     | $5 + 2 \text{ cycles}$ |
| 4  |                     | $5 + 3 \text{ cycles}$ |
| 5  |                     | $5 + 4 \text{ cycles}$ |
| 6  |                     | $5 + 4 \text{ cycles}$ |
| 7  |                     | $5 + 4 \text{ cycles}$ |
| 8  |                     | $5 + 4 \text{ cycles}$ |
| 9  |                     | $5 + 4 \text{ cycles}$ |
| 10 |                     | $5 + 4 \text{ cycles}$ |



Algoritmo 2

Null-forwarding met de informatie dat we starten van A

A

|   | 1  | 2  | 3  | 4   | 5  | 6 | 7 | 8 | 9 | 10 |
|---|----|----|----|-----|----|---|---|---|---|----|
| A | IF | ID | EX | MEM | WB |   |   |   |   |    |
| B | IF | ID | EX | MEM | WB |   |   |   |   |    |
| C | IF | ID | EX | MEM | WB |   |   |   |   |    |

Algoritmo 3

So' Mem - EX (on Alu - AL)

So' forwarding met

Mem & EX

|   | 1  | 2  | 3  | 4   | 5  | 6 | 7 | 8 |
|---|----|----|----|-----|----|---|---|---|
| A | IF | ID | EX | MEM | WB |   |   |   |
| B | IF | ID | EX | MEM | WB |   |   |   |
| C | IF |    |    |     |    |   |   |   |

Note forwarding

maar om conflicterende dingen

## Exercícios 6ª Aula

(Continua da aula anterior)

Na resolução deste exercício, assuma que a decisão sobre se um salto condicional é ou não efectuado é tomada no andar EX e que é usado um delay slot. Para o código abaixo:

1. or \$3, \$0, \$0
2. or \$8, \$5, \$0
3. inicio: beq \$8, \$0, fim
4. addiu \$4, \$4, -4
5. lw \$9, 0(\$4)
6. add \$3, \$3, \$9
7. beq \$0, \$0, inicio
8. addi \$8, \$8, -1
9. fim: sub \$3, \$7, \$3

- a) Identifique todas as dependências (de dados) existentes. (Cuidado com os saltos.)
- b) Identifique os conflitos de dados.
- c) Introduza os nops necessários para eliminar os conflitos num pipeline sem forwarding.
- d) Introduza os nops necessários para eliminar conflitos no pipeline com forwarding.
- e) Apresente a evolução do estado do pipeline com forwarding durante a execução do código, sendo a instrução 3 executada 2 vezes. Indique todos os atrasos introduzidos e todos os pontos onde foi necessário fazer forwarding de algum valor.
- f) Considerando somente os ciclos 5 a 12 da simulação feita na alínea anterior, calcule a percentagem dos ciclos de relógio em que todos os andares do pipeline são efectivamente usados por alguma instrução.
- g) Para o ciclo da execução em que o conteúdo do pipeline é o indicado abaixo, qual é o valor dos sinais de controlo usados em cada um dos andares? (ver figura-04-51 dos recursos)

| Andar | Instrução          |
|-------|--------------------|
| IF    | add \$3, \$3, \$9  |
| ID    | lw \$9, 0(\$4)     |
| EX    | addiu \$4, \$4, -4 |
| MEM   | beq \$8, \$0, fim  |
| WB    | or \$8, \$5, \$0   |

- h) Modifique o código de modo a que a sua execução se possa dar sem a introdução de qualquer atraso.

Última alteração: Terça, 20 Outubro 2015, 19:30

### Exercícios 7a Aula

**4.10** In this exercise, we examine how resource hazards, control hazards, and Instruction Set Architecture (ISA) design can affect pipelined execution. Problems in this exercise refer to the following fragment of MIPS code:

```
sw    r16, 12(r6)
lw    r16, 8(r6)
beq r5, r4, label1 # Assume r5!=r4
add r5, r1, r4
slt r5, r15, r4
```

Assume that individual pipeline stages have the following latencies:

| IF    | ID    | EX    | MEM   | WB    |
|-------|-------|-------|-------|-------|
| 200ps | 120ps | 150ps | 100ps | 100ps |

**4.10.1** [10] <§4.5> For this problem, assume that all branches are perfectly predicted (this eliminates all control hazards) and that no delay slots are used. If we only have one memory (for both instructions and data), there is a structural hazard every time we need to fetch an instruction in the same cycle in which another instruction accesses data. To guarantee forward progress, this hazard must always be resolved in favor of the instruction that accesses data. What is the total execution time of this instruction sequence in the 5-stage pipeline that only has one memory? We have seen that data hazards can be eliminated by adding `nops` to the code. Can you do the same with this structural hazard? Why?

**4.10.2** [20] <§4.5> For this problem, assume that all branches are perfectly predicted (this eliminates all control hazards) and that no delay slots are used. If we change load/store instructions to use a register (without an offset) as the address, these instructions no longer need to use the ALU. As a result, MEM and EX stages can be overlapped and the pipeline has only 4 stages. Change this code to accommodate this changed ISA. Assuming this change does not affect clock cycle time, what speedup is achieved in this instruction sequence?

**4.10.3** [10] <§4.5> Assuming stall-on-branch and no delay slots, what speedup is achieved on this code if branch outcomes are determined in the ID stage, relative to the execution where branch outcomes are determined in the EX stage?

**4.10.4** [10] <§4.5> Given these pipeline stage latencies, repeat the speedup calculation from 4.10.2, but take into account the (possible) change in clock cycle time. When EX and MEM are done in a single stage, most of their work can be done in parallel. As a result, the resulting EX/MEM stage has a latency that is the larger of the original two, plus 20 ps needed for the work that could not be done in parallel.

**4.10.5** [10] <§4.5> Given these pipeline stage latencies, repeat the speedup calculation from 4.10.3, taking into account the (possible) change in clock cycle time. Assume that the latency ID stage increases by 50% and the latency of the EX stage decreases by 10ps when branch outcome resolution is moved from EX to ID.

Pipeline MIPS 2-issue, resolução dos branches no EX, previsão perfeita de saltos e sem delay slot uma instrução do issue packet pode ser atrasada em relação à outra, mas entram sempre juntas instruções de cada vez no pipeline

única restrição no issue packet: não podem ser duas instruções de acesso à memória

1. executar uma iteração do ciclo no código (ie, evolução do pipeline e indicação dos forwardings):

