

## II RISC DESIGN

- \* Application specific instruction processor :-  
 - States a balance between flexibility of general UP and speed of a circuit.
- Application-specific :- eg - Fast implementation of Fourier Transform for image processing.

### → RISC

- Small no. of instructions
- Only few addressing modes
- Simple design
- All ~~arithmetic~~ instructions other than <sup>register</sup> arithmetic, logical and LOAD/STORE instructions must use direct addressing ~~or~~ immediate.
- Only Load and Store instructions can access memory
- We can afford to have fixed length encoding.

eg - If there are 32 registers (5 bits),

| OP | OPR1 | OPR2 | DEST |
|----|------|------|------|
| 7  | 5    | 5    | 5    |

- Arithmetic & logical instructions can also use immediate addressing  
 Load/Store can use any addressing mode.
- If 32-bit ~~dt~~ instructions are decided,

Register direct (R)

| OP | OPR1 | OPR2 | DEST | Wasted |
|----|------|------|------|--------|
| 6  | 5    | 5    | 5    | .      |

Immediate (I)

| OP | OPR1 | DEST | 16-bit number |
|----|------|------|---------------|
| 6  | 5    | 5    |               |

No memory direct in RISC because too many bits required for memory address  
 Say all registers are ~~32-bit~~ n-bit. n could be 32

classmate

Date \_\_\_\_\_  
 Page \_\_\_\_\_

- Load/Store supports Base + Index and Base + Displacement.

$\text{LOAD } Rd, (\text{RSI}) \text{ RS2}$

 $\equiv Rd \leftarrow [\text{RSI} + \text{RS2}]$ 

'R-type instruction'

$\text{LOAD } Rd, (\text{RSI}) \text{ Imm}$

 $\equiv Rd \leftarrow [\text{RSI} + \text{Imm}]$ 

'I-type instruction'

Imm is 16-bit. Max displacement = 65,535

- 3-types of load/store instructions :- LW :- Load word (32-bits)  
 LH :- half-word (16-bits)  
 LB :- byte (8-bits)

### → Change in control flow instructions

We have studied branching on flag values

Other possibility - No flags at all. #different flavor

Condition is computed by same instruction and decision is taken accordingly

e.g Branch if not equal :-

bneq RSI, RS2, offset

 $\equiv \text{if } (\text{RSI} == \text{RS2}) \{$ 

PC = PC + 4

\} one instruction is 4 bytes

else \{

PC = ~~PC + offset \* 4~~ \$ PC + 4 + offset \* 4 \}

Jump 'offset' no. of instructions ahead.

offset is corresponding to next instruction, not current instruction

Similarly ~~the~~ Branch if equal beg.

SLE

Jump

J

loc  $\swarrow$  26 bit

Design new type of instruction :- J-type

|         |              |
|---------|--------------|
| OP<br>6 | Imm (26-bit) |
|---------|--------------|

$$PC = PC + 4 + loc * 4$$

- Register direct jumping is also possible :- JR R1 ..... R-type instruction  
 $PC \leftarrow R1$

- Calling a subroutine :- 'Jump and link' :- JAL loc  
- Uses J-type instruction.



$$R31 \leftarrow PC + 4$$

- Store location of next instruction in  
R31 - Define R31 as return register

$$PC = PC + 4 + loc * 4$$

- All jumping, offsetting is done w.r.t next instruction

- Jump and link to register :- JALR R1.

$$\equiv R31 \leftarrow PC + 4 \\ PC \leftarrow R1$$

- SLE :- Set if less than equal to

SLE R1 R2 R3

if ( $R1 \leq R2$ ) :  $R3 = 01$ else  $R3 = 00$ 

|                    |    |    |   |
|--------------------|----|----|---|
| SLE I              | R1 | R2 | I |
| if $R1 \leq I$ --- |    |    |   |

Similarly we have SLT, SLTI, SGE~~E~~, SGET, SGT, SGTE

- These instructions compensate for lack of flags.

→ Structured Instruction format.



- If  $x = 000000$  → R-type instruction.

OP is stored as last 6 bits (bits 27 to 32)

∴ 64 R-type instructions possible.

e.g.

- If  $x \neq 000000$  → I or J type instruction.

Total 63 I or J type instructions possible

4/10

- Add/Sub/NAND/Shift :- R-type

LW, SW, BEQ

J

:- I-type

→

:- J-type.

|        |     |      |        |    |
|--------|-----|------|--------|----|
| 000000 | RS1 | RS2  | DEST   | OP |
| OP     | RS1 | DEST | Imm 16 |    |
| OP     |     |      | Imm 26 |    |

- In RISC, there is almost no possibility of merging HKT with operational tasks, because all instructions do similar stuff and resources are already utilized.

- Either of HKT or OT can be performed first.

- We will do HKT first, unlike what we did in CISC

# different flavour



### Minimal Datapath

- 32 registers (programmer-accessible)
- Steering Logic can be point-to-point or broadcast
  - Point-to-point transport # different flavour



- Add/Subtract :-  $C_n = 0 \therefore$  Add
- $C_{in} = 1 \therefore$  Subtract (by 2's complement)

\* Add with carry is not supported

- Shifter :- 5 LSBs of B will tell  $\Rightarrow$  how much to shift A by.
- NOR :- Checks if output is 0 ( $\therefore NOR = 1$ )  
Z bit is NOT visible to programmer



All 'writing to registers' happens at clock edge



Demux writes to registers  
 $\text{if } F_n = 1$

Two muxes to read registers



### ► Flow chart



8/10

- Time-consuming tasks (only one possible (serially) in one state) :-  
ALU use, Register File RD/WR, Memory RD/WR.

clock

Cycle time =  $\text{Clock} \max(\text{time for individual states})$

- Hence we do not want to perform time-consuming tasks serially in one state.



Now, RS is 32-bit. Use sign extender to extend Imm|G to 32 bits  $\Rightarrow$  If MSB is 0, pre-append sixteen '0's  
'1's



→ STORE

HKT



S10

S1

OT



Store

S11

S2

RS in t1 and RD in t2



S12

S5



S13

S7

→ BEQ R1, R2, Imm16

if ( $R1 == R2$ ) {  $PC = PC + 4 + 4 * Imm16$  }

else {  $PC = PC + 4$  }

HKT

S14

Stores  $PC + 4$  in PC

S1

OT

S15



S2

S16



S8

S17



S9

Use some concoction of MUX to execute if statements

Shift left twice to multiply by 4

→ JUMP



HKT



OT



One state is blank for time to decode the instruction  
In other instructions, decoding was done in parallel



9/10

→ Level 2 flowchart :- Memory Access, ALU operation, current state, next state.  
• Merge ~~task~~ states doing similar tasks

Q: Note that second state of LOAD is a subset of second state of ADD  
which is same as second states of STORE and BEQ. ~~and JUMP~~  
Merge all these (S2, S6, ...)  
Even for LOAD, use same second state as ADD

- States 1 and 2 have become common to all instructions.
- The last state of each instruction has S1 as next state, which has S2 as next state.  
After S2, next state is different.
- We can merge S4 and S9, and choose between ir<sub>15-11</sub> and ir<sub>16-20</sub> using a MUX, at the cost of an additional control signal.
- Merge S12 in STORE and S7 in LOAD.

- Wherever there are multiple possible 'next states' ( $S_2, S_5, \text{etc}$ ), we require a decoder that decodes current instruction.
- We can even merge  $S_{10}$  and  $S_2$ , with a modification that PC will be updated only if the instruction decoder says that current instruction is indeed JUMP.  
This can be done because ALU is idle in  $S_2$ .



We will use a non-microcode style FSM #different flavour  
→ Point to point & communication links



Basically → Go over all states. Draw the links that you need.  
Use MUXes at whichever inputs you need.

11/10

→ Control signals

- 1 Memory :-  $\overline{RD}$ ,  $\overline{WR}$  (Active low)
  - 2 Register File :-  $WR$
  - 3 IR :- Enable bit
  - 4 MUX at memory input :- 1 bit
  - 5 " " T1 " :- 2 bits
  - 6 " " ALU input 1 :- 1 bit
  - 7 2 :- 1 bit
  - 8 SE output :- 1 bit
  - 9 Choose ALU operation :- 2 bits
  - 10 PC :- Enable bit ?
  - 11 Updating Z flag :- 1 bit
- 10 states encoded in 4 bits



→ Next-state logic

process (current state, decoder 1, decoder 2)

when ( $S_1$ )

next state =  $S_2$

when ( $S_2$ )

next state = decoder 1

:

:

when (others)

to ensure no sequential circuits. (tutorial)

next state =  $S_1$

→ Output Logic

process (

)

when ( $S_1$ )

// Specify ALL control signals.

,

,

,

→ 3<sup>rd</sup> process :- To instantiate flip flops.

15/01/10

\* Quiz

- Average CPI = Weighted mean of CPIs for different type of instructions

where weight = frequency of occurrence of each type of instructions,

- \* Say, memory R/W takes 3 ns
- ALU operation                                  2.5 ns
- RF R/W                                          2 ns

Then, clock cycle chosen should be at least 3 ns.

### → Single cycle implementation

- Wasn't possible in CISC (complex & variable-length instructions)
- Possible in RISC, because state transitions are simple and similar.

$$S \rightarrow T \rightarrow T' \rightarrow T'' \rightarrow S' \Rightarrow S \rightarrow S'$$

- Single cycle implementation :- CPI = 1

e.g. Arithmetic / Logical instructions

When we do entire instruction in single cycles, there cannot be storage in between because writing happens at clock edge

$$\begin{aligned} pc \rightarrow alu1\_a, mem\_a \\ +4 \rightarrow alu1\_b \end{aligned}$$

$$alu1\_out \rightarrow pc$$

$$mem\_d_{21-25} \rightarrow rf\_d1$$

$$mem\_d_{16-20} \rightarrow rf\_d2$$

$$rf\_d1 \rightarrow alu2\_a$$

$$rf\_d2 \rightarrow alu2\_b$$

$$alu2\_out \rightarrow rf\_d3$$

$$mem\_d_{11-15} \rightarrow rf\_d3$$

- Controller is no longer an FSM. It is a combinational logic that generates appropriate control signals based on value of opcode.



- Now, cycle time would be  $3 + 2 + 2.5 + 2 = 9.5 \text{ ns}$

Mem + alu 1 → RF read → alu-2 → RF write

- All signals will stabilize by  $9.5 \text{ ns}$ .
  - PC and RF will be updated at clock edge at the end of  $9.5 \text{ ns}$ .

eg LOAD

$\text{pc} \rightarrow \text{alu1\_a}, \text{mem1\_a}$

$+4 \rightarrow \text{alu1\_b}$

$\text{alu1\_out} \rightarrow \text{pc}$

$\text{mem1\_d}_{21-25} \rightarrow \text{rf\_a1}$

$\text{mem1\_d}_{0-15} \rightarrow \text{SE16} \rightarrow \text{alu2\_b}$

$\text{rf\_d1} \rightarrow \text{alu2\_a}$

$\text{alu2\_out} \rightarrow \text{mem2\_a}$

$\text{mem2\_d} \rightarrow \text{rf\_d3}$

$\text{mem1\_d}_{16-20} \rightarrow \text{rf\_a3}$

ALU3 is adder only

classmate

Date \_\_\_\_\_

Page \_\_\_\_\_

## STORE

pc → alu1\_a, mem1\_a

+4 → alu1\_b

alu1\_out → pc

mem1\_d<sub>21-25</sub> → rf\_a1

mem1\_d<sub>0-15</sub> → SF16 → alu2\_b

rf\_d1 → alu2\_a

alu2\_out → mem2\_a

mem1\_d<sub>16-20</sub> → rf\_a2

rf\_d2 → mem2\_d

## \* Control

eg BEQ R1, R2, Imm16

pc → mem1\_a, alu2\_a

+4 → alu1\_b

mem1\_d<sub>20-25</sub> → rf\_a1

mem1\_d<sub>16-20</sub> → rf\_a2

rf\_d1 → alu2\_a

rf\_d2 → alu2\_b

alu1\_out → alu3\_a

mem1\_d<sub>0-15</sub> → SF16 → alu3\_b → alu3\_out

if (alu3\_out == 0) { alu3\_out → pc }

else { alu1\_out → pc }

P.T.O.

eg JUMP

$pc \rightarrow mem1\_a,alu2\_a$

$+4 \rightarrow alu1\_b$

$alu1\_out \rightarrow alu2\_a$

$mem1\_d_{0-25} \rightarrow SE\ 26 \rightarrow 2S \rightarrow alu2\_b$

$alu2\_out \rightarrow pc$ .

→ Control Signals.

|                  | A | B | C | D | E | F | G | H | I | J | K | L | M |
|------------------|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0000000 (R type) | 0 | 0 | 0 | 1 | 1 |   |   |   | 0 | 1 |   |   |   |
| !                |   |   |   |   |   |   |   |   |   |   |   |   |   |

Simplify ~~to fast~~

eg D is always 1

→ Cycle Time

- LOAD takes most time (costliest path)

$$3 + 2 + 2.5 + 3 + 2 = 12.5\text{ ns}$$

↓      ↓      ↓      ↓      ↓  
 mem1  $\rightarrow$  rf      alu      mem2  $\rightarrow$  rf

→ Performance = # of instructions  $\times \underbrace{CPI}_{=1} \times$  Cycle time

- In single cycle implementation, performance does not depend on fractions of different type of instructions

- Performance :-
- |                      |       |
|----------------------|-------|
| Arithmetic / Logical | - 50% |
| Load                 | - 20% |
| Store                | - 10% |
| Branch BEq           | - 15  |
| Jump                 | - 5%  |

For multicycle, Performance =  $N \times \frac{1}{\text{Time}}$

$$\begin{aligned} &= N \times \frac{1}{12.45} \quad 4.15 \times 3 \text{ ns} \\ &= 12.45 \\ \text{single cycle} &= 12.5 \end{aligned}$$

- Depending on fractions of instruction types, either of single cycle or multicycle implementation.
- In spite of fewer registers, single-cycle is more expensive overall in terms of hardware.
- Single cycle :- Lowest possible CPI (1)
- Multi cycle cycle time.
- Possible to strike a balance :- eg - 2 cycle implementation.

### → Pipelining.

- Begin next instruction before previous is over, using the idle resources.
- In RISC, subtasks in every instruction are very similar - fetch instruction, fetch operands, alu operation, etc.
- It is possible to get least possible cycle time (like multicycle)
  - If we assume all instructions to be independent, then we can do pipelining and then we will get cycle time equal to  $\max \{ \text{sub-task time} \}$

## → Subtasks

- 1 R-type :- Instruction fetch, Instruction Decode / Operand Read, Execute, Write Back
- 2 LOAD :- IF, ID/OR, EX, Memory Read, WB
- 3 Store :- IF, ID/OR, EX, Mem
- 4 BEQ :- IF, ID/OR, EX
- 5 JUMP :- IF, ID

↓  
exhaustive set of  
sub-tasks

- First 5 cycles, no instruction will be completed.  
After that, one instruction will be completed, every cycle.  
(for independent instructions and if instructions are consecutive from top to bottom (no control flow changes in between))

$$\rightarrow \text{Performance} = N \times \underbrace{1}_{\text{CPI}} \times \underbrace{3\text{ns}}_{\text{Cycle time}}$$