

# Introduction to Computer Organization \_ Basic Concept

## Computer Organization and Architecture

- Five classic components of a computer – input, output, memory, datapath, and control



## The Instruction Set: a Critical Interface



## ISA (Instruction Set Architecture)

- ISA, or simply architecture – the abstract interface between the hardware and the lowest level software that encompasses all the information necessary to write a machine language program, including instructions, registers, memory access, I/O, ...
  - Enables implementations of varying cost and performance to run identical software
- The combination of the basic instruction set (the ISA) and the operating system interface is called the application binary interface (ABI)
  - ABI – The user portion of the instruction set plus the operating system interfaces used by application programmers. Defines a standard for binary portability across computers.

- "... the attributes of a [computing] system as seen by the programmer, i.e. the conceptual structure and functional behavior, as distinct from the organization of the data flow and controls, the logic design, and the physical implementation."  
- Amdahl, Blaauw, and Brooks, 1964
- ISA Includes
  - Organization of storage
  - Data types
  - Encoding and representing instructions
  - Instruction Set (or opcodes)
  - Modes of addressing data items/instructions
  - Program visible exception handling
- Specifies requirements for binary compatibility across implementations (ABI)

- A very important abstraction
  - interface between hardware and low-level software
  - standardizes instructions, machine language bit patterns, etc
  - advantage: *different implementations of the same architecture*
  - disadvantage: *sometimes prevents using new innovations*
- Common instruction set architectures:
  - IA-32, PowerPC, MIPS, SPARC, ARM, and others

# Introduction to Computer Organization \_ MIPS Instruction and Datapath

## MIPS Instruction and Format

| MIPS instructions                | Name  | Format | Pseudo MIPS         | Name  | Format |
|----------------------------------|-------|--------|---------------------|-------|--------|
| add                              | add   | R      | move                | move  | R      |
| subtract                         | sub   | R      | multiply            | mult  | R      |
| add immediate                    | addi  | I      | multiply immediate  | multi | I      |
| load word                        | lw    | I      | load immediate      | li    | I      |
| store word                       | sw    | I      | branch less than    | blt   | I      |
| load half                        | lh    | I      | branch less than    |       |        |
| load half unsigned               | lhu   | I      | or equal            | ble   | I      |
| store half                       | sh    | I      | branch greater than | bgt   | I      |
| load byte                        | lb    | I      | branch greater than |       |        |
| load byte unsigned               | lbu   | I      | or equal            | bge   | I      |
| store byte                       | sb    | I      |                     |       |        |
| load linked                      | ll    | I      |                     |       |        |
| store conditional                | sc    | I      |                     |       |        |
| load upper immediate             | lui   | I      |                     |       |        |
| and                              | and   | R      |                     |       |        |
| or                               | or    | R      |                     |       |        |
| nor                              | nor   | R      |                     |       |        |
| and immediate                    | andi  | I      |                     |       |        |
| or immediate                     | ori   | I      |                     |       |        |
| shift left logical               | sll   | R      |                     |       |        |
| shift right logical              | srl   | R      |                     |       |        |
| branch on equal                  | beq   | I      |                     |       |        |
| branch on not equal              | bne   | I      |                     |       |        |
| set less than                    | slt   | R      |                     |       |        |
| set less than immediate          | slti  | I      |                     |       |        |
| set less than immediate unsigned | sltiu | I      |                     |       |        |
| jump                             | j     | J      |                     |       |        |
| jump register                    | jr    | R      |                     |       |        |
| jump and link                    | jal   | J      |                     |       |        |

## Instruction and Datapath

|                   | Instruction name              | ARM                | MIPS         |
|-------------------|-------------------------------|--------------------|--------------|
| Register-register | Add                           | add                | addu, addiu  |
|                   | Add (trap if overflow)        | adds; swivs        | add          |
|                   | Subtract                      | sub                | subu         |
|                   | Subtract (trap if overflow)   | subs; swivs        | sub          |
|                   | Multiply                      | mul                | mult, multu  |
|                   | Divide                        | —                  | div, divu    |
|                   | And                           | and                | and          |
|                   | Or                            | orr                | or           |
|                   | Xor                           | eor                | xor          |
|                   | Load high part register       | —                  | lui          |
|                   | Shift left logical            | lsl <sup>1</sup>   | sllv, sll    |
|                   | Shift right logical           | lsr <sup>1</sup>   | srlv, srl    |
| Data transfer     | Shift right arithmetic        | asr <sup>1</sup>   | sra, sra     |
|                   | Compare                       | cmp, cmn, tst, teq | slt/i,slt/iu |
|                   | Load byte signed              | ldrsb              | lb           |
|                   | Load byte unsigned            | ldrb               | lbu          |
|                   | Load halfword signed          | ldrsh              | lh           |
|                   | Load halfword unsigned        | ldrh               | lhu          |
|                   | Load word                     | ldr                | lw           |
|                   | Store byte                    | strb               | sb           |
|                   | Store halfword                | strh               | sh           |
|                   | Store word                    | str                | sw           |
|                   | Read, write special registers | mrs, msr           | move         |
|                   | Atomic Exchange               | swp, swpb          | ll;sc        |

**FIGURE 2.32 ARM register-register and data transfer instructions equivalent to MIPS core.** Dashes mean the operation is not available in that architecture or not synthesized in a few instructions. If there are several choices of instructions equivalent to the MIPS core, they are separated by commas. ARM includes shifts as part of every data operation instruction, so the shifts with superscript 1 are just a variation of a move instruction, such as  $lsl^1$ . Note that ARM has no divide instruction.

# Introduction to Computer Organization \_ MIPS (RISC) Design Principle

1. Simplicity favors regularity

## Arithmetic Operations

- Add and subtract, three operands

- | Two sources and one destination

```
add a, b, c # a gets b + c
```

- All arithmetic operations have this form

- *Design Principle 1: Simplicity favors regularity*

- | Regularity makes implementation simpler
  - | Simplicity enables higher performance at lower cost

2. Smaller is faster

## Register Operands

- Arithmetic instructions use register operands

- MIPS has a  $32 \times 32$ -bit register file
  - | Use for frequently accessed data
  - | Numbered 0 to 31
  - | 32-bit data called a "word"

- Assembler names

- | \$t0, \$t1, ..., \$t9 for temporary values
  - | \$s0, \$s1, ..., \$s7 for saved variables

- *Design Principle 2: Smaller is faster*

- | c.f. main memory: millions of locations

4. Good Design demands good compromises

## Branch Instruction Design

- Why not include `blt`, `bge`, etc in the MIPS ISA?

- Hardware for  $<$ ,  $\geq$ , ... slower than  $=$ ,  $\neq$

- | Combining with branch involves more work per instruction, requiring a slower clock
  - | All instructions would be penalized!

- `beq` and `bne` are the common case (Principle #3)

- This is a good design compromise (Principle #4)

3. Make the common case fast

## Immediate Operands

- Constant data specified in an instruction

```
addi $s3, $s3, 4
```

- No subtract immediate instruction

- | Just use a negative constant

```
addi $s2, $s1, -1
```

- *Design Principle 3: Make the common case fast*

- | Small constants are common (50% of MIPS arithmetic instructions in SPEC2006 use constants!)
  - | Immediate operand avoids a load instruction

# Introduction to Computer Organization \_ MIPS Addressing Mode

## MIPS Instruction Format

| Name     | Bit Fields |                     |        |                        |        |                 | Notes                         |
|----------|------------|---------------------|--------|------------------------|--------|-----------------|-------------------------------|
|          | 6 bits     | 5 bits              | 5 bits | 5 bits                 | 6 bits | (32 bits total) |                               |
| R-Format | op         | rs                  | rt     | rd                     | shmt   | funct           | Arithmetic, logic             |
| I-format | op         | rs                  | rt     | address/immediate (16) |        |                 | Load/store, branch, immediate |
| J-format | op         | target address (26) |        |                        |        |                 | Jump                          |

## Popularity

| Instruction class  | MIPS examples                     | HLL correspondence                                   | Frequency |         |
|--------------------|-----------------------------------|------------------------------------------------------|-----------|---------|
|                    |                                   |                                                      | Integer   | Ft. pt. |
| Arithmetic         | add, sub, addi                    | Operations in assignment statement s                 | 16%       | 48%     |
| Data transfer      | lw, sw, lb, lh, lhu, sb, lui      | References to data structures, such as arrays        | 35%       | 36%     |
| Logical            | and, or, nor, andi, ori, sll, srl | Operations in assignment statement s                 | 12%       | 4%      |
| Conditional branch | beq, bne,slt, slti, sltiu         | If statements and loops                              | 34%       | 8%      |
| Jump               | j, jr, jal                        | Procedure calls, returns, and case/switch statements | 2%        | 0%      |

## MIPS Addressing Mode

### 1. Immediate addressing



### 2. Register addressing



### 3. Base addressing



### 4. PC-relative addressing



### 5. Pseudodirect addressing



## MIPS Instruction Category

| Category           | Instruction                      | Example             | Meaning                                 | Comments                              |
|--------------------|----------------------------------|---------------------|-----------------------------------------|---------------------------------------|
| Arithmetic         | add                              | add \$s1,\$s2,\$s3  | \$s1 = \$s2 + \$s3                      | Three register operands               |
|                    | subtract                         | sub \$s1,\$s2,\$s3  | \$s1 = \$s2 - \$s3                      | Three register operands               |
|                    | add immediate                    | addi \$s1,\$s2,20   | \$s1 = \$s2 + 20                        | Used to add constants                 |
|                    | load word                        | lw \$s1,20(\$s2)    | \$s1 = Memory[\$s2 + 20]                | Word from memory to register          |
|                    | store word                       | sw \$s1,20(\$s2)    | Memory[\$s2 + 20] = \$s1                | Word from register to memory          |
|                    | load half                        | lh \$s1,20(\$s2)    | \$s1 = Memory[\$s2 + 20]                | Halfword memory to register           |
|                    | load half unsigned               | lhu \$s1,20(\$s2)   | \$s1 = Memory[\$s2 + 20]                | Halfword memory to register           |
|                    | store half                       | sh \$s1,20(\$s2)    | Memory[\$s2 + 20] = \$s1                | Halfword register to memory           |
|                    | load byte                        | lb \$s1,20(\$s2)    | \$s1 = Memory[\$s2 + 20]                | Byte from memory to register          |
|                    | load byte unsigned               | lbu \$s1,20(\$s2)   | \$s1 = Memory[\$s2 + 20]                | Byte from memory to register          |
| Data transfer      | store byte                       | sb \$s1,20(\$s2)    | Memory[\$s2 + 20] = \$s1                | Byte from register to memory          |
|                    | load linked word                 | l1 \$s1,20(\$s2)    | \$s1 = Memory[\$s2 + 20]                | Load word as 1st half of atomic swap  |
|                    | store condition. word            | sc \$s1,20(\$s2)    | Memory[\$s2+20]=1:\$s1=0 or 1           | Store word as 2nd half of atomic swap |
|                    | load upper immmed.               | lui \$s1,20         | \$s1 = 20 * 2 <sup>16</sup>             | Loads constant in upper 16 bits       |
|                    | and                              | and \$s1,\$s2,\$s3  | \$s1 = \$s2 & \$s3                      | Three reg. operands; bit-by-bit AND   |
|                    | or                               | or \$s1,\$s2,\$s3   | \$s1 = \$s2   \$s3                      | Three reg. operands; bit-by-bit OR    |
|                    | nor                              | nor \$s1,\$s2,\$s3  | \$s1 = ~(\$s2 & \$s3)                   | Three reg. operands; bit-by-bit NOR   |
|                    | and immediate                    | andi \$s1,\$s2,20   | \$s1 = \$s2 & 20                        | Bit-by-bit AND reg with constant      |
|                    | or immediate                     | ori \$s1,\$s2,20    | \$s1 = \$s2   20                        | Bit-by-bit OR reg with constant       |
|                    | shift left logical               | sll \$s1,\$s2,10    | \$s1 = \$s2 << 10                       | Shift left by constant                |
| Logical            | shift right logical              | srl \$s1,\$s2,10    | \$s1 = \$s2 >> 10                       | Shift right by constant               |
|                    | branch on equal                  | beq \$s1,\$s2,25    | if(\$s1 == \$s2) go to PC + 4 + 100     | Equal test; PC-relative branch        |
|                    | branch on not equal              | bne \$s1,\$s2,25    | if(\$s1 != \$s2) go to PC + 4 + 100     | Not equal test; PC-relative           |
|                    | set on less than                 | slt \$s1,\$s2,\$s3  | if(\$s2 < \$s3) \$s1 = 1; else \$s1 = 0 | Compare less than; for beq, bne       |
|                    | set on less than unsigned        | sltu \$s1,\$s2,\$s3 | if(\$s2 < \$s3) \$s1 = 1; else \$s1 = 0 | Compare less than unsigned            |
|                    | set less than immediate          | slti \$s1,\$s2,20   | if(\$s2 < 20) \$s1 = 1; else \$s1 = 0   | Compare less than constant            |
|                    | set less than immediate unsigned | sltiu \$s1,\$s2,20  | if(\$s2 < 20) \$s1 = 1; else \$s1 = 0   | Compare less than constant unsigned   |
| Conditional branch | set on less than                 | slt \$s1,\$s2,\$s3  | if(\$s2 < \$s3) \$s1 = 1; else \$s1 = 0 | Compare less than; for beq, bne       |
|                    | set on less than immediate       | slti \$s1,\$s2,20   | if(\$s2 < 20) \$s1 = 1; else \$s1 = 0   | Compare less than constant            |
|                    | set less than immediate          | sltiu \$s1,\$s2,20  | if(\$s2 < 20) \$s1 = 1; else \$s1 = 0   | Compare less than constant unsigned   |
|                    | set less than immediate unsigned | sltiu \$s1,\$s2,20  | if(\$s2 < 20) \$s1 = 1; else \$s1 = 0   | Compare less than constant unsigned   |

# Introduction to Computer Organization \_ MIPS Program Execution

## MIPS Instruction Executions

- Data operations**
  - Arithmetic (add, subtract, ...)
  - Logical (and, or, not, xor, ...)
- Data transfer**
  - Load (memory  $\rightarrow$  register)
  - Store (register  $\rightarrow$  memory)
- Sequencing**
  - Branch (conditional, e.g.,  $<$ ,  $>$ ,  $==$ )
  - Jump (unconditional, e.g., goto)

| Function                | Instruction      | Effect                             |
|-------------------------|------------------|------------------------------------|
| add                     | add R1, R2, R3   | R1 = R2 + R3                       |
| sub                     | sub R1, R2, R3   | R1 = R2 - R3                       |
| add immediate           | addi R1, R2, 145 | R1 = R2 + 145                      |
| multiply                | mult R1, R2, R3  | remainder                          |
| divide                  | div R1, R2, R3   | R1 = R2 / R3                       |
| and                     | and R1, R2, R3   | R1 = R2 & R3                       |
| or                      | or R1, R2, R3    | R1 = R2   R3                       |
| and immediate           | andi R1, R2, 145 | R1 = R2 & 145                      |
| or immediate            | ori R1, R2, 145  | R1 = R2   145                      |
| shift left logical      | sll R1, R2, 7    | R1 = R2 << 7                       |
| shift right logical     | srl R1, R2, 7    | R1 = R2 >> 7                       |
| load word               | lw R1, 145(R2)   | R1 = [R2 + 145]                    |
| store word              | sw R1, 145(R2)   | [R2 + 145] = R1                    |
| load upper immediate    | lui R1, 145      | R1 = 145                           |
| branch on equal         | beq R1, R2, 145  | If (R1 == R2) go to PC + 4 + 145*4 |
| branch on not equal     | bne R1, R2, 145  | If (R1 != R2) go to PC + 4 + 145*4 |
| set on less than        | slt R1, R2, R3   | If (R2 < R3) R1 = 1, else R1 = 0   |
| set less than immediate | slti R1, R2, 1   | R1 = 1, else R1 = 0                |
| jump                    | jr R1, 145       | R31 = PC + 4; go to 145            |
| jump register           | j R1             | R31 = PC + 4; go to R1             |
| jump and link           | jal 145          | R31 = PC + 4; go to 145            |

(Complete table is printed in the book for reference.)

1. Data operations:  
add/sub

2. Data transfers:  
load word/store word

3. Sequencing:  
branch equal/jump

## 3. Sequencing

- ALU compares registers
- Result tells the Control whether to branch
- If the branch is taken, then the Control adds a constant from the instruction to the Program Counter
- The Control always adds 4 to the Program Counter



## 1. Data Operations

- Program Counter holds the instruction address.
- Instructions are fetched from memory into the Instruction Register.
- Control logic decodes the instruction and tells the ALU and Register File what to do.
- ALU executes the instruction and results flow back to the Register File.
- The Control logic updates the Program Counter for the next instruction.



## 2. Data Transfer

- ALU generates address
- Address goes to the Memory Address Register
- Results to/from memory are stored in the Memory Data Register
- Data from memory can now be stored back into the Register File or to memory can be written.



# Introduction to Computer Organization \_ MIPS Branch and Jump

## Execution Review

1. Program Counter holds the instruction address.
2. Instructions are **fetched** from memory into the Instruction Register.
3. Control logic **decodes** the instruction and tells the ALU and Register File what to do.
4. ALU **executes** the instruction and results flow back to the Register File.
5. The Control logic **updates** the Program Counter for the next instruction.
6. The Memory Address register and Data Register are used to load and store to/from Memory.



## Constants(immediate)

- Small **constants (immediates)** are used all over code (~50%)

```
if (a==b) c=1;
else c=2;
```

- How can we support this in the processor?

- Put the “typical constants” in memory and load them (slow)
- Create hard-wired registers (like R0) for them (how many?)

Need lots of bits to choose among lots of constant registers.

- MIPS does something in between:

- Some instructions can have **constants inside the instruction**
- The control logic then sends the constants to the ALU
- **addi R29, R29, 4** ← value 4 is inside the instruction

Store the constant data in the instruction, not the register file.

- But there's a problem:

- Instructions have only 32 bits. Need some for opcode and registers.



- How do we tradeoff space for constants and instructions?

## Branch and Jump

### Branch instructions

- bne/beq I-format      **16 bit immediate**
- j J-format              **26 bit immediate**

### But addresses are 32 bits! How do we handle this?

- Treat bne/beq as **relative offsets (add to current PC)**
- Treat j as an **absolute value (replace 26 bits of the PC)**



## Loading immediate

### Loading immediate values (constants)

14

Control tells the ALU to take one operand from the Register File and the other from the Instruction.



# Introduction to Computer Organization \_ MIPS ISA (Instruction Set Architecture)

| MIPS instructions       | Name  | Format | Pseudo MIPS                  | Name  | Format |
|-------------------------|-------|--------|------------------------------|-------|--------|
| add                     | add   | R      | move                         | move  | R      |
| subtract                | sub   | R      | multiply                     | mult  | R      |
| add immediate           | addi  | I      | multiply immediate           | multi | I      |
| load word               | lw    | I      | load immediate               | li    | I      |
| store word              | sw    | I      | branch less than             | blt   | I      |
| load half               | lh    | I      | branch less than or equal    | ble   | I      |
| load half unsigned      | lhu   | I      | branch greater than          | bgt   | I      |
| store half              | sh    | I      | branch greater than or equal | bge   | I      |
| load byte               | lb    | I      |                              |       |        |
| load byte unsigned      | lbu   | I      |                              |       |        |
| store byte              | sb    | I      |                              |       |        |
| load linked             | ll    | I      |                              |       |        |
| store conditional       | sc    | I      |                              |       |        |
| load upper immediate    | lui   | I      |                              |       |        |
| and                     | and   | R      |                              |       |        |
| or                      | or    | R      |                              |       |        |
| nor                     | nor   | R      |                              |       |        |
| and immediate           | andi  | I      |                              |       |        |
| or immediate            | ori   | I      |                              |       |        |
| shift left logical      | sll   | R      |                              |       |        |
| shift right logical     | srl   | R      |                              |       |        |
| branch on equal         | beq   | I      |                              |       |        |
| branch on not equal     | bne   | I      |                              |       |        |
| set less than           | slt   | R      |                              |       |        |
| set less than immediate | slti  | I      |                              |       |        |
| set less than immediate | sltiu | I      |                              |       |        |
| unsigned                |       |        |                              |       |        |
| jump                    | j     | J      |                              |       |        |
| jump register           | jr    | R      |                              |       |        |
| jump and link           | jal   | J      |                              |       |        |

| MIPS Instruction Format |            |                     |        |                        |        |                 |                               |
|-------------------------|------------|---------------------|--------|------------------------|--------|-----------------|-------------------------------|
| Name                    | Bit Fields |                     |        |                        |        |                 | Notes                         |
|                         | 6 bits     | 5 bits              | 5 bits | 5 bits                 | 6 bits | (32 bits total) |                               |
| R-Format                | op         | rs                  | rt     | rd                     | shmt   | funct           | Arithmetic, logic             |
| I-format                | op         | rs                  | rt     | address/immediate (16) |        |                 | Load/store, branch, immediate |
| J-format                | op         | target address (26) |        |                        |        |                 | Jump                          |

| Logic Operation    |             |                |                   |
|--------------------|-------------|----------------|-------------------|
| Logical operations | C operators | Java operators | MIPS instructions |
| Shift left         | <<          | <<             | sll               |
| Shift right        | >>          | >>>            | srl               |
| Bit-by-bit AND     | &           | &              | and, andi         |
| Bit-by-bit OR      |             |                | or, ori           |
| Bit-by-bit NOT     | ~           | ~              | nor               |

**FIGURE 2.8 C and Java logical operators and their corresponding MIPS instructions.** MIPS implements NOT using a NOR with one operand being zero.

**Basic Operations**

- NOT
- AND
- OR
- NOR
- NAND
- XOR
- XNOR

Basic gates we will use in this class.

| AND |   |     | OR |   |     | XOR |   |     |
|-----|---|-----|----|---|-----|-----|---|-----|
| A   | B | Out | A  | B | Out | A   | B | Out |
| 0   | 0 | 0   | 0  | 0 | 0   | 0   | 0 | 0   |
| 0   | 1 | 0   | 0  | 1 | 1   | 0   | 1 | 1   |
| 1   | 0 | 0   | 1  | 0 | 1   | 1   | 0 | 1   |
| 1   | 1 | 1   | 1  | 1 | 1   | 1   | 1 | 0   |

  

| NAND |   |     | NOR |   |     | XNOR |   |     |
|------|---|-----|-----|---|-----|------|---|-----|
| A    | B | Out | A   | B | Out | A    | B | Out |
| 0    | 0 | 1   | 0   | 0 | 1   | 0    | 0 | 1   |
| 0    | 1 | 0   | 0   | 1 | 0   | 0    | 1 | 0   |
| 1    | 0 | 1   | 1   | 0 | 0   | 1    | 0 | 0   |
| 1    | 1 | 0   | 1   | 1 | 0   | 1    | 1 | 1   |

# Introduction to Computer Organization \_ MIPS Assembly and Registers

## MIPS Assembly Category

| Category           | Instruction                      | Example             | Meaning                                 | Comments                              |
|--------------------|----------------------------------|---------------------|-----------------------------------------|---------------------------------------|
| Arithmetic         | add                              | add \$s1,\$s2,\$s3  | \$s1 = \$s2 + \$s3                      | Three register operands               |
|                    | subtract                         | sub \$s1,\$s2,\$s3  | \$s1 = \$s2 - \$s3                      | Three register operands               |
|                    | add immediate                    | addi \$s1,\$s2,20   | \$s1 = \$s2 + 20                        | Used to add constants                 |
| Data transfer      | load word                        | lw \$s1,20(\$s2)    | \$s1 = Memory[\$s2 + 20]                | Word from memory to register          |
|                    | store word                       | sw \$s1,20(\$s2)    | Memory[\$s2 + 20] = \$s1                | Word from register to memory          |
|                    | load half                        | lh \$s1,20(\$s2)    | \$s1 = Memory[\$s2 + 20]                | Halfword memory to register           |
|                    | load half unsigned               | lhu \$s1,20(\$s2)   | \$s1 = Memory[\$s2 + 20]                | Halfword memory to register           |
|                    | store half                       | sh \$s1,20(\$s2)    | Memory[\$s2 + 20] = \$s1                | Halfword register to memory           |
|                    | load byte                        | lb \$s1,20(\$s2)    | \$s1 = Memory[\$s2 + 20]                | Byte from memory to register          |
|                    | load byte unsigned               | lbu \$s1,20(\$s2)   | \$s1 = Memory[\$s2 + 20]                | Byte from memory to register          |
|                    | store byte                       | sb \$s1,20(\$s2)    | Memory[\$s2 + 20] = \$s1                | Byte from register to memory          |
|                    | load linked word                 | ll \$s1,20(\$s2)    | \$s1 = Memory[\$s2 + 20]                | Load word as 1st half of atomic swap  |
|                    | store condition. word            | sc \$s1,20(\$s2)    | Memory[\$s2+20]= \$s1; \$s1=0 or 1      | Store word as 2nd half of atomic swap |
| Logical            | load upper immmed.               | lui \$s1,20         | \$s1 = 20 * 2 <sup>16</sup>             | Loads constant in upper 16 bits       |
|                    | and                              | and \$s1,\$s2,\$s3  | \$s1 = \$s2 & \$s3                      | Three reg. operands; bit-by-bit AND   |
|                    | or                               | or \$s1,\$s2,\$s3   | \$s1 = \$s2   \$s3                      | Three reg. operands; bit-by-bit OR    |
|                    | nor                              | nor \$s1,\$s2,\$s3  | \$s1 = ~(\$s2   \$s3)                   | Three reg. operands; bit-by-bit NOR   |
|                    | and immediate                    | andi \$s1,\$s2,20   | \$s1 = \$s2 & 20                        | Bit-by-bit AND reg with constant      |
|                    | or immediate                     | ori \$s1,\$s2,20    | \$s1 = \$s2   20                        | Bit-by-bit OR reg with constant       |
|                    | shift left logical               | sll \$s1,\$s2,10    | \$s1 = \$s2 << 10                       | Shift left by constant                |
|                    | shift right logical              | srl \$s1,\$s2,10    | \$s1 = \$s2 >> 10                       | Shift right by constant               |
| Conditional branch | branch on equal                  | beq \$s1,\$s2,25    | if(\$s1 == \$s2) go to PC + 4 + 100     | Equal test; PC-relative branch        |
|                    | branch on not equal              | bne \$s1,\$s2,25    | if(\$s1 != \$s2) go to PC + 4 + 100     | Not equal test; PC-relative           |
|                    | set on less than                 | slt \$s1,\$s2,\$s3  | if(\$s2 < \$s3) \$s1 = 1; else \$s1 = 0 | Compare less than; for beq, bne       |
|                    | set on less than unsigned        | sltu \$s1,\$s2,\$s3 | if(\$s2 < \$s3) \$s1 = 1; else \$s1 = 0 | Compare less than unsigned            |
|                    | set less than immediate          | slti \$s1,\$s2,20   | if(\$s2 < 20) \$s1 = 1; else \$s1 = 0   | Compare less than constant            |
|                    | set less than immediate unsigned | sltiu \$s1,\$s2,20  | if(\$s2 < 20) \$s1 = 1; else \$s1 = 0   | Compare less than constant unsigned   |

## MIPS Registers and Usage Convention



| Name                         | Example                                                                          | Comments                                                                                                                                                                                      |
|------------------------------|----------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 32 registers                 | \$s0-\$s7, \$t0-\$t9, \$zero, \$a0-\$a3, \$v0-\$v1, \$gp, \$fp, \$sp, \$ra, \$at | Fast locations for data. In MIPS, data must be in registers to perform arithmetic. Register \$zero always equals 0, and register \$at is reserved by the assembler to handle large constants. |
| 2 <sup>30</sup> memory words | Memory[0], Memory[4], ..., Memory[4294967292]                                    | Accessed only by data transfer instructions. MIPS uses byte addresses, so sequential word addresses differ by 4. Memory holds data structures, arrays, and spilled registers.                 |

# Introduction to Computer Organization \_ MIPS Procedure Calls 1

## Procedure Calls

- Procedures need to:
  - Know where to find their **arguments** and put their **results**
  - Save and restore** registers to avoid overwriting the **caller's** registers
  - Return to the right place when done
- To accomplish this we:
  - Have **conventions** for who saves registers
  - Save them in memory on the **stack**
  - Use **jal** and **jr \$ra** to enter and exit procedures
- As long as everyone follows the convention we get interoperability

## Terminology



- The **Caller** calls the procedure
- The **procedure** is the **Callee**
- The **Caller** gives the **Callee Arguments** (data)
- The **Callee** returns **Results** (data) to the **Caller**

## Saving and Restoring Registers



## How to do Procedure Call

- Transfer** control to the **callee** to start the procedure:

**jal ProcedureAddress**; jump-and-link to the procedure

- Keeps track of the instruction *after* the **jal** so we can continue in the right place when we are done with the procedure.
- Stores the return address (PC+4) in \$ra (R31)**

- Return** control to the **caller** when the procedure is done:

**jr \$ra**; jump-return to the address in \$ra

- Jumps back to the address stored in \$ra (R31)
- This is why you need to store the return address so you know where to go back to!

# Introduction to Computer Organization \_ MIPS Procedure Calls 1

## 6 Steps of Procedure Call

### Six Steps in Execution of a Procedure

1. Main routine (caller) places parameters in a place where the procedure (callee) can access them
  - \$a0 - \$a3: four argument registers
2. Caller transfers control to the callee (`jal Dest`)
3. Callee acquires the storage resources needed
4. Callee performs the desired task
5. Callee places the result value in a place where the caller can access it
  - \$v0 - \$v1: two value registers for result values
6. Callee returns control to the caller (`jr $ra`)
  - \$ra: one return address register to return to the point of origin

## Register Usage of Procedure Call

- \$a0 - \$a3: arguments (reg's 4 - 7)
- \$v0, \$v1: result values (reg's 2 and 3)
- \$t0 - \$t9: temporaries
  - Can be overwritten by callee
- \$s0 - \$s7: saved
  - Must be saved/restored by callee !
- \$gp: global pointer for static data (reg 28)
- \$sp: stack pointer (reg 29)
- \$fp: frame pointer (reg 30)
- \$ra: return address (reg 31)
- \$gp, \$sp, \$fp, \$ra must be saved/restored by callee !

## Using Registers to Implement

- Caller
  - Save caller-saved registers \$t0-\$t3, \$s0-\$s7
  - Load arguments in \$a0-\$a3, rest on stack above \$fp
  - Execute `jal` instruction
- Callee Setup
  - Allocate memory in frame ( $$sp = ssp - \text{frame}$ )
  - Save callee-saved registers \$s0-\$s7, \$fp, \$ra
  - Create frame ( $$fp = ssp + \text{frame size} - 4$ )
- Callee Return
  - Place return value in \$v0 and \$v1
  - Restore any callee-saved registers (\$fp, \$ra, \$s0-\$s7...)
  - Pop stack ( $ssp = ssp + \text{frame size}$ )
  - Return by `jr $ra`

## Instructions of Procedure Call

- Procedure call: jump and link
  - `jal ProcedureLabel`
    - Address of following instruction put in \$ra
    - Jumps to target address
- Procedure return: jump register
  - `jr $ra`
    - Copies \$ra to program counter
- Can also be used for computed jumps
  - e.g., for case/switch statements

# Introduction to Computer Organization \_ MIPS Organization

## Processor and Memory



## Instruction Classes Distribution

| Instruction class  | MIPS examples                     | HLL correspondence                                   | Frequency |         |
|--------------------|-----------------------------------|------------------------------------------------------|-----------|---------|
|                    |                                   |                                                      | Integer   | Ft. pt. |
| Arithmetic         | add, sub, addi                    | Operations in assignment statement s                 | 16%       | 48%     |
| Data transfer      | lw, sw, lb, lbu, lh, lhu, sb, lui | References to data structures, such as arrays        | 35%       | 36%     |
| Logical            | and, or, nor, andi, ori, sll, srl | Operations in assignment statement s                 | 12%       | 4%      |
| Conditional branch | beq, bne, slt, slti, sltiu        | If statements and loops                              | 34%       | 8%      |
| Jump               | jr, jal                           | Procedure calls, returns, and case/switch statements | 2%        | 0%      |

## Addressing

### 1. Immediate addressing

op rs rt Immediate

### 2. Register addressing

op rs rt rd ... funct

Registers  
Register

### 3. Base addressing

op rs rt Address

Register



### 4. PC-relative addressing

op rs rt Address

PC



### 5. Pseudodirect addressing

op Address

PC



## MIPS Fields

MIPS fields are given names to make them easier to discuss:

|        |        |        |        |        |        |
|--------|--------|--------|--------|--------|--------|
| op     | rs     | rt     | rd     | shamt  | funct  |
| 6 bits | 5 bits | 5 bits | 5 bits | 5 bits | 6 bits |

Here is the meaning of each name of the fields in MIPS instructions:

- **op:** Basic operation of the instruction, traditionally called the **opcode**.
- **rs:** The first register source operand.
- **rt:** The second register source operand.
- **rd:** The register destination operand. It gets the result of the operation.
- **shamt:** Shift amount. (Section 2.6 explains shift instructions and this term; it will not be used until then, and hence the field contains zero in this section.)
- **funct:** Function. This field, often called the *function code*, selects the specific variant of the operation in the op field.

# Introduction to Computer Organization \_ MIPS Logic Operations

## Logic Operation

| Logical operations | C operators | Java operators | MIPS instructions |
|--------------------|-------------|----------------|-------------------|
| Shift left         | <<          | <<             | sll               |
| Shift right        | >>          | >>>            | srl               |
| Bit-by-bit AND     | &           | &              | and, andi         |
| Bit-by-bit OR      |             |                | or, ori           |
| Bit-by-bit NOT     | ~           | ~              | nor               |

### Basic Operations

|        |                        |  |
|--------|------------------------|--|
| - NOT  | $\text{!A}$            |  |
| - AND  | $A \cdot B$            |  |
| - OR   | $A + B$                |  |
| - NOR  | $\text{!}(A + B)$      |  |
| - NAND | $\text{!}(A \cdot B)$  |  |
| - XOR  | $A \oplus B$           |  |
| - XNOR | $\text{!}(A \oplus B)$ |  |

Basic gates we will use in this class.

| AND |   |     | OR |   |     | XOR |   |     |
|-----|---|-----|----|---|-----|-----|---|-----|
| A   | B | Out | A  | B | Out | A   | B | Out |
| 0   | 0 | 0   | 0  | 0 | 0   | 0   | 0 | 0   |
| 0   | 1 | 0   | 0  | 1 | 1   | 0   | 1 | 1   |
| 1   | 0 | 0   | 1  | 0 | 1   | 1   | 0 | 1   |
| 1   | 1 | 1   | 1  | 1 | 1   | 1   | 1 | 0   |

  

| NAND |   |     | NOR |   |     | XNOR |   |     |
|------|---|-----|-----|---|-----|------|---|-----|
| A    | B | Out | A   | B | Out | A    | B | Out |
| 0    | 0 | 1   | 0   | 0 | 1   | 0    | 0 | 1   |
| 0    | 1 | 0   | 0   | 1 | 0   | 0    | 1 | 0   |
| 1    | 0 | 1   | 1   | 0 | 0   | 1    | 0 | 0   |
| 1    | 1 | 0   | 1   | 1 | 0   | 1    | 1 | 1   |

## NOT Operation

- Useful to invert bits in a word
  - Change 0 to 1, and 1 to 0
- MIPS has NOR 3-operand instruction
  - $a \text{ NOR } b = \text{NOT} (a \text{ OR } b)$

nor \$t0, \$t1, \$zero

Register \$t0 always treated as zero

\$t1 0000 0000 0000 0000 0011 1100 0000 0000

\$t0 1111 1111 1111 1111 1100 0011 1111 1111

## Shift Operation

| op     | rs     | rt     | rd     | shamt  | funct  |
|--------|--------|--------|--------|--------|--------|
| 6 bits | 5 bits | 5 bits | 5 bits | 5 bits | 6 bits |

- shamt: how many positions to shift
- Shift left logical
  - Shift left and fill with 0 bits
  - $sll$  by  $i$  bits multiplies by  $2^i$
- Shift right logical
  - Shift right and fill with 0 bits
  - $srl$  by  $i$  bits divides by  $2^i$  (unsigned only)

## AND and OR Operations

- Useful to mask bits in a word
  - Clear some bits to 0, leave others unchanged
  - or \$t0, \$t1, \$t2 # \$t1 is the "AND mask"

\$t2 0000 0000 0000 0000 ... 01 1100 0000

\$t1 0000 0000 0000 0000 ... 00 0000 0000

\$t0 0000 0000 0000 0000 ... 00 0000 0000

- Useful to include bits in a word

$\text{or } \$t0, \$t1, \$t2 \# \$t1 \text{ is the "OR mask"}$

\$t2 0000 0000 0000 0000 ... 01 1100 0000

\$t1 0000 0000 0000 0000 ... 00 1100 0000

\$t0 0000 0000 0000 0000 ... 00 01 1100 0000

# Introduction to Computer Organization \_ Multiplication Algorithm and Hardware



**FIGURE 3.4** The first multiplication algorithm, using the hardware shown in Figure 3.3. If the least significant bit of the multiplier is 1, add the multiplicand to the product. If not, go to the next step. Shift the multiplicand left and the multiplier right in the next two steps. These three steps are repeated 32 times.



**FIGURE 3.3** First version of the multiplication hardware. The Multiplicand register, ALU, and Product register are all 64 bits wide, with only the Multiplier register containing 32 bits. (Appendix B describes ALUs.) The 32-bit multiplicand starts in the right half of the Multiplicand register and is shifted left 1 bit on each step. The multiplier is shifted in the opposite direction at each step. The algorithm starts with the product initialized to 0. Control decides when to shift the Multiplicand and Multiplier registers and when to write new values into the Product register.



**FIGURE 3.5** Refined version of the multiplication hardware. Compare with the first version in Figure 3.3. The Multiplicand register, ALU, and Multiplier register are all 32 bits wide, with only the Product register left at 64 bits. Now the product is shifted right. The separate Multiplier register also disappeared. The multiplier is placed instead in the right half of the Product register. These changes are highlighted in color. (The Product register should really be 65 bits to hold the carry out of the adder, but it's shown here as 64 bits to highlight the evolution from Figure 3.3.)

# Introduction to Computer Organization \_ Division Algorithm and Hardware



**FIGURE 3.9 A division algorithm, using the hardware in Figure 3.8.** If the remainder is positive, the divisor did go into the dividend, so step 2a generates a 1 in the quotient. A negative remainder after step 1 means that the divisor did not go into the dividend, so step 2b generates a 0 in the quotient and adds the divisor to the remainder, thereby reversing the subtraction of step 1. The final shift, in step 3, aligns the divisor properly, relative to the dividend for the next iteration. These steps are repeated 33 times.



**FIGURE 3.8 First version of the division hardware.** The Divisor register, ALU, and Remainder register are all 64 bits wide, with only the Quotient register being 32 bits. The 32-bit divisor starts in the left half of the Divisor register and is shifted right 1 bit each iteration. The remainder is initialized with the dividend. Control decides when to shift the Divisor and Quotient registers and when to write the new value into the Remainder register.



**FIGURE 3.11 An improved version of the division hardware.** The Divisor register, ALU, and Quotient register are all 32 bits wide, with only the Remainder register left at 64 bits. Compared to Figure 3.8, the ALU and Divisor registers are halved and the remainder is shifted left. This version also combines the Quotient register with the right half of the Remainder register. (As in Figure 3.5, the Remainder register should really be 65 bits to make sure the carry out of the adder is not lost.)

# Introduction to Computer Organization \_ Floating Point

## Floating-Point Representation - single precision

|       |          |    |    |    |    |    |    |    |         |    |    |    |    |    |    |    |    |    |    |    |    |   |   |   |   |   |   |   |   |   |   |
|-------|----------|----|----|----|----|----|----|----|---------|----|----|----|----|----|----|----|----|----|----|----|----|---|---|---|---|---|---|---|---|---|---|
| 31    | 30       | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22      | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
| s     | exponent |    |    |    |    |    |    |    |         |    |    |    |    |    |    |    |    |    |    |    |    |   |   |   |   |   |   |   |   |   |   |
| 1 bit | 8 bits   |    |    |    |    |    |    |    | 23 bits |    |    |    |    |    |    |    |    |    |    |    |    |   |   |   |   |   |   |   |   |   |   |

In general, floating-point numbers are of the form

$$(-1)^s \times F \times 2^E$$

F involves the value in the fraction field and E involves the value in the exponent field; the exact relationship to these fields will be spelled out soon.

## Floating-Point Representation - double precision

|                      |          |    |    |    |    |    |    |    |    |    |    |         |    |    |    |    |    |    |    |    |    |   |   |   |   |   |   |   |   |   |   |
|----------------------|----------|----|----|----|----|----|----|----|----|----|----|---------|----|----|----|----|----|----|----|----|----|---|---|---|---|---|---|---|---|---|---|
| 31                   | 30       | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19      | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
| s                    | exponent |    |    |    |    |    |    |    |    |    |    |         |    |    |    |    |    |    |    |    |    |   |   |   |   |   |   |   |   |   |   |
| 1 bit                | 11 bits  |    |    |    |    |    |    |    |    |    |    | 20 bits |    |    |    |    |    |    |    |    |    |   |   |   |   |   |   |   |   |   |   |
| fraction (continued) |          |    |    |    |    |    |    |    |    |    |    |         |    |    |    |    |    |    |    |    |    |   |   |   |   |   |   |   |   |   |   |

MIPS double precision allows numbers almost as small as  $2.0_{\text{ten}} \times 10^{-308}$  and almost as large as  $2.0_{\text{ten}} \times 10^{308}$

## IEEE Floating-Point Format



## MIPS Floating-Point

| MIPS floating-point operands |                                                     |                                                                                                                                                                                                                               |
|------------------------------|-----------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Name                         | Example                                             | Comments                                                                                                                                                                                                                      |
| 32 floating-point registers  | \$F0, \$F1, \$F2, ..., \$F11                        | MIPS floating-point registers are used in pairs for double precision numbers.                                                                                                                                                 |
| 2 <sup>32</sup> memory words | Memory(\$0), Memory(\$1), ..., Memory(\$4294967292) | Accessed only by data transfer instructions. MIPS uses byte addresses, so sequential word addresses differ by 4. Memory holds data structures, such as arrays, and spilled registers, such as those saved on procedure calls. |

## MIPS floating-point assembly language

| Category           | Instruction        | Example              | Meaning                             | Comments                               |
|--------------------|--------------------|----------------------|-------------------------------------|----------------------------------------|
| Arithmetic         | FP add single      | add.s \$F2,\$F3,\$F6 | \$F2 - FP add + \$F6                | FP add (single precision)              |
|                    | FP subtract single | sub.s \$F2,\$F4,\$F6 | \$F2 - FP sub - \$F4 + \$F6         | FP sub (single precision)              |
|                    | FP multiply single | mul.s \$F2,\$F3,\$F6 | \$F2 - FP multiply * \$F6           | FP multiply (single precision)         |
|                    | FP divide single   | div.s \$F2,\$F3,\$F6 | \$F2 - FP divide / \$F6             | FP divide (single precision)           |
|                    | FP add double      | add.d \$F2,\$F3,\$F6 | \$F2 - FP add + \$F6                | FP add (double precision)              |
|                    | FP subtract double | sub.d \$F2,\$F4,\$F6 | \$F2 - FP sub - \$F4 + \$F6         | FP sub (double precision)              |
|                    | FP multiply double | mul.d \$F2,\$F3,\$F6 | \$F2 - FP multiply * \$F6           | FP multiply (double precision)         |
|                    | FP divide double   | div.d \$F2,\$F3,\$F6 | \$F2 - FP divide / \$F6             | FP divide (double precision)           |
| Data Transfer      | load word coop. t  | lwcl \$F1,100(\$s2)  | \$F1 - Memory(\$s2 + 100)           | 32-bit data to FP register             |
|                    | store word coop. t | swcl \$F1,100(\$s2)  | Memory(\$s2 + 100) = \$F1           | 32-bit data to memory                  |
| Conditional Branch | branch on FP true  | bclt \$F             | If (\$cond == 1) go to PC + 4 + 100 | PC-relative branch if FP cond          |
|                    | branch on FP false | bclf \$F             | If (\$cond == 0) go to PC + 4 + 100 | PC-relative branch if not cond         |
|                    | FP compare single  | c.lt.s \$F2,\$F4     | #(\$F2 < \$F4)                      | FP compare less than: single precision |
|                    | FP compare double  | c.lt.d \$F2,\$F4     | #(\$F2 < \$F4)                      | FP compare less than: double precision |

## MIPS floating-point machine language

| Name       | Format | Example | Comments |                                                    |
|------------|--------|---------|----------|----------------------------------------------------|
| add.s      | R      | 17      | 16       | 6 4 2 0 add.s \$F2,\$F4,\$F6                       |
| sub.s      | R      | 17      | 16       | 6 4 2 1 sub.s \$F2,\$F4,\$F6                       |
| mul.s      | R      | 17      | 16       | 6 4 2 3 mul.s \$F2,\$F4,\$F6                       |
| div.s      | R      | 17      | 16       | 6 4 2 3 div.s \$F2,\$F4,\$F6                       |
| add.d      | R      | 17      | 17       | 6 4 2 0 add.d \$F2,\$F4,\$F6                       |
| sub.d      | R      | 17      | 17       | 6 4 2 1 sub.d \$F2,\$F4,\$F6                       |
| mul.d      | R      | 17      | 17       | 6 4 2 2 mul.d \$F2,\$F4,\$F6                       |
| div.d      | R      | 17      | 17       | 6 4 2 3 div.d \$F2,\$F4,\$F6                       |
| lwcl       | I      | 49      | 20       | 2 100 lwcl \$F2,100(\$s4)                          |
| swcl       | I      | 57      | 20       | 2 100 swcl \$F2,100(\$s4)                          |
| bclt       | I      | 17      | B        | 1 25 bclt 25                                       |
| bclf       | I      | 17      | B        | 0 25 bclf 25                                       |
| c.lt.s     | R      | 17      | 16       | 4 2 0 60 c.lt.s \$F2,\$F4                          |
| c.lt.d     | R      | 17      | 17       | 4 2 0 60 c.lt.d \$F2,\$F4                          |
| Field size |        | 6 bits  | 5 bits   | 5 bits 5 bits 6 bits All MIPS instructions 32 bits |

FIGURE 3.17 MIPS floating-point architecture revealed thus far. See Appendix A, Section A.10, for more detail. This information is also found in column 2 of the MIPS Reference Data Card at the front of this book.

# Introduction to Computer Organization \_ Floating-Point Addition

## Floating-Point - Multiplication Algorithm



FIGURE 3.14 Floating-point addition. The normal path is to execute steps 3 and 4 once, but if rounding causes the sum to be unnormalized, we must repeat step 3.

## Floating-Point - Addition Hardware



FIGURE 3.15 Block diagram of an arithmetic unit dedicated to floating-point addition. The steps of Figure 3.14 correspond to each block, from top to bottom. First, the exponent of one operand is subtracted from the other using the small ALU to determine which is larger and by how much. This difference controls the three multiplexors from left to right, they select the larger exponent, the significand of the smaller number, and the significand of the larger number. The smaller significand is shifted right, and then the significands are added together using the big ALU. The normalization step then shifts the sum left or right and increments or decrements the exponent. Rounding then creates the final result, which may require normalizing again to produce the actual final result.

# Introduction to Computer Organization \_ Floating-Point Multiplication

## Floating-Point - Multiplication Algorithm



## Floating-Point Summary

The major difference between computer numbers and numbers in the real world is that computer numbers have limited size and hence limited precision; it's possible to calculate a number too big or too small to be represented in a word. Programmers must remember these limits and write programs accordingly.

| C type       | Java type | Data transfers | Operations                                                           |
|--------------|-----------|----------------|----------------------------------------------------------------------|
| int          | int       | lw, sw, lui    | addu, addiu, subu, mult, div, AND, ANDi, OR, ORi, NOR, slt, slti     |
| unsigned int | -         | lw, sw, lui    | addu, addiu, subu, multu, divu, AND, ANDi, OR, ORi, NOR, sltu, sltiu |
| char         | -         | lb, sb, lui    | add, addi, sub, mult, div AND, ANDi, OR, ORi, NOR, slt, slti         |
| -            | char      | lh, sh, lui    | addu, addiu, subu, multu, divu, AND, ANDi, OR, ORi, NOR, sltu, sltiu |
| float        | float     | lwcl, swcl     | add.s, sub.s, mult.s, div.s, c.eq.s, c.lt.s, c.le.s                  |
| double       | double    | ld, sd         | add.d, sub.d, mult.d, div.d, c.eq.d, c.lt.d, c.le.d                  |

FIGURE 3.16 Floating-point multiplication. The normal path is to execute steps 3 and 4 once, but if rounding causes the sum to be unnormalized, we must repeat step 3.

# Introduction to Computer Organization \_ Performance Thinking

## Performance Big View

- ❑ Measure, Report, and Summarize
- ❑ Make intelligent choices
- ❑ See through the marketing hype
- ❑ Key to understanding underlying organizational motivation

*Why is some hardware better than others for different programs?*

*What factors of system performance are hardware related?  
(e.g., Do we need a new machine, or a new operating system?)*

- ❑ Response Time (latency)
  - How long does it take for my job to run?
  - How long does it take to execute a job?
  - How long must I wait for the database query?
- ❑ Throughput
  - How many jobs can the machine run at once?
  - What is the average execution rate?
  - How much work is getting done?
- ❑ If we upgrade a machine with a new processor what do we increase?
- ❑ If we add a new machine to the lab what do we increase?

## Performance Metrics

- ❑ Purchasing perspective
  - given a collection of machines, which has the
    - best performance ?
    - least cost ?
    - best cost/performance?
- ❑ Design perspective
  - faced with design options, which has the
    - best performance improvement ?
    - least cost ?
    - best cost/performance?
- ❑ Both require
  - basis for comparison
  - metric for evaluation
- ❑ Our goal is to understand what factors in the architecture contribute to overall system performance and the relative importance (and cost) of these factors



# Introduction to Computer Organization \_ Performance

## Performance

$$\text{Time} = \frac{\text{Instructions}}{\text{Program}} \times \frac{\text{Clock cycles}}{\text{Instruction}} \times \frac{\text{Seconds}}{\text{Clock cycle}}$$



## MIPS

(million instructions per second)

$$\text{MIPS} = \frac{\text{Instruction count}}{\text{Execution time} \times 10^6}$$

$$\text{MIPS} = \frac{\text{Instruction count}}{\frac{\text{Instruction count} \times \text{CPI}}{\text{Clock rate}} \times 10^6} = \frac{\text{Clock rate}}{\text{CPI} \times 10^6}$$

# Introduction to Computer Organization \_ Processor Design \_ Logic Unit

## Logic Operation

### Basic Operations

|        |                                                    |
|--------|----------------------------------------------------|
| - NOT  | $\neg A$                                           |
| - AND  | $A \cdot B$                                        |
| - OR   | $A + B$                                            |
| - NOR  | $\neg(A + B) = \neg(A \text{ OR } B)$              |
| - NAND | $\neg(A \cdot B) = \neg(A \text{ AND } B)$         |
| - XOR  | $A \oplus B = (A \cdot \neg B) + (\neg A \cdot B)$ |
| - XNOR | $\neg(A \oplus B) = \neg(A \text{ XOR } B)$        |

Basic gates we will use in this class

| AND |   |     | OR |   |     | XOR |   |     |
|-----|---|-----|----|---|-----|-----|---|-----|
| A   | B | Out | A  | B | Out | A   | B | Out |
| 0   | 0 | 0   | 0  | 0 | 0   | 0   | 0 | 0   |
| 0   | 1 | 0   | 0  | 1 | 1   | 0   | 1 | 1   |
| 1   | 0 | 0   | 1  | 0 | 1   | 1   | 0 | 1   |
| 1   | 1 | 1   | 1  | 1 | 1   | 1   | 1 | 0   |

  

| NAND |   |     | NOR |   |     | XNOR |   |     |
|------|---|-----|-----|---|-----|------|---|-----|
| A    | B | Out | A   | B | Out | A    | B | Out |
| 0    | 0 | 1   | 0   | 0 | 1   | 0    | 0 | 1   |
| 0    | 1 | 1   | 0   | 1 | 0   | 0    | 1 | 0   |
| 1    | 0 | 1   | 1   | 0 | 0   | 1    | 0 | 0   |
| 1    | 1 | 0   | 1   | 1 | 0   | 1    | 1 | 1   |

## Logic Blocks



### Processor

### Memory

## Logic Blocks

- MUXes** select an input from many
  - The input may be a bus (multiple bits)
- DEMUXes** do the opposite: output to one output
- DECODERS** take a binary valued input and activate one output (1-hot)
  - E.g., binary in 010 will activate 1-hot output #2
- ENCODERS** take a 1-hot input and output its binary value
  - E.g., 1-hot input #3 in will output binary 011
- ADDERS** take in A and B and output the sum and the carry
  - Sum is the same number of bits as A or B
  - Carry is one additional bit in case you have an overflow



## ALU

- Must support the Arithmetic/Logic operations of the ISA
- add, addi, addiu, addu
- sub, subu
- mult, multu, div, divu
- sqrt
- and, andi, nor, or, ori, xor, xorl
- beq, bne,slt, slti, sltiu, sltiu



| Operation | Operand A | Operand B | Result indicating overflow |
|-----------|-----------|-----------|----------------------------|
| $A + B$   | $\geq 0$  | $\geq 0$  | $< 0$                      |
| $A + B$   | $< 0$     | $< 0$     | $\geq 0$                   |
| $A - B$   | $\geq 0$  | $\geq 0$  | $< 0$                      |
| $A - B$   | $< 0$     | $\geq 0$  | $\geq 0$                   |

FIGURE 3.2 Overflow conditions for addition and subtraction.

## A MIPS ALU Implementation

- Zero detect (slt, slti, sltiu, sltu, beq, bne)



- Enable overflow bit setting for signed arithmetic (add, addi, sub)

# Introduction to Computer Organization \_ Processor Design \_ Datapath

## Elements of Datapath



FIGURE 4.5 Two state elements are needed to store and access instructions, and an adder is needed to compute the next instruction address.



FIGURE 4.7 The two elements needed to implement R-format ALU operations are the register file and the ALU.



FIGURE 4.8 The two units needed to implement loads and stores, in addition to the register file and ALU of Figure 4.7, are the data memory unit and the sign extension unit.

## Design of Datapath



FIGURE 4.11 The simple datapath for the core MIPS architecture combines the elements required by different instruction classes. The components come from Figures 4.6, 4.9, and 4.10. This datapath can execute the basic instructions (load-store word, ALU operations, and branches) in a single clock cycle. Just one additional multiplexor is needed to integrate branches. The support for jumps will be added later.

Introduction to Computer Organization \_ Processor Design \_ Control



**FIGURE B.5.14** The symbol commonly used to represent an ALU, as shown in Figure B.5.12. This symbol is also used to represent an adder, so it is normally labeled either with ALU or Adder.



**FIGURE B.5.1** The 1-bit logical unit for AND and OR.



**FIGURE B.5.5** Adder hardware for the CarryOut signal



**FIGURE B.5.6** A 1-bit ALU that performs AND, OR, and addition (see Figure B.5.5).



**FIGURE B.5.12** The final 32-bit ALU. This adds a Zero detector to Figure B.5.11.

| Instruction<br>opcode | ALUOp | Instruction<br>operation | Funct field | Desired<br>ALU action | ALU control<br>input |
|-----------------------|-------|--------------------------|-------------|-----------------------|----------------------|
| LW                    | 00    | load word                | XXXXXX      | add                   | 0010                 |
| SW                    | 00    | store word               | XXXXXX      | add                   | 0010                 |
| Branch equal          | 01    | branch equal             | XXXXXX      | subtract              | 0110                 |
| R-type                | 10    | add                      | 100000      | add                   | 0010                 |
| R-type                | 10    | subtract                 | 100010      | subtract              | 0110                 |
| R-type                | 10    | AND                      | 100100      | AND                   | 0000                 |
| R-type                | 10    | OR                       | 100101      | OR                    | 0001                 |
| R-type                | 10    | set on less than         | 101010      | set on less than      | 0111                 |

**FIGURE 4.12** How the ALU control bits are set depends on the ALUOp control bits and the different function codes for the R-type instruction. See  Appendix B.

| ALUOp  |        | Funct field |    |    |    |    |    | Operation |
|--------|--------|-------------|----|----|----|----|----|-----------|
| ALUOp1 | ALUOp0 | F5          | F4 | F3 | F2 | F1 | F0 |           |
| 0      | 0      | X           | X  | X  | X  | X  | X  | 0010      |
| X      | 1      | X           | X  | X  | X  | X  | X  | 0110      |
| 1      | X      | X           | X  | 0  | 0  | 0  | 0  | 0010      |
| 1      | X      | X           | X  | 0  | 0  | 1  | 0  | 0110      |
| 1      | X      | X           | X  | 0  | 1  | 0  | 0  | 0000      |
| 1      | X      | X           | X  | 0  | 1  | 0  | 1  | 0001      |
| 1      | X      | X           | X  | 1  | 0  | 1  | 0  | 0111      |

**FIGURE 4.13 The truth table for the 4 ALU control bits (called Operation).** The inputs are the ALUOp and function code field. Only the entries for which the ALU control is asserted are shown. Some don't-care entries have been added. For example, the ALUOp does not use the encoding 11, so the truth table can contain entries 1X and X1, rather than 10 and 01. Note that when the function field is used, the first 2 bits (F5 and F4) of these instructions are always 10, so they are don't-care terms and are replaced with XX in the truth table.

# Introduction to Computer Organization \_ Processor Design \_ ALL

## Five Classic Components of a Computer



## Put it together

### Putting it All Together



## Processor Design Steps

- Analyze instruction set => datapath requirements**
  - the meaning of each instruction is given by the register transfers (ISA model  $\rightarrow$  RTL model)
  - datapath must include storage element for ISA registers
    - possibly more
  - datapath must support each register transfer
- Select set of datapath components and establish clocking methodology**
- Assemble datapath meeting the RTL requirements**
- Analyze implementation of each instruction to determine setting of control points that effect the assembly**
- Assemble the control logic**
- RTL datapath and control design are refined to track physical design and functional validation**
  - Changes made for timing and errata (a.k.a. "bug") fixes
  - Amount of work varies with capabilities of CAD tools and degree of optimization for cost/performance

## Logic Design

### Combinational Elements



### 1 bit ALU

- Using a MUX we can add the AND, OR, and adder operations into a single ALU



### 4 bit ALU



### D Latches

- Modified SR Latch
- Latches value when C is asserted



### D Flip Flops

- Uses Master/Slave D Latches



### Storage Element: Register



### Storage Element: Reg File



### Storage Element: Memory



# Introduction to Computer Organization \_ Processor Design \_ ALL

## Five Classic Components of a Computer



## Operand of the Instruction

| Instruction | RegDst | ALUSrc | MemtoReg | RegWrite | MemRead | MemWrite | Branch | ALUOp1 | ALUOp0 |
|-------------|--------|--------|----------|----------|---------|----------|--------|--------|--------|
| R-format    | 1      | 0      | 0        | 1        | 0       | 0        | 0      | 1      | 0      |
| lw          | 0      | 1      | 1        | 1        | 1       | 0        | 0      | 0      | 0      |
| sw          | X      | 1      | X        | 0        | 0       | 1        | 0      | 0      | 0      |
| beq         | X      | 0      | X        | 0        | 0       | 0        | 1      | 0      | 1      |

**FIGURE 4.18** The setting of the control lines is completely determined by the opcode fields of the instruction. The first row of the table corresponds to the R-format instructions (add, sub, AND, OR, and slt). For all these instructions, the source register fields are rs and rt, and the destination register field is rd; this defines how the signals ALUSrc and RegDst are set. Furthermore, an R-type instruction writes a register (RegWrite = 1), but neither reads nor writes data memory. When the Branch control signal is 0, the PC is unconditionally replaced with PC + 4; otherwise, the PC is replaced by the branch target if the Zero output of the ALU is also high. The ALUOp field for R-type instructions is set to 10 to indicate that the ALU control should be generated from the funct field. The second and third rows of this table give the control signal settings for lw and sw. These ALUSrc and ALUOp fields are set to perform the address calculation. The MemRead and MemWrite are set to perform the memory access. Finally, RegDst and RegWrite are set for a load to cause the result to be stored into the rt register. The branch instruction is similar to an R-format operation, since it sends the rs and rt registers to the ALU. The ALUOp field for branch is set for a subtract (ALU control = 01), which is used to test for equality. Notice that the MemtoReg field is irrelevant when the RegWrite signal is 0; since the register is not being written, the value of the data on the register data write port is not used. Thus, the entry MemtoReg in the last two rows of the table is replaced with X for don't care. Don't cares can also be added to RegDst when RegWrite is 0. This type of don't care must be added by the designer, since it depends on knowledge of how the datapath works.

## The Implementation of the MIPS subset



**FIGURE 4.24** The simple control and datapath are extended to handle the jump instruction.

| Signal name | Effect when deasserted                                                                       | Effect when asserted                                                                                    |
|-------------|----------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|
| RegDst      | The register destination number for the Write register comes from the rt field (bits 20:16). | The register destination number for the Write register comes from the rd field (bits 15:11).            |
| RegWrite    | None.                                                                                        | The register on the Write register input is written with the value on the Write data input.             |
| ALUSrc      | The second ALU operand comes from the second register file output (Read data 2).             | The second ALU operand is the sign-extended, lower 16 bits of the instruction.                          |
| PCSrc       | The PC is replaced by the output of the adder that computes the value of PC + 4.             | The PC is replaced by the output of the adder that computes the branch target.                          |
| MemRead     | None.                                                                                        | Data memory contents designated by the address input are put on the Read data output.                   |
| MemWrite    | None.                                                                                        | Data memory contents designated by the address input are replaced by the value on the Write data input. |
| MemtoReg    | The value fed to the register Write data input comes from the data memory.                   | The value fed to the register Write data input comes from the ALU.                                      |

**FIGURE 4.16** The effect of each of the seven control signals.

| Input or output | Signal name | R-format | lw | sw | beq |
|-----------------|-------------|----------|----|----|-----|
| Inputs          | Op5         | 0        | 1  | 1  | 0   |
|                 | Op4         | 0        | 0  | 0  | 0   |
|                 | Op3         | 0        | 0  | 1  | 0   |
|                 | Op2         | 0        | 0  | 0  | 1   |
|                 | Op1         | 0        | 1  | 1  | 0   |
|                 | Op0         | 0        | 1  | 1  | 0   |
| Outputs         | RegDst      | 1        | 0  | X  | X   |
|                 | ALUSrc      | 0        | 1  | 1  | 0   |
|                 | MemtoReg    | 0        | 1  | X  | X   |
|                 | RegWrite    | 1        | 1  | 0  | 0   |
|                 | MemRead     | 0        | 1  | 0  | 0   |
|                 | MemWrite    | 0        | 0  | 1  | 0   |
|                 | Branch      | 0        | 0  | 0  | 1   |
|                 | ALUOp1      | 1        | 0  | 0  | 0   |
|                 | ALUOp0      | 0        | 0  | 0  | 1   |

**FIGURE 4.22** The control function for the simple single-cycle implementation is completely specified by this truth table.

# Introduction to Computer Organization \_ Processor Design \_ Pipeline \_ Basic

## Why?

### Performance

$$\text{Time} = \frac{\text{Instructions}}{\text{Program}} \times \frac{\text{Clock cycles}}{\text{Instruction}} \times \frac{\text{Seconds}}{\text{Clock cycle}}$$



### MIPS (million instructions per second)

$$\text{MIPS} = \frac{\text{Instruction count}}{\text{Execution time}} \times 10^6$$

$$\text{MIPS} = \frac{\text{Instruction count}}{\frac{\text{Instruction count} \times \text{CPI}}{\text{Clock rate}} \times 10^6} = \frac{\text{Clock rate}}{\text{CPI} \times 10^6}$$

## MIPS Pipeline 5 Stages

MIPS instructions classically take five steps:

1. Fetch instruction from memory.
2. Read registers while decoding the instruction. The regular format of MIPS instructions allows reading and decoding to occur simultaneously.
3. Execute the operation or calculate an address.
4. Access an operand in data memory.
5. Write the result into a register.

| Instruction class                 | Instruction fetch | Register read | ALU operation | Data access | Register write | Total time |
|-----------------------------------|-------------------|---------------|---------------|-------------|----------------|------------|
| Load word (lw)                    | 200 ps            | 100 ps        | 200 ps        | 200 ps      | 100 ps         | 800 ps     |
| Store word (sw)                   | 200 ps            | 100 ps        | 200 ps        | 200 ps      |                | 700 ps     |
| R-format (add, sub, AND, OR, slt) | 200 ps            | 100 ps        | 200 ps        |             | 100 ps         | 600 ps     |
| Branch (beq)                      | 200 ps            | 100 ps        | 200 ps        |             |                | 500 ps     |

**FIGURE 4.26 Total time for each instruction calculated from the time for each component.**  
This calculation assumes that the multiplexors, control unit, PC accesses, and sign extension unit have no delay.

## Pipeline Hazards

### structural hazard

When a planned instruction cannot execute in the proper clock cycle because the hardware does not support the combination of instructions that are set to execute.

### data hazard

When a planned instruction cannot execute in the proper clock cycle because data that is needed to execute the instruction is not yet available.

### control hazard

When the proper instruction cannot execute in the proper pipeline clock cycle because the instruction that was fetched is not the one that is needed; that is, the flow of instruction addresses is not what the pipeline expected.

# Introduction to Computer Organization \_ Processor Design \_ Pipeline \_ Control

## Pipeline Design



**FIGURE 4.46** The pipelined datapath of Figure 4.41 with the control signals identified. This datapath borrows the control logic for PC source, register destination number, and ALU control from Section 4.4. Note that we now need the 6-bit funct field (function code) of the instruction in the EX stage as input to ALU control, so these bits must also be included in the ID/EX pipeline register. Recall that these 6 bits are also the 6 least significant bits of the immediate field in the instruction, so the ID/EX pipeline register can supply them from the immediate field since sign extension leaves these bits unchanged.

## 32-bit ALU

| Instruction opcode | ALUOp | Instruction operation | Function code | Desired ALU action | ALU control input |
|--------------------|-------|-----------------------|---------------|--------------------|-------------------|
| LW                 | 00    | load word             | XXXXXX        | add                | 0010              |
| SW                 | 00    | store word            | XXXXXX        | add                | 0010              |
| Branch equal       | 01    | branch equal          | XXXXXX        | subtract           | 0110              |
| R-type             | 10    | add                   | 100000        | add                | 0010              |
| R-type             | 10    | subtract              | 100010        | subtract           | 0110              |
| R-type             | 10    | AND                   | 100100        | AND                | 0000              |
| R-type             | 10    | OR                    | 100101        | OR                 | 0001              |
| R-type             | 10    | set on less than      | 101010        | set on less than   | 0111              |

**FIGURE 4.47** A copy of Figure 4.12. This figure shows how the ALU control bits are set depending on the ALUOp control bits and the different function codes for the R-type instruction.

| Signal name | Effect when deasserted (0)                                                                   | Effect when asserted (1)                                                                                |
|-------------|----------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|
| RegDst      | The register destination number for the Write register comes from the rt field (bits 20:16). | The register destination number for the Write register comes from the rd field (bits 15:11).            |
| RegWrite    | None.                                                                                        | The register on the Write register input is written with the value on the Write data input.             |
| ALUSrc      | The second ALU operand comes from the second register file output (Read data 2).             | The second ALU operand is the sign-extended, lower 16 bits of the instruction.                          |
| PCSrc       | The PC is replaced by the output of the adder that computes the branch target.               | The PC is replaced by the output of the adder that computes the branch target.                          |
| MemRead     | None.                                                                                        | Data memory contents designated by the address input are put on the Read data output.                   |
| MemWrite    | None.                                                                                        | Data memory contents designated by the address input are replaced by the value on the Write data input. |
| MemtoReg    | The value fed to the register Write data input                                               | The value fed to the register Write data input comes from the data memory.                              |

**FIGURE 4.48** A copy of Figure 4.16. The function of each of seven control signals is defined. The ALU control lines (ALUOp) are defined in the second column of Figure 4.47. When a 1-bit control to a 2-way multiplexer is asserted, the multiplexer selects the input corresponding to 1. Otherwise, if the control is deasserted, the multiplexer selects the 0 input. Note that PCSrc is controlled by an AND gate in Figure 4.46. If the Branch signal and the ALU Zero signal are both set, then PCSrc is 1; otherwise, it is 0. Control sets the Branch signal only during a beq instruction; otherwise, PCSrc is set to 0.

## final three stages



**FIGURE 4.50** The control lines for the final three stages. Note that four of the nine control lines are used in the EX phase, with the remaining five control lines passed on to the EX/MEM pipeline register extended to hold the control lines; three are used during the MEM stage, and the last two are passed to MEM/WB for use in the WB stage.

## Execution/address calculation stage control lines

## Memory access stage control lines

## Write-back stage control lines

| Instruction | Execution/address calculation stage control lines |        |        |        | Memory access stage control lines |          |           | Write-back stage control lines |           |
|-------------|---------------------------------------------------|--------|--------|--------|-----------------------------------|----------|-----------|--------------------------------|-----------|
|             | RegDst                                            | ALUOp1 | ALUOp0 | ALUSrc | Branch                            | Mem-Read | Mem-Write | Reg-Write                      | Memto-Reg |
| R-format    | 1                                                 | 1      | 0      | 0      | 0                                 | 0        | 0         | 1                              | 0         |
| lw          | 0                                                 | 0      | 0      | 1      | 0                                 | 1        | 0         | 1                              | 1         |
| sw          | X                                                 | 0      | 0      | 1      | 0                                 | 0        | 1         | 0                              | X         |
| beq         | X                                                 | 0      | 1      | 0      | 1                                 | 0        | 0         | 0                              | X         |

**FIGURE 4.49** The values of the control lines are the same as in Figure 4.18, but they have been shuffled into three groups corresponding to the last three pipeline stages.

# Introduction to Computer Organization \_ Processor Design \_ Pipeline \_ Datapath

## Pipeline Design



**FIGURE 4.33** The single-cycle datapath from Section 4.4 (similar to Figure 4.17). Each step of the instruction can be mapped onto the datapath from left to right. The only exceptions are the update of the PC and the write-back step, shown in color, which sends either the ALU result or the data from memory to the left to be written into the register file. (Normally we use color lines for control, but these are data lines.)



**FIGURE 4.35** The pipelined version of the datapath in Figure 4.33. The pipeline registers, in color, separate each pipeline stage. They are labeled by the stages that they separate; for example, the first is labeled IF/ID because it separates the instruction fetch and instruction decode stages. The registers must be wide enough to store all the data corresponding to the lines that go through them. For example, the IF/ID register must be 64 bits wide, because it must hold both the 32-bit instruction fetched from memory and the incremented 32-bit PC address. We will expand these registers over the course of this chapter, but for now the other three pipeline registers contain 128, 97, and 64 bits, respectively.

## Pipeline Datapath



**FIGURE 4.41** The corrected pipelined datapath to handle the load instruction properly. The write register number now comes from the MEM/WB pipeline register along with the data. The register number is passed from the ID pipe stage until it reaches the MEM/WB pipeline register, adding five more bits to the last three pipeline registers. This new path is shown in color.



**FIGURE 4.42** The portion of the datapath in Figure 4.41 that is used in all five stages of a load instruction.

# Introduction to Computer Organization \_ Processor Design \_ Pipeline \_ Datapath Hazard

## Datapath with Control



**FIGURE 4.51** The pipelined datapath of Figure 4.46, with the control signals connected to the control portions of the pipeline registers. The control values for the last three stages are created during the instruction decode stage and then placed in the ID/EX pipeline register. The control lines for each pipe stage are used, and remaining control lines are then passed to the next pipeline stage.

## Forwarding



**FIGURE 4.56** The datapath modified to resolve hazards via forwarding. Compared with the datapath in Figure 4.51, the additions are the multiplexors to the inputs to the ALU. This figure is a more stylized drawing, however, leaving out details from the full datapath, such as the branch hardware and the sign extension hardware.

## Forwarding with Detection



**FIGURE 4.60** Pipelined control overview, showing the two multiplexors for forwarding, the hazard detection unit, and the forwarding unit. Although the ID and EX stages have been simplified—the sign-extended immediate and branch logic are missing—this drawing gives the essence of the forwarding hardware requirements.

| Mux control   | Source | Explanation                                                                    |
|---------------|--------|--------------------------------------------------------------------------------|
| ForwardA = 00 | ID/EX  | The first ALU operand comes from the register file.                            |
| ForwardA = 10 | EX/MEM | The first ALU operand is forwarded from the prior ALU result.                  |
| ForwardA = 01 | MEM/WB | The first ALU operand is forwarded from data memory or an earlier ALU result.  |
| ForwardB = 00 | ID/EX  | The second ALU operand comes from the register file.                           |
| ForwardB = 10 | EX/MEM | The second ALU operand is forwarded from the prior ALU result.                 |
| ForwardB = 01 | MEM/WB | The second ALU operand is forwarded from data memory or an earlier ALU result. |

**FIGURE 4.55** The control values for the forwarding multiplexors in Figure 4.54. The signed immediate that is another input to the ALU is described in the *Elaboration* at the end of this section.

# Introduction to Computer Organization \_ Processor Design \_ Pipeline \_ Control Hazard



# Introduction to Computer Organization \_ Processor Design \_ Pipeline \_ Exception

## Type of Event

| Type of event                                 | From where? | MIPS terminology       |
|-----------------------------------------------|-------------|------------------------|
| I/O device request                            | External    | Interrupt              |
| Invoke the operating system from user program | Internal    | Exception              |
| Arithmetic overflow                           | Internal    | Exception              |
| Using an undefined instruction                | Internal    | Exception              |
| Hardware malfunctions                         | Either      | Exception or interrupt |

## Exception type

## Exception vector address (in hex)

|                       |                          |
|-----------------------|--------------------------|
| Undefined instruction | 8000 0000 <sub>hex</sub> |
| Arithmetic overflow   | 8000 0180 <sub>hex</sub> |

## Example

### Exception in a Pipelined Computer

Given this instruction sequence,

```

40hex sub $11, $2, $4
44hex and $12, $2, $5
48hex or $13, $2, $6
4Chex add $1, $2, $1
50hex slt $15, $6, $7
54hex lw $16, 50($7)
...
    
```

## Forwarding



**FIGURE 4.67** The result of an exception due to arithmetic overflow in the add instruction. The overflow is detected during the EX stage of clock 6, saving the address following the add in the EPC register (4C + 4 = 50<sub>hex</sub>). Overflow causes all the flush signals to be set near the end of this clock cycle, deasserting control values (setting them to 0) for the add. Clock cycle 7 shows the instructions converted to bubbles in the pipeline plus the fetching of the first instruction of the exception routine—sw \$25, 1000(\$0)—from instruction location 8000 0180<sub>hex</sub>. Note that the AND and OR instructions, which are prior to the add, still complete. Although not shown, the ALU overflow signal is an input to the control unit.

# Introduction to Computer Organization \_ Processor Design \_ Enhancement

## Five Classic Components of a Computer



FIGURE 4.1 An abstract view of the implementation of the MIPS subset showing the major functional units and the major connections between them.



FIGURE 4.2 The basic implementation of the MIPS subset, including the necessary multiplexors and control lines.



FIGURE 4.11 The simple datapath for the core MIPS architecture combines the elements required by different instruction classes.

## Datapath



FIGURE 4.15 The datapath of Figure 4.11 with all necessary multiplexors and all control lines identified. The control lines are shown in color. The ALU control block has also been added. The PC does not require a write control, since it is written on the end of every clock cycle; the branch control logic determines whether it is written with the incremented PC or the branch target address.

## Datapath with Control



FIGURE 4.17 The simple datapath with the control unit. The input to the control unit is the 6-bit opcode field from the instruction. The outputs of the control unit consist of three 1-bit signals that are used to control multiplexors (RegDst, ALUSrc, and MemtoReg), three signals for controlling reads and writes in the register file and data memory (RegWrite, MemRead, and MemWrite), a 1-bit signal used in determining whether to possibly branch (Branch), and a 2-bit control signal for the ALU (ALUOp). An AND gate is used to combine the branch control signal and the Zero output from the ALU; the AND gate output controls the selection of the next PC. Notice that PCSrc is now a derived signal, rather than one coming directly from the control unit. Thus, we drop the signal name in subsequent figures.

# Introduction to Computer Organization \_ Processor Design \_ R/L/B/J

## R-type



**FIGURE 4.19** The datapath in operation for an R-type instruction, such as add \$t1,\$t2,\$t3. The control lines, datapath units, and connections that are active are highlighted.

## Load



**FIGURE 4.20** The datapath in operation for a load instruction. The control lines, datapath units, and connections that are active are highlighted. A store instruction would operate very similarly. The main difference would be that the memory control would indicate a write rather than a read, the second register value read would be used for the data to store, and the operation of writing the data memory value to the register file would not occur.

## Jump



**FIGURE 4.24** The simple control and datapath are extended to handle the jump instruction. An additional multiplexer (at the upper right) is used to choose between the jump target and either the branch target or the sequential instruction following this one. This multiplexer is controlled by the jump control signal. The jump target address is obtained by shifting the lower 26 bits of the jump instruction left 2 bits, effectively adding 00 as the low-order bits, and then concatenating the upper 4 bits of PC + 4 as the high-order bits, thus yielding a 32-bit address.

## Branch



**FIGURE 4.21** The datapath in operation for a branch-on-equal instruction. The control lines, datapath units, and connections that are active are highlighted. After using the register file and ALU to perform the compare, the Zero output is used to select the next program counter from between the two candidates.

# Introduction to Computer Organization \_ Cache Memory \_Basic Concept

## Cach Basic

### Cache Basics

- Two questions to answer (in hardware):
  - Q1: How do we know if a data item is in the cache?
- Direct mapped
  - Each memory block is mapped to exactly one block in the cache
    - lots of lower level blocks must **share** blocks in the cache
  - Address mapping (to answer Q2):  
**(block address) modulo (# of blocks in the cache)**
  - Have a **tag** associated with each cache block that contains the address information (the upper portion of the address) required to identify the block (to answer Q1)

### MIPS Direct Mapped Cache Example

- One word blocks, cache size = 1K words (or 4KB)



### Multiword Block Direct Mapped Cache

- Four words/block, cache size = 1K words



## Cache Hit and Miss

### Handling Cache Hits

- Read hits (IS and DS)
  - this is what we want!
- Write hits (DS only)
  - require the cache and memory to be **consistent**
    - always write the data into both the cache block and the next level in the memory hierarchy (**write-through**)
    - writes run at the speed of the next level in the memory hierarchy – so slow! – or can use a **write buffer** and stall only if the write buffer is full
  - allow cache and memory to be **inconsistent**
    - write the data only into the cache block (write-back the cache block to the next level in the memory hierarchy when that cache block is “evicted”)
    - need a **dirty** bit for each data cache block to tell if it needs to be written back to memory when it is evicted – can use a **write buffer** to help “buffer” write-backs of dirty blocks

### Sources of Cache Misses

- Compulsory (cold start or process migration, first reference):
  - First access to a block, “cold” fact of life, not a whole lot you can do about it. If you are going to run “millions” of instruction, compulsory misses are insignificant
  - Solution: increase block size (increases miss penalty; very large blocks could increase miss rate)
- Capacity:
  - Cache cannot contain all blocks accessed by the program
  - Solution: increase cache size (may increase access time)
- Conflict (collision):
  - Multiple memory locations mapped to the same cache location
  - Solution 1: increase cache size
  - Solution 2: increase associativity (stay tuned) (may increase access time)

### Handling Cache Misses (Single Word Blocks)

- Read misses (IS and DS)
    - stall the pipeline, fetch the block from the next level in the memory hierarchy, install it in the cache and send the requested word to the processor, then let the pipeline resume
  - Write misses (DS only)
    - stall the pipeline, fetch the block from next level in the memory hierarchy, install it in the cache (which may involve having to evict a dirty block if using a write-back cache), write the word from the processor to the cache, then let the pipeline resume
- or
- Write allocate** – just write the word into the cache updating both the tag and data, no need to check for cache hit, no need to stall
- or
- No-write allocate** – skip the cache write (but must invalidate that cache block since it will now hold stale data) and just write the word to the write buffer (and eventually to the next memory level). 094 Spring 2014 NEED TO STALL IF THE WRITE BUFFER ISN'T FULL

# Introduction to Computer Organization \_ Cache Memory \_ Reducing Cache Miss Rate

## Improving Cache Performance

### 0. Reduce the time to hit in the cache

- smaller cache
- direct mapped cache
- smaller blocks
- for writes
  - no write allocate – no “hit” on cache, just write to write buffer
  - write allocate – to avoid two cycles (first check for hit, then write) pipeline writes via a delayed write buffer to cache

### 1. Reduce the miss rate

- bigger cache
- more flexible placement (increase associativity)
- larger blocks (16 to 64 bytes typical)
- victim cache – small buffer holding most recently discarded blocks

### 2. Reduce the miss penalty

- smaller blocks
- use a write buffer to hold dirty blocks being replaced so don't have to wait for the write to complete before reading
- check write buffer (and/or victim cache) on read miss – may get lucky
- for large blocks fetch critical word first
- use multiple cache levels – L2 cache not tied to CPU clock rate
- faster backing store/improved memory bandwidth
  - wider buses
  - memory interleaving, DDR SDRAMs

### Summary: The Cache Design Space

#### □ Several interacting dimensions

- cache size
- block size
- associativity
- replacement policy
- write-through vs write-back
- write allocation

#### □ The optimal choice is a compromise

- depends on access characteristics
  - workload
  - use (I-cache, D-cache, TLB)
- depends on technology / cost

• Simplicity often wins



## Reducing Cache Miss Rate

### Reducing Cache Miss Rates #1

1. Allow more flexible block placement
  - In a direct mapped cache a memory block maps to exactly one cache block
  - At the other extreme, could allow a memory block to be mapped to any cache block – fully associative cache
  - A compromise is to divide the cache into sets each of which consists of n “ways” (n-way set associative). A memory block maps to a unique set (specified by the index field) and can be placed in any way of that set (so there are n choices)

(block address) modulo (# sets in the cache)

### Reducing Cache Miss Rates #2

#### 1. Use multiple levels of caches

- With advancing technology have more than enough room on the die for bigger L1 caches or for a second level of caches – normally a unified L2 cache (i.e., it holds both instructions and data) and in some cases even a unified L3 cache
- For our example, CPI<sub>ideal</sub> of 2, 100 cycle miss penalty (to main memory) and a 25 cycle miss penalty (to UL2S), 36% load/stores, a 2% (4%) L1 IS (DS) miss rate, add a 0.5% UL2S miss rate

$$\begin{aligned} \text{CPI}_{\text{stalls}} = & 2 + .02 \times 25 + .36 \times .04 \times 25 + .005 \times 100 + \\ & .36 \times .005 \times 100 = 3.54 \end{aligned}$$

(as compared to 5.44 with no L2\$)

# Introduction to Computer Organization \_ TLB \_ Basic Concept

## Translation Lookaside Buffers

### Translation Lookaside Buffers (TLBs)

- Just like any other cache, the TLB can be organized as fully associative, set associative, or direct mapped

| V | Virtual Page # | Physical Page # | Dirty | Ref | Access |
|---|----------------|-----------------|-------|-----|--------|
|   |                |                 |       |     |        |

- TLB access time is typically smaller than cache access time (because TLBs are much smaller than caches)
  - TLBs are typically not more than 512 entries even on high-end machines



**FIGURE 5.29** The TLB acts as a cache of the page table for the entries that map to physical pages only. The TLB contains a subset of the virtual-to-physical page mappings that are in the page table. The TLB mappings are shown in color. Because the TLB is a cache, it must have a tag field. If there is no matching entry in the TLB for a page, the page table must be examined. The page table either supplies a physical page number for the page (which can then be used to build a TLB entry) or indicates that the page resides on disk, in which case a page fault occurs. Since the page table has an entry for every virtual page, no tag field is needed; in other words, unlike a TLB, a page table is not a cache.

## TLB in Memory Hierarchy

### A TLB in the Memory Hierarchy



- A TLB miss – is it a page fault or merely a TLB miss?
  - If the page is loaded into main memory, then the TLB miss can be handled (in hardware or software) by loading the translation information from the page table into the TLB
    - Takes 10's of cycles to find and load the translation info into the TLB
  - If the page is not in main memory, then it's a true page fault
    - Takes 1.000.000's of cycles to service a page fault
- TLB misses are much more frequent than true page faults.

### TLB Event Combinations

| TLB  | Page Table | Cache    | Possible? Under what circumstances?                                        |
|------|------------|----------|----------------------------------------------------------------------------|
| Hit  | Hit        | Hit      | Yes – what we want!                                                        |
| Hit  | Hit        | Miss     | Yes – although the page table is not checked if the TLB hits               |
| Miss | Hit        | Hit      | Yes – TLB miss, PA in page table                                           |
| Miss | Hit        | Miss     | Yes – TLB miss, PA in page table, but data not in cache                    |
| Miss | Miss       | Miss     | Yes – page fault                                                           |
| Hit  | Miss       | Miss/Hit | Impossible – TLB-translation-not-possible if page is not present in memory |
| Miss | Miss       | Hit      | Impossible – data-not-allowed-in-cache-if-page is not in memory            |

### Reducing Translation Time

- Can overlap the cache access with the TLB access
  - Works when the high order bits of the VA are used to access the TLB while the low order bits are used as index into cache



# Introduction to Computer Organization \_ TLB \_ SW&HW Boundary

## Boundary of S/W and H/w

- What parts of the virtual to physical address translation is done by or assisted by the hardware?
  - Translation Lookaside Buffer (TLB) that caches the recent translations
    - TLB access time is part of the cache hit time
    - May allow an extra stage in the pipeline for TLB access
  - Page table storage, fault detection and updating
    - Page faults result in interrupts (precise) that are then handled by the OS
    - Hardware must support (i.e., update appropriately) Dirty and Reference bits (e.g., ~LRU) in the Page Tables
  - Disk placement
    - Bootstrap (e.g., out of disk sector 0) so the system can service a limited number of page faults before the OS is even loaded

## Summary

### The Principle of Locality:

- Program likely to access a relatively small portion of the address space at any instant of time.
  - Temporal Locality: Locality in Time
  - Spatial Locality: Locality in Space

### Caches, TLBs, Virtual Memory all understood by examining how they deal with the four questions

- Where can entry be placed?
- How is entry found?
- What entry is replaced on miss?
- How are writes handled?

### Page tables map virtual address to physical address

TLBs are important for fast translation

## Issues

### Q1&Q2: Where can a entry be placed/found?

|                   | # of sets                     | Entries per set                   |
|-------------------|-------------------------------|-----------------------------------|
| Direct mapped     | # of entries                  | 1                                 |
| Set associative   | (# of entries)/ associativity | Associativity (typically 2 to 16) |
| Fully associative | 1                             | # of entries                      |

|                   | Location method                                           | # of comparisons        |
|-------------------|-----------------------------------------------------------|-------------------------|
| Direct mapped     | Index                                                     | 1                       |
| Set associative   | Index the set; compare set's tags                         | Degree of associativity |
| Fully associative | Compare all entries' tags<br>Separate lookup (page) table | # of entries            |

### Q3: Which entry should be replaced on a miss?

- Easy for direct mapped – only one choice
- Set associative or fully associative
  - Random
  - LRU (Least Recently Used)
- For a 2-way set associative, random replacement has a miss rate about 1.1 times higher than LRU
- LRU is too costly to implement for high levels of associativity (> 4-way) since tracking the usage information is costly

### Q4: What happens on a write?

**Write-through** – The information is written to the entry in the current memory level and to the entry in the next level of the memory hierarchy

- Always combined with a write buffer so write waits to next level in memory can be eliminated (as long as the write buffer doesn't fill)

**Write-back** – The information is written only to the entry in the current memory level. The modified entry is written to next level of memory only when it is replaced.

- Need a dirty bit to keep track of whether the entry is clean or dirty
- Virtual memory systems always use write-back of dirty pages to disk

### Pros and cons of each?

- Write-through: read misses don't result in writes (so are simpler and cheaper), easier to implement

**Write-back**: writes run at the speed of the cache; repeated writes require only one write to lower level



FIGURE 5.31 Processing a read or a write-through in the Intrinsity FastMATH TLB and cache.

| Design change           | Effect on miss rate                                                         | Possible negative performance effect                              |
|-------------------------|-----------------------------------------------------------------------------|-------------------------------------------------------------------|
| Increases cache size    | Decreases capacity misses                                                   | May increase access time                                          |
| Increases associativity | Decreases miss rate due to conflict misses                                  | May increase access time                                          |
| Increases block size    | Decreases miss rate for a wide range of block sizes due to spatial locality | Increases miss penalty. Very large block could increase miss rate |

FIGURE 5.38 Memory hierarchy design challenges.



FIGURE 5.40 Four states of the simple controller.

# Introduction to Computer Organization \_ Storage Management \_ NAS and SAN

## Storage-device Hierarchy



Figure 1.6 Storage-device hierarchy.



## A Typical Memory Hierarchy

- Take advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology at the speed offered by the fastest technology



## Storage Attachement \_ NAS vs SAN

- Host-attached storage accessed through I/O ports talking to I/O busses
- Attachment technologies and protocols
  - IDE
  - SCSI
  - Fiber Channel
- SCSI itself is a bus, up to 16 devices on one cable, SCSI initiator requests operation and SCSI targets perform tasks
  - Each target can have up to 8 logical units (disks attached to device controller)
- FC (fiber channel) is high-speed serial architecture
  - Can be switched fabric with 24-bit address space – the basis of storage area networks (SANs) in which many hosts attach to many storage units
  - Can be arbitrated loop (FC-AL) of 126 devices



## Network Attached Storage

- Network-attached storage (NAS) is storage made available over a network rather than over a local connection (such as a bus)
- NFS and CIFS are common protocols used for network attached storage
  - We use those protocols to access remote storage that is connected to a network.
- Implemented via remote procedure calls (RPCs) between host and storage
- New iSCSI protocol uses IP network to carry the SCSI protocol
  - SCSI – SCSI bus/cable (local/host attached)
  - iSCSI – TCP/IP network (network attached)



## Storage Area Network

- Common in large storage environments (and becoming more common)
- Multiple hosts attached to multiple storage arrays – flexible
- Uses a different communication infrastructure (SAN) than the common networking infrastructure



# Introduction to Computer Organization \_ Storage Management \_ Disk Management

## Magnetic Disk / Hard Disk

**Overview of Mass Storage Systems: Magnetic Disks**

- Magnetic disks provide bulk of secondary storage of modern computers
  - Drives rotate at 60 to 200 times per second
  - Transfer rate** is rate at which data flow between drive and computer
  - Positioning time (random-access time)** is time to move disk arm to desired cylinder (**seek time**) and time for desired sector to rotate under the disk head (**rotational latency**)
  - Head crash** results from disk head making contact with the disk surface
    - That's bad
- Disks can be removable
- Drive attached to computer via I/O bus
  - Busses vary, including EIDE, ATA, SATA, USB, Fibre Channel, SCSI
  - Host controller** in computer uses bus to talk to **disk controller** built into drive or storage array

**Magnetic Disk Characteristic**

- Disk read/write components
  - Seek time:** position the head over the proper track (3 to 13 ms avg)
    - due to locality of disk references the actual average seek time may be only 25% to 33% of the advertised number
  - Rotational latency:** wait for the desired sector to rotate under the head (1/2 of 1 rotation/RPM, converted to ms)
    - $0.5/5400\text{RPM} = 5.6 \text{ ms}$  to  $0.5/15000\text{RPM} = 2.0 \text{ ms}$
  - Transfer time:** transfer a block of bits (one or more sectors) under the head to the disk controller's cache (70 to 125 MB/s are typical disk transfer rates in 2008)
    - disk controller's cache takes advantage of spatial locality in disk accesses. Cache transfer rates are much faster (e.g. 375 MB/s)
- Controller time: the overhead the disk controller imposes in performing a disk I/O access (typically <0.2 ms)



Figure 11.1 HDD moving-head disk mechanism.

**Disk Structure**

- Disk drives are addressed as large 1-dimensional arrays of logical blocks, where the logical block is the smallest unit of transfer.
- The 1-dimensional array of logical blocks is mapped into the sectors of the disk sequentially.
  - Sector 0 is the first sector of the first track on the outermost cylinder.
  - Mapping proceeds in order through that track, then the rest of the tracks in that cylinder, and then through the rest of the cylinders from outermost to innermost.

## Disk Scheduling



Figure 11.6 FCFS disk scheduling.



Figure 11.8 C-SCAN disk scheduling.



Figure 11.7 SCAN disk scheduling.



## Selecting a Disk-Scheduling Algorithm

- SSTF is common and has a natural appeal
- SCAN and C-SCAN perform better for systems that place a heavy load on the disk.
- Performance depends on the number and types of requests.
- Requests for disk service can be influenced by the file-allocation method.
- The disk-scheduling algorithm should be written as a separate module of the operating system, allowing it to be replaced with a different algorithm if necessary.
- Either SSTF or LOOK is a reasonable choice for the default algorithm.

# Introduction to Computer Organization \_ Storage Management \_ DAID

## Mass Storage

### Dependability, Reliability, Availability

- Reliability – measured by the mean time to failure (MTTF). Service interruption is measured by mean time to repair (MTTR)
- Availability – a measure of service accomplishment  
 $\text{Availability} = \frac{\text{MTTF}}{\text{MTTF} + \text{MTTR}}$
- To increase MTTF, either improve the quality of the components or design the system to continue operating in the presence of faulty components
  - Fault avoidance: preventing fault occurrence by construction
  - Fault tolerance: using redundancy to correct or bypass faulty components (hardware)
    - Fault detection versus fault correction
    - Permanent faults versus transient faults

### RAIDs: Disk Arrays

#### Redundant Array of Inexpensive Disks



#### Arrays of small and inexpensive disks

- Increase potential throughput by having many disk drives
  - Data is spread over multiple disk
  - Multiple accesses are made to several disks at a time

#### Reliability is lower than a single disk

#### But availability can be improved by adding redundant disks (RAID)

Lost information can be reconstructed from redundant information  
 MTTR: mean time to repair is in the order of hours  
 MTTF: mean time to failure of disks is tens of years

### Summary

- Four components of disk access time:
  - Seek Time: advertised to be 3 to 14 ms but lower in real systems
  - Rotational Latency: 5.6 ms at 5400 RPM and 2.0 ms at 15000 RPM
  - Transfer Time: 30 to 80 MB/s
  - Controller Time: typically less than .2 ms
- RAIDS can be used to improve availability
  - RAID 1 and RAID 5 – widely used in servers, one estimate is that 80% of disks in servers are RAIDS
  - RAID 0+1 (mirroring) – EMC, HP/Tandem, IBM
  - RAID 3 – Storage Concepts
  - RAID 4 – Network Appliance
- RAIDS have enough redundancy to allow continuous operation, but hot swapping (replacement while system is running) is challenging



Figure 11.15 RAID levels.

## RAID Level

### RAID: Level 0 (No Redundancy, but Striping)



- Multiple smaller disks as opposed to one big disk
  - Spreading the sector over multiple disks – **striping** – means that multiple blocks can be accessed in parallel increasing the performance (i.e. throughput)
  - this 4 disk system gives four times the throughput of a 1 disk system
  - Same cost as one big disk – assuming 4 small disks cost the same as one big disk
- No redundancy, so what if one disk fails?
  - Failure of one or more disks is more likely as the number of disks in the system increases

### RAID: Level 0+1 (Striping with Mirroring)



- Combines the best of RAID 0 and RAID 1, data is striped across four disks and mirrored to four disks
  - Four times the throughput (due to striping)
  - # redundant disks = # of data disks, so reliability doubles the cost
  - writes have to be made to both sets of disks, so writes would be only 1/2 the performance of a RAID 0 with 8 disks
- What if one disk fails?
  - If a disk fails, the system just goes to the "mirror" for the data

### RAID: Level 3 (Bit-Interleaved Parity)



- Cost of higher availability is reduced to 1/N where N is the number of disks in a protection group
  - # redundant disks = 1 × # of protection groups
  - Reads and writes must access all disks
  - writing new data to the data disk as well as computing and writing the parity disk, means reading the other disks, so that the parity disk can be updated
- Can tolerate limited (single) disk failure, since the data can be reconstructed
  - Reads require reading all the operational data disks as well as the parity disk to calculate the missing data stored on the failed disk

### RAID: Level 4 (Block-Interleaved Parity)



- Cost of higher availability still only 1/N but the parity is stored as blocks associated with sets of data blocks
  - Four times the throughput (blocks are striped)
  - # redundant disks = 1 × # of protection groups
  - Supports "small reads" and "small writes" (reads and writes that go to just one (or a few) data disk in a protection group, not to all)
    - by watching which bits change when writing new information, need only to change the corresponding bits on the parity disk (read-modify-write)
    - the parity disk must be updated on every write, so it is a bottleneck for back-to-back writes
- Tolerate limited disk failure, since the data can be reconstructed

### Small Writes



- RAID 3 writes
  - New D1 data
  - 3 reads and 2 writes involving all the disks



- RAID 4 small writes
  - New D1 data
  - 2 reads and 2 writes involving just two disks

### RAID: Level 5 (Distributed Block-Interleaved Parity)



- Cost of higher availability still only 1/N but the parity block can be located on any of the disks so there is no single bottleneck for writes
  - Still four times the throughput (block striping)
  - # redundant disks = 1 × # of protection groups
  - Supports "small reads" and "small writes" (reads and writes that go to just one (or a few) data disk in a protection group, not to all)
    - Allows multiple simultaneous writes as long as the accompanying parity blocks are not located on the same disk
- Tolerate limited disk failure, since the data can be reconstructed

# Introduction to Computer Organization \_ I/O System \_ Basic Concept

## I/O System



Figure 12.1 A typical PC bus structure.

## I/O Concept



# Introduction to Computer Organization \_ I/O System \_ Performance

## I/O Performance

### Input and Output Devices

- I/O devices are incredibly diverse with respect to
  - Behavior – input, output or storage
  - Partner – human or machine
  - Data rate – the peak rate at which data can be transferred between the I/O device and the main memory or processor

| Device           | Behavior        | Partner | Data rate (Mb/s)    |
|------------------|-----------------|---------|---------------------|
| Keyboard         | input           | human   | 0.0001              |
| Mouse            | input           | human   | 0.0038              |
| Laser printer    | output          | human   | 3.2000              |
| Magnetic disk    | storage         | machine | 800.0000-3000.0000  |
| Graphics display | output          | human   | 800.0000-8000.0000  |
| Network/LAN      | input or output | machine | 100.0000-10000.0000 |

↑ rates of transfer

### I/O Performance Measures

- I/O bandwidth (throughput) – amount of information that can be input (output) and communicated across an interconnect (e.g., a bus) to the processor/memory (I/O device) per unit time
  - How much data can we move through the system in a certain time?
  - How many I/O operations can we do per unit time?
- I/O response time (latency) – the total elapsed time to accomplish an input or output operation
  - An especially important performance metric in real-time systems
- Many applications require both high throughput and short response times

### I/O System Performance

- Designing an I/O system to meet a set of bandwidth and/or latency constraints means
  - Finding the weakest link in the I/O system – the component that constrains the design
    - The processor and memory system ?
    - The underlying interconnection (i.e., bus) ?
    - The I/O controllers ?
    - The I/O devices themselves ?
  - (Re)configuring the weakest link to meet the bandwidth and/or latency requirements
  - Determining requirements for the rest of the components and (re)configuring them to support this latency and/or bandwidth

### Disk I/O System Example



### I/O System Performance Example

- A disk workload consisting of 64KB reads and writes where the user program executes 200,000 instructions per disk I/O operation and a processor that sustains 3 billion instr/sec and averages 100,000 OS instructions to handle a disk I/O operation

The maximum disk I/O rate (# I/O/sec) of the processor is

$$\frac{\text{Instr execution rate}}{\text{Instr per I/O}} = \frac{3 \times 10^9}{(200 + 100) \times 10^3} \approx 10,000 \text{ I/O/sec}$$

a memory-I/O bus that sustains a transfer rate of 1000 MB/s

$$\text{Each disk I/O reads/writes 64 KB so the maximum I/O rate of the bus is } \frac{\text{Bus bandwidth}}{\text{Bytes per I/O}} = \frac{1000 \times 10^6}{64 \times 10^3} \approx 15.625 \text{ I/O/sec}$$

SCSI disk I/O controllers with a DMA transfer rate of 320 MB/sec that can accommodate up to 7 disks per controller

disk drives with a read/write bandwidth of 75 MB/sec and an average seek plus rotational latency of 6 ms

What is the maximum sustainable I/O rate and what is the number of disks and SCSI controllers required to achieve that rate?

### I/O System Performance Example, Con't

So the processor is the bottleneck, not the Memory-I/O bus

disk drives with a read/write bandwidth of 75 MB/sec and an average seek plus rotational latency of 6 ms

$$\text{Disk I/O read/write time} = \text{seek} + \text{rotational time} + \text{transfer time} = 6\text{ms} + 64\text{KB}/(75\text{MB/sec}) = 6.9\text{ms}$$

Thus each disk can complete 1000ms/6.9ms or 146 I/O's per second. To satisfy the processor requires 10,000 I/O's per second.

$$10,000/146 = 69 \text{ disks}$$

To calculate the number of SCSI disk controllers, we need to know the average transfer rate per disk, in order to ensure we can put the maximum of 7 disks per SCSI controller and that a disk controller won't saturate the memory-I/O bus during a DMA transfer

$$\text{Disk transfer rate} = (\text{transfer size} / \text{transfer time}) = 64\text{KB}/6.9\text{ms} = 9.56 \text{ MB/sec}$$

Thus 7 disks won't saturate either the SCSI controller (with a maximum transfer rate of 320 MB/sec) or the memory-I/O bus (1000 MB/sec). This means we will need 6/7 or 10 SCSI controllers

## I/O Interface

### I/O Transactions

- An I/O transaction is a sequence of operations over the interconnect that includes a request and may include a response either of which may carry data. A transaction is initiated by a single request and may take many individual bus operations. An I/O transaction typically includes two parts
  - Sending the address
  - Receiving or sending the data

### Bus transactions are defined by what they do to memory

- A **read** transaction reads data from memory (to either the processor or an I/O device)
- A **write** transaction writes data to the memory (from either the processor or an I/O device)

### Asynchronous Bus Handshaking Protocol

- Output (read) data from memory to an I/O device



I/O device signals a request by raising ReadReq and putting the addr on the data lines

Memory sees ReadReq, reads addr from data lines, and raises Ack

I/O device sees Ack and releases the ReadReq and data lines

Memory sees ReadReq go low and drops Ack

When memory has data ready, it places it on data lines and raises DataRdy

I/O device sees DataRdy, reads the data from data lines, and raises Ack

Memory sees Ack, releases the data lines, and drops DataRdy

I/O device sees DataRdy go low and drops Ack

### A Typical I/O System



### Interfacing I/O Devices to the Processor, Memory, and OS

- The operating system acts as the interface between the I/O hardware and the program requesting I/O since

- Multiple programs using the processor share the I/O system
- I/O systems usually use interrupts which are handled by the OS
- Low-level control of an I/O device is complex and detailed

- Thus OS must handle interrupts generated by I/O devices and supply routines for low-level I/O device operations, provide equitable access to the shared I/O resources, protect those I/O devices/activities to which a user program doesn't have access and schedule I/O requests to enhance system throughput

OS must be able to give commands to the I/O devices

I/O device must be able to notify the OS about its status

Must be able to transfer data between the memory and the I/O device

### Communication of I/O Devices and Processor

- How the processor directs the I/O devices

- Special I/O instructions

- Must specify both the device and the command

#### Memory-mapped I/O

- Portions of the high-order memory address space are assigned to each I/O device
- Read and writes to those memory addresses are interpreted as commands to the I/O devices

- Load/stores to the I/O address space can only be done by the OS

#### How I/O devices communicate with the processor

- Polling – the processor periodically checks the status of an I/O device (through the OS) to determine its need for service

- Processor is totally in control – but does all the work

- Can waste a lot of processor time due to speed differences

- Interrupt-driven I/O – the I/O device issues an interrupt to indicate that it needs attention