

CS2340 DAE201, Spring 2023

Five main components of a computer system:

CPU, Memory, Storage, Input, Output

Software: Application software and System software



Application software: written by HLL

System software: OS, compiler...

These are called programs: Instruction & Data

Abstraction: help deals with complexity, and it hide low level details.

ISA: Instruction set Architecture: The hardware / software interface

Application binary interface: ISA plus system software interface

Implementation: details underlying the interface

Level of program code: HLL  $\rightarrow$  Assembly  $\rightarrow$  Machine code

Inside CPU: Data Path: performs operations on info; Control: control data path; Cache memory; memory

Moore's Law: The number of transistors on an integrated chip doubles every two years.

All data and programs are represented as binary

Volatile main memory: normal memory      Non-volatile secondary memory: disk, Flash memory, CD/DVD

Kilobyte:  $2^{10}/1024 \rightarrow 10^3$       Megabyte:  $2^{20}/1048576 \rightarrow 10^6$       Gigabyte:  $2^{30}/1073,741,824 \rightarrow 10^9$

Terabyte:  $2^{40} \rightarrow 10^{12}$       Petabyte:  $2^{50} \rightarrow 10^{15}$       Exabyte:  $2^{60} \rightarrow 10^{18}$

Abstract levels: Problem-oriented language  $\rightarrow$  Assembly  $\rightarrow$  OS  $\rightarrow$  ISA  $\rightarrow$  Microstructure  $\rightarrow$  Digital logic

From High level language:

HLL  $\rightarrow$  Compiler  $\rightarrow$  Assembler  $\rightarrow$  Linker  $\rightarrow$  Loader  $\rightarrow$  Executable

Linker: merge all the object files and program library together

Loader: Load the program and library to memory

Assembly:

two types: native and pseudo

native instruction: can be understood by machine directly; Ex: add, RS, RD, RT

pseudo instruction: sugar-coated instructions for programmers.

Assembler can convert pseudo to native

Assembler can understand instructions but not CPU

- start with ":"

- Executed by assembler at assembly time, not at run-time

Directives for allocating data items: .word, .half, .bytes, .ascii

Directives for segment information: .data, .text

Symbol related directive: global

MIPS syntax: number are base 10, Hex start with "0x"

Labels are followed by ":"

Identifiers begin with letter and may contain alphanumeric, underscore or dots.

Key words and instructions can not be used as identifier.

Comments start with "#"

Assembly language statements can not be split across multiple lines

[label:] [optional] [operand1] [, operand2 [, operand3]] ] [# comment]

Syscall: A pseudo instruction request for a OS service.

put requested service at \$V0

put input value at \$A0 (or \$F12 if it is float)

Get out put value from \$V0 after called

Example:

li \$V0, 4 # syscall 4, print string

la \$A0, hello # load label hello's string to \$A0

syscall # call

| Service           | Code | Arguments                                                                                 | Return                  |
|-------------------|------|-------------------------------------------------------------------------------------------|-------------------------|
| print int         | 1    | \$a0 = value                                                                              |                         |
| print float       | 2    | \$f12 = value                                                                             |                         |
| print double      | 3    | \$f12 = value                                                                             |                         |
| print str         | 4    | \$a0 = address of str                                                                     |                         |
| read int          | 5    |                                                                                           | \$v0 = value read       |
| read float        | 6    |                                                                                           | \$f12 = value read      |
| read double       | 7    |                                                                                           | \$f12 = value read      |
| read str          | 8    | \$a0 = address of str(label)<br>\$a1 = number of chars<br>\$a2 = bytes of storage desired |                         |
| memory allocation | 9    |                                                                                           | \$v0 = address of block |
| exit              | 10   |                                                                                           |                         |

Ex:

.Text

global main

main:

la \$a0, Hello # load address  $\rightarrow$  load value in label Hello to \$a0

li \$v0, 4 # sel request to 4

syscall # request print string

li \$v0, 5 # read a int

syscall # execute

sw \$v0, x # save word, save \$v0 value to label X

lw \$t0, x # load word, load X's value to \$t0

sub \$t2, \$t0, \$t1 # subtract, save value of t0 - t1 to t2

Data

X: space + # allocate 4 bytes of memory for label X

Hello: .asciz "Hello world" # store string in memory

Performance of computer program: speed and cost

Standard for performance: response time and throughput

response time: how long does it take to do a task

how many seconds does it take to compile a program

throughput: total work done in a unit of time

Both are determined by execution time, time to execute a program.

$$\text{Performance} = \frac{1}{\text{execution time}}$$

How to calculate execution time:

elapsed time: total response time, including CPU, I/O, OS, all aspects

CPU time: time used by CPU to execute a task

Elapsed time determine system performance,

CPU time compromise user CPU time and system CPU time.

cpu time is very important parameter of execution time

when we say X is n times higher than Y in performance:  $\frac{\text{Performance}_X}{\text{Performance}_Y} = \frac{\text{execution time}_Y}{\text{execution time}_X} = n$

Ques: CPU clocking: operation of digital hardware governed by a constant-rate clock

clock period (Time)

Time      clock cycle: an up and down

clock period: the time required to finish a clock period

$$\text{ex: } 250\text{ops} = 0.25\text{ns} = 250 \times 10^{-12}\text{ sec}$$

clock frequency: the number of clock cycles it can do in 1 second

$$\text{ex: } 4\text{GHz} = 4000\text{MHz} = 4.0 \times 10^9\text{Hz} (4 \times 10^9 \text{clock cycles per s})$$

$$\text{CPU Time} = \text{CPU clock cycles} \times \text{Clock cycle time} = \frac{\text{CPU clock cycles}}{\text{clock Rate}} = \frac{\text{number of clock cycles}}{\text{clock frequency}}$$

(number of clock cycles)      (clock period)

CPU performance can be improved by:

reducing number of clock periods

increasing clock rate

Hardware design must trade off between clock rate against cycle count

Example: Computer A: 2GHz clock, 10s CPU time

Design Computer B aim for 6s and can do faster clock, but result in 1.2x clock cycles

How fast must computer B clock be

$$\text{Clock Rate}_B = \frac{\text{clock cycles of B}}{\text{CPU time of B}} = \frac{1.2 \times \text{clock cycles of A}}{6s}$$

$$\text{clock cycles of A} = \text{CPU time} \times \text{Clock Rate}_A = 10s \times 2\text{GHz} = 20 \times 10^9$$

$$\text{Clock Rate}_B = \frac{1.2 \times 20 \times 10^9}{6s} = \frac{24 \times 10^9}{6s} = 4\text{GHz}$$

Clock cycles = Instruction Count  $\times$  Cycles per instruction

CPI: cycle per instruction; clock cycles needed for each instruction

Clock cycles = number of instructions  $\times$  cycles needed per instruction

CPU Time = number of instruction  $\times$  CPI  $\times$  Clock cycle time =  $\frac{\text{number of instruction} \times \text{CPI}}{\text{clock frequency}}$

Instruction count for a program: Determined by program, ISA, and compiler.

Average cycles per instruction: Determined by CPU hardware and different instructions have different CPI

CPI example

Computer A: Cycle time = 250 ps, CPI = 2

Computer B: Cycle time = 500 ps, CPI = 1.2

$$\text{CPU Time}_A = 1 \times 2 \times 250 = 500$$

$$\text{CPU Time}_B = 1 \times 1.2 \times 500 = 600$$

$$\frac{\text{CPU Time}_B}{\text{CPU Time}_A} = \frac{600}{500} = 1.2 \rightarrow A \text{ is 1.2 times faster than B}$$

$$\text{CPI} = \frac{\text{total clock cycles}}{\text{number of instructions}}$$

$$\text{CPU Time} = \frac{\text{number of instructions}}{\text{program}} \times \frac{\text{total clock cycles}}{\text{instruction}} \times \frac{\text{Seconds}}{\text{Clock cycle}}$$

Performance depends on:

Algorithm: other instruction count, possibly average CPI

Programming language: other instruction count, average CPI

Compiler: Same as programming language

Instruction set architecture: All above, and Tc

$$\text{Power} = \text{Capacitive load} \times \text{Voltage}^2 \times \text{Frequency}$$

Pitfall: MIPS as a performance metric

MIPS: Millions of Instructions Per Second

- Don't account for

Different computers' difference in ISA

Different instructions' difference in complexity

$$\text{MIPS} = \frac{\text{Instruction Count}}{\text{Execution Time} \times 10^6} = \frac{IC}{\frac{Clock\ Frequency}{CPI} \times 10^6} = \frac{\text{Clock Frequency}}{CPI \times 10^6}$$

$$\text{Amdahl's Law: } T_{\text{improve}} = \frac{T_{\text{target}}}{T_{\text{improved}} + T_{\text{unimproved}}}$$

Ex: 80% / 100%, how much improvement in performance needed to get 5x overall?

$$20 = \frac{80}{n} + 20 \rightarrow \text{can't be done}$$

The power is always a limit on performance

Ex: hex to dec (If first number of hex is greater than 7, then it is negative)

a) 0xFAD E

negative something  
VK

F: 15 → 1111

1111 1010 1101 1110

$$2^{10} + 2^8 + 2^5 + 2^1 = 1314$$

A: 10 → 1010

- 0000 0000 0000 0001

So this is -1314

D: 13 → 1101

1111 1010 1101 1101

flip sign

E: 14 → 1110

0000 0101 0010 0010

↓ ↓ ↓ ↓

2<sup>10</sup> 2<sup>8</sup> 2<sup>5</sup> 2<sup>1</sup>

b) 0x4BEEF

4: 0100

2<sup>14</sup> 2<sup>12</sup> 2<sup>10</sup>

0100 1011 1100 1111

$$= 19439$$

B: 11 → 1011

1111 1010 1101 1101

2<sup>10</sup> 2<sup>8</sup> 2<sup>5</sup> 2<sup>1</sup>

15

E: 14 → 1110

F: 15 → 1111

Ex: Dec to hex (32 bits)

a) -24312

$$24312 \div 2 = 12156 \dots 0$$

$$94 \div 2 = 47 \dots 0$$

$$12156 \div 2 = 6078 \dots 0$$

$$47 \div 2 = 23 \dots 1$$

$$6078 \div 2 = 3039 \dots 0$$

$$23 \div 2 = 11 \dots 1$$

$$3039 \div 2 = 1519 \dots 1$$

$$11 \div 2 = 5 \dots 1$$

$$1519 \div 2 = 759 \dots 1$$

$$5 \div 2 = 2 \dots 1$$

$$759 \div 2 = 379 \dots 1$$

$$2 \div 2 = 1 \dots 0$$

$$379 \div 2 = 189 \dots 1$$

$$1 \div 2 = 0 \dots 1$$

$$189 \div 2 = 94 \dots 1$$

sign = 0

$$24312 = 0101 \ 1110 \ 1111 \ 1000$$

A  
↑  
1  
↑  
0  
↑  
8

$$-24312 = 1010 \ 0001 \ 0000 \ 0111 + 0001 = 1010 \ 0001 \ 0000 \ 1000$$

A108 → extend to 32 bits → FFFFA108 (Because it is negative, so pack F)

b) 32544

$$32544 \div 16 = 2034 \dots 0 \rightarrow 0$$

$$2034 \div 16 = 127 \dots 2 \rightarrow 2$$

$$127 \div 16 = 7 \dots 15 \rightarrow F$$

$$7 \div 16 = 0 \dots 7 \rightarrow 7$$

7F20  $\rightarrow$  IEEE 32 bits 00007F20

(positive, pad with 0)

Ex: float to IEEE binary

a) 0.375

$$-0.375 = -\frac{3}{8} = -\frac{3}{2^3} = -1 \times 3 \times 2^{-3} = -1 \times 0011_2 \times 2^{-3} = -1 \times 1.1 \times 2^{-2}$$

more decimal of 0 will to the left by 1

to get a 1.1, then the right side  $2^3$  times 2 become  $2^{-2}$

S: sign = 1 (negative), exponent:  $-2 + 127 = 125 = 01111101$

(power at  
the end)

Fraction:  $\underbrace{100000\dots00}_{23 \text{ bits}}$  (save the decimal part of 1. something)

IEEE =  $\underbrace{01111101}_8 \underbrace{100000\dots}_{23}$

Ex: IEEE binary to float

a) IEEE 0 0000

S: 0011

E:  $14 \rightarrow 1110$

$01111101 | 11000000000000000000000000$

positive, S=0

$$01111101 = 125 \rightarrow 125 - 127 = -2, \text{ Exponent} = -2$$

$$11000000\dots \rightarrow 1 + 0.11 = 1.11 \rightarrow \text{Fraction} = 1.11$$

$$1 \times 1.11 \times 2^{-2} = 1 \times 0111 \times 2^{-4} = 1 \times 7 \times 2^{-4} = \frac{7}{16}$$

↑  
S is 0, so positive

↑  
move 2 decimal  
to right, and  
put 0 at front

↑  
moved to 2 decimal  
right, so  
divide by  $2^2 \div 2^2 = 2^{-4}$

## RISC vs CISC:

RISC: reduced instruction set computer

CISC: complex instruction set computer

RISC: single word instruction, fixed-field encoding, load/store architecture, simple operation

CISC: variable length instruction, variable format, memory operands, complex operations.

Why to use RISC: save memory, register, to make chip design simple and regular (universal/common)

A MIPS instruction must be 32 bits long as a word

## MIPS Instructions:

- op: operation code (opcode)

R-format

- rs: first source register number

opcode(6 bits) + rs(5 bits) + rt(5 bits) + rd(5 bits) + shamt(5 bits) + funct(6 bits)

- rt: second source register number

$$6 + 5 + 5 + 5 + 6 = 32 \text{ bits}$$

- rd: destination register number

I-format

- shamt: shift amount

opcode(6 bits) + rs(5 bits) + rt(5 bits) + Imm/late(16 bits)

$$6 + 5 + 5 + 16 = 32 \text{ bits}$$

Logical operation: add (add)

Data transportation: lw (load word)

J-format

chain control: j (jump)

opcode(6 bits) + address(26 bits)

$$6 + 26 = 32 \text{ bits}$$

Ex: Sub \$t1 \$s1 \$s2  
opcode: 0 9 17 18  
funcode: 0x2234

R: opode rs rt rd shamt funcode take 6  
0 17 18 9 0 0x22 → 00100010  
000000 10001 10010 01001 00000 100010

Arithmetic operations use register operands

I format: Immediate arithmetic and Load/Store instructions:

addi \$r0, \$r1, 4 , add 4 and \$r1 and store in r0

Immediate instructions are used to handle constant.

Array operation:

$g = h + A[8]$ , g in \$s1, h in \$s2, base address of A is in \$s3

offset,  $4 \times 8 = 32$  as 4 bytes per data

lw \$t0, 32(\$s3) ← base address

add \$s1, \$s2, \$t0

Register is faster than memory, try use more registers

Access A[i], Base of A is in \$s4, bins1, cins2, i in \$s3

add \$t1, \$t1, \$t1 # \$t1 × 2 → i → 2i

add \$t1, \$t1, \$t1 # \$t1 × 4 → i → 4i

add \$t1, \$t1, \$s4 # add 4i to base address, you get value's address

lw \$t0, 0(\$t1) # local value's address is stored in \$t1 to \$t0 with offset 0

Format Example:

Data transportation:



|        |        |       |           |           |
|--------|--------|-------|-----------|-----------|
| I      | opcode | rs    | rt        | immediate |
| 101011 | 10111  | 01111 | 0000 0000 | 0001 1100 |

Control flow: Von Neumann computer architecture.

Control flow is determined by sequence of executed instructions.

Typically, instructions are executed in sequential order.

executed in sequence of address (4 bytes)

Jump and Branch instruction:

The jump and branch instructions change default flow of chain

if ( $i = j$ )  $f = g + h$       → bne \$s<sub>3</sub>, \$s<sub>4</sub>, Else # if \$s<sub>3</sub> not equal to \$s<sub>4</sub>, go to Else label  
else  $f = g - h$       add \$s<sub>0</sub>, \$s<sub>1</sub>, \$s<sub>2</sub> # default, when \$s<sub>3</sub> = \$s<sub>4</sub>

j Exit # jump to exit label

Else: sub \$s<sub>0</sub>, \$s<sub>1</sub>, \$s<sub>2</sub> # Else case

Exit: ...

bne: branch not equal

beq: branch equal

Can't use a register and add to it the 16 bits offset

16 bits  $\rightarrow$  extend to 18 bits by  $\times 4$  ] branching destination address  
program counter  $\rightarrow$  program counter + 4 ] if true, update program counter

Example:

bne \$S\_0, \$S\_1, Lab1  
add \$S\_3, \$S\_0, \$S\_1  
Lab1: ...

op: 0x5 rs: 0x16 rt: 0x11 offset: 0x1  
000101 10000 10001 0000 0000 0000 0001

Jump: j label

j oper code address  
6 bits 26 bits

jump get address: 26 bits address  $\rightarrow$  extend to 28 bits by  $\times 4$ ,

24 bits from program counter + [4  $\times$  address(26 bits)] (28 bits)

While loop: i in \$S\_0, j in \$S\_1, k in \$S\_2  
while (i != k)  
i = i + j

loop: beg \$S\_0 \$S\_2 End  
add \$S\_0 \$S\_0 \$S\_1

j loop

Exit: ...

jal: Jump and Link, jump to address and store where are you from in \$ra

jr: Jump to address stored in register

slt: \$t<sub>0</sub> \$s<sub>0</sub> \$s<sub>1</sub> #if \$s<sub>0</sub> < \$s<sub>1</sub>, set \$t<sub>0</sub> = 1 else \$t<sub>0</sub> = 0

slti: \$t<sub>0</sub> \$s<sub>0</sub> constant #if \$s<sub>0</sub> < constant, set \$t<sub>0</sub> = 1, else \$t<sub>0</sub> = 0

slt/slti: for signed, sltu/slhi for unsigned

Branch far away:

If target is farther than 16 bits, the assembler will re-write the code

beg \$s<sub>0</sub> \$s<sub>1</sub> L<sub>1</sub> #old

bne \$s<sub>0</sub> \$s<sub>1</sub> L<sub>2</sub> #new

j L<sub>1</sub> #j can jump much farther  
L<sub>2</sub>: ...

shift left by i bits: multiply by 2<sup>i</sup> shift right by i bits: divide by 2<sup>i</sup>

There is no not as: A NOR 0 = NOT A Nor: not or, reverse of or, return 1 with 2^m - 1

Bytes/half word operation:

lb/lbu lh/lhu load byte, load byte unsigned, load half word, load half word unsigned

How to load 32 bits constant to register

lui rt, constant

ori rt, constant

1010 1010 0101 0101

lui \$t0 10101010 # load upper half to register

ori \$t0 0101 0101 # load lower part to register

## Memory Organization:

main memory can be view as a large one-dimensional array of bytes.

a memory address is an index into the memory array

Byte is the smallest address unit, but most data use larger space (word)

A word in MIPS architecture is a 32 bits or 4 bytes

Endianness: how are data stored in the memory

### Memory

| little endian | a   0A     | big endian | a   0A   |
|---------------|------------|------------|----------|
|               | → a+1   0C |            | a+1   0B |
|               | a+2   0B   |            | a+2   0C |
|               | a+3   0A   |            | a+3   0D |

Storage: A D B C O D

Ex: saving Woosh! in little end → SOOW \o\o\h

Bit: 0 or 1, Nibble: 4 bits, bytes: 8 bits, Word: may be 16, 32, 64 bits, depends on size of register

HLI: char 8 bits, short: 16 bits, int: 32, 64 bits, long: 32 bits.

Hex              bin

A

1010

C

1100

E

1110

B

1011

D

1101

F

1111