

# CS330 - Computer Organization and Assembly Language Programming

Lecture 20

-Review-

Professor : Mahmut Unan – UAB CS

# Agenda

## Review

*You are responsible for all the topics*

*This is a rough review*

## Midterm 2 Exam

2nd Exam: Thursday 03/31/2022-

Lecture Time

# Hello World !

A typical *C program* basically consists of the following parts;

- Preprocessor Commands
- Functions
- Variables
- Statements & Expressions
- Comment

```
#include <stdio.h>
int main()
{
    /* the first program in CS330 */
    printf("hello, world\n");
    return 0;
}
```

# Information is Bits + Context

- “hello.c” is a source code
  - Sequence of bits (0 or 1)
  - 8-bit data chunks are called bytes
  - Each byte represents some text character in the program

|     |      |      |      |      |     |     |      |      |     |     |     |     |     |     |     |
|-----|------|------|------|------|-----|-----|------|------|-----|-----|-----|-----|-----|-----|-----|
| #   | i    | n    | c    | l    | u   | d   | e    | <sp> | <   | s   | t   | d   | i   | o   | .   |
| 35  | 105  | 110  | 99   | 108  | 117 | 100 | 101  | 32   | 60  | 115 | 116 | 100 | 105 | 111 | 46  |
| h   | >    | \n   | \n   | i    | n   | t   | <sp> | m    | a   | i   | n   | (   | )   | \n  | {   |
| 104 | 62   | 10   | 10   | 105  | 110 | 116 | 32   | 109  | 97  | 105 | 110 | 40  | 41  | 10  | 123 |
| \n  | <sp> | <sp> | <sp> | <sp> | p   | r   | i    | n    | t   | f   | (   | "   | h   | e   | 1   |
| 10  | 32   | 32   | 32   | 32   | 112 | 114 | 105  | 110  | 116 | 102 | 40  | 34  | 104 | 101 | 108 |
| l   | o    | ,    | <sp> | w    | o   | r   | l    | d    | \   | n   | "   | )   | ;   | \n  | }   |
| 108 | 111  | 44   | 32   | 119  | 111 | 114 | 108  | 100  | 92  | 110 | 34  | 41  | 59  | 10  | 125 |

Figure 1.2 The ASCII text representation of hello.c.

# Programs translated by other programs

- unix> gcc -o hello hello.c



## Compilation System



- **Pre-processing**
  - E.g., `#include<stdio.h>` is inserted into `hello.i`
- **Compilation (.s)**
  - Each statement is an assembly language program
- **Assembly (.o)**
  - A binary file whose bytes encode mach. language instructions
- **Linking**
  - Get `printf()` which resides in a separate precompiled object file

# Virtual memory

- Illusion that each process has exclusive use of a large main memory
  - Example: Virtual address space for Linux
- **Files:** A sequence of bytes



# Data Measurement Chart

## Data Measurement Size

|                      |                                 |
|----------------------|---------------------------------|
| <b>Bit</b>           | Single Binary Digit<br>(1 or 0) |
| <b>Byte</b>          | 8 bits                          |
| <b>Kilobyte (KB)</b> | 1,024 Bytes                     |
| <b>Megabyte (MB)</b> | 1,024 Kilobytes                 |
| <b>Gigabyte (GB)</b> | 1,024 Megabytes                 |
| <b>Terabyte (TB)</b> | 1,024 Gigabytes                 |
| <b>Petabyte (PB)</b> | 1,024 Terabytes                 |
| <b>Exabyte (EB)</b>  | 1,024 Petabytes                 |

# The Decimal System

- System based on decimal digits (0, 1, 2, 3, 4, 5, 6, 7, 8, 9) to represent numbers
- For example the number 83 means eight tens plus three:

$$83 = (8 * 10) + 3$$

- The number 4728 means four thousands, seven hundreds, two tens, plus eight:

$$4728 = (4 * 1000) + (7 * 100) + (2 * 10) + 8$$

- The decimal system is said to have a **base**, or **radix**, of 10. This means that each digit in the number is multiplied by 10 raised to a power corresponding to that digit's position:

$$83 = (8 * 10^1) + (3 * 10^0)$$

$$4728 = (4 * 10^3) + (7 * 10^2) + (2 * 10^1) + (8 * 10^0)$$

# The Binary System

- Only two digits, 1 and 0
- Represented to the base 2
- The digits 1 and 0 in binary notation have the same meaning as in decimal notation:

$$0_2 = 0_{10}$$

$$1_2 = 1_{10}$$

- To represent larger numbers each digit in a binary number has a value depending on its position:

$$10_2 = (1 * 2^1) + (0 * 2^0) = 2_{10}$$

$$11_2 = (1 * 2^1) + (1 * 2^0) = 3_{10}$$

$$100_2 = (1 * 2^2) + (0 * 2^1) + (0 * 2^0) = 4_{10}$$

For representing numbers in base 2, there are two possible digits (0, 1) in which column values are a power of two:

|       |       |       |       |       |       |       |       |
|-------|-------|-------|-------|-------|-------|-------|-------|
| $2^7$ | $2^6$ | $2^5$ | $2^4$ | $2^3$ | $2^2$ | $2^1$ | $2^0$ |
| 128   | 64    | 32    | 16    | 8     | 4     | 2     | 1     |

---

$$\begin{array}{cccccccccc} 0 & 1 & 1 & 0 & 0 & 0 & 1 & 1 \\ 0 + 64 + 32 + 0 + 0 + 0 + 2 + 1 = 99 \end{array}$$

Although values represented in base 2 are significantly longer than those in base 10, **binary representation is used in digital computing because of the resulting simplicity of hardware design**

# Encoding Byte Values

- Byte = 8 bits
  - Binary  $00000000_2$  to  $11111111_2$
  - Decimal:  $0_{10}$  to  $255_{10}$
  - Hexadecimal  $00_{16}$  to  $FF_{16}$ 
    - Base 16 number representation
    - Use characters '0' to '9' and 'A' to 'F'
    - Write  $FA1D37B_{16}$  in C as
      - `0xFA1D37B`
      - `0xfa1d37b`

| Hex | Decimal | Binary |
|-----|---------|--------|
| 0   | 0       | 0000   |
| 1   | 1       | 0001   |
| 2   | 2       | 0010   |
| 3   | 3       | 0011   |
| 4   | 4       | 0100   |
| 5   | 5       | 0101   |
| 6   | 6       | 0110   |
| 7   | 7       | 0111   |
| 8   | 8       | 1000   |
| 9   | 9       | 1001   |
| A   | 10      | 1010   |
| B   | 11      | 1011   |
| C   | 12      | 1100   |
| D   | 13      | 1101   |
| E   | 14      | 1110   |
| F   | 15      | 1111   |

$1100\ 1001\ 0111\ 1011 \rightarrow 0xC97B$

# Machine Words

- Machine has “word size”
  - Nominal size of integer-valued data
  - More importantly – a virtual address is encoded by such a word
    - Hence, it determines max size of virtual address space
  - Most current machines are 32 bits (4 bytes)
    - Limits addresses to 4GB
    - Becoming too small for memory-intensive applications
  - Newer systems are 64 bits (8 bytes)
    - Potentially address  $\approx 1.84 \times 10^{19}$  bytes
  - Machines support multiple data formats
    - Fractions or multiples of word size
    - Always integral number of bytes

# Data Sizes

- Each computer has a word size
  - For a machine with  $w$ -bit word size
  - The virtual address can range from 0 to  $2^w - 1$
  - The program access to at most  $2^w$  bytes
- 32 bit vs 64 bit

# Addressing and Byte Ordering

- For objects that span multiple bytes (e.g. integers), we need to agree on two things
  - what would be the address of the object?
  - how would we order the bytes in memory?

# Byte Ordering

- How to order bytes within multi-byte word in memory
- Conventions
  - (most) Sun's, IBMs are “Big Endian” machines
    - Least significant byte has highest address (comes last)
  - (most) Intel's are “Little Endian” machines
    - Least significant byte has lowest address (comes first)
- Example
  - Variable x has 4-byte representation 0x01234567
  - Address given by &x is 0x100 0x100 0x101

## BigEndian



## LittleEndian



# Boolean Variables and Operations

- Developed by George Boole in 19th Century
  - Algebraic representation of logic
    - Encode “True” as 1 and “False” as 0
  - $\langle \{0,1\}, |, \&, \sim, 0, 1 \rangle$
  - | is “sum” operation, & is “product” operation
  - $\sim$  is “complement” operation (not additive inverse)
  - 0 is identity for sum, 1 is identity for product
- Makes use of variables and operations
  - Are logical
  - A variable may take on the value 1 (TRUE) or 0 (FALSE)
  - Basic logical operations are AND, OR, XOR and NOT

# Boolean Variables and Operations / 2

- AND
  - Yields true (binary value 1) if and only if both of its operands are true
  - In the absence of parentheses the AND operation takes precedence over the OR operation
  - When no ambiguity will occur the AND operation is represented by simple concatenation instead of the dot operator
- OR
  - Yields true if either or both of its operands are true
- NOT
  - Inverts the value of its operand

# Table: Boolean Operators

## (a) Boolean Operators of Two Input Variables

| P | Q | NOT P<br>( $\bar{P}$ ) | P AND Q<br>( $P \bullet Q$ ) | P OR Q<br>( $P + Q$ ) | P NAND Q<br>( $\overline{P \bullet Q}$ ) | P NOR Q<br>( $\overline{P + Q}$ ) | P XOR Q<br>( $P \oplus Q$ ) |
|---|---|------------------------|------------------------------|-----------------------|------------------------------------------|-----------------------------------|-----------------------------|
| 0 | 0 | 1                      | 0                            | 0                     | 1                                        | 1                                 | 0                           |
| 0 | 1 | 1                      | 0                            | 1                     | 1                                        | 0                                 | 1                           |
| 1 | 0 | 0                      | 0                            | 1                     | 1                                        | 0                                 | 1                           |
| 1 | 1 | 0                      | 1                            | 1                     | 0                                        | 0                                 | 0                           |

## (b) Boolean Operators Extended to More than Two Inputs (A, B, ...)

| Operation | Expression                             | Output = 1 if                                       |
|-----------|----------------------------------------|-----------------------------------------------------|
| AND       | $A \bullet B \bullet \dots$            | All of the set {A, B, ...} are 1.                   |
| OR        | $A + B + \dots$                        | Any of the set {A, B, ...} are 1.                   |
| NAND      | $\overline{A \bullet B \bullet \dots}$ | Any of the set {A, B, ...} are 0.                   |
| NOR       | $\overline{A + B + \dots}$             | All of the set {A, B, ...} are 0.                   |
| XOR       | $A \oplus B \oplus \dots$              | The set {A, B, ...} contains an odd number of ones. |

# Table: Basic Identities of Boolean Algebra

| <b>Basic Postulates</b>                              |                                                      |                    |
|------------------------------------------------------|------------------------------------------------------|--------------------|
| $A \cdot B = B \cdot A$                              | $A + B = B + A$                                      | Commutative Laws   |
| $A \cdot (B + C) = (A \cdot B) + (A \cdot C)$        | $A + (B \cdot C) = (A + B) \cdot (A + C)$            | Distributive Laws  |
| $1 \cdot A = A$                                      | $0 + A = A$                                          | Identity Elements  |
| $A \cdot \overline{A} = 0$                           | $A + \overline{A} = 1$                               | Inverse Elements   |
| <b>Other Identities</b>                              |                                                      |                    |
| $0 \cdot A = 0$                                      | $1 + A = 1$                                          |                    |
| $A \cdot A = A$                                      | $A + A = A$                                          |                    |
| $A \cdot (B \cdot C) = (A \cdot B) \cdot C$          | $A + (B + C) = (A + B) + C$                          | Associative Laws   |
| $\overline{A \cdot B} = \overline{A} + \overline{B}$ | $\overline{A + B} = \overline{A} \cdot \overline{B}$ | DeMorgan's Theorem |

# Exercise 1

Evaluate the following expression when A =0, B =1, and C= 1

$$F = B + \bar{C}A + B\bar{A} + A\bar{B}$$

$$1 + 0 + 1 + 0$$

$$1 + 1 = 1$$

Simplify the following functions;

$$F = AB + BC + \bar{B}C$$

$$F = A + \bar{A} B$$

# General Boolean Algebras

- Boolean operations can be extended to work on bit vectors
  - Operations applied bitwise

$$\begin{array}{rcl} \begin{array}{c} 01101001 \\ \& \underline{01010101} \end{array} & \begin{array}{c} 01101001 \\ \mid \underline{01010101} \end{array} & \begin{array}{c} 01101001 \\ \wedge \underline{01010101} \end{array} \\ \begin{array}{c} 01000001 \\ \textcolor{red}{01111101} \end{array} & \begin{array}{c} 01111101 \\ \textcolor{red}{00111100} \end{array} & \begin{array}{c} 00111100 \\ \textcolor{red}{10101010} \end{array} \\ & & \begin{array}{c} \sim \underline{01010101} \\ \textcolor{red}{10101010} \end{array} \end{array}$$

- All of the properties of Boolean algebra apply
- Now, Boolean  $\mid$ ,  $\&$  and  $\sim$  correspond to set union, intersection and complement

# Bit-Level Operations in C

- Operations **&**, **|**, **~**, **^** Available in C
  - Apply to any “integral” data type
    - long, int, short, char, unsigned
  - View arguments as bit vectors
  - Arguments applied bit-wise
- Examples (Char data type)
  - $\sim 0x41 \rightarrow 0xBE$ 
    - $\sim 01000001_2 \rightarrow 10111110_2$
  - $\sim 0x00 \rightarrow 0xFF$ 
    - $\sim 00000000_2 \rightarrow 11111111_2$
  - $0x69 \& 0x55 \rightarrow 0x41$ 
    - $01101001_2 \& 01010101_2 \rightarrow 01000001_2$
  - $0x69 | 0x55 \rightarrow 0x7D$ 
    - $01101001_2 | 01010101_2 \rightarrow 01111101_2$

# Contrast: Logic Operations in C

- Contrast to Logical Operators

- **&&, ||, !**

- View 0 as “False”
    - Anything nonzero as “True”
    - Always return 0 or 1
    - Early termination

- Examples (char data type)

- $\text{!}0x41 \rightarrow 0x00$
  - $\text{!}0x00 \rightarrow 0x01$
  - $\text{!}!\text{0x41} \rightarrow 0x01$
  - $0x69 \&\& 0x55 \rightarrow 0x01$
  - $0x69 \mid\mid 0x55 \rightarrow 0x01$
  - $p \&\& *p$  (avoids null pointer access)



Watch out for **&&** vs. **&** (and **||** vs. **|**)...  
one of the more common oopsies  
in C programming

# Shift Operations

- Left Shift:  $x \ll y$ 
  - Shift bit-vector **x** left **y** positions
    - Throw away extra bits on left
    - Fill with 0's on right
- Right Shift:  $x \gg y$ 
  - Shift bit-vector **x** right **y** positions
    - Throw away extra bits on right
    - Logical shift
      - Fill with 0's on left
    - Arithmetic shift
      - Replicate most significant bit on left
      - Useful in two's compliment
- Undefined Behavior
  - Shift amount  $< 0$  or  $\geq$  word size

|                          |          |
|--------------------------|----------|
| Argument <b>x</b>        | 01100010 |
| <b>&lt;&lt; 3</b>        | 00010000 |
| Log. <b>&gt;&gt; 2</b>   | 00011000 |
| Arith. <b>&gt;&gt; 2</b> | 00011000 |

|                          |          |
|--------------------------|----------|
| Argument <b>x</b>        | 10100010 |
| <b>&lt;&lt; 3</b>        | 00010000 |
| Log. <b>&gt;&gt; 2</b>   | 00101000 |
| Arith. <b>&gt;&gt; 2</b> | 11101000 |

# Unsigned Encodings

$$B2U(X) = \sum_{i=0}^{w-1} x_i \cdot 2^i$$

(Binary To Unsigned)

e.g. B2U ([1011]) =  $1 * 2^3 + 0 * 2^2 + 1 * 2^1 + 1 * 2^0 = 11$

– C short 2 bytes long

```
short int x = 15213;
```

|   | Decimal | Hex   | Binary            |
|---|---------|-------|-------------------|
| x | 15213   | 3B 6D | 00111011 01101101 |

# Examples



# Two's Complement

$$B2T(X) = -x_{w-1} \cdot 2^{w-1} + \sum_{i=0}^{w-2} x_i \cdot 2^i$$

  
**Sign Bit**

- e.g. B2T ([1011]) =  $-1 * 2^3 + 0 * 2^2 + 1 * 2^1 + 1 * 2^0 = -5$
- C short 2 bytes long

```
short int y = -15213;
```

|   | Decimal | Hex   | Binary            |
|---|---------|-------|-------------------|
| x | 15213   | 3B 6D | 00111011 01101101 |
| y | -15213  | C4 93 | 11000100 10010011 |

- **Sign bit**

- For 2's complement, most significant bit indicates sign
  - 0 for nonnegative; 1 for negative

# Two's Compliment

- Invert and add **one**

Suppose we're working with 8 bit quantities and suppose we want to find how **-28** would be expressed in two's complement notation.

- First we write out 28 in **binary form**.

00011100

- Then we **invert the digits**. 0 becomes 1, 1 becomes 0.

11100011

- Then we **add 1**.

11100100

That is how one would write -28 in 8 bit binary.

# Numeric Ranges

## ■ Unsigned Values

- $UMin = 0$   
000...0
- $UMax = 2^w - 1$   
111...1

## ■ Two's Complement Values

- $TMin = -2^{w-1}$   
100...0
- $TMax = 2^{w-1} - 1$   
011...1

## ■ Other Values

- Minus 1  
111...1

## Values for $W = 16$

|      | Decimal | Hex   | Binary            |
|------|---------|-------|-------------------|
| UMax | 65535   | FF FF | 11111111 11111111 |
| TMax | 32767   | 7F FF | 01111111 11111111 |
| TMin | -32768  | 80 00 | 10000000 00000000 |
| -1   | -1      | FF FF | 11111111 11111111 |
| 0    | 0       | 00 00 | 00000000 00000000 |

# Signed vs. Unsigned in C

- Constants
  - By default are considered to be signed integers
  - Unsigned if have “U” as suffix  
`0U, 4294967259U`
- Casting
  - Explicit casting between signed & unsigned same as U2T and T2U

```
int tx, ty;  
unsigned ux, uy;  
tx = (int) ux;  
uy = (unsigned) ty;
```

- Implicit casting also occurs via assignments and procedure calls

```
tx = ux;  
uy = ty;
```

# Sign Extension

- **Task:**
  - Given  $w$ -bit signed integer  $x$
  - Convert it to  $w+k$ -bit integer with same value
- **Rule:**
  - Make  $k$  copies of sign bit:
  - $X' = x_{w-1}, \dots, x_{w-1}, x_{w-1}, x_{w-2}, \dots, x_0$



# Truncating Numbers

- Reduce the number of bits representing the number
- Truncating  $w$ -bit number to a  $k$  bit number, we drop the high order  $w-k$  bits
  - Can alter its value
    - A form of overflow

# Summary: Expanding, Truncating: Basic Rules

- **Expanding (e.g., short int to int)**
  - Unsigned: zeros added
  - Signed: sign extension
  - Both yield expected result
- **Truncating (e.g., unsigned to unsigned short)**
  - Unsigned/signed: bits are truncated
  - Result reinterpreted
  - Unsigned: mod operation
  - Signed: similar to mod
  - For small numbers yields expected behavior

# **OVERFLOW RULE:**

- If two numbers are added, and they are both positive or both negative, then overflow occurs if and only if the result has the opposite sign.

# Multiplication

- Goal: Computing Product of  $w$ -bit numbers  $x, y$ 
  - Either signed or unsigned
- But, exact results can be bigger than  $w$  bits
  - Unsigned: up to  $2w$  bits
    - Result range:  $0 \leq x * y \leq (2^w - 1)^2 = 2^{2w} - 2^{w+1} + 1$
  - Two's complement min (negative): Up to  $2w-1$  bits
    - Result range:  $x * y \geq (-2^{w-1}) * (2^{w-1}-1) = -2^{2w-2} + 2^{w-1}$
  - Two's complement max (positive): Up to  $2w$  bits, but only for  $(TMin_w)^2$ 
    - Result range:  $x * y \leq (-2^{w-1})^2 = 2^{2w-2}$
- So, maintaining exact results...
  - would need to keep expanding word size with each product computed
  - is done in software, if needed
    - e.g., by “arbitrary precision” arithmetic packages

# Unsigned Binary Multiplication

|                                                                                                                     |                                                                                                                          |
|---------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|
| $\begin{array}{r} 1011 \\ \times 1101 \\ \hline 1011 \\ 0000 \\ 1011 \\ \hline 1011 \\ \hline 10001111 \end{array}$ | <p><b>Multiplicand (11)</b></p> <p><b>Multiplier (13)</b></p> <p><b>Partial products</b></p> <p><b>Product (143)</b></p> |
|---------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|

**Figure 10.7 Multiplication of Unsigned Binary Integers**

# Power-of-2 Multiply with Shift

## ■ Operation

- $u \ll k$  gives  $u * 2^k$
- Both signed and unsigned

Operands:  $w$  bits

True Product:  $w+k$  bits

Discard  $k$  bits:  $w$  bits



## ■ Examples

- $u \ll 3 == u * 8$
- $(u \ll 5) - (u \ll 3) == u * 24$
- Most machines shift and add faster than multiply
  - Compiler generates this code automatically

# Unsigned Power-of-2 Divide with Shift

## ■ Quotient of Unsigned by Power of 2

- $u \gg k$  gives  $\lfloor u / 2^k \rfloor$
- Uses logical shift



|        | Division   | Computed | Hex   | Binary            |
|--------|------------|----------|-------|-------------------|
| x      | 15213      | 15213    | 3B 6D | 00111011 01101101 |
| x >> 1 | 7606.5     | 7606     | 1D B6 | 00011101 10110110 |
| x >> 4 | 950.8125   | 950      | 03 B6 | 00000011 10110110 |
| x >> 8 | 59.4257813 | 59       | 00 3B | 00000000 00111011 |

# Arithmetic: Basic Rules

## ■ Addition:

- Unsigned/signed: Normal addition followed by truncate, same operation on bit level
- Unsigned: addition mod  $2^w$ 
  - Mathematical addition + possible subtraction of  $2^w$
- Signed: modified addition mod  $2^w$  (result in proper range)
  - Mathematical addition + possible addition or subtraction of  $2^w$

## ■ Multiplication:

- Unsigned/signed: Normal multiplication followed by truncate, same operation on bit level
- Unsigned: multiplication mod  $2^w$
- Signed: modified multiplication mod  $2^w$  (result in proper range)

# Fractional Binary Numbers: Examples

| Value  | Representation      |
|--------|---------------------|
| 5 3/4  | 101.11 <sub>2</sub> |
| 2 7/8  | 10.111 <sub>2</sub> |
| 1 7/16 | 1.0111 <sub>2</sub> |

## Observations

- Divide by 2 by shifting right (unsigned)
- Multiply by 2 by shifting left
- Numbers of form 0.111111...<sub>2</sub> are just below 1.0
  - $1/2 + 1/4 + 1/8 + \dots + 1/2^i + \dots \rightarrow 1.0$
  - Use notation  $1.0 - \varepsilon$

# IEEE Floating Point

- **IEEE Standard 754**
  - Established in 1985 as uniform standard for floating point arithmetic
    - Before that, many idiosyncratic formats
  - Supported by all major CPUs
- **Driven by numerical concerns**
  - Nice standards for rounding, overflow, underflow
  - Hard to make fast in hardware
    - Numerical analysts predominated over hardware designers in defining standard

# Floating Point Representation

- Numerical Form:

$$(-1)^s M \cdot 2^E$$

- Sign bit **s** determines whether number is negative or positive
- Significand **M** normally a fractional value in range [1.0,2.0).
- Exponent **E** weights value by power of two

- Encoding

- MSB **S** is sign bit **s**
- exp field encodes **E** (but is not equal to E)
- frac field encodes **M** (but is not equal to M)



# Precision options

- Single precision: 32 bits



- Double precision: 64 bits



- The value encoded by a given bit representation can be divided into three different cases, depending on the value of **exp.**
- Case 1: Normalized Values
- Case 2: Denormalized Values
- Case 3: Special Values

# Rounding

- Rounding Modes (illustrate with \$ rounding)

|                            | \$1.40 | \$1.60 | \$1.50 | \$2.50 | -\$1.50 |
|----------------------------|--------|--------|--------|--------|---------|
| – Towards zero             | \$1    | \$1    | \$1    | \$2    | -\$1    |
| – Round down ( $-\infty$ ) | \$1    | \$1    | \$1    | \$2    | -\$2    |
| – Round up ( $+\infty$ )   | \$2    | \$2    | \$2    | \$3    | -\$1    |
| – Nearest Even (default)   | \$1    | \$2    | \$2    | \$2    | -\$2    |

# Our Coverage

- IA32
  - The traditional x86
  - For ???: RIP, Fall 2018
- x86-64
  - The standard
  - moat.cis.uab.edu
  - gcc hello.c
  - gcc -m64 hello.c
- Presentation
  - Book covers x86-64
  - So, we will cover x86-64

- **C, assembly, machine code**

## Definitions

- **Architecture:** (also ISA: instruction set architecture) The parts of a processor design that one needs to understand or write assembly/machine code.
  - Examples: instruction set specification, registers.
- **Microarchitecture:** Implementation of the architecture.
  - Examples: cache sizes and core frequency.
- Code Forms:
  - **Machine Code:** The byte-level programs that a processor executes
  - **Assembly Code:** A text representation of machine code
- Example ISAs:
  - Intel: x86, IA32, Itanium, x86-64
  - ARM: Used in almost all mobile phones

# Assembly/Machine Code View



## Programmer-Visible State

- **PC: Program counter**
  - Indicates the address of next instruction
  - Called "%rip" (x86-64)
- **Register file**
  - Heavily used program data
  - 16 named locations storing 64bit values
- **Condition codes**
  - Store status information about most recent arithmetic or logical operation
  - Used for conditional branching
- **Memory**
  - Byte addressable array
  - Code and user data
  - Stack to support procedures

# Turning C into Object Code

- Code in files **p1.c p2.c**
- Compile with command: **gcc -Og p1.c p2.c -o p**
  - Use basic optimizations (**-Og**) [New to recent versions of GCC]
  - Put resulting binary in file **p**



# x86-64 Integer Registers

|      |      |
|------|------|
| %rax | %eax |
|------|------|

|     |      |
|-----|------|
| %r8 | %r8d |
|-----|------|

|      |      |
|------|------|
| %rbx | %ebx |
|------|------|

|     |      |
|-----|------|
| %r9 | %r9d |
|-----|------|

|      |      |
|------|------|
| %rcx | %ecx |
|------|------|

|      |       |
|------|-------|
| %r10 | %r10d |
|------|-------|

|      |      |
|------|------|
| %rdx | %edx |
|------|------|

|      |       |
|------|-------|
| %r11 | %r11d |
|------|-------|

|      |      |
|------|------|
| %rsi | %esi |
|------|------|

|      |       |
|------|-------|
| %r12 | %r12d |
|------|-------|

|      |      |
|------|------|
| %rdi | %edi |
|------|------|

|      |       |
|------|-------|
| %r13 | %r13d |
|------|-------|

|      |      |
|------|------|
| %rsp | %esp |
|------|------|

|      |       |
|------|-------|
| %r14 | %r14d |
|------|-------|

|      |      |
|------|------|
| %rbp | %ebp |
|------|------|

|      |       |
|------|-------|
| %r15 | %r15d |
|------|-------|

- Can reference low-order 4 bytes (also low-order 1 & 2 bytes)

# Moving Data

- Moving Data

***movq Source, Dest:***

- Operand Types

– ***Immediate:*** Constant integer data

- Example: `$0x400`, `$-533`
- Like C constant, but prefixed with '\$'
- Encoded with 1, 2, or 4 bytes

– ***Register:*** One of 16 integer registers

- Example: `%rax`, `%r13`
- But `%rsp` reserved for special use
- Others have special uses for particular instructions

– ***Memory:*** 8 consecutive bytes of memory at address given by register

- Simplest example: (`%rax`)
- Various other “address modes”



# movq Operand Combinations

|      | Source     | Dest       | Src,Dest            | C Analog       |
|------|------------|------------|---------------------|----------------|
| movq | <i>Imm</i> | <i>Reg</i> | movq \$0x4, %rax    | temp = 0x4;    |
|      |            | <i>Mem</i> | movq \$-147, (%rax) | *p = -147;     |
|      | <i>Reg</i> | <i>Reg</i> | movq %rax, %rdx     | temp2 = temp1; |
|      | <i>Reg</i> | <i>Mem</i> | movq %rax, (%rdx)   | *p = temp;     |
|      | <i>Mem</i> | <i>Reg</i> | movq (%rax), %rdx   | temp = *p;     |

*Cannot do memory-memory transfer with a single instruction*

# Example of Simple Addressing Modes

```
void swap
    (long *xp, long *yp)
{
    long t0 = *xp;
    long t1 = *yp;
    *xp = t1;
    *yp = t0;
}
```

```
swap:
    movq    (%rdi), %rax
    movq    (%rsi), %rdx
    movq    %rdx, (%rdi)
    movq    %rax, (%rsi)
    ret
```

# Pushing and Popping Stack Data

- Stack plays a vital role in the handling of procedure calls.
- Stack = a data structure where values can be added or deleted → according to last in first out discipline
- Add data to stack → push
- Remove data from stack → pop

|      |             |                                   |
|------|-------------|-----------------------------------|
| Push | Source      | Push source onto stack            |
| Pop  | Destination | Pop top of stack into destination |

EXAMPLES:

pushq %rbp

Popq M[R[%rsp]]

- The stack pointer %rsp holds the address of the top stack element.
- Pushing a quad word value onto the stack involves first decrementing the stack pointer by 8 and then writing the value at the new top of stack address.
- `pushq %rbp` is equal to  
`subq &8, %rsp`  
`movq %rbp, (%rsp)`

# Address Computation Instruction

- **leaq** Source, Destination
    - Source is address mode expression
    - Set Destination to address denoted by expression
- leaq S, D    D ← &S    Load effective address**

- Uses
  - Computing addresses without a memory reference
    - E.g., translation of **p = &x[i];**
  - Computing arithmetic expressions of the form  $x + k*y$ 
    - $k = 1, 2, 4, \text{ or } 8$
- Example

```
long m12(long x)
{
    return x*12;
}
```

Converted to ASM by compiler:

```
leaq (%rdi,%rdi,2), %rax # t <- x+x*2
salq $2, %rax           # return t<<2
```

# More on LEAQ

- The destination operand must be a register

```
leaq 7(%rdx, %rdx, 4), %rax
```

→ set register %rax to  $5x + 7$

If %rax has the value x, and %rcx holds the value y

```
leaq 6(%rax), %rdx → 6 + x
```

```
leaq (%rax, %rcx), %rdx → x + y
```

# ADD

**ADDQ %rbx, %rax**

- adds %rbx to %rax, and overwrites the result in %rax

**ADDQ &8, %rsp**

adds 8 to the stack pointer %rsp, (incrementing)

# Some Arithmetic Operations

- Two Operand Instructions:

## Format    Computation

|       |          |                                            |
|-------|----------|--------------------------------------------|
| addq  | Src,Dest | Dest = Dest + Src                          |
| subq  | Src,Dest | Dest = Dest – Src                          |
| imulq | Src,Dest | Dest = Dest * Src                          |
| salq  | Src,Dest | Dest = Dest << Src <b>Also called shlq</b> |
| sarq  | Src,Dest | Dest = Dest >> Src <b>Arithmetic</b>       |
| shrq  | Src,Dest | Dest = Dest >> Src <b>Logical</b>          |
| xorq  | Src,Dest | Dest = Dest ^ Src                          |
| andq  | Src,Dest | Dest = Dest & Src                          |
| orq   | Src,Dest | Dest = Dest   Src                          |

# Processor State (x86-64, Partial)

- Information about currently executing program
  - Temporary data ( `%rax`, ... )
  - Location of runtime stack ( `%rsp` )
  - Location of current code control point ( `%rip`, ... )
  - Status of recent tests ( CF, ZF, SF, OF )

Current stack top

Registers

|                   |                   |
|-------------------|-------------------|
| <code>%rax</code> | <code>%r8</code>  |
| <code>%rbx</code> | <code>%r9</code>  |
| <code>%rcx</code> | <code>%r10</code> |
| <code>%rdx</code> | <code>%r11</code> |
| <code>%rsi</code> | <code>%r12</code> |
| <code>%rdi</code> | <code>%r13</code> |
| <code>%rsp</code> | <code>%r14</code> |
| <code>%rbp</code> | <code>%r15</code> |

`%rip`

Instruction pointer

CF

ZF

SF

OF

Condition codes

# Condition Codes (Implicit Setting)

- **Single bit registers**
  - CF Carry Flag (for unsigned) SF Sign Flag (for signed)
  - ZF Zero Flag OF Overflow Flag (for signed)

- **Implicitly set (think of it as side effect) by arithmetic operations**

Example: **addq Src,Dest**  $\leftrightarrow t = a+b$

**CF set** if carry out from most significant bit (unsigned overflow) (For ex: unsigned  $t < \text{unsigned } a$ )

**ZF set** if  $t == 0$

**SF set** if  $t < 0$  (as signed)

**OF set** if two's-complement (signed) overflow

$(a>0 \ \&\& \ b>0 \ \&\& \ t<0) \ || \ (a<0 \ \&\& \ b<0 \ \&\& \ t>=0)$

- **Not set by leaq instruction**

# Condition Codes (Explicit Setting: Compare)

- Explicit Setting by Compare Instruction
  - cmpq** Src2, Src1
  - cmpq b, a** like computing  $a - b$  without setting destination
  - CF set** if carry out from most significant bit (used for unsigned comparisons)
  - ZF set** if  $a == b$
  - SF set** if  $(a - b) < 0$  (as signed)
  - OF set** if two's-complement (signed) overflow  
 $(a > 0 \ \&\& \ b < 0 \ \&\& \ (a - b) < 0) \ \|\ (a < 0 \ \&\& \ b > 0 \ \&\& \ (a - b) > 0)$

# Condition Codes (Explicit Setting: Test)

- **Explicit Setting by Test instruction**
  - testq Src2, Src1**
    - **testq b, a** like computing **a&b** without setting destination
  - Sets condition codes based on value of Src1 & Src2
  - Useful to have one of the operands be a mask
- ZF set when **a&b == 0**
- SF set when **a&b < 0**

# Accessing the condition code

- Rather than reading the condition codes directly, there are three common ways of using the conditions codes;
  - Set a single byte 0 or 1 depending on the condition code
  - Conditionally jump to some other parts of the code
  - Conditionally transfer data

# Reading Condition Codes

- **SetX Instructions**
  - Set low-order byte of destination to 0 or 1 based on combinations of condition codes
  - Does not alter remaining 7 bytes

| SetX         | Condition                 | Description               |
|--------------|---------------------------|---------------------------|
| <b>sete</b>  | <b>ZF</b>                 | Equal / Zero              |
| <b>setne</b> | <b>~ZF</b>                | Not Equal / Not Zero      |
| <b>sets</b>  | <b>SF</b>                 | Negative                  |
| <b>setns</b> | <b>~SF</b>                | Nonnegative               |
| <b>setg</b>  | <b>~(SF^OF) &amp; ~ZF</b> | Greater (Signed)          |
| <b>setge</b> | <b>~(SF^OF)</b>           | Greater or Equal (Signed) |
| <b>setl</b>  | <b>(SF^OF)</b>            | Less (Signed)             |
| <b>setle</b> | <b>(SF^OF)   ZF</b>       | Less or Equal (Signed)    |
| <b>seta</b>  | <b>~CF &amp; ~ZF</b>      | Above (unsigned)          |
| <b>setb</b>  | <b>CF</b>                 | Below (unsigned)          |

# Reading Condition Codes (Cont.)

- **SetX Instructions:**
  - Set single byte based on combination of condition codes
- **One of addressable byte registers**
  - Does not alter remaining bytes
  - Typically use **movzbl** to finish job
    - 32-bit instructions also set upper 32 bits to 0

```
int gt (long x, long y)
{
    return x > y;
}
```

| Register | Use(s)            |
|----------|-------------------|
| %rdi     | Argument <b>x</b> |
| %rsi     | Argument <b>y</b> |
| %rax     | Return value      |

```
cmpq    %rsi, %rdi    # Compare x:y
setg    %al             # Set when >
movzbl  %al, %eax     # Zero rest of %rax
ret
```

# Conditional Branching: Jumping

- **jX Instructions**
  - Jump to different part of code depending on condition codes

| jX               | Condition                       | Description               |
|------------------|---------------------------------|---------------------------|
| <code>jmp</code> | <code>1</code>                  | Unconditional             |
| <code>je</code>  | <code>ZF</code>                 | Equal / Zero              |
| <code>jne</code> | <code>~ZF</code>                | Not Equal / Not Zero      |
| <code>js</code>  | <code>SF</code>                 | Negative                  |
| <code>jns</code> | <code>~SF</code>                | Nonnegative               |
| <code>jg</code>  | <code>~(SF^OF) &amp; ~ZF</code> | Greater (Signed)          |
| <code>jge</code> | <code>~(SF^OF)</code>           | Greater or Equal (Signed) |
| <code>jl</code>  | <code>(SF^OF)</code>            | Less (Signed)             |
| <code>jle</code> | <code>(SF^OF)   ZF</code>       | Less or Equal (Signed)    |
| <code>ja</code>  | <code>~CF &amp; ~ZF</code>      | Above (unsigned)          |
| <code>jb</code>  | <code>CF</code>                 | Below (unsigned)          |

# “Do-While” Loop Compilation

Goto Version

```
long pcount_goto
(unsigned long x) {
    long result = 0;
loop:
    result += x & 0x1;
    x >>= 1;
    if(x) goto loop;
    return result;
}
```

| Register | Use(s)            |
|----------|-------------------|
| %rdi     | Argument <b>x</b> |
| %rax     | <b>result</b>     |

```
        movl    $0, %eax      # result = 0
.L2:                           # loop:
        movq    %rdi, %rdx
        andl    $1, %edx      # t = x & 0x1
        addq    %rdx, %rax    # result += t
        shrq    %rdi          # x >>= 1
        jne     .L2          # if (x) goto loop
        rep; ret
```

# General “Do-While” Translation

C Code

```
do  
  Body  
  while (Test);
```

Goto Version

```
loop:  
  Body  
  if (Test)  
    goto loop
```

- Body:

```
{  
  Statement1;  
  Statement2;  
  ...  
  Statementn;  
}
```

# General “While” Translation #1

- “Jump-to-middle” translation
- Used with `-Og`

While version

```
while (Test)
  Body
```



Goto Version

```
goto test;
loop:
  Body
test:
  if (Test)
    goto loop;
done:
```

# While Loop Example #1

C Code

```
long pcount_while
(unsigned long x) {
    long result = 0;
    while (x) {
        result += x & 0x1;
        x >>= 1;
    }
    return result;
}
```

Jump to Middle

```
long pcount_goto_jtm
(unsigned long x) {
    long result = 0;
    goto test;
loop:
    result += x & 0x1;
    x >>= 1;
test:
    if(x) goto loop;
    return result;
}
```

- Compare to do-while version of function
- Initial goto starts loop at test

# “For” Loop → While Loop

For Version

```
for (Init; Test; Update)
```

*Body*



While Version

```
Init;
```

```
while (Test) {
```

*Body*

*Update*;

```
}
```

# For-While Conversion

Init

```
i = 0
```

Test

```
i < WSIZE
```

Update

```
i++
```

Body

```
{  
    unsigned bit =  
        (x >> i) & 0x1;  
    result += bit;  
}
```

```
long pcount_for_while  
(unsigned long x)  
{  
    size_t i;  
    long result = 0;  
    i = 0;  
    while (i < WSIZE)  
    {  
        unsigned bit =  
            (x >> i) & 0x1;  
        result += bit;  
        i++;  
    }  
    return result;  
}
```

# Mechanisms in Procedures

- **Passing control**
  - To beginning of procedure code
  - Back to return point
- **Passing data**
  - Procedure arguments
  - Return value
- **Memory management**
  - Allocate during procedure execution
  - Deallocate upon return
- **Mechanisms all implemented with machine instructions**
- **x86-64 implementation of a procedure uses only those mechanisms required**



# Stack Structure

## x86-64 Stack

- Region of memory managed with stack discipline
- Grows toward lower addresses
- Register **%rsp** contains lowest stack address
  - address of “top” element

Stack Pointer: **%rsp** →



# x86-64 Stack: Push

- **pushq Src**

- Fetch operand at Src
- Decrement **%rsp** by 8
- Write operand at address given by **%rsp**

Stack Pointer: **%rsp**

Stack “Bottom”



# x86-64 Stack: Pop

## ■ **popq Dest**

- Read value at address given by **%rsp**
- Increment **%rsp** by 8
- Store value at Dest (**must be register**)



# Passing control

## Code Examples

```
void multstore  
    (long x, long y, long *dest)  
{  
    long t = mult2(x, y);  
    *dest = t;  
}
```

```
000000000400540 <multstore>:  
400540: push    %rbx      # Save %rbx  
400541: mov     %rdx,%rbx   # Save dest  
400544: callq   400550 <mult2>   # mult2(x,y)  
400549: mov     %rax,(%rbx)  # Save at dest  
40054c: pop     %rbx      # Restore %rbx  
40054d: retq               # Return
```

```
long mult2  
    (long a, long b)  
{  
    long s = a * b;  
    return s;  
}
```

```
000000000400550 <mult2>:  
400550: mov     %rdi,%rax   # a  
400553: imul   %rsi,%rax   # a * b  
400557: retq               # Return
```

# Procedure Control Flow

- Use stack to support procedure call and return
- **Procedure call:** `call label`
  - Push return address on stack
  - Jump to label
- Return address:
  - Address of the next instruction right after call
  - Example from disassembly
- **Procedure return:** `ret`
  - Pop address from stack
  - Jump to address

# Control Flow Example #1

```
0000000000400540 <multstore>:  
.  
.  
400544: callq 400550 <mult2>  
400549: mov    %rax, (%rbx)  
.  
.
```

```
0000000000400550 <mult2>:  
400550: mov    %rdi, %rax  
.  
.  
400557: retq
```



## Control Flow Example #2

```
0000000000400540 <multstore>:  
.  
.  
400544: callq 400550 <mult2>  
400549: mov %rax, (%rbx) ←
```

```
0000000000400550 <mult2>:  
400550: mov %rdi,%rax ←  
. .  
400557: retq
```



## Control Flow Example #3

```
0000000000400540 <multstore>:
```

```
•  
•  
400544: callq 400550 <mult2>  
400549: mov    %rax, (%rbx) ←  
•  
•
```

0x130

0x128

0x120

0x118

0x400549

%rsp 0x118

%rip 0x400557

```
0000000000400550 <mult2>:
```

```
400550: mov    %rdi,%rax  
•  
•  
400557: retq ←
```



## Control Flow Example #4

```
0000000000400540 <multstore>:  
.  
.  
400544: callq 400550 <mult2>  
400549: mov %rax, (%rbx)  
.  
.
```

```
0000000000400550 <mult2>:  
400550: mov %rdi, %rax  
.  
.  
400557: retq
```



# Passing data : Procedure Data Flow

## Registers

- First 6 arguments



## Stack



- Return value



- Only allocate stack space when needed

# Data Flow Examples

```
void multstore
    (long x, long y, long *dest)
{
    long t = mult2(x, y);
    *dest = t;
}
```

```
0000000000400540 <multstore>:
# x in %rdi, y in %rsi, dest in %rdx
...
400541: mov    %rdx,%rbx    # Save dest
400544: callq  400550 <mult2>  # mult2(x,y)
# t in %rax
400549: mov    %rax,(%rbx)   # Save at dest
...
```

```
long mult2
    (long a, long b)
{
    long s = a * b;
    return s;
}
```

```
0000000000400550 <mult2>:
# a in %rdi, b in %rsi
400550: mov    %rdi,%rax    # a
400553: imul   %rsi,%rax    # a * b
# s in %rax
400557: retq   # Return
```

# Managing local data : Stack-Based Languages

- **Languages that support recursion**
  - e.g., C, Pascal, Java
  - Code must be “Reentrant”
    - Multiple simultaneous instantiations of single procedure
  - Need some place to store state of each instantiation
    - Arguments
    - Local variables
    - Return pointer
- **Stack discipline**
  - State for given procedure needed for limited time
    - From when called to when return
  - Callee returns before caller does
- **Stack allocated in Frames**
  - state for single procedure instantiation

# Stack Frames

- **Contents**
  - Return information
  - Local storage (if needed)
  - Temporary space (if needed)
- **Management**
  - Space allocated when enter procedure
    - “Set-up” code
    - Includes push by **call** instruction
  - Deallocated when return
    - “Finish” code
    - Includes pop by **ret** instruction



# Example

```
yoo (...)  
{  
    •  
    •  
    who () ;  
    •  
    •  
}
```



# Example



# Example



# Example



# Stack



# Example



# Example



# Example



Stack



# Example

```
yoo( )  
{    who( ... )  
{  
    . . .  
    amI();  
    . . .  
    amI();  
    . . .  
}
```



Stack



# Example



# Example

```
yoo( )  
{   who (...)  
{  
    . . .  
    amI ();  
    . . .  
    amI ();  
    . . .  
}
```



# Example

```
yoo (...)  
{  
    •  
    •  
    who () ;  
    •  
    •  
}
```



# x86-64/Linux Stack Frame

- **Current Stack Frame (“Top” to Bottom)**

- “Argument build:”  
Parameters for function about to call
- Local variables  
If can’t keep in registers
- Saved register context
- Old frame pointer (optional)



- **Caller Stack Frame**

- Return address
  - Pushed by **call** instruction
- Arguments for this call

# Illustration of Recursion

## Recursive Function

```
/* Recursive popcount */
long pcount_r(unsigned long x) {
    if (x == 0)
        return 0;
    else
        return (x & 1)
            + pcount_r(x >> 1);
}
```

```
pcount_r:
    movl    $0, %eax
    testq   %rdi, %rdi
    je      .L6
    pushq   %rbx
    movq   %rdi, %rbx
    andl   $1, %ebx
    shrq   %rdi # (by 1)
    call   pcount_r
    addq   %rbx, %rax
    popq   %rbx
.L6:
    rep; ret
```

# Recursive Function Terminal Case

```
/* Recursive popcount */
long pcount_r(unsigned long x) {
    if (x == 0)
        return 0;
    else
        return (x & 1)
            + pcount_r(x >> 1);
}
```

pcount\_r:

```
    movl    $0, %eax
    testq   %rdi, %rdi
    je      .L6
    pushq   %rbx
    movq   %rdi, %rbx
    andl   $1, %ebx
    shrq   %rdi # (by 1)
    call   pcount_r
    addq   %rbx, %rax
    popq   %rbx
```

.L6:

rep; ret

| Register | Use(s)       | Type         |
|----------|--------------|--------------|
| %rdi     | x            | Argument     |
| %rax     | Return value | Return value |

# Observations About Recursion

- **Handled Without Special Consideration**
  - Stack frames mean that each function call has private storage
    - Saved registers & local variables
    - Saved return pointer
  - Register saving conventions prevent one function call from corrupting another's data
    - Unless the C code explicitly does so (e.g., buffer overflow in Lecture 9)
  - Stack discipline follows call / return pattern
    - If P calls Q, then Q returns before P
    - Last-In, First-Out
- **Also works for mutual recursion**
  - P calls Q; Q calls P

# Stack Structure

## x86-64 Stack

- Region of memory managed with stack discipline
- Grows toward lower addresses
- Register **%rsp** contains lowest stack address
  - address of “top” element

Stack Pointer: **%rsp** →



# x86-64 Stack: Push

- **pushq Src**

- Fetch operand at Src
- Decrement **%rsp** by 8
- Write operand at address given by **%rsp**

Stack Pointer: **%rsp**

Stack “Bottom”



# x86-64 Stack: Pop

## ■ **popq Dest**

- Read value at address given by **%rsp**
- Increment **%rsp** by 8
- Store value at Dest (**must be register**)



# Procedure Control Flow

- Use stack to support procedure call and return
- **Procedure call:** `call label`
  - Push return address on stack
  - Jump to label
- Return address:
  - Address of the next instruction right after call
  - Example from disassembly
- **Procedure return:** `ret`
  - Pop address from stack
  - Jump to address

# Array Allocation

- Basic Principle

$T \mathbf{A}[L];$

- Array of data type  $T$  and length  $L$
- Contiguously allocated region of  $L * \text{sizeof}(T)$  bytes in memory

```
char string[12];
```



```
int val[5];
```



```
double a[3];
```



```
char *p[3];
```



# recall pointers

- The unary operators `&` and `*` allow generation and dereferencing of pointers
- An expression `Expr` → denoting some object;
  - `&Expr` is a pointer giving the address of the object
- An expression `AExpr` denoting an address;
  - `*AExpr` gives the value at that address
- Thus;
  - $\text{Expr} = * \& \text{AExpr}$

# Array Access

- Basic Principle

$T \mathbf{A}[L]$  ;

- Array of data type  $T$  and length  $L$
- Identifier **A** can be used as a pointer to array element 0: Type  $T^*$



- Reference Type      Value

|                          |                    |          |
|--------------------------|--------------------|----------|
| <code>val[4]</code>      | <code>int</code>   | 3        |
| <code>val</code>         | <code>int *</code> | $x$      |
| <code>val+1</code>       | <code>int *</code> | $x + 4$  |
| <code>&amp;val[2]</code> | <code>int *</code> | $x + 8$  |
| <code>val[5]</code>      | <code>int</code>   | ??       |
| <code>*(val+1)</code>    | <code>int</code>   | 5        |
| <code>val + i</code>     | <code>int *</code> | $x + 4i$ |

- The memory referencing instruction of x86\_64 are designed to simplify array access.
- For ex: suppose E is an array of values of type int and we wish to evaluate E[i], where the **address of E** is stored in register **%rdx** and **i** is stored in register **%rcx**. Then the instruction is:
  - `movl (%rdx, %rcx, 4), %eax`
- which performs the address computation  $xE+4i$ , read that memory location, and copy the result to register **%eax**

# Array Example

```
#define ZLEN 5
typedef int zip_dig[ZLEN];

zip_dig cmu = { 1, 5, 2, 1, 3 };
zip_dig mit = { 0, 2, 1, 3, 9 };
zip_dig ucb = { 9, 4, 7, 2, 0 };
```

zip\_dig cmu;



zip\_dig mit;



zip\_dig ucb;



- Declaration “`zip_dig cmu`” equivalent to “`int cmu[5]`”
- Example arrays were allocated in successive 20 byte blocks
  - Not guaranteed to happen in general

# Array Accessing Example

```
zip_dig cmu;
```



```
int get_digit  
    (zip_dig z, int digit)  
{  
    return z[digit];  
}
```

IA32

```
# %rdi = z  
# %rsi = digit  
movl (%rdi,%rsi,4), %eax # z[digit]
```

- Register `%rdi` contains starting address of array
- Register `%rsi` contains array index
- Desired digit at  $\%rdi + 4 * \%rsi$
- Use memory reference `(%rdi,%rsi,4)`

# Array Loop Example

```
void zincr(zip_dig z) {  
    size_t i;  
    for (i = 0; i < ZLEN; i++)  
        z[i]++;  
}
```

```
# %rdi = z  
movl    $0, %eax          # i = 0  
jmp     .L3               # goto middle  
.L4:  
    addl    $1, (%rdi,%rax,4) # z[i]++  
    addq    $1, %rax          # i++  
.L3:  
    cmpq    $4, %rax          # i:4  
    jbe     .L4               # if <=, goto loop  
rep; ret
```

# Multidimensional Arrays

|  |  |  |
|--|--|--|
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |
|  |  |  |

```
int A[5] [3];
```

is equivalent to the declaration;

```
typedef int row3_t;  
row3_t A[5];
```

# Storing in the memory

- The array elements are ordered in memory in ***row-major*** order;
  - all elements of row 0 ( $A[0][i]$ )
  - followed by all elements of row 1 ( $A[1][i]$ )
  - ....
  - ...
- To access elements of multidimensional arrays, the compiler generates code to compute the offset of the desired element and then uses one of the mov instructions with the start of the array as the base address.

# Row Major – Column Major

Row-major order



Column-major order



# Row Major | | Column Major ???

- **Row major:** C, C++, Objective-C ,Pascal  
....numpy....
- **Column major:** Fortran, MATLAB, GNU Octave, Julia, Scilab....
- Java doesn't have multi-dimensional arrays. It has arrays of arrays....

# Multidimensional (Nested) Arrays

- Declaration

$T \ A[R][C];$

- 2D array of data type  $T$
- $R$  rows,  $C$  columns
- Type  $T$  element requires  $K$  bytes

- Array Size

- $R * C * K$  bytes

- Arrangement

- Row-Major Ordering



```
int A[R][C];
```



# Nested Array Example

```
#define PCOUNT 4
zip_dig pgh[PCOUNT] =
{{1, 5, 2, 0, 6},
 {1, 5, 2, 1, 3 },
 {1, 5, 2, 1, 7 },
 {1, 5, 2, 2, 1 }};
```



- “`zip_dig pgh [ 4 ]`” equivalent to “`int pgh [ 4 ] [ 5 ]`”
  - Variable **pgh**: array of 4 elements, allocated contiguously
  - Each element is an array of 5 **int**’s, allocated contiguously
- “Row-Major” ordering of all elements in memory

# Nested Array Row Access

- Row Vectors
  - $\mathbf{A}[i]$  is array of  $C$  elements
  - Each element of type  $T$  requires  $K$  bytes
  - Starting address  $\mathbf{A} + i * (C * K)$

```
int A[R][C];
```



# C - Structures

Arrays allow to define type of variables that can hold several data items of the same kind. Similarly **structure** is another user defined data type available in C that allows to combine data items of different kinds.

Structures are used to represent a record. Suppose you want to keep track of your books in a library. You might want to track the following attributes about each book –

- Title
- Author
- Subject
- Book ID

# Defining a Structure

To define a structure, you must use the **struct** statement. The struct statement defines a new data type, with more than one member. The format of the struct statement is as follows –

```
struct [structure tag] {  
    member definition;  
    member definition;  
    ...  
    member definition;  
} [one or more structure variables];
```

The **structure tag** is optional and each member definition is a normal variable definition, such as int i; or float f; or any other valid variable definition. At the end of the structure's definition, before the final semicolon, you can specify one or more structure variables but it is optional. Here is the way you would declare the Book structure –

```
struct Books {  
    char title[50];  
    char author[50];  
    char subject[100];  
    int book_id;  
} book;
```

# Structure Representation

```
struct rec {  
    int a[4];  
    size_t i;  
    struct rec *next;  
};
```



- Structure represented as block of memory
  - **Big enough to hold all of the fields**
- Fields ordered according to declaration
  - **Even if another ordering could yield a more compact representation**
- Compiler determines overall size + positions of fields
  - **Machine-level program has no understanding of the structures in the source code**

# Generating Pointer to Structure Member

```
struct rec {  
    int a[4];  
    size_t i;  
    struct rec *next;  
};
```



- Generating Pointer to Array Element

- Offset of each structure member determined at compile time
- Compute as **`r + 4*idx`**

```
int *get_ap  
(struct rec *r, size_t idx)  
{  
    return &r->a[idx];  
}
```

```
# r in %rdi, idx in %rsi  
leaq (%rdi,%rsi,4), %rax  
ret
```

# Following Linked List

- C Code

```
void set_val
    (struct rec *r, int val)
{
    while (r) {
        int i = r->i;
        r->a[i] = val;
        r = r->next;
    }
}
```

```
struct rec {
    int a[4];
    int i;
    struct rec *next;
};
```



| Register | Value      |
|----------|------------|
| %rdi     | <b>r</b>   |
| %rsi     | <b>val</b> |

```
.L11:                                # loop:
    movslq  16(%rdi), %rax      #   i = M[r+16]
    movl    %esi, (%rdi,%rax,4) #   M[r+4*i] = val
    movq    24(%rdi), %rdi      #   r = M[r+24]
    testq   %rdi, %rdi         #   Test r
    jne     .L11                #   if !=0 goto loop
```

# Structures & Alignment

- Unaligned Data



```
struct S1 {  
    char c;  
    int i[2];  
    double v;  
} *p;
```

- Aligned Data

- Primitive data type requires K bytes
- Address must be multiple of K



# Alignment Principles

- Aligned Data
  - Primitive data type requires K bytes
  - Address must be multiple of K
  - Required on some machines; advised on x86-64
- Motivation for Aligning Data
  - Memory accessed by (aligned) chunks of 4 or 8 bytes (system dependent)
    - Inefficient to load or store datum that spans quad word boundaries
    - Virtual memory trickier when datum spans 2 pages
- Compiler
  - Inserts gaps in structure to ensure correct alignment of fields

# Floating Point: Background

- History
  - x87 FP
    - Legacy, very ugly
  - SSE FP
    - Special case use of vector instructions
  - AVX FP
    - Newest version
    - Similar to SSE
    - Documented in book

# Programming with SSE3

## XMM Registers

- 16 total, each 16 bytes
- 16 single-byte integers



- 8 16-bit integers



- 4 32-bit integers



- 4 single-precision floats



- 2 double-precision floats



- 1 single-precision float



- 1 double-precision float



# Scalar & SIMD Operations

## ■ Scalar Operations: Single Precision      `addss %xmm0 , %xmm1`



## ■ SIMD Operations: Single Precision      `addps %xmm0 , %xmm1`



## ■ Scalar Operations: Double Precision      `addsd %xmm0 , %xmm1`



# x86-64 Linux Memory Layout

*not drawn to scale*

- Stack
  - Runtime stack (8MB limit)
  - E. g., local variables
- Heap
  - Dynamically allocated as needed
  - When call `malloc()`, `calloc()`, `new()`
- Data
  - Statically allocated data
  - E.g., global vars, static vars, string constants
- Text / Shared Libraries
  - Executable machine instructions
  - Read-only



# Recall: Memory Referencing Bug Example

```
typedef struct {
    int a[2];
    double d;
} struct_t;

double fun(int i) {
    volatile struct_t s;
    s.d = 3.14;
    s.a[i] = 1073741824; /* Possibly out of bounds */
    return s.d;
}
```

|        |                      |
|--------|----------------------|
| fun(0) | → 3.14               |
| fun(1) | → 3.14               |
| fun(2) | → 3.1399998664856    |
| fun(3) | → 2.00000061035156   |
| fun(4) | → 3.14               |
| fun(6) | → Segmentation fault |

- Result is system specific

# Memory Referencing Bug Example

```
typedef struct {  
    int a[2];  
    double d;  
} struct_t;
```

|        |                      |
|--------|----------------------|
| fun(0) | → 3.14               |
| fun(1) | → 3.14               |
| fun(2) | → 3.1399998664856    |
| fun(3) | → 2.00000061035156   |
| fun(4) | → 3.14               |
| fun(6) | → Segmentation fault |

Explanation:



# Such problems are a BIG deal

- Generally called a “buffer overflow”
  - when exceeding the memory size allocated for an array
- Why a big deal?
  - It’s the #1 technical cause of security vulnerabilities
    - #1 overall cause is social engineering / user ignorance
- Most common form
  - Unchecked lengths on string inputs
  - Particularly for bounded character arrays on the stack
    - sometimes referred to as stack smashing

# String Library Code

- Implementation of Unix function `gets()`

```
/* Get string from stdin */
char *gets(char *dest)
{
    int c = getchar();
    char *p = dest;
    while (c != EOF && c != '\n') {
        *p++ = c;
        c = getchar();
    }
    *p = '\0';
    return dest;
}
```

- No way to specify limit on number of characters to read
- Similar problems with other library functions
  - **strcpy, strcat**: Copy strings of arbitrary length
  - **scanf, fscanf, sscanf**, when given %s conversion specification

# Vulnerable Buffer Code

```
/* Echo Line */
void echo()
{
    char buf[4]; /* Way too small! */
    gets(buf);
    puts(buf);
}
```

← btw, how big  
is big enough?

```
void call_echo() {
    echo();
}
```

```
unix>./bufdemo-nsp
Type a string:012345678901234567890123
012345678901234567890123
```

```
unix>./bufdemo-nsp
Type a string:0123456789012345678901234
Segmentation Fault
```

# Buffer Overflow Disassembly

echo:

```
00000000004006cf <echo>:  
 4006cf: 48 83 ec 18          sub    $0x18,%rsp  
 4006d3: 48 89 e7          mov    %rsp,%rdi  
 4006d6: e8 a5 ff ff ff      callq  400680 <gets>  
 4006db: 48 89 e7          mov    %rsp,%rdi  
 4006de: e8 3d fe ff ff      callq  400520 <puts@plt>  
 4006e3: 48 83 c4 18          add    $0x18,%rsp  
 4006e7: c3                  retq
```

call\_echo:

```
4006e8: 48 83 ec 08          sub    $0x8,%rsp  
 4006ec: b8 00 00 00 00      mov    $0x0,%eax  
 4006f1: e8 d9 ff ff ff      callq  4006cf <echo>  
 4006f6: 48 83 c4 08          add    $0x8,%rsp  
 4006fa: c3                  retq
```

# Buffer Overflow Stack

*Before call to gets*



```
/* Echo Line */
void echo()
{
    char buf[4]; /* Way too small! */
    gets(buf);
    puts(buf);
}
```

```
echo:
    subq $24, %rsp
    movq %rsp, %rdi
    call gets
    . . .
```

# Buffer Overflow Stack Example

*Before call to gets*



```
void echo ()  
{  
    char buf[4];  
    gets(buf);  
    . . .  
}
```

```
echo:  
    subq $24, %rsp  
    movq %rsp, %rdi  
    call gets  
    . . .
```

call\_echo:

```
. . .  
4006f1: callq 4006cf <echo>  
4006f6: add    $0x8,%rsp  
. . .
```

# Buffer Overflow Stack Example #1

*After call to gets*

| Stack Frame<br>for call_echo |    |    |    |
|------------------------------|----|----|----|
| 00                           | 00 | 00 | 00 |
| 00                           | 40 | 06 | f6 |
| 00                           | 32 | 31 | 30 |
| 39                           | 38 | 37 | 36 |
| 35                           | 34 | 33 | 32 |
| 31                           | 30 | 39 | 38 |
| 37                           | 36 | 35 | 34 |
| 33                           | 32 | 31 | 30 |

```
void echo()
{
    char buf[4];
    gets(buf);
    ...
}
```

```
echo:
    subq $24, %rsp
    movq %rsp, %rdi
    call gets
    ...

```

call\_echo:

```
...
4006f1: callq 4006cf <echo>
4006f6: add    $0x8,%rsp
...
```

buf ← %rsp

```
unix>./bufdemo-nsp
Type a string:01234567890123456789012
01234567890123456789012
```

Overflowed buffer, but did not corrupt state

# Buffer Overflow Stack Example #2

*After call to gets*

| Stack Frame<br>for call_echo |    |    |    |
|------------------------------|----|----|----|
| 00                           | 00 | 00 | 00 |
| 00                           | 40 | 00 | 34 |
| 33                           | 32 | 31 | 30 |
| 39                           | 38 | 37 | 36 |
| 35                           | 34 | 33 | 32 |
| 31                           | 30 | 39 | 38 |
| 37                           | 36 | 35 | 34 |
| 33                           | 32 | 31 | 30 |

```
void echo()
{
    char buf[4];
    gets(buf);
    ...
}
```

```
echo:
    subq $24, %rsp
    movq %rsp, %rdi
    call gets
    ...

```

call\_echo:

```
...
4006f1: callq 4006cf <echo>
4006f6: add    $0x8,%rsp
...

```

buf ← %rsp

```
unix>./bufdemo-nsp
Type a string:0123456789012345678901234
Segmentation Fault
```

Overflowed buffer and corrupted return pointer

# Buffer Overflow Stack Example #3

*After call to gets*

| Stack Frame<br>for call_echo |    |    |    |
|------------------------------|----|----|----|
| 00                           | 00 | 00 | 00 |
| 00                           | 40 | 06 | 00 |
| 33                           | 32 | 31 | 30 |
| 39                           | 38 | 37 | 36 |
| 35                           | 34 | 33 | 32 |
| 31                           | 30 | 39 | 38 |
| 37                           | 36 | 35 | 34 |
| 33                           | 32 | 31 | 30 |

```
void echo()
{
    char buf[4];
    gets(buf);
    ...
}
```

```
echo:
    subq $24, %rsp
    movq %rsp, %rdi
    call gets
    ...

```

call\_echo:

```
...
4006f1: callq 4006cf <echo>
4006f6: add    $0x8,%rsp
...
```

buf ← %rsp

```
unix>./bufdemo-nsp
Type a string:012345678901234567890123
012345678901234567890123
```

Overflowed buffer, corrupted return pointer, but program seems to work!

# Buffer Overflow Stack Example #3 Explained

*After call to gets*

| Stack Frame<br>for call_echo |    |    |    |
|------------------------------|----|----|----|
| 00                           | 00 | 00 | 00 |
| 00                           | 40 | 06 | 00 |
| 33                           | 32 | 31 | 30 |
| 39                           | 38 | 37 | 36 |
| 35                           | 34 | 33 | 32 |
| 31                           | 30 | 39 | 38 |
| 37                           | 36 | 35 | 34 |
| 33                           | 32 | 31 | 30 |

register\_tm\_clones:

```
...  
400600:    mov      %rsp, %rbp  
400603:    mov      %rax, %rdx  
400606:    shr     $0x3f, %rdx  
40060a:    add      %rdx, %rax  
40060d:    sar      %rax  
400610:    jne     400614  
400612:    pop     %rbp  
400613:    retq
```

buf ← %rsp

“Returns” to unrelated code

Lots of things happen, without modifying critical state

Eventually executes `retq` back to `main`

# Code Injection Attacks

```
void P() {  
    Q();  
    ...  
}
```

A  
return address

```
int Q() {  
    char buf[64];  
    gets(buf);  
    ...  
    return ...;  
}
```



- Input string contains byte representation of executable code
- Overwrite return address A with address of buffer B
- When Q executes `ret`, will jump to exploit code

# Exploits Based on Buffer Overflows

- *Buffer overflow bugs can allow remote machines to execute arbitrary code on victim machines*
- Distressingly common in real programs
  - Programmers keep making the same mistakes ☹
  - Recent measures make these attacks much more difficult
- Examples across the decades
  - Original “Internet worm” (1988)
  - “IM wars” (1999)
  - Twilight hack on Wii (2000s)
  - ... and many, many more

# Example: the original Internet worm (1988)

- Exploited a few vulnerabilities to spread
  - Early versions of the finger server (fingerd) used **gets()** to read the argument sent by the client:
    - **finger unan@uab.edu**
  - Worm attacked fingerd server by sending phony argument:
    - **finger "exploit-code padding new-return-address"**
    - exploit code: executed a root shell on the victim machine with a direct TCP connection to the attacker.
- Once on a machine, scanned for other machines to attack
  - invaded ~6000 computers in hours (10% of the Internet ☺)
    - see June 1989 article in *Comm. of the ACM*
  - the young author of the worm was prosecuted...
  - and CERT was formed... still homed at CMU

# Aside: Worms and Viruses

- Worm: A program that
  - Can run by itself
  - Can propagate a fully working version of itself to other computers
- Virus: Code that
  - Adds itself to other programs
  - Does not run independently
- Both are (usually) designed to spread among computers and to wreak havoc

# **OK, what to do about buffer overflow attacks**

- Avoid overflow vulnerabilities
- Employ system-level protections
- Have compiler use “stack canaries”
- Lets talk about each...

# 1. Avoid Overflow Vulnerabilities in Code (!)

```
/* Echo Line */
void echo()
{
    char buf[4]; /* Way too small! */
    fgets(buf, 4, stdin);
    puts(buf);
}
```

- For example, use library routines that limit string lengths
  - **fgets** instead of **gets**
  - **strncpy** instead of **strcpy**
  - Don't use **scanf** with **%s** conversion specification
    - Use **fgets** to read the string
    - Or use **%ns** where **n** is a suitable integer

## 2. System-Level Protections can help

- Nonexecutable code segments
  - In traditional x86, can mark region of memory as either “read-only” or “writeable”
    - Can execute anything readable
  - X86-64 added explicit “execute” permission
  - Stack marked as non-executable



Any attempt to execute this code will fail

# 3. Stack Canaries can help

- Idea
  - Place special value (“canary”) on stack just beyond buffer
  - Check for corruption before exiting function
- GCC Implementation
  - **-fstack-protector**
  - Now the default (disabled earlier)

```
unix>./bufdemo-sp
Type a string:0123456
0123456
```

```
unix>./bufdemo-sp
Type a string:01234567
*** stack smashing detected ***
```

# Protected Buffer Disassembly

echo:

```
40072f: sub    $0x18,%rsp
400733: mov    %fs:0x28,%rax
40073c: mov    %rax,0x8(%rsp)
400741: xor    %eax,%eax
400743: mov    %rsp,%rdi
400746: callq  4006e0 <gets>
40074b: mov    %rsp,%rdi
40074e: callq  400570 <puts@plt>
400753: mov    0x8(%rsp),%rax
400758: xor    %fs:0x28,%rax
400761: je     400768 <echo+0x39>
400763: callq  400580 <__stack_chk_fail@plt>
400768: add    $0x18,%rsp
40076c: retq
```

# Setting Up Canary

*Before call to gets*



```
/* Echo Line */
void echo()
{
    char buf[4]; /* Way too small! */
    gets(buf);
    puts(buf);
}
```

```
echo:
. . .
movq    %fs:40, %rax    # Get canary
movq    %rax, 8(%rsp)  # Place on stack
xorl    %eax, %eax    # Erase canary
. . .
```

# Checking Canary

After call to gets

|                              |
|------------------------------|
| Stack Frame<br>for call_echo |
| Return Address<br>(8 bytes)  |
| Canary<br>(8 bytes)          |
| 00 36 35 34<br>33 32 31 30   |

```
/* Echo Line */
void echo()
{
    char buf[4]; /* Way too small! */
    gets(buf);
    puts(buf);
}
```

Input: 0123456

buf ← %rsp

```
echo:
    ...
    movq    8(%rsp), %rax      # Retrieve from stack
    xorq    %fs:40, %rax      # Compare to canary
    je     .L6                  # If same, OK
    call   __stack_chk_fail    # FAIL
.L6: ...
```

# Performance Realities

- ***There's more to performance than asymptotic complexity***
- **Constant factors matter too!**
  - You will learn in CS332, the performance range depending on how code is written
  - Must optimize at multiple levels:
    - algorithm, data representations, procedures, and loops
- Must understand system to optimize performance
  - How programs are compiled and executed
  - How modern processors + memory systems operate
  - How to measure program performance and identify bottlenecks
  - How to improve performance without destroying code modularity and generality

# Optimizing Compilers

- Provide efficient mapping of program to machine
  - register allocation
  - code selection and ordering (scheduling)
  - dead code elimination
  - eliminating minor inefficiencies
- Don't (usually) improve asymptotic efficiency
  - up to programmer to select best overall algorithm
  - big-O savings are (often) more important than constant factors
    - but constant factors also matter
- Have difficulty overcoming “optimization blockers”
  - potential memory aliasing
  - potential procedure side-effects

# Limitations of Optimizing Compilers

- Operate under fundamental constraint
  - Must not cause any change in program behavior
    - Except, possibly when program making use of nonstandard language features
  - Often prevents it from making optimizations that would only affect behavior under pathological conditions.
- Behavior that may be obvious to the programmer can be obfuscated by languages and coding styles
  - e.g., Data ranges may be more limited than variable types suggest
- Most analysis is performed only within procedures
  - Whole-program analysis is too expensive in most cases
  - Newer versions of GCC do interprocedural analysis within individual files
    - But, not between code in different files
- Most analysis is based only on *static* information
  - Compiler has difficulty anticipating run-time inputs
- **When in doubt, the compiler must be conservative**

# Generally Useful Optimizations

- Optimizations that you or the compiler should do regardless of processor / compiler
- Code Motion
  - Reduce frequency with which computation performed
    - If it will always produce same result
    - Especially moving code out of loop

```
void set_row(double *a, double *b,
    long i, long n)
{
    long j;
    for (j = 0; j < n; j++)
        a[n*i+j] = b[j];
}
```



```
long j;
int ni = n*i;
for (j = 0; j < n; j++)
    a[ni+j] = b[j];
```

# Compiler-Generated Code Motion (-O1)

```
void set_row(double *a, double *b,
    long i, long n)
{
    long j;
    for (j = 0; j < n; j++)
        a[n*i+j] = b[j];
}
```

```
long j;
long ni = n*i;
double *rowp = a+ni;
for (j = 0; j < n; j++)
    *rowp++ = b[j];
```

```
set_row:
    testq    %rcx, %rcx          # Test n
    jle .L1           # If 0, goto done
    imulq    %rcx, %rdx          # ni = n*i
    leaq (%rdi,%rdx,8), %rdx   # rowp = A + ni*8
    movl $0, %eax          # j = 0
.L3:                 # loop:
    movsd    (%rsi,%rax,8), %xmm0      # t = b[j]
    movsd    %xmm0, (%rdx,%rax,8)       # M[A+ni*8 + j*8] = t
    addq $1, %rax          # j++
    cmpq %rcx, %rax          # j:n
    jne .L3               # if !=, goto loop
.L1:                 # done:
    rep ; ret
```

# Reduction in Strength

- Replace costly operation with simpler one
- Shift, add instead of multiply or divide

$16 \times x \rightarrow x \ll 4$

- Utility machine dependent
- Depends on cost of multiply or divide instruction
  - On Intel Nehalem, integer multiply requires 3 CPU cycles

- Recognize sequence of products

```
for (i = 0; i < n; i++) {  
    int ni = n*i;  
    for (j = 0; j < n; j++)  
        a[ni + j] = b[j];  
}
```



```
int ni = 0;  
for (i = 0; i < n; i++) {  
    for (j = 0; j < n; j++)  
        a[ni + j] = b[j];  
    ni += n;  
}
```

# Share Common Subexpressions

- Reuse portions of expressions
- GCC will do this with -O1

```
/* Sum neighbors of i,j */  
up = val[(i-1)*n + j];  
down = val[(i+1)*n + j];  
left = val[i*n + j-1];  
right = val[i*n + j+1];  
sum = up + down + left + right;
```

```
long inj = i*n + j;  
up = val[inj - n];  
down = val[inj + n];  
left = val[inj - 1];  
right = val[inj + 1];  
sum = up + down + left + right;
```

3 multiplications:  $i*n$ ,  $(i-1)*n$ ,  $(i+1)*n$

```
leaq    1(%rsi), %rax # i+1  
leaq    -1(%rsi), %r8 # i-1  
imulq  %rcx, %rsi   # i*n  
imulq  %rcx, %rax   # (i+1)*n  
imulq  %rcx, %r8     # (i-1)*n  
addq   %rdx, %rsi   # i*n+j  
addq   %rdx, %rax   # (i+1)*n+j  
addq   %rdx, %r8     # (i-1)*n+j
```

1 multiplication:  $i*n$

```
imulq  %rcx, %rsi # i*n  
addq %rdx, %rsi   # i*n+j  
movq %rsi, %rax   # i*n+j  
subq %rcx, %rax   # i*n+j-n  
leaq (%rsi,%rcx), %rcx # i*n+j+n
```

# Optimization Blocker #1: Procedure Calls

- Procedure to Convert String to Lower Case

```
void lower(char *s)
{
    size_t i;
    for (i = 0; i < strlen(s); i++)
        if (s[i] >= 'A' && s[i] <= 'Z')
            s[i] -= ('A' - 'a');
}
```

# Lower Case Conversion Performance

- Time quadruples when double string length
- Quadratic performance



# Convert Loop To Goto Form

```
void lower(char *s)
{
    size_t i = 0;
    if (i >= strlen(s))
        goto done;
loop:
    if (s[i] >= 'A' && s[i] <= 'Z')
        s[i] -= ('A' - 'a');
    i++;
    if (i < strlen(s))
        goto loop;
done:
}
```

- `strlen` executed every iteration

# Calling Strlen

```
/* My version of strlen */
size_t strlen(const char *s)
{
    size_t length = 0;
    while (*s != '\0') {
        s++;
        length++;
    }
    return length;
}
```

- Strlen performance
  - Only way to determine length of string is to scan its entire length, looking for null character.
- Overall performance, string of length N
  - N calls to strlen
  - Require times N, N-1, N-2, ..., 1
  - Overall O(N<sup>2</sup>) performance

# Improving Performance

```
void lower(char *s)
{
    size_t i;
    size_t len = strlen(s);
    for (i = 0; i < len; i++)
        if (s[i] >= 'A' && s[i] <= 'Z')
            s[i] -= ('A' - 'a');
}
```

- Move call to `strlen` outside of loop
- Since result does not change from one iteration to another
- Form of code motion

# Lower Case Conversion Performance

- Time doubles when double string length
- Linear performance of lower2



# Optimization Blocker: Procedure Calls

- ***Why couldn't compiler move strlen out of inner loop?***
  - Procedure may have side effects
    - Alters global state each time called
  - Function may not return same value for given arguments
    - Depends on other parts of global state
    - Procedure `lower` could interact with `strlen`
- **Warning:**
  - Compiler treats procedure call as a black box
  - Weak optimizations near them
- **Remedies:**
  - Use of inline functions
    - GCC does this with `-O1`
      - Within single file
  - Do your own code motion

```
size_t lencnt = 0;
size_t strlen(const char *s)
{
    size_t length = 0;
    while (*s != '\0') {
        s++; length++;
    }
    lencnt += length;
    return length;
}
```

# Memory Matters

```
/* Sum rows is of n X n matrix a
   and store in vector b */
void sum_rows1(double *a, double *b, long n) {
    long i, j;
    for (i = 0; i < n; i++) {
        b[i] = 0;
        for (j = 0; j < n; j++)
            b[i] += a[i*n + j];
    }
}
```

```
# sum_rows1 inner loop
.L4:
    movsd    (%rsi,%rax,8), %xmm0    # FP load
    addsd    (%rdi), %xmm0          # FP add
    movsd    %xmm0, (%rsi,%rax,8)    # FP store
    addq    $8, %rdi
    cmpq    %rcx, %rdi
    jne     .L4
```

- Code updates `b[i]` on every iteration
- Why couldn't compiler optimize this away?

# Memory Aliasing

```
/* Sum rows is of n X n matrix a  
   and store in vector b */  
  
void sum_rows1(double *a, double *b, long n) {  
    long i, j;  
    for (i = 0; i < n; i++) {  
        b[i] = 0;  
        for (j = 0; j < n; j++)  
            b[i] += a[i*n + j];  
    }  
}
```

Value of B:

```
double A[9] =  
{ 0, 1, 2,  
 4, 8, 16},  
32, 64, 128};  
  
double B[3] = A+3;  
  
sum_rows1(A, B, 3);
```

init: [4, 8, 16]

i = 0: [3, 8, 16]

i = 1: [3, 22, 16]

i = 2: [3, 22, 224]

- Code updates `b[i]` on every iteration
- Must consider possibility that these updates will affect program behavior

# Removing Aliasing

```
/* Sum rows is of n X n matrix a
   and store in vector b */
void sum_rows2(double *a, double *b, long n) {
    long i, j;
    for (i = 0; i < n; i++) {
        double val = 0;
        for (j = 0; j < n; j++)
            val += a[i*n + j];
        b[i] = val;
    }
}
```

```
# sum_rows2 inner loop
.L10:
    addsd    (%rdi), %xmm0 # FP load + add
    addq    $8, %rdi
    cmpq    %rax, %rdi
    jne     .L10
```

- No need to store intermediate results

# Optimization Blocker: Memory Aliasing

- Aliasing
  - Two different memory references specify single location
  - Easy to have happen in C
    - Since allowed to do address arithmetic
    - Direct access to storage structures
  - Get in habit of introducing local variables
    - Accumulating within loops
    - Your way of telling compiler not to check for aliasing

# Exploiting Instruction-Level Parallelism

- Need general understanding of modern processor design
  - Hardware can execute multiple instructions in parallel
- Performance limited by data dependencies
- Simple transformations can yield dramatic performance improvement
  - Compilers often cannot make these transformations
  - Lack of associativity and distributivity in floating-point arithmetic

# Benchmark Example: Data Type for Vectors

```
/* data structure for vectors */
typedef struct{
    size_t len;
    data_t *data;
} vec;
```



## •Data Types

- Use different declarations for data\_t
- int
- long
- float
- double

```
/* retrieve vector element
   and store at val */
int get_vec_element
    (*vec v, size_t idx, data_t *val)
{
    if (idx >= v->len)
        return 0;
    *val = v->data[idx];
    return 1;
}
```

# Benchmark Computation

```
void combinel(vec_ptr v, data_t *dest)
{
    long int i;
    *dest = IDENT;
    for (i = 0; i < vec_length(v); i++) {
        data_t val;
        get_vec_element(v, i, &val);
        *dest = *dest OP val;
    }
}
```

Compute sum or product of vector elements

- Data Types

- Use different declarations for `data_t`
- `int`
- `long`
- `float`
- `double`

- Operations

- Use different definitions of `OP` and `IDENT`
- `+` / `0`
- `*` / `1`

# Cycles Per Element (CPE)

- Convenient way to express performance of program that operates on vectors or lists
- Length = n
- In our case: **CPE = cycles per OP**
- $T = \text{CPE} \cdot n + \text{Overhead}$ 
  - CPE is slope of line



# Benchmark Performance

```
void combine1(vec_ptr v, data_t *dest)
{
    long int i;
    *dest = IDENT;
    for (i = 0; i < vec_length(v); i++) {
        data_t val;
        get_vec_element(v, i, &val);
        *dest = *dest OP val;
    }
}
```

Compute sum or product of vector elements

| Method                  | Integer |       | Double FP |       |
|-------------------------|---------|-------|-----------|-------|
| Operation               | Add     | Mult  | Add       | Mult  |
| Combine1<br>unoptimized | 22.68   | 20.02 | 19.98     | 20.18 |
| Combine1 -O1            | 10.12   | 10.12 | 10.17     | 11.14 |

# Basic Optimizations

```
void combine4(vec_ptr v, data_t *dest)
{
    long i;
    long length = vec_length(v);
    data_t *d = get_vec_start(v);
    data_t t = IDENT;
    for (i = 0; i < length; i++)
        t = t OP d[i];
    *dest = t;
}
```

- Move `vec_length` out of loop
- Avoid bounds check on each cycle
- Accumulate in temporary

# Effect of Basic Optimizations

```
void combine4(vec_ptr v, data_t *dest)
{
    long i;
    long length = vec_length(v);
    data_t *d = get_vec_start(v);
    data_t t = IDENT;
    for (i = 0; i < length; i++)
        t = t OP d[i];
    *dest = t;
}
```

| Method       | Integer |       | Double FP |       |
|--------------|---------|-------|-----------|-------|
| Operation    | Add     | Mult  | Add       | Mult  |
| Combine1 -O1 | 10.12   | 10.12 | 10.17     | 11.14 |
| Combine4     | 1.27    | 3.01  | 3.01      | 5.01  |

- Eliminates sources of overhead in loop

# Modern CPU Design



# Superscalar Processor

- **Definition:** A superscalar processor can issue and execute *multiple instructions in one cycle*. The instructions are retrieved from a sequential instruction stream and are usually scheduled dynamically.
- Benefit: without programming effort, superscalar processor can take advantage of the *instruction level parallelism* that most programs have
  - Most modern CPUs are superscalar.
  - Intel: since Pentium (1993)

# Pipelined Functional Units

```
long mult_eg(long a, long b, long c) {  
    long p1 = a*b;  
    long p2 = a*c;  
    long p3 = p1 * p2;  
    return p3;  
}
```



|         | Time |     |     |     |       |       |   |       |
|---------|------|-----|-----|-----|-------|-------|---|-------|
|         | 1    | 2   | 3   | 4   | 5     | 6     | 7 |       |
| Stage 1 | a*b  | a*c |     |     | p1*p2 |       |   |       |
| Stage 2 |      | a*b | a*c |     |       | p1*p2 |   |       |
| Stage 3 |      |     | a*b | a*c |       |       |   | p1*p2 |

- Divide computation into stages
- Pass partial computations from stage to stage
- Stage  $i$  can start on new computation once values passed to  $i+1$
- E.g., complete 3 multiplications in 7 cycles, even though each requires 3 cycles

# Haswell CPU

- 8 Total Functional Units
- Multiple instructions can execute in parallel
  - 2 load, with address computation
  - 1 store, with address computation
  - 4 integer
  - 2 FP multiply
  - 1 FP add
  - 1 FP divide
- Some instructions take > 1 cycle, but can be pipelined

| <i>Instruction</i>             | <i>Latency</i> | <i>Cycles/Issue</i> |
|--------------------------------|----------------|---------------------|
| Load / Store                   | 4              | 1                   |
| Integer Multiply               | 3              | 1                   |
| <b>Integer/Long Divide</b>     | <b>3-30</b>    | <b>3-30</b>         |
| Single/Double FP Multiply      | 5              | 1                   |
| Single/Double FP Add           | 3              | 1                   |
| <b>Single/Double FP Divide</b> | <b>3-15</b>    | <b>3-15</b>         |

# x86-64 Compilation of Combine4

- Inner Loop (Case: Integer Multiply)

```
.L519:                                # Loop:  
    imull  (%rax,%rdx,4), %ecx  # t = t * d[i]  
    addq   $1, %rdx            # i++  
    cmpq   %rdx, %rbp          # Compare length:i  
    jg     .L519                # If >, goto Loop
```

| Method        | Integer |      | Double FP |      |
|---------------|---------|------|-----------|------|
| Operation     | Add     | Mult | Add       | Mult |
| Combine4      | 1.27    | 3.01 | 3.01      | 5.01 |
| Latency Bound | 1.00    | 3.00 | 3.00      | 5.00 |

# Loop Unrolling (2x1)

```
void unroll2a_combine(vec_ptr v, data_t *dest)
{
    long length = vec_length(v);
    long limit = length-1;
    data_t *d = get_vec_start(v);
    data_t x = IDENT;
    long i;
    /* Combine 2 elements at a time */
    for (i = 0; i < limit; i+=2) {
        x = (x OP d[i]) OP d[i+1];
    }
    /* Finish any remaining elements */
    for (; i < length; i++) {
        x = x OP d[i];
    }
    *dest = x;
}
```

- Perform 2x more useful work per iteration

# Effect of Loop Unrolling

| Method        | Integer |      | Double FP |      |
|---------------|---------|------|-----------|------|
| Operation     | Add     | Mult | Add       | Mult |
| Combine4      | 1.27    | 3.01 | 3.01      | 5.01 |
| Unroll 2x1    | 1.01    | 3.01 | 3.01      | 5.01 |
| Latency Bound | 1.00    | 3.00 | 3.00      | 5.00 |

- Helps integer add
  - Achieves latency bound
- Others don't improve. *Why?*
  - Still sequential dependency

```
x = (x OP d[i]) OP d[i+1];
```

# Loop Unrolling with Reassociation (2x1a)

```
void unroll2aa_combine(vec_ptr v, data_t *dest)
{
    long length = vec_length(v);
    long limit = length-1;
    data_t *d = get_vec_start(v);
    data_t x = IDENT;
    long i;
    /* Combine 2 elements at a time */
    for (i = 0; i < limit; i+=2) {
        x = x OP (d[i] OP d[i+1]);
    }
    /* Finish any remaining elements */
    for (; i < length; i++) {
        x = x OP d[i];
    }
    *dest = x;
}
```

Compare to before

$x = (x \text{ OP } d[i]) \text{ OP } d[i+1];$

- Can this change the result of the computation?
- Yes, for FP. *Why?*

# Effect of Reassociation

| Method           | Integer |      | Double FP |      |
|------------------|---------|------|-----------|------|
| Operation        | Add     | Mult | Add       | Mult |
| Combine4         | 1.27    | 3.01 | 3.01      | 5.01 |
| Unroll 2x1       | 1.01    | 3.01 | 3.01      | 5.01 |
| Unroll 2x1a      | 1.01    | 1.51 | 1.51      | 2.51 |
| Latency Bound    | 1.00    | 3.00 | 3.00      | 5.00 |
| Throughput Bound | 0.50    | 1.00 | 1.00      | 0.50 |

- Nearly 2x speedup for Int \*, FP +, FP \*
  - Reason: Breaks sequential dependency

```
x = x OP (d[i] OP d[i+1]);
```

- Why is that? (next slide)

2 func. units for FP \*  
2 func. units for load

4 func. units for int +  
2 func. units for load

# Loop Unrolling with Separate Accumulators (2x2)

```
void unroll2a_combine(vec_ptr v, data_t *dest)
{
    long length = vec_length(v);
    long limit = length-1;
    data_t *d = get_vec_start(v);
    data_t x0 = IDENT;
    data_t x1 = IDENT;
    long i;
    /* Combine 2 elements at a time */
    for (i = 0; i < limit; i+=2) {
        x0 = x0 OP d[i];
        x1 = x1 OP d[i+1];
    }
    /* Finish any remaining elements */
    for (; i < length; i++) {
        x0 = x0 OP d[i];
    }
    *dest = x0 OP x1;
}
```

- Different form of reassociation

# Effect of Separate Accumulators

| Method           | Integer |      | Double FP |      |
|------------------|---------|------|-----------|------|
| Operation        | Add     | Mult | Add       | Mult |
| Combine4         | 1.27    | 3.01 | 3.01      | 5.01 |
| Unroll 2x1       | 1.01    | 3.01 | 3.01      | 5.01 |
| Unroll 2x1a      | 1.01    | 1.51 | 1.51      | 2.51 |
| Unroll 2x2       | 0.81    | 1.51 | 1.51      | 2.51 |
| Latency Bound    | 1.00    | 3.00 | 3.00      | 5.00 |
| Throughput Bound | 0.50    | 1.00 | 1.00      | 0.50 |

- Int + makes use of two load units

```
x0 = x0 OP d[i];  
x1 = x1 OP d[i+1];
```

- 2x speedup (over unroll2) for Int \*, FP +, FP \*

# Unrolling & Accumulating

- Idea
  - Can unroll to any degree  $L$
  - Can accumulate  $K$  results in parallel
  - $L$  must be multiple of  $K$
- Limitations
  - Diminishing returns
    - Cannot go beyond throughput limitations of execution units
  - Large overhead for short lengths
    - Finish off iterations sequentially

# Unrolling & Accumulating: Double \*

- Case
    - Intel Haswell
    - Double FP Multiplication
    - Latency bound: 5.00. Throughput bound: 0.50

# Unrolling & Accumulating: Int +

- Case
    - Intel Haswell
    - Integer addition
    - Latency bound: 1.00. Throughput bound: 1.00

# Achievable Performance

| Method           | Integer |      | Double FP |      |
|------------------|---------|------|-----------|------|
| Operation        | Add     | Mult | Add       | Mult |
| Best             | 0.54    | 1.01 | 1.01      | 0.52 |
| Latency Bound    | 1.00    | 3.00 | 3.00      | 5.00 |
| Throughput Bound | 0.50    | 1.00 | 1.00      | 0.50 |

- Limited only by throughput of functional units
- Up to 42X improvement over original, unoptimized code

# Programming with AVX2

## YMM Registers

- 16 total, each 32 bytes
- 32 single-byte integers



- 16 16-bit integers



- 8 32-bit integers



- 8 single-precision floats



- 4 double-precision floats



- 1 single-precision float



- 1 double-precision float



# SIMD Operations

## ■ SIMD Operations: Single Precision

`vaddsd %ymm0, %ymm1, %ymm1`



## ■ SIMD Operations: Double Precision

`vaddpd %ymm0, %ymm1, %ymm1`



# Using Vector Instructions

| Method               | Integer |      | Double FP |      |
|----------------------|---------|------|-----------|------|
| Operation            | Add     | Mult | Add       | Mult |
| Scalar Best          | 0.54    | 1.01 | 1.01      | 0.52 |
| Vector Best          | 0.06    | 0.24 | 0.25      | 0.16 |
| Latency Bound        | 0.50    | 3.00 | 3.00      | 5.00 |
| Throughput Bound     | 0.50    | 1.00 | 1.00      | 0.50 |
| Vec Throughput Bound | 0.06    | 0.12 | 0.25      | 0.12 |

- Make use of AVX Instructions
  - Parallel operations on multiple data elements

# What About Branches?

- Challenge
  - Instruction Control Unit must work well ahead of Execution Unit to generate enough operations to keep EU busy

```
404663:    mov      $0x0,%eax
404668:    cmp      (%rdi),%rsi
40466b:    jge      404685
40466d:    mov      0x8(%rdi),%rax
.
.
.
404685:    repz    retq
```

The diagram shows a block of assembly code. A brace on the right side, labeled "Executing", covers the first four instructions (404663 to 40466d). A red arrow points from the text "How to continue?" to the label "404685", which is the target of a conditional branch (jge).

- When encounters conditional branch, cannot reliably determine where to continue fetching

# Modern CPU Design



# Branch Outcomes

–When encounter conditional branch, cannot determine where to continue fetching

- Branch Taken: Transfer control to branch target
- Branch Not-Taken: Continue with next instruction in sequence

–Cannot resolve until outcome determined by branch/integer unit

```
404663:    mov      $0x0,%eax
404668:    cmp      (%rdi),%rsi
40466b:    jge      404685
40466d:    mov      0x8(%rdi),%rax
.
.
.
404685:    repz    retq
```



# Branch Prediction

- Idea
  - Guess which way branch will go
  - Begin executing instructions at predicted position



# Branch Prediction Through Loop

```
401029:  vmulsd (%rdx),%xmm0,%xmm0  
40102d:  add     $0x8,%rdx  
401031:  cmp     %rax,%rdx  
401034:  jne    401029
```

Assume  
*vector length = 100*

```
401029:  vmulsd (%rdx),%xmm0,%xmm0  
40102d:  add     $0x8,%rdx  
401031:  cmp     %rax,%rdx  
401034:  jne    401029
```

Predict Taken (OK)

```
401029:  vmulsd (%rdx),%xmm0,%xmm0  
40102d:  add     $0x8,%rdx  
401031:  cmp     %rax,%rdx  
401034:  jne    401029
```

Predict Taken  
(Oops)

```
401029:  vmulsd (%rdx),%xmm0,%xmm0  
40102d:  add     $0x8,%rdx  
401031:  cmp     %rax,%rdx  
401034:  jne    401029
```

Read  
invalid  
location



# Branch Misprediction Invalidation

```
401029:  vmulsd (%rdx),%xmm0,%xmm0  
40102d:  add     $0x8,%rdx  
401031:  cmp     %rax,%rdx  
401034:  jne     401029      i = 98
```

Assume  
vector length = 100

```
401029:  vmulsd (%rdx),%xmm0,%xmm0  
40102d:  add     $0x8,%rdx  
401031:  cmp     %rax,%rdx  
401034:  jne     401029      i = 99
```

Predict Taken (OK)

```
401029:  vmulsd (%rdx),%xmm0,%xmm0  
40102d:  add     $0x8,%rdx  
401031:  cmp     %rax,%rdx  
401034:  jne     401029      i = 100
```

Predict Taken  
(Oops)

```
401029:  vmulsd (%rdx),%xmm0,%xmm0  
40102d:  add     $0x8,%rdx  
401031:  cmp     %rax,%rdx  
401034:  jne     401029      i = 101
```

Invalidate

# Branch Misprediction Recovery

```
401029:  vmulsd (%rdx), %xmm0, %xmm0  
40102d:  add     $0x8, %rdx  
401031:  cmp     %rax, %rdx  
401034:  jne     401029  
401036:  jmp     401040  
...  
401040:  vmovsd %xmm0, (%r12)
```

*i = 99*

Definitely not taken

Reload  
Pipeline

- Performance Cost
  - Multiple clock cycles on modern processor
  - Can be a major performance limiter

# Getting High Performance

- Good compiler and flags
- Don't do anything stupid
  - Watch out for hidden algorithmic inefficiencies
  - Write compiler-friendly code
    - Watch out for optimization blockers:  
procedure calls & memory references
    - Look carefully at innermost loops (where most work is done)
- Tune code for machine
  - Exploit instruction-level parallelism
  - Avoid unpredictable branches
  - Make code cache friendly (Covered later in course)

# Instruction Set Architecture

- Assembly Language View
  - Processor state
    - Registers, memory, ...
  - Instructions
    - addq, pushq, ret, ...
    - How instructions are encoded as bytes
- Layer of Abstraction
  - Above: how to program machine
    - Processor executes instructions in a sequence
  - Below: what needs to be built
    - Use variety of tricks to make it run fast
    - E.g., execute multiple instructions simultaneously



# Y86-64 Processor State



- **Program Registers**
  - 15 registers (omit %r15). Each 64 bits
- **Condition Codes**
  - Single-bit flags set by arithmetic or logical instructions
    - ZF: Zero
    - SF:Negative
    - OF: Overflow
- **Program Counter**
  - Indicates address of next instruction
- **Program Status**
  - Indicates either normal operation or some error condition
- **Memory**
  - Byte-addressable storage array
  - Words stored in little-endian byte order

# Y86-64 Instruction Set #1

| Byte             | 0 | 1  | 2    | 3  | 4 | 5 | 6 | 7 | 8 | 9 |
|------------------|---|----|------|----|---|---|---|---|---|---|
| halt             | 0 | 0  |      |    |   |   |   |   |   |   |
| nop              | 1 | 0  |      |    |   |   |   |   |   |   |
| cmoveXX rA, rB   | 2 | fn | rA   | rB |   |   |   |   |   |   |
| irmovq V, rB     | 3 | 0  | F    | rB | V |   |   |   |   |   |
| rmmovq rA, D(rB) | 4 | 0  | rA   | rB | D |   |   |   |   |   |
| mrmovq D(rB), rA | 5 | 0  | rA   | rB | D |   |   |   |   |   |
| OPq rA, rB       | 6 | fn | rA   | rB |   |   |   |   |   |   |
| jXX Dest         | 7 | fn | Dest |    |   |   |   |   |   |   |
| call Dest        | 8 | 0  | Dest |    |   |   |   |   |   |   |
| ret              | 9 | 0  |      |    |   |   |   |   |   |   |
| pushq rA         | A | 0  | rA   | F  |   |   |   |   |   |   |
| popq rA          | B | 0  | rA   | F  |   |   |   |   |   |   |

# Y86-64 Instructions

- Format
  - **1–10 bytes** of information read from memory
    - Can determine instruction length from first byte
    - Not as many instruction types, and simpler encoding than with x86-64
  - Each accesses and modifies some part(s) of the program state
- Instruction encodings range between 1 and 10 bytes.
  - An instruction consists of a **1-byte instruction specifier, possibly a 1-byte register specifier, and possibly a 8-byte constant word.**

# Y86-64 Instruction Set #2

Byte

|                  | 0 | 1  | 2    | 3  | 4 | 5 | 6 | 7 | 8 | 9 |        |
|------------------|---|----|------|----|---|---|---|---|---|---|--------|
| halt             | 0 | 0  |      |    |   |   |   |   |   |   | rrmovq |
| nop              | 1 | 0  |      |    |   |   |   |   |   |   | cmove  |
| cmovXX rA, rB    | 2 | fn | rA   | rB |   |   |   |   |   |   | cmovle |
| irmovq V, rB     | 3 | 0  | F    | rB | V |   |   |   |   |   | cmovl  |
| rmmovq rA, D(rB) | 4 | 0  | rA   | rB | D |   |   |   |   |   | cmovne |
| mrmovq D(rB), rA | 5 | 0  | rA   | rB | D |   |   |   |   |   | cmovge |
| OPq rA, rB       | 6 | fn | rA   | rB |   |   |   |   |   |   | cmovg  |
| jXX Dest         | 7 | fn | Dest |    |   |   |   |   |   |   |        |
| call Dest        | 8 | 0  | Dest |    |   |   |   |   |   |   |        |
| ret              | 9 | 0  |      |    |   |   |   |   |   |   |        |
| pushq rA         | A | 0  | rA   | F  |   |   |   |   |   |   |        |
| popq rA          | B | 0  | rA   | F  |   |   |   |   |   |   |        |

# Y86-64 Instruction Set #3

| Byte             | 0 | 1  | 2    | 3  | 4 | 5 | 6 | 7 | 8 | 9 |
|------------------|---|----|------|----|---|---|---|---|---|---|
| halt             | 0 | 0  |      |    |   |   |   |   |   |   |
| nop              | 1 | 0  |      |    |   |   |   |   |   |   |
| cmoveXX rA, rB   | 2 | fn | rA   | rB |   |   |   |   |   |   |
| irmovq V, rB     | 3 | 0  | F    | rB | V |   |   |   |   |   |
| rmmovq rA, D(rB) | 4 | 0  | rA   | rB | D |   |   |   |   |   |
| mrmovq D(rB), rA | 5 | 0  | rA   | rB | D |   |   |   |   |   |
| OPQ rA, rB       | 6 | fn | rA   | rB |   |   |   |   |   |   |
| jXX Dest         | 7 | fn | Dest |    |   |   |   |   |   |   |
| call Dest        | 8 | 0  | Dest |    |   |   |   |   |   |   |
| ret              | 9 | 0  |      |    |   |   |   |   |   |   |
| pushq rA         | A | 0  | rA   | F  |   |   |   |   |   |   |
| popq rA          | B | 0  | rA   | F  |   |   |   |   |   |   |

# Y86-64 Instruction Set #4

| Byte             | 0 | 1  | 2    | 3  | 4 | 5 | 6 | 7 | jmp | 7   0 |
|------------------|---|----|------|----|---|---|---|---|-----|-------|
| halt             | 0 | 0  |      |    |   |   |   |   | jle | 7   1 |
| nop              | 1 | 0  |      |    |   |   |   |   | jl  | 7   2 |
| cmoveXX rA, rB   | 2 | fn | rA   | rB |   |   |   |   | je  | 7   3 |
| irmovq V, rB     | 3 | 0  | F    | rB | V |   |   |   | jne | 7   4 |
| rmmovq rA, D(rB) | 4 | 0  | rA   | rB | D |   |   |   | jge | 7   5 |
| mrmovq D(rB), rA | 5 | 0  | rA   | rB | D |   |   |   | jg  | 7   6 |
| OPq rA, rB       | 6 | fn | rA   | rB |   |   |   |   |     |       |
| jXX Dest         | 7 | fn | Dest |    |   |   |   |   |     |       |
| call Dest        | 8 | 0  | Dest |    |   |   |   |   |     |       |
| ret              | 9 | 0  |      |    |   |   |   |   |     |       |
| pushq rA         | A | 0  | rA   | F  |   |   |   |   |     |       |
| popq rA          | B | 0  | rA   | F  |   |   |   |   |     |       |

# Instruction Encoding

- Each instruction requires between 1 and 10 bytes, depending on which fields are required.
- Every instruction has an initial byte identifying the instruction type. This byte is split into two 4-bit parts: the high-order, or code, part, and the low-order, or function, part.

# Encoding Registers

- Each register has 4-bit ID

|             |   |
|-------------|---|
| %rax        | 0 |
| %rcx        | 1 |
| %rdx        | 2 |
| %rbx        | 3 |
| %rsp        | 4 |
| %rbp        | 5 |
| %rsi        | 6 |
| %rdi        | 7 |
| %r8         | 8 |
| %r9         | 9 |
| %r10        | A |
| %r11        | B |
| %r12        | C |
| %r13        | D |
| %r14        | E |
| No Register | F |

- Same encoding as in x86-64
- Register ID 15 ( $0xF$ ) indicates “no register”
  - Will use this in our hardware design in multiple places

# Instruction Example

- Addition Instruction



- Add value in register rA to that in register rB
  - Store result in register rB
  - Note that Y86-64 only allows addition to be applied to register data
- Set condition codes based on result
- e.g., `addq %rax, %rsi` Encoding: 60 06
- Two-byte encoding
  - First indicates instruction type
  - Second gives source and destination registers

# Arithmetic and Logical Operations

Instruction Code

Add

 addq rA, rB

Function Code

|   |   |    |    |
|---|---|----|----|
| 6 | 0 | rA | rB |
|---|---|----|----|

Subtract (rA from rB)

 subq rA, rB

|   |   |    |    |
|---|---|----|----|
| 6 | 1 | rA | rB |
|---|---|----|----|

And

 andq rA, rB

|   |   |    |    |
|---|---|----|----|
| 6 | 2 | rA | rB |
|---|---|----|----|

Exclusive-Or

 xorq rA, rB

|   |   |    |    |
|---|---|----|----|
| 6 | 3 | rA | rB |
|---|---|----|----|

- Refer to generically as “OPq”
- Encodings differ only by “function code”
  - Low-order 4 bytes in first instruction word
- Set condition codes as side effect

# Move Operations

Register → Register

`rrmovq rA, rB`

|   |   |
|---|---|
| 2 | 0 |
|---|---|

Immediate → Register

`irmovq V, rB`

|   |   |   |    |   |
|---|---|---|----|---|
| 3 | 0 | F | rB | V |
|---|---|---|----|---|

Register → Memory

`rmmovq rA, D(rB)`

|   |   |    |    |   |
|---|---|----|----|---|
| 4 | 0 | rA | rB | D |
|---|---|----|----|---|

Memory → Register

`mrmovq D(rB), rA`

|   |   |    |    |   |
|---|---|----|----|---|
| 5 | 0 | rA | rB | D |
|---|---|----|----|---|

- Like the x86-64 `movq` instruction
- Simpler format for memory addresses
- Give different names to keep them distinct

# Move Instruction Examples

X86-64

```
movq $0xabcd, %rdx
```

Encoding:

```
movq %rsp, %rbx
```

Encoding:

```
movq -12(%rbp),%rcx
```

Encoding:

```
movq %rsi,0x41c(%rsp)
```

Encoding:

Y86-64

```
irmovq $0xabcd, %rdx
```

```
30 82 cd ab 00 00 00 00 00 00
```

```
rrmovq %rsp, %rbx
```

```
20 43
```

```
mrmovq -12(%rbp),%rcx
```

```
50 15 f4 ff ff ff ff ff ff ff ff
```

```
rmmovq %rsi,0x41c(%rsp)
```

```
40 64 1c 04 00 00 00 00 00 00
```

# Status Conditions

| Mnemonic | Code |
|----------|------|
| AOK      | 1    |

- Normal operation

| Mnemonic | Code |
|----------|------|
| HLT      | 2    |

- Halt instruction encountered

| Mnemonic | Code |
|----------|------|
| ADR      | 3    |

- Bad address (either instruction or data) encountered

| Mnemonic | Code |
|----------|------|
| INS      | 4    |

- Invalid instruction encountered

- Desired Behavior
  - If AOK, keep going
  - Otherwise, stop program execution

# CISC Instruction Sets

- Complex Instruction Set Computer
- IA32 is example
- Stack-oriented instruction set
  - Use stack to pass arguments, save program counter
  - Explicit push and pop instructions
- Arithmetic instructions can access memory
  - `addq %rax, 12(%rbx,%rcx,8)`
    - requires memory read and write
    - Complex address calculation
- Condition codes
  - Set as side effect of arithmetic and logical instructions
- Philosophy
  - Add instructions to perform “typical” programming tasks

# RISC Instruction Sets

- Reduced Instruction Set Computer
- Internal project at IBM, later popularized by Hennessy (Stanford) and Patterson (Berkeley)
- Fewer, simpler instructions
  - Might take more to get given task done
  - Can execute them with small and fast hardware
- Register-oriented instruction set
  - Many more (typically 32) registers
  - Use for arguments, return pointer, temporaries
- Only load and store instructions can access memory
  - Similar to Y86-64 `mrmovq` and `rmmovq`
- No Condition codes
  - Test instructions return 0/1 in register

| CISC                                                                                                                                                                                             | Early RISC                                                                                                                                                                                         |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| A large number of instructions. The Intel document describing the complete set of instructions [28, 29] is over 1200 pages long.                                                                 | Many fewer instructions. Typically less than 100.                                                                                                                                                  |
| Some instructions with long execution times. These include instructions that copy an entire block from one part of memory to another and others that copy multiple registers to and from memory. | No instruction with a long execution time. Some early RISC machines did not even have an integer multiply instruction, requiring compilers to implement multiplication as a sequence of additions. |
| Variable-length encodings. IA32 instructions can range from 1 to 15 bytes.                                                                                                                       | Fixed-length encodings. Typically all instructions are encoded as 4 bytes.                                                                                                                         |
| Multiple formats for specifying operands. In IA32, a memory operand specifier can have many different combinations of displacement, base and index registers, and scale factors.                 | Simple addressing formats. Typically just base and displacement addressing.                                                                                                                        |

|                                                                                                                                               |                                                                                                                                                                                                                                                                                                           |
|-----------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Arithmetic and logical operations can be applied to both memory and register operands.                                                        | Arithmetic and logical operations only use register operands. Memory referencing is only allowed by <i>load</i> instructions, reading from memory into a register, and <i>store</i> instructions, writing from a register to memory. This convention is referred to as a <i>load/store architecture</i> . |
| Implementation artifacts hidden from machine-level programs. The ISA provides a clean abstraction between programs and how they get executed. | Implementation artifacts exposed to machine-level programs. Some RISC machines prohibit particular instruction sequences and have jumps that do not take effect until the following instruction is executed. The compiler is given the task of optimizing performance within these constraints.           |
| Condition codes. Special flags are set as a side effect of instructions and then used for conditional branch testing.                         | No condition codes. Instead, explicit test instructions store the test results in normal registers for use in conditional evaluation.                                                                                                                                                                     |
| Stack-intensive procedure linkage. The stack is used for procedure arguments and return addresses.                                            | Register-intensive procedure linkage. Registers are used for procedure arguments and return addresses. Some procedures can thereby avoid any memory references. Typically, the processor has many more (up to 32) registers.                                                                              |

# CISC vs. RISC

- Original Debate
  - Strong opinions!
  - CISC proponents---easy for compiler, fewer code bytes
  - RISC proponents---better for optimizing compilers, can make run fast with simple chip design
- Current Status
  - For desktop processors, choice of ISA not a technical issue
    - With enough hardware, can make anything run fast
    - Code compatibility more important
  - x86-64 adopted many RISC features
    - More registers; use them for argument passing
  - For embedded processors, RISC makes sense
    - Smaller, cheaper, less power
    - Most cell phones use ARM processor

# Overview of Logic Design

- Fundamental Hardware Requirements
  - Communication
    - How to get values from one place to another
  - Computation
  - Storage
- Bits are Our Friends
  - Everything expressed in terms of values 0 and 1
  - Communication
    - Low or high voltage on wire
  - Computation
    - Compute Boolean functions
  - Storage
    - Store bits of information

# Digital Signals



- Use voltage thresholds to extract discrete values from continuous signal
- Simplest version: 1-bit signal
  - Either high range (1) or low range (0)
  - With guard range between them
- Not strongly affected by noise or low quality circuit elements
  - Can make circuits simple, small, and fast

# Computing with Logic Gates



- Outputs are Boolean functions of inputs
- Respond continuously to changes in inputs
  - With some, small delay



# Combinational Circuits



- Acyclic Network of Logic Gates
  - Continuously responds to changes on primary inputs
  - Primary outputs become (after some delay) Boolean functions of primary inputs

# Bit Equality



- Generate 1 if *a* and *b* are equal
- Hardware Control Language (HCL)
  - Very simple hardware description language
    - Boolean operations have syntax similar to C logical operations
  - We'll use it to describe control logic for processors

# Word Equality



- 64-bit word size
- HCL representation
  - Equality operation
  - Generates Boolean value

# Bit-Level Multiplexor



HCL Expression

```
bool out = (s&&a) || (!s&&b)
```

- Control signal  $s$
- Data signals  $a$  and  $b$
- Output  $a$  when  $s=1$ ,  $b$  when  $s=0$

# Word Multiplexor

Word-Level Representation



HCL Representation

```
int Out = [  
    s : A;  
    1 : B;  
];
```

- Select input word A or B depending on control signal  $s$
- HCL representation
  - Case expression
  - Series of test : value pairs
  - Output value for first successful test

# HCL Word-Level Examples

Minimum of 3 Words



```
int Min3 = [
    A < B && A < C : A;
    B < A && B < C : B;
    1 : C;
];
```

- Find minimum of three input words
- HCL case expression
- Final case guarantees match

4-Way Multiplexor



```
int Out4 = [
    !s1&&!s0: D0;
    !s1 : D1;
    !s0 : D2;
    1 : D3;
];
```

- Select one of 4 inputs based on two control bits
- HCL case expression
- Simplify tests by assuming sequential matching

# Arithmetic Logic Unit



- Combinational logic
  - Continuously responding to inputs
- Control signal selects function computed
  - Corresponding to 4 arithmetic/logical operations in Y86-64
- Also computes values for condition codes

# Storing 1 Bit

Bistable Element



# Registers



- Stores word of data
  - Different from *program registers* seen in assembly code
- Collection of edge-triggered latches
- Loads input on rising edge of clock

# Register Operation



- Stores data bits
- For most of time acts as barrier between input and output
- As clock rises, loads input

# Building Blocks

- Combinational Logic
  - Compute Boolean functions of inputs
  - Continuously respond to input changes
  - Operate on data and implement control



- Storage Elements
  - Store bits
  - Addressable memories
  - Non-addressable registers
  - Loaded only as clock rises



# SEQ Hardware Structure

- State
  - Program counter register (PC)
  - Condition code register (CC)
  - Register File
  - Memories
    - Access same memory space
    - Data: for reading/writing program data
    - Instruction: for reading instructions
- Instruction Flow
  - Read instruction at address specified by PC
  - Process through stages
  - Update program counter



# SEQ Stages

- Fetch
  - Read instruction from instruction memory
- Decode
  - Read program registers
- Execute
  - Compute value or address
- Memory
  - Read or write data
- Write Back
  - Write program registers
- PC
  - Update program counter



# Instruction Decoding



- Instruction Format
  - Instruction byte icode:ifun
  - Optional register byte rA:rB
  - Optional constant word valC

# SEQ Hardware

- Key
  - **Blue boxes:** predesigned hardware blocks
    - E.g., memories, ALU
  - **Gray boxes:** control logic
    - Describe in HCL
  - **White ovals:** labels for signals
  - **Thick lines:** 64-bit word values
  - **Thin lines:** 4-8 bit values
  - **Dotted lines:** 1-bit values



# SEQ Summary

- Implementation
  - Express every instruction as series of simple steps
  - Follow same general flow for each instruction type
  - Assemble registers, memories, predesigned combinational blocks
  - Connect with control logic
- Limitations
  - Too slow to be practical
  - In one cycle, must propagate through instruction memory, register file, ALU, and data memory
  - Would need to run clock very slowly
  - Hardware units only active for fraction of clock cycle

# Real-World Pipelines: Car Washes

Sequential



Parallel



Pipelined



- Idea
  - Divide process into independent stages
  - Move objects through stages in sequence
  - At any given times, multiple objects being processed

# Pipelining

**Pipelining** is running multiple *stages* of the same *process* in parallel in a way that efficiently uses all the available hardware while respecting the dependencies of each stage upon the previous stages.

**Latency** is the time to complete a single task.

**Throughput** is the rate at which tasks complete.

# Pipeline Diagrams

- Unpipelined



- Cannot start new operation until previous one completes

- 3-Way Pipelined



- Up to 3 operations in process simultaneously

# Hazards

- There are three classes of hazards:
- **Structural Hazards.** They arise from resource conflicts when the hardware cannot support all possible combinations of instructions in simultaneous overlapped execution.
- **Data Hazards.** They arise when an instruction depends on the result of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline.
- **Control Hazards.** They arise from the pipelining of branches and other instructions that change the PC.

# Data Dependencies



- System
  - Each operation depends on result from preceding one

# Data Hazards



- Result does not feed back around in time for next operation
- Pipelining has changed behavior of system

# Data Dependencies in Processors



- Result from one instruction used as operand for another
  - Read-after-write (RAW) dependency
- Very common in actual programs
- Must make sure our pipeline handles these properly
  - Get correct results
  - Minimize performance impact

# SEQ+ Hardware

- Still sequential implementation
- Reorder PC stage to put at beginning
- PC Stage
  - Task is to select PC for current instruction
  - Based on results computed by previous instruction
- Processor State
  - PC is no longer stored in register
  - But, can determine PC based on other stored information



# Adding Pipeline Registers



# Pipeline Stages

- Fetch
  - Select current PC
  - Read instruction
  - Compute incremented PC
- Decode
  - Read program registers
- Execute
  - Operate ALU
- Memory
  - Read or write data memory
- Write Back
  - Update register file



# Predicting the PC



- Start fetch of new instruction after current one has completed fetch stage
  - Not enough time to reliably determine next instruction
- Guess which instruction will follow
  - Recover if prediction was incorrect

# Our Prediction Strategy

- Instructions that Don't Transfer Control
  - Predict next PC to be valP
  - Always reliable
- Call and Unconditional Jumps
  - Predict next PC to be valC (destination)
  - Always reliable
- Conditional Jumps
  - Predict next PC to be valC (destination)
  - Only correct if branch is taken
    - Typically right 60% of time
- Return Instruction
  - Don't try to predict

# Recovering from PC Misprediction



- Mispredicted Jump
  - Will see branch condition flag once instruction reaches memory stage
  - Can get fall-through PC from valA (value M\_valA)
- Return Instruction
  - Will get return PC when ret reaches write-back stage (W\_valM)

# Pipeline Demonstration

```
irmovq    $1,%rax  #I1  
irmovq    $2,%rcx  #I2  
irmovq    $3,%rdx  #I3  
irmovq    $4,%rbx  #I4  
halt          #I5
```



- File: demo-basic.ys

# Data Dependencies: 3 Nop's

```
# demo-h3.ys
```

```
0x000:  irmovq$10,%rdx
0x00a:  irmovq $3,%rax
0x014:  nop
0x015:  nop
0x016:  nop
0x017:  addq %rdx,%rax
0x019:  halt
```



Cycle 6



Cycle 7



# Random-Access Memory (RAM)

- Key features
  - RAM is traditionally packaged as a chip.
  - Basic storage unit is normally a cell (one bit per cell).
  - Multiple RAM chips form a memory.
- RAM comes in two varieties:
  - SRAM (Static RAM)
  - DRAM (Dynamic RAM)

# SRAM vs DRAM Summary

|      | Trans.<br>per bit | Access<br>time | Needs<br>refresh? | Needs<br>EDC? | Cost | Applications                    |
|------|-------------------|----------------|-------------------|---------------|------|---------------------------------|
| SRAM | 4 or 6            | 1X             | No                | Maybe         | 100x | Cache memories                  |
| DRAM | 1                 | 10X            | Yes               | Yes           | 1X   | Main memories,<br>frame buffers |

- **SRAM**

- More complex than Dram, require 4 to 6 transistor per bit
- More expensive (each cell is more complex)
- Faster than DRAM
- Used in small – fast memories
- Stores each bit in bistable memory cell(each cell is implemented with a six-transistor circuit)

- **DRAM**

- Simple, Slower, Cheaper
- Needs to be refreshed constantly
- Used in main memories and frame buffers associated with the graphic cards
- Stores each bit as charge on a capacitor

# Nonvolatile Memories

- DRAM and SRAM are **volatile** memories
  - Lose information if powered off.
- **Nonvolatile** memories retain value even if powered off
  - Read-only memory (**ROM**): programmed during production
  - Programmable ROM (**PROM**): can be programmed once
  - Erasable PROM (**EPROM**): can be bulk erased (UV, X-Ray)
  - Electrically erasable PROM (**EEPROM**): electronic erase capability
  - Flash memory: EEPROMs. with partial (block-level) erase capability
    - Wears out after about 100,000 erasing

# Nonvolatile Memories

- Uses for Nonvolatile Memories
  - Firmware programs stored in a ROM (BIOS, controllers for disks, network cards, graphics accelerators, security subsystems,...)
  - Solid state disks (replace rotating disks in thumb drives, smart phones, mp3 players, tablets, laptops,...)
  - Disk caches

# Traditional Bus Structure Connecting CPU and Memory

- A **bus** is a collection of parallel wires that carry address, data, and control signals.
- Buses are typically shared by multiple devices.



# Memory Read Transaction (1)

- CPU places address A on the memory bus.



# Memory Read Transaction (2)

- Main memory reads A from the memory bus, retrieves word x, and places it on the bus.



# Memory Read Transaction (3)

- CPU read word  $x$  from the bus and copies it into register  $\%rax$ .



# Memory Write Transaction (1)

- CPU places address A on bus. Main memory reads it and waits for the corresponding data word to arrive.



# Memory Write Transaction (2)

- CPU places data word  $y$  on the bus.



# Memory Write Transaction (3)

- Main memory reads data word  $y$  from the bus and stores it at address  $A$ .



# What's Inside A Disk Drive?



*Image courtesy of Seagate Technology*

# Disk Geometry

- Disks consist of **platters**, each with two **surfaces**.
- Each surface consists of concentric rings called **tracks**.
- Each track consists of **sectors** separated by **gaps**.



# Disk Geometry (Multiple-Platter View)

- Aligned tracks form a cylinder.



# Disk Capacity

- **Capacity**: maximum number of bits that can be stored.
  - Vendors express capacity in units of gigabytes (GB), where  $1 \text{ GB} = 10^9 \text{ Bytes}$ . (actually Terabytes now)
- Capacity is determined by these technology factors:
  - **Recording density** (bits/in): number of bits that can be squeezed into a 1 inch segment of a track.
  - **Track density** (tracks/in): number of tracks that can be squeezed into a 1 inch radial segment.
  - **Areal density** (bits/in<sup>2</sup>): product of recording and track density.

# Recording zones

- Modern disks partition tracks into disjoint subsets called **recording zones**
  - Each track in a zone has the same number of sectors, determined by the circumference of innermost track.
  - Each zone has a different number of sectors/track, outer zones have more sectors/track than inner zones.
  - So we use **average** number of sectors/track when computing capacity.



# Computing Disk Capacity

**Capacity = (# bytes/sector) x (avg. # sectors/track) x  
(# tracks/surface) x (# surfaces/platter) x (#  
platters/disk)**

Example:

- 512 bytes/sector
- 300 sectors/track (on average)
- 20,000 tracks/surface
- 2 surfaces/platter
- 5 platters/disk

$$\begin{aligned}\text{Capacity} &= 512 \times 300 \times 20000 \times 2 \times 5 \\ &= 30,720,000,000 \\ &= 30.72 \text{ GB}\end{aligned}$$

# Disk Operation (Single-Platter View)

The disk surface spins at a fixed rotational rate



The read/write *head* is attached to the end of the *arm* and flies over the disk surface on a thin cushion of air.

By moving radially, the arm can position the read/write head over any track.

# Disk Operation (Multi-Platter View)



# Disk Structure - top view of single platter



Surface organized into tracks

Tracks divided into sectors

# Disk Access



Head in position above a track

# Disk Access



Rotation is counter-clockwise

# Disk Access – Read



About to read blue sector

# Disk Access – Read



After **BLUE** read

After reading blue sector

# Disk Access – Read



After **BLUE** read

Red request scheduled next

# Disk Access – Seek



After **BLUE** read

Seek for **RED**

Seek to red's track

# Disk Access – Rotational Latency



After **BLUE** read



Seek for **RED**



Rotational latency

Wait for red sector to rotate around

# Disk Access – Read



Complete read of red

# Disk Access – Service Time Components



After **BLUE** read



Seek for **RED**



Rotational latency



After **RED** read



Data transfer



Seek



Rotational latency



Data transfer

# Disk Access Time

- Average time to access some target sector approximated by :
  - $T_{access} = T_{avg\ seek} + T_{avg\ rotation} + T_{avg\ transfer}$
- Seek time (Tavg seek)
  - Time to position heads over cylinder containing target sector.
  - Typical Tavg seek is 3—9 ms
- Rotational latency (Tavg rotation)
  - Time waiting for first bit of target sector to pass under r/w head.
  - $T_{avg\ rotation} = 1/2 \times 1/\text{RPMs} \times 60\ \text{sec}/1\ \text{min}$
  - Typical Tavg rotation = 7200 RPMs
- Transfer time (Tavg transfer)
  - Time to read the bits in the target sector.
  - $T_{avg\ transfer} = 1/\text{RPM} \times 1/(\text{avg\ # sectors/track}) \times 60\ \text{secs}/1\ \text{min.}$

# Disk Access Time Example

- Given:
  - Rotational rate = 7,200 RPM
  - Average seek time = 9 ms.
  - Avg # sectors/track = 400.
- Derived:
  - Tavg rotation =  $1/2 \times (60 \text{ secs}/7200 \text{ RPM}) \times 1000 \text{ ms/sec} = 4 \text{ ms}$ .
  - Tavg transfer =  $60/7200 \text{ RPM} \times 1/400 \text{ secs/track} \times 1000 \text{ ms/sec} = 0.02 \text{ ms}$
  - Taccess = 9 ms + 4 ms + 0.02 ms
- **Important points:**
  - Access time dominated by seek time and rotational latency.
  - First bit in a sector is the most expensive, the rest are free.
  - SRAM access time is about 4 ns/doubleword, DRAM about 60 ns
    - **Disk is about 40,000 times slower than SRAM,**
    - **2,500 times slower than DRAM.**

# Logical Disk Blocks

- Modern disks present a simpler abstract view of the complex sector geometry:
  - The set of available sectors is modeled as a sequence of b-sized **logical blocks** (0, 1, 2, ...)
- Mapping between logical blocks and actual (physical) sectors
  - Maintained by hardware/firmware device called disk controller.
  - Converts requests for logical blocks into (surface,track,sector) triples.
- Allows controller to set aside spare cylinders for each zone.
  - Accounts for the difference in “formatted capacity” and “maximum capacity”.

# I/O Bus

CPU chip



# Reading a Disk Sector (1)

CPU chip



CPU initiates a disk read by writing a command, logical block number, and destination memory address to a **port** (address) associated with disk controller.

# Reading a Disk Sector (2)

CPU chip



Disk controller reads the sector and performs a direct memory access (**DMA**) transfer into main memory.



# Reading a Disk Sector (3)

CPU chip



# Solid State Disks (SSDs)



- Pages: 512KB to 4KB, Blocks: 32 to 128 pages
- Data read/written in units of pages.
- Page can be written only after its block has been erased
- A block wears out after about 100,000 repeated writes.

# SSD Performance Characteristics

|                      |          |                       |          |
|----------------------|----------|-----------------------|----------|
| Sequential read tput | 550 MB/s | Sequential write tput | 470 MB/s |
| Random read tput     | 365 MB/s | Random write tput     | 303 MB/s |
| Avg seq read time    | 50 us    | Avg seq write time    | 60 us    |

- Sequential access faster than random access
  - Common theme in the memory hierarchy
- Random writes are somewhat slower
  - Erasing a block takes a long time (~1 ms)
  - Modifying a block page requires all other pages to be copied to new block
  - In earlier SSDs, the read/write gap was much larger.

Source: Intel SSD 730 product specification.

# SSD Tradeoffs vs Rotating Disks

- Advantages
  - No moving parts → faster, less power, more rugged
- Disadvantages
  - Have the potential to wear out
    - Mitigated by “wear leveling logic” in flash translation layer
    - E.g. Intel SSD 730 guarantees 128 petabyte ( $128 \times 10^{15}$  bytes) of writes before they wear out
  - In 2015, about 30 times more expensive per byte
- Applications
  - MP3 players, smart phones, laptops
  - Beginning to appear in desktops and servers