

03 | SUN

JUNE - 2018

## Microcontrollers & Embedded Systems. Module-2.

### Introduction to the ARM Instruction set:

- \*→ Data processing Instructions.
- Programme Instruction
- Software Interrupt Instructions.
- Programs Status Register Instructions
- \*→ Coprocessor Instructions
- Loading Constants

04 | MON

### ARM programming using Assembly language

- Writing Assembly code.
- \*→ Profiling and cycle counting.
- \*→ Instruction scheduling.
- \*→ Register Allocation.
- Conditional Execution.
- Looping.
- Constructs.

05 | TUE

### Introduction:-

The hexadecimal numbers are represented with the prefix 0x and binary numbers with prefix 0b

PRE <pre-conditions>

<Instructions>

POST <post-conditions>

In the pre & post-conditions, memory is denoted as mem<data-size>[address]

Example: mem 32[1024]

06 | WED

ARM instruction process data held in registers and memory is accessed only with load and store instructions

ARM instructions commonly take two or three operands

Instruction Syntax: ADD R3, R1, R2

Destination register Rd: R3.

Source register 1 Rn: R1.

Source register 2 Rm: R2.

JUNE - 2018

07 | THU

JUNE - 2018

## Data Processing Instructions:

The data processing instructions manipulate data within registers.

They are \*

\* MOVE

\* ARITHMETIC

\* LOGICAL

\* COMPARISON and

\* MULTIPLY Instruction.

Most data processing instructions can process one of their operands using the barrel shifter.

08 | FRI then it updates the flags in CPSR. Move and logical operations update carry flag C, negative flag N, zero flag Z.

The C flag is set from the result of the barrel shifter as the last bit shifted out.

- N flag is set to bit 31 of the result
- Z flag is set if the result is zero.

### → MOVE Instruction:

This instruction copies N to Rd, where N is a register or immediate value. It is useful for setting initial values and transferring

| Sun | Mon | Tue | Wed | Thu | Fri | Sat | Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
|     | 1   | 2   | 3   | 4   | 5   |     | 6   | 7   | 8   | 9   | 10  | 11  | 12  |
| 13  | 14  | 15  | 16  | 17  | 18  | 19  | 20  | 21  | 22  | 23  | 24  | 25  | 26  |
| 27  | 28  | 29  | 30  | 31  |     |     |     |     |     |     |     |     |     |

MAY 2018

09 | SAT

JUNE - 2018

data blw & registers

syntax: <instruction>{<cond>} {S} Rd, N

MOV - Move a 32-bit value into a register  
→ Rd = N.

MVN - move the NOT of the 32-bit value into a register → Rd = ~N.

Example:

PRE  $x5 = 5$

$x7 = 8$

MOV  $x7, x5$ ; Let  $x7 = x5$

10 | SUN POST  $x5 = 5$

$x7 = 5$

\* Barrel shifter :-

- Data processing instructions are processed within the arithmetic logic unit (ALU)
- A unique and powerful feature of the ARM processor is the ability to shift the 32-bit binary pattern in one of the source register left or right by a specific number of positions before it enters the ALU.

| Sun | Mon | Tue | Wed | Thu | Fri | Sat | Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
|     | 1   | 2   | 3   | 4   | 5   |     | 6   | 7   | 8   | 9   | 10  | 11  | 12  |
| 13  | 14  | 15  | 16  | 17  | 18  | 19  | 20  | 21  | 22  | 23  | 24  | 25  | 26  |
| 27  | 28  | 29  | 30  | 31  |     |     |     |     |     |     |     |     |     |

MAY 2018

11 | MON

JUNE - 2018

- Pre-processing or shift occurs within the cycle time of the instructions.

→ Shift increases the power and flexibility of data processing operations.

→ This is particularly useful for loading constants into a register and achieving fast multiply or division by a power of 2.

- There are data processing instructions that do not use the barrel shift, for example:

12 | TUE the MUL (multiply), CLZ (count leading zeros) & QADD (signal saturated 32-bit add), instructions.

- The data flow b/w ALU & barrel shifter (Below figure).

~~else applying to MUL~~

- Registers Rn enters the ALU without any pre-processing of registers
- We apply LSL (logical shift left) to Rn before moving it to Rd. It is same like applying C language shift operators to the

| Sun | Mon | Tue | Wed | Thu | Fri | Sat | Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
|     | 1   | 2   | 3   | 4   | 5   |     | 6   | 7   | 8   | 9   | 10  | 11  | 12  |
| 13  | 14  | 15  | 16  | 17  | 18  | 19  | 20  | 21  | 22  | 23  | 24  | 25  | 26  |
| 27  | 28  | 29  | 30  | 31  |     |     |     |     |     |     |     |     |     |

MAY 2018

13 | WED

JUNE - 2018

- The MOV instruction copies the shift operator result N into register Rd.
- N → represents the result of the LSL operation



14 | THU

Barrel shifter and ALU.

### Barrel shifter Operations

LSL - logical shift left -  $x \ll y$   
 LSR - logical shift right -  $x \ll y$  - unsigned( $x \gg y$ )  
 ASR - arithmetic shift right -  $x \ll y$  - (signed)  $x \gg y$   
 ROR - rotate right -  $x \gg y$  -  $x \gg y | x \ll (32-y)$   
 RRX - rotate right extended -  $x \gg y$   
 ( $c \text{ flag} \ll 31 \text{ || } (\text{unsigned}) x \gg y$ )

| Sun | Mon | Tue | Wed | Thu | Fri | Sat | Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
| 13  | 14  | 15  | 16  | 17  | 18  | 19  | 20  | 21  | 22  | 23  | 24  | 25  | 26  |
| 27  | 28  | 29  | 30  | 31  |     |     |     |     |     |     |     |     |     |
|     |     |     |     |     |     |     |     |     |     |     |     |     |     |

MAY 2018

15 | FRI

Example:-

PRE  $\gamma_5 = 5$

$\gamma_7 = 8$

MOV  $\gamma_7, \gamma_5, LSL\#2$

POST

$\gamma_5 = 5$

$\gamma_7 = 20.$

Barrel shifter operation syntax for data processing instructions

Immediate  $\rightarrow$  # immediate.

16 | SAT Register  $\rightarrow R_m$ .

Logical shift left by immediate

$\rightarrow R_m, LSL \# \text{shift\_imm.}$

Logical shift left by register  $\rightarrow R_m, LSL R_s$ .

Arithmetic shift right by immediate

$\rightarrow R_m, ASR \# \text{shift\_imm}$

Arithmetic shift right by register

$\rightarrow R_m, ASR R_s$ .

Rotate right by register  $\rightarrow R_m, ROR R_s$ .

Rotate right with extend  $\rightarrow R_m, RRX$ .

JUNE - 2018

MAY 2018

17 | SUN

Arithmetic Instructions.

Syntax: <instruction> {<conds>} {S} Rd, Rn, N.

We use three operands :- Rd - destination register other two or source operands. In two source operand one must be register others can be register or immediate data.

ADC - add two 32-bit values and carry  
-  $R_d = R_n + N + \text{Carry}$

ADD - add two 32-bit value -  $R_d = R_n + N$   
RSB - reverse subtract of two 32-bit values -  $R_d = N - R_n$

RSC - reverse subtract with carry of two 32-bit values -  $R_d = N - R_n - !CF$ .

SBC - subtract with carry of two 32-bit values -  $R_d = R_n - N - !CF$ .

SUB - subtract two 32-bit  $\Rightarrow R_d = R_n - N$ .

The arithmetic instruction implement addition and subtraction of 32-bit signed & unsigned.

| Sun | Mon | Tue | Wed | Thu | Fri | Sat | Sun |
|-----|-----|-----|-----|-----|-----|-----|-----|
|     | 1   | 2   | 3   | 4   | 5   | 6   |     |
| 13  | 14  | 15  | 16  | 17  | 18  | 19  | 20  |
| 27  | 28  | 29  | 30  | 31  |     |     |     |

| Sun | Mon | Tue | Wed | Thu | Fri | Sat | Sun |
|-----|-----|-----|-----|-----|-----|-----|-----|
| 13  | 14  | 15  | 16  | 17  | 18  | 19  | 6   |
| 27  | 28  | 29  | 30  | 31  |     |     | 20  |
|     |     |     |     |     |     |     |     |

MAY 2018

19 | TUE

Examples:-

1) PRE  $\text{R0} = \text{0X00000000}$   
 $\text{R1} = \text{0X00000002}$   
 $\text{R2} = \text{0X00000001}$   
 $\text{SUB R0, R1, R2}$   
 POST  $\text{R0} = \text{0X00000001}$

2) PRE  $\text{R0} = \text{0X00000000}$

$\text{R1} = \text{0X00000077}$

$\text{RSB R0, R1, #0}$

POST  $\text{R0} = -\text{R1} = \text{0X4fffff89}$ .

20 | WED

3) PRE  $\text{RPSR} = \text{NZCVqiFT-USER}$

$\text{R1} = \text{0X00000001}$

$\text{SUBS R1, R1, #1}; \text{R1} = \text{R1} - 1$

POST  $\text{RPSR} = \text{NZCVqiFT-USER}$

$\text{R1} = \text{0X00000000}$

### Logical Instructions :-

Syntax: <instruction> {<cond>} {S} Rd, Rn, N

Logical instructions perform bitwise logical operations on the two source registers.

JUNE - 2018

21 | THU

JUNE - 2018  
AND - logical bitwise AND of two 32-bit values  $\Rightarrow \text{Rd} = \text{Rn} \& \text{N}$ .

ORR - logical bitwise OR of two 32-bit values  $\Rightarrow \text{Rd} = \text{Rn} \vee \text{N}$ .

EOR - logical exclusive OR of two 32-bit values  $\Rightarrow \text{Rd} = \text{Rn} \oplus \text{N}$ .

BIC - logical bit clear (AND NOT)  
 $\Rightarrow \text{Rd} = \text{Rn} \& \neg \text{N}$ .

Example:- PRE  $\text{R0} = \text{0X00000000}$

$\text{R1} = \text{0X02040608}$

$\text{R2} = \text{0X10305070}$

ORR  $\text{R0, R1, R2}$

POST  $\text{R0} = \text{0X12345678}$

22 | FRI

BIC Instruction :- (Bit Clear instruction)  
 $\Rightarrow$  It is combination of AND and NOT  
 Syntax : BIC {<cond>} {S} Rd, Rn, N.

Example: PRE  $\text{R1} = \text{0b1111}$

$\text{R2} = \text{0b0100}$

BIC  $\text{R0, R1, R2}$

POST  $\text{R0} = \text{0b1010}$

This is equivalent to  
 $\text{Rd} = \text{Rn} \text{ AND } \text{NOT}(N)$ .

| Sun | Mon | Tue | Wed | Thu | Fri | Sat | Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
|     | 1   | 2   | 3   | 4   | 5   | 6   |     | 7   | 8   | 9   | 10  | 11  | 12  |
| 13  | 14  | 15  | 16  | 17  | 18  | 19  | 20  | 21  | 22  | 23  | 24  | 25  | 26  |
| 27  | 28  | 29  | 30  | 31  |     |     |     |     |     |     |     |     |     |

MAY 2018

| Sun | Mon | Tue | Wed | Thu | Fri | Sat | Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
|     |     | 1   | 2   | 3   | 4   | 5   | 6   | 7   | 8   | 9   | 10  | 11  | 12  |
| 13  | 14  | 15  | 16  | 17  | 18  | 19  | 20  | 21  | 22  | 23  | 24  | 25  | 26  |
| 27  | 28  | 29  | 30  | 31  |     |     |     |     |     |     |     |     |     |

MAY 2018

23 | SAT

### Purpose of BIC instruction:-

Example:-

→ binary      → hexadecimal form  
 $x_1 = 0b\ 1100$      $x_1 = 0xc$

$x_2 = 0b\ 1000$

BIC  $x_0, x_1, x_2$

$x_0 = 0b\ 0100$      $x_0 = 0x4 = 4$

So BIC instruction is used to clear (i.e. to ~~not~~ make zero) particular bit in a source data.

24 | SUN

### Comparison Instructions

Syntax: <instruction>{<conds>} Rn, N

CBN - compare negated - flags set as a result of  $R_n + N$

CMP - Compare - flags set as a result of  $R_n - N$

TED - Test for equality of two 32-bit values - flags set as a result of  $R_n \wedge N$

TST - test bits of a 32-bit value -

flags set as a result of  $R_n \wedge N$ .

| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|-----|-----|-----|-----|-----|-----|-----|
|     | 1   | 2   | 3   | 4   | 5   | 6   |
| 13  | 14  | 15  | 16  | 17  | 18  | 19  |
| 27  | 28  | 29  | 30  | 31  |     | 20  |

MAY 2018

JUNE - 2018

25 | MON

JUNE - 2018

The compare instruction are used to compare or test a register with a 32-bit value. They update the CPSR flag bits according to the result, but do not affect other registers. It is not required to apply the S suffix for comparison instruction to update the flag.

Example :-

CMP{<conds>} Rn, N

PRE

CPSR = NZCVQIFT-USER

26 | TUE

$x_1 = 20$

$x_2 = 30$

CMP  $x_1, x_2$

POST

CPSR = NZCVQIFT-USER

Working

CMP means

Subtraction take place

Observe the above example

$$x_2 - x_1 \Rightarrow 30 - 20 = 10 \geq 0$$

conditional code is NZCVQIFT-USER.

| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|-----|-----|-----|-----|-----|-----|-----|
|     | 1   | 2   | 3   | 4   | 5   | 6   |
| 13  | 14  | 15  | 16  | 17  | 18  | 19  |
| 27  | 28  | 29  | 30  | 31  |     | 20  |

MAY 2018

27 | WED

JUNE - 2018

Example 2:

| N | Z | C | V |
|---|---|---|---|
| 0 | 1 | 0 | 0 |

PRE

$CPSR = \text{NZC}VQ\text{IFT\_USER}$

$\gamma_1 = 30$

$\gamma_2 = 30$

CHP  $\gamma_1, \gamma_2$

POST

$CPSR = \text{NZC}VQ\text{IFT\_USER}$ .

28 | THU

Multiply Instruction (32 bit).

Syntax : MLA {<cond>} {S} Rd, Rm, Rs, Rn

MUL {<cond>} {S} Rd, Rm, Rs

MLA - multiply and accumulate  $\rightarrow R_d = (R_m * R_s) + R_n$

MUL - multiply  $\rightarrow R_d = R_m * R_s$

Multiply instructions multiply the contents of a pair of registers and depending upon the instruction, accumulate in register.

| Sun | Mon | Tue | Wed | Thu | Fri | Sat | Sun |
|-----|-----|-----|-----|-----|-----|-----|-----|
| 1   | 2   | 3   | 4   | 5   | 6   | 7   | 8   |
| 14  | 15  | 16  | 17  | 18  | 19  | 20  | 21  |
| 27  | 28  | 29  | 30  | 31  |     |     |     |

MAY 2018

29 | FRI

JUNE - 2018

Example :

PRE  $\gamma_0 = 0x0000000000$

$\gamma_1 = 0x0000000002$

$\gamma_2 = 0x0000000002$

MUL  $\gamma_0, \gamma_1, \gamma_2 ; \gamma_0 = \gamma_1 * \gamma_2$   
POST

$\gamma_0 = 0x0000000004$

$\gamma_1 = 0x0000000002$

$\gamma_2 = 0x0000000002$

30 | SAT

Multiplication of 64 bit.

Syntax : <instruction> {<cond>} RdLo, RdHi, Rm, Rs.

SMAL - Signed multiply accumulate long  
 $\rightarrow [RdHi, RdLo] = [RdHi, RdLo] + (Rm * Rs)$ .

SMULL - Signed multiply long  
 $\rightarrow [RdHi, RdLo] = Rm * Rs$ .

UMAL - Unsigned multiply accumulate long  
 $\rightarrow [RdHi, RdLo] = [RdHi, RdLo] + (Rm * Rs)$ .

UMULL - unsigned multiply long  
 $\rightarrow [RdHi, RdLo] = Rm * Rs$ .

| Sun | Mon | Tue | Wed | Thu | Fri | Sat | Sun |
|-----|-----|-----|-----|-----|-----|-----|-----|
| 1   | 2   | 3   | 4   | 5   | 6   | 7   | 8   |
| 14  | 15  | 16  | 17  | 18  | 19  | 20  | 21  |
| 27  | 28  | 29  | 30  | 31  |     |     |     |

MAY 2018

Note

The number of cycles takes to execute a multiply instruction depends on the processor implementation.

For some implementation the cycle timing also depends on the value of Rs.

Example:

The instruction multiplies registers  $x_2$   $\& x_3$   $\& x_4$  places the result into register  $x_0 \& x_1$ .

Register  $x_0$  contains the lower 32-bits,  
 $x_1$  register  $x_1$  contains the higher 32 bits  
of the 64-bit result.

$$\text{PRE } x_0 = 0x00000000$$

$$x_1 = 0x00000000$$

$$x_2 = 0x10000002$$

$$x_3 = 0x00000002$$

$$\text{UHULL } x_0, x_1, x_2, x_3 ; [x_1, x_0] = x_2 * x_3$$

POST

$$x_0 = 0xe0000004 ; = RdLO$$

$$x_1 = 0x00000001 ; = RdHI$$

| Sun | Mon | Tue | Wed | Thu | Fri | Sat | Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
|     | 1   | 2   | 3   | 4   | 5   | 6   | 7   | 8   | 9   | 10  | 11  | 12  |     |
| 13  | 14  | 15  | 16  | 17  | 18  | 19  | 20  | 21  | 22  | 23  | 24  | 25  | 26  |
| 27  | 28  | 29  | 30  | 31  |     |     |     |     |     |     |     |     |     |

MAY 2018

01 | SUN

JULY - 2018

The long multiply instruction produce a 64-bit result. The result is too large to fit a single 32-bit register so result is placed in two registers namely

RdHi - holds higher 32-bits

RdLo - holds lower 32-bits

$$\therefore 32 + 32 \text{ bits} \Rightarrow 64 \text{ bits.}$$

Refer Textbook for more Examples:-

02 | MON

JUN 2018

## \* Branch Instructions :-

There are 4 branch Instruction

- B → Branch
- BL → Branch with link
- BX → Branch exchange
- BLX → Branch exchange with link

- A branch instruction changes the flow of execution
- OR is used to call a routine
- Branch instruction allows programmers
  - to skip set of instructions from execution
  - to repeat execution of set of instruction
- Branching changes content of PC register.
- A branch can be conditional or unconditional
- Conditional branches get executed only when condition is true.

Syntax:

B{<cond>} label

BL{<cond>} label

BX{<cond>} Rm

BLX{<cond>} label | Rm

Conditional is optional.

B → Branch → PC = label

BL → Branch with link → PC = label

$l_x$  = address of the next instruction  
after the BL

( $l_x$  = link register).

BX → Branch exchange

→ PC = Rm & 0x000000fe, T = Rm & 1

BLX → Branch exchange with link

→ PC = label, T = 1

→ PC = Rm & 0x000000fe, T = Rm & 1

→  $l_x$  = address of the next instruction after BL

where.

T refers to Thumb bit in CPSR.

When instructions set T, the ARM switches to Thumb state.

Example for forward and Backward Branches:

1) B forward

ADD  $x_1, x_2, \#4$   
ADD  $x_0, x_6, \#2$   
ADD  $x_3, x_7, \#4$

2) backward

ADD  $x_1, x_2, \#4$   
SUB  $x_1, x_2, \#4$   
ADD  $x_4, x_6, x_7$

forward

SUB  $x_1, x_2, \#4$

B backward.

forward & backward are labels.

↳ it skips three instructions.

2015 backward branch forms infinite loop.

# 26

Thursday  
March

085-280 • Wk 13

FEBRUARY 2015

| M  | T  | W  | T  | F  | S  | S  |
|----|----|----|----|----|----|----|
| 2  | 3  | 4  | 5  | 6  | 7  | 1  |
| 9  | 10 | 11 | 12 | 13 | 14 | 18 |
| 16 | 17 | 18 | 19 | 20 | 21 | 22 |
| 23 | 24 | 25 | 26 | 27 | 28 |    |

BL instruction is similar to B instruction but overwrites the link register (lr) with return address. It performs subroutine call.

Example:

10 BL subroutine ; branch to subroutine  
11 CMP x1, #5 ; compare x1 with 5  
12 MOVEQ x1, #0 ; if (x1 == 5) then x1 = 0.  
13  
14 subroutine  
15 {  
16     subroutine codes  
17     MOV pc, lr ; return by moving pc=lr

BX, BLX are third type of Branch instruction

- 3 BX uses an absolute address stored in register Rm.
- 4 It primarily used to branch to and from Thumb code. The T bit in the CPSR is updated by the least significant bit of branch register.
- 5 BLX is same as BX additionally set the link register with the return address.

| S  | M  | T  | W  | T  | F  | S  |
|----|----|----|----|----|----|----|
|    | 1  | 2  | 3  | 4  |    |    |
| 5  | 6  | 7  | 8  | 9  | 10 | 11 |
| 12 | 13 | 14 | 15 | 16 | 17 | 18 |
| 19 | 20 | 21 | 22 | 23 | 24 | 25 |
| 26 | 27 | 28 | 29 | 30 |    |    |

27

Friday  
March

WK 13 • 086 279

## \* Load - Store Instructions.

### Types

- Single - register transfer , → multiple - register transfer,
- swap .

#### • Single - Register Transfer:

- These instructions are used for moving a single data item in and out of a register.

- The data types supported are signed & unsigned, half-words (16-bit) and bytes.

Syntax :  $\{LDR|STR\} \{cond\} \{B\} Rd, \text{ addressing}$   
 $LDR \{cond\} SB|H|SH Rd, \text{ addressing}^2$   
 $STR \{cond\} H Rd, \text{ addressing}^2$

LDR - load word into register  $\rightarrow Rd \leftarrow \text{mem } 32[\text{address}]$

STR - save byte or word from a register  
 $\rightarrow Rd \rightarrow \text{mem } 32[\text{address}]$

LDRB - load byte into a register  $\rightarrow Rd \leftarrow \text{mem } 8[\text{address}]$

STRB - save byte from a register  
 $\rightarrow Rd \rightarrow \text{mem } 8[\text{address}]$

LDRH - load halfword into a register  
 $\rightarrow Rd \leftarrow \text{mem } 16[\text{address}]$

STRH - save halfword into a register  
 $\rightarrow Rd \rightarrow \text{mem } 16[\text{address}]$

LDRSB - load signal byte into register

Rd  $\leftarrow$  Sign Extend (mem8[address]).

8 LDRSH - load signed halfword into a register  
 $\rightarrow$  Rd  $\leftarrow$  Sign Extend (mem16[address]).

9

10 LDR & STR instructions can load and store data on a boundary alignment that is the same as the data type size being loaded or stored.

12

Example:-

1 LDR x0,[x1]; = LDR x0,[x1,#0]

2 Load register x0 with the contents of the memory address pointed to by register x1.

3 STR x0,[x1]; = STR x0,[x1,#0]

Store the contents of register x0 to the memory address pointed by x1.

29

Sunday Register x1 is called base address register.

APRIL 2015

| S  | M  | T  | W  | T  | F  | S  |
|----|----|----|----|----|----|----|
|    | 1  | 2  | 3  | 4  |    |    |
| 5  | 6  | 7  | 8  | 9  | 10 | 11 |
| 12 | 13 | 14 | 15 | 16 | 17 | 18 |
| 19 | 20 | 21 | 22 | 23 | 24 | 25 |
| 26 | 27 | 28 | 29 | 30 |    |    |

WK 14 • 089-278

30  
Monday  
March

## Single - Register Load - Store addressing Modes:

There are three types of load - store instruction

- { ° pre - index with write back
- ° pre - index
- ° post - index

→ Indexing addressing methods.

### Index methods :

| <u>Index method</u>     | <u>Data</u>                               | <u>Base address register</u>  | <u>Example</u>                             |
|-------------------------|-------------------------------------------|-------------------------------|--------------------------------------------|
| preindex with writeback | $\text{mem}[\text{base} + \text{offset}]$ | $\text{base} + \text{offset}$ | $\text{LDR } \text{R0}, [\text{R1}, \#4]!$ |
| pre index               | $\text{mem}[\text{base} + \text{offset}]$ | not updated                   | $\text{LDR } \text{R0}, [\text{R1}, \#4]$  |
| post index              | $\text{mem}[\text{base}]$                 | $\text{base} + \text{offset}$ | $\text{LDR } \text{R0}, [\text{R1}], \#4$  |

31

Tuesday  
March

090-275 • WK 14

| FEBRUARY 2015 |    |    |    |    |    |    |
|---------------|----|----|----|----|----|----|
| M             | T  | W  | T  | F  | S  | S  |
| 2             | 3  | 4  | 5  | 6  | 7  | 8  |
| 9             | 10 | 11 | 12 | 13 | 14 | 15 |
| 16            | 17 | 18 | 19 | 20 | 21 | 22 |
| 23            | 24 | 25 | 26 | 27 | 28 | 29 |

8 Pre-index with writeback calculates an address from a base register plus address offset & then updates that address base register with the new address.

9

Pre-index is same as pre-index writeback but does not update the address base register.

10 Post-index only updates the address base register after the address is used.

11

Example:-

1

$$\text{PRE } \text{R0} = 0x00000000$$

2

$$\text{R1} = 0x00009000$$

3

$$\text{mem 32}[0x00009000] = 0x01010101$$

4

$$\text{mem 32}[0x00009004] = 0x02020202$$

5

$$\text{LDR R0, [R1, #4]}$$

Pre-indexing with writeback

$$\text{POST (1)} \quad \text{R0} = 0x02020202$$

6

$$\text{R1} = 0x00009004$$

$$\text{LDR R0, [R1, #4]}$$

Pre-indexing

$$\text{POST (2)} \quad \text{R0} = 0x02020202$$

$$\text{R1} = 0x00009000$$

$$\text{LDR R0, [R1, #4]}$$

Post-indexing

$$\text{POST (3)} \quad \text{R0} = 0x01010101$$

$$\text{R1} = 0x00009004$$

2015

8

Wednesday  
July

189-176 • WK 28

JUNE 2015

| M  | T  | W  | T  | F  | S  | S  |
|----|----|----|----|----|----|----|
| 1  | 2  | 3  | 4  | 5  | 6  | 7  |
| 8  | 9  | 10 | 11 | 12 | 13 | 14 |
| 15 | 16 | 17 | 18 | 19 | 20 | 21 |
| 22 | 23 | 24 | 25 | 26 | 27 | 28 |
| 29 | 30 |    |    |    |    |    |

Addressing mode for single - Register:

8

Addressing mode and index  
method

Addressing' syntax

10 Pre-index with immediate offset  $\rightarrow [R_n, \# +/- \text{offset}_{-12}]$ 11 Preindex with register offset  $\rightarrow [R_n, +/- R_m]$ 12 Preindex with scaled register offset  $\rightarrow [R_n, +/- R_m, \text{shift} \# \text{shift} - \text{imm}]$ 13 Pre-index writeback with immediate offset  $\rightarrow [R_n, \# +/- \text{offset}_{-12}]$ 14 Pre index writeback with register offset  $\rightarrow [R_n, +/- R_m]$ 15 Immediate postindexed  $\rightarrow [R_n], \# +/- \text{Offset}_{-12}$ 16 Register post index  $\rightarrow [R_n], +/- \text{offset}_{-12}, R_m$ 17 Scaled register postindex  $\rightarrow [R_n] +/- R_m, \text{shift} \# \text{shift} - \text{imm}$ 

2015

- A signed offset or register is denoted by +1 - identifying that it is either a positive or negative offset from the base address Register R<sub>b</sub>.
- Immediate means the address is calculated using the base address register & a 12-bit offset encoded in the instruction.
- Scaled means the address is calculated using the base address register & a barrel shift operation.

## • Multiple-Register Transfer

- It can transfer data b/w multiple register & memory in a single instruction.
- Efficient from single-register transfers for moving blocks of data, saving and restoring context & stacks.
- It can increase interrupt latency
- Transfer occurs from a base address register R<sub>b</sub> pointing into memory

10

Friday  
July

191-174 • WK 28

| JUNE 2015 |    |    |    |    |    |   |
|-----------|----|----|----|----|----|---|
| M         | T  | W  | T  | F  | S  | S |
| 1         | 2  | 3  | 4  | 5  | 6  |   |
| 8         | 9  | 10 | 11 | 12 | 13 |   |
| 15        | 16 | 17 | 18 | 19 | 20 |   |
| 22        | 23 | 24 | 25 | 26 | 27 |   |
| 29        | 30 |    |    |    |    |   |

Syntax: <LDI|STM> {<cond>} <addressing mode>  
 Rn{.3}, <registers>{^}

9 LDM - load multiple register  
 -  $\{Rd\}^{*N} \leftarrow \text{mem } 32[\text{start address} + 4^*N]$

10 STM - Save multiple register  
 -  $\{Rd\}^{*N} \rightarrow \text{mem } 32[\text{start address} + 4^*N]$

11 12 ~~where~~, N - Number of registers in the list of register.

13 Addressing mode for Load-store Multiple Instruction:

Addressing  $\rightarrow$  Description  $\rightarrow$  start  $\rightarrow$  end  $\rightarrow$  Rn!  
mode address address.

1A  $\rightarrow$  Increment after  $\rightarrow R_n \rightarrow R_n + 4^*N - 4 \rightarrow R_n + 4^*N$

1B  $\rightarrow$  Increment before  $\rightarrow R_n + 4 \rightarrow R_n + 4^*N \rightarrow R_n + 4^*N$

DA  $\rightarrow$  decrement after  $\rightarrow R_n - 4^*N + 4 \rightarrow R_n \rightarrow R_n - 4^*N$

DB  $\rightarrow$  decrement before  $\rightarrow R_n - 4^*N \rightarrow R_n - 4 \rightarrow R_n - 4^*N$

2015

| AUGUST 2015 |    |    |    |    |    |    |
|-------------|----|----|----|----|----|----|
| S           | M  | T  | W  | T  | F  | S  |
| 1           | 2  | 3  | 4  | 5  | 6  | 7  |
| 8           | 9  | 10 | 11 | 12 | 13 | 14 |
| 15          | 16 | 17 | 18 | 19 | 20 | 21 |
| 22          | 23 | 24 | 25 | 26 | 27 | 28 |
| 29          | 30 | 31 |    |    |    |    |

WK 28 • 192-173

11  
Saturday  
July

- Any subset of the current bank of registers can be transferred to memory, or fetched from memory.
- Rn - determine the source or destination address for a load-store multiple instruction.

Example:

$$\text{PRE : } \text{mem32}[0x80018] = 0x03$$

$$\text{mem32}[0x80014] = 0x02$$

$$\text{mem32}[0x80010] = 0x01$$

$$x0 = 0x00080010$$

$$x1 = 0x00000000$$

$$x2 = 0x00000000$$

$$x3 = 0x00600000$$

LDMIA x0!, [x1 - x3]

$$\text{POST : } x0 = 0x00080010$$

$$x1 = 0x00000001$$

$$x2 = 0x00000002$$

$$x3 = 0x00000003$$

12  
Sunday

13

Monday  
July

194 171 • WK 29

| M  | T  | W  | T  | F  | S  |
|----|----|----|----|----|----|
| 1  | 2  | 3  | 4  | 5  | 6  |
| 8  | 9  | 10 | 11 | 12 | 13 |
| 15 | 16 | 17 | 18 | 19 | 20 |
| 22 | 23 | 24 | 25 | 26 | 27 |
| 29 | 30 |    |    |    |    |

## Stack Operation :-

- The pop operation uses a load-store multiple instructions.
- The push operation uses a store multiple instruction.

When using a stack you have to decide whether the stack will grow up or down in memory.

A stack is either -

12 ascending (A) - grows towards higher memory address

descending (D) - grows towards lower memory address.

When you use a full stack (F), the stack pointer

2 SP points to an address that is the last used for full location.

If we use an empty stack (E) the sp points to an address that is the first unused or empty location.

## Addressing methods for stack operation.

### Addressing modes

FA

FD

EA

ED

Description

full ascending

full descending

empty ascending

empty descending

| POP   | LDN   | Push  | STM |
|-------|-------|-------|-----|
| LDMFA | LDMDA | SIMFA | STM |
| LDMFO | LDMA  | STMFO | STM |
| LDMEA | LDMDB | STMFA | STM |
| LDMED | LOMIB | STMED |     |

 STM  
2015

Example:- Using STNED.

PRE       $r1 = 0x00000002$

$r4 = 0x00000003$

$sp = 0x00080010$

STNED  $sp!, \{r1, r4\}$

POST       $r1 = 0x00000002$

$r4 = 0x00000003$

$sp = 0x00080008$

When handling a checked stack there are three attributes that need to be preserved:  
the stack base, stack pointer, stack limit.

### \* Swap Instruction.

Syntax: SWP {B}{<cond>} Rd, Rm, [Rn]

SWP - swap a word blw memory and a register

$tmp = \text{mem } 32[Rn]$

$\text{mem } 32[Rn] = Rm$

$Rd = tmp$

SWPB - swap a byte blw memory and a register

$tmp = \text{mem } 8[Rn]$

$\text{mem } 8[Rn] = Rm$

$Rd = tmp$

| AUGUST 2015 |    |    |    |    |    |    |
|-------------|----|----|----|----|----|----|
| S           | M  | T  | W  | T  | F  | S  |
| 1           |    |    |    |    |    |    |
| 2           | 3  | 4  | 5  | 6  | 7  | 8  |
| 9           | 10 | 11 | 12 | 13 | 14 | 15 |
| 16          | 17 | 18 | 19 | 20 | 21 | 22 |
| 23          | 24 | 25 | 26 | 27 | 28 | 29 |

WK 20 • 197-168

16  
Thursday  
July

- It is a special-case in load-store instruction
- It swaps the contents of memory with the contents of a register.
- It is an atomic operation - it reads & writes a location in the same bus operation, preventing any other instruction from reading or writing to that location until still complete.

### Example:-

PRE    mem32 [0x9000] = 0x12345678  
 $x_0 = 0x00000000$   
 $x_1 = 0x11112222$   
 $x_2 = 0x000009000$   
 SWP  $x_0, x_1, [x_2]$

POST    mem32 [0x9000] = 0x11112222  
 $x_0 = 0x12345678$   
 $x_1 = 0x11112222$   
 $x_2 = 0x00009000$

# 17

Friday  
July

198 167 • WK 29

| M  | T  | W  | T  | F  |
|----|----|----|----|----|
| 1  | 2  | 3  | 4  | 5  |
| 8  | 9  | 10 | 11 | 12 |
| 15 | 16 | 17 | 18 | 19 |
| 22 | 23 | 24 | 25 | 26 |
| 29 | 30 |    |    |    |

## Software Interrupt Instruction.

A software interrupt instruction (SWI) causes a software interrupt exception, which provides a mechanism for application to call OS routines.

Syntax : SWI {<cond>} SWI\_number.

SWI - Software interrupt -  $lr_{-SVC}$  = address of instruction following SWI

$$SPSR_{-SVC} = CPSR$$

$$PC = \text{vector} + 0x8$$

$$CPSR \text{ mode} = SVC$$

$$CPSR I = 1 \text{ (mark IRQ interrupt).}$$

PRE

example :- CPSR = nzcvqift - USER

PC = 0x0000 8000

lr = 0x003ffff ; lr = x14

r0 = 0x12

0x00008000 SWI 0x123456

POST

CPSR = nzcvqift - SVC

SPSR = nzcvqift - USER

PC = 0x0000 0008

lr = 0x00008004

r0 = 0x12.

AUGUST 2015

| S  | M  | T  | W  | T  | F  | S  |
|----|----|----|----|----|----|----|
| 30 | 31 | 1  | 2  | 3  | 4  | 5  |
| 7  | 8  | 9  | 10 | 11 | 12 | 13 |
| 14 | 15 | 16 | 17 | 18 | 19 | 20 |
| 21 | 22 | 23 | 24 | 25 | 26 | 27 |
| 28 | 29 |    |    |    |    |    |

WK 29 • 199-166

18  
Saturday  
July

When the processor executes an SWI instruction, it sets the program counter PC to the offset 0x8 in the vector table. The instruction also forces the processor mode to SVC, which allows an OS routine to be called in a privileged mode.

Since SWI instructions are used to call OS routines, you need some form of parameter passing. This is achieved using registers.

### Program Status Register Instruction.

The ARM instruction set provides two instructions to directly control a program status register.

→ The MSR instruction transfers the contents of either the CPSR or SPSR into a register.

→ The MSR instruction transfers the contents of a register into the CPSR or SPSR.

→ Together these instructions are used to read & write the CPSR and SPSR.

# 20

Monday

July

201 164 • WK 30

| JUNE 2016 |    |    |    |    |    |
|-----------|----|----|----|----|----|
| M         | T  | W  | T  | F  | S  |
| 1         | 2  | 3  | 4  | 5  | 6  |
| 8         | 9  | 10 | 11 | 12 | 13 |
| 15        | 16 | 17 | 18 | 19 | 20 |
| 22        | 23 | 24 | 25 | 26 | 27 |
| 29        | 30 |    |    |    |    |

Syntax:  $MRS\{\langle cond \rangle\} Rd, \langle cpsr | spsr \rangle$   
 $MSR\{\langle cond \rangle\} \langle cpsr | spsr \rangle - \langle fields \rangle, Rm$   
 $MSR\{\langle cond \rangle\} \langle cpsr | spsr \rangle - \langle fields \rangle,$   
# immediate.

## \* Coprocessor Instructions:

Coprocessor instructions are used to extend the instruction set.

- A coprocessor can either provide additional computation capability, or be used to control the memory subsystem, including cache and memory management.
- A coprocessor instructions include data processing, register transfer, and memory transfer instructions.

Syntax:  $COP\{\langle cond \rangle\} cp, Opcode1, Cd, Cn\{Opcode2\}$   
 $\langle MRC | MCR \rangle\{\langle cond \rangle\} cp, Opcode1, Rd, Cn, Cm$   
 $\{, Opcode2\}$   
 $\langle LDC | STC \rangle\{\langle cond \rangle\} cp, Cd, \text{addressing}.$

~~cop~~ —

COP - coprocessor data processing - perform an operation in a coprocessor.

MRC MCR - coprocessor register transfer - move data to/ from coprocessor register.

16 17 18 19 20 21 22  
23 24 25 26 27 28 29

LDC STC → coprocessor memory transfer - load and store blocks of memory to/ from a coprocessor.

- In the syntax of the coprocessor instructions.

→ The Cp field represents the coprocessor number blw p0 and p15.

→ The opcode fields describes the operation to take place on the coprocessor.

→ The Co, Cm and Cd fields describe registers within the coprocessor.

- The coprocessor operations and registers depend on the specific coprocessor you are using.

• Coprocessor 15 (CP15) is reserved for system control purposes, such as memory management, write buffer control, cache control and identification registers.

Example:-

CP15 register being copied into a general purpose register.

MRC p15, 0, &10, C0, C0, 0.

27

Monday  
July

208-157 • WK 31

JUNE 2015

| M  | T  | W  | T  | F  | S  |
|----|----|----|----|----|----|
| 1  | 2  | 3  | 4  | 5  | 6  |
| 8  | 9  | 10 | 11 | 12 | 13 |
| 15 | 16 | 17 | 18 | 19 | 20 |
| 22 | 23 | 24 | 25 | 26 | 27 |
| 29 | 30 |    |    |    |    |

## LOADING CONSTANTS:

8 Syntax: LDR Rd, =constant  
ADR Rd, label.

10 LDR - load constant pseudo instruction  
Rd = 32-bit constant.

11 ADR - load address pseudo instruction  
Rd = 32-bit relative address.

12 There is no ARM instruction to move a 32-bit constant into a register. Since ARM instructions are 32 bits in size, they obviously cannot specify a general 32-bit constant. To aid programming there are two pseudo-instructions to move a 32-bit value into a register.

4 → The first pseudo-instruction writes a 32-bit constant to a register using whatever instruction are available. It defaults to a memory read if the constant cannot be encoded using other instructions.

→ The second pseudo-instruction writes a relative address into a register, which will be encoded using ~~other~~ a PC-relative expression.

| S  | N  | E  | W  | T  | F  | S  |
|----|----|----|----|----|----|----|
| 30 | 31 | 1  | 2  | 3  | 4  | 5  |
| 29 | 30 | 6  | 7  | 8  | 9  | 10 |
| 28 | 29 | 11 | 12 | 13 | 14 | 15 |
| 27 | 28 | 16 | 17 | 18 | 19 | 20 |
| 26 | 27 | 21 | 22 | 23 | 24 | 25 |
| 25 | 26 | 20 | 21 | 22 | 23 | 24 |

28

Tuesday  
July

WK 31 • 203-156

Example:-

LDR instruction loading 32-bit constant  
0xffff00ffff into register x0.

LDR x0, [pc, #constant\_number - 8 - {pc}]

constant\_number

0CD 0xffff00ffff.

## ARM programming Using Assembly Language:

### Introduction:-

Writing assembly by hand gives you direct control of three optimization tools that you cannot explicitly use if writing C source.

Instruction Scheduling:- Reordering the instruction in a code sequence to avoid processor stalls. Since ARM implementations are pipelined, the timing of an instruction can be affected by neighbouring instruction.

29

Wednesday  
July

210-155 • WK 31

| JUNE |    |    |    |    |    |
|------|----|----|----|----|----|
| M    | T  | W  | T  | F  | S  |
| 1    | 2  | 3  | 4  | 5  | 6  |
| 8    | 9  | 10 | 11 | 12 | 13 |
| 15   | 16 | 17 | 18 | 19 | 20 |
| 22   | 23 | 24 | 25 | 26 | 27 |
| 29   | 30 |    |    |    |    |

Reading

8 Register allocation :- Deciding how variables  
should be allocated to ARM registers or  
stack locations for maximum performance.  
9

10 Conditional execution :- Accessing the full range  
of ARM condition code & conditional instruction.  
11

## 12 Writing Assembly Code.

Converting C program to assembly code.

```
#include <stdio.h>
int square (int i)
int main (void)
{
    int i;
    for (i=0; i<10; i++)
    {
        printf (" square of %d\n", i, square (i));
    }
    int square (int i)
    {
        return i*i;
    }
}
```

AUGUST 2015

| S  | M  | T  | W  | T  | F  | S  |
|----|----|----|----|----|----|----|
| 8  | 9  | 10 | 11 | 12 | 13 | 14 |
| 15 | 16 | 17 | 18 | 19 | 20 | 21 |
| 22 | 23 | 24 | 25 | 26 | 27 | 28 |
| 29 | 30 | 31 |    |    |    |    |

30  
Thursday  
July

WK 31 • 211-154

## Assembly program.

AREA .text!, CODE, READONLY

EXPORT square

Square

```
MUL    R1, R0, R0  
MOV    R0, R1  
MOV    PC, R8  
END
```

Area - it is directive names the area or code section that the code lives in. In previous code we define a Read-only code area called .text.

EXPORT - directive makes the symbol square available for external linking.

END - directive marks the end of the assembly file.

Example:-

When calling ARM code from C compiled as thumb, the only change required is to change the return instruction to a BX.

AUGUST

SEPTEMBER

OCTOBER

NOVEMBER

DECEMBER

Area 1. text1, code, READONLY

EXPORT square

square

```
HUL    x1,x0, x0  
MOV    x0, x1  
BX     lr  
END.
```

\* Profiling and Cycle Counting.

The first stage of any optimization process is to identify the critical routines and measure their current performance. A profiler is a tool that measures the proportion of time or processing cycle spent in each subroutine. You use a profiler to identify the most critical routines. A cycle counter measures the number of cycles taken by a specific routine. You can measure your success by using a cycle counter to benchmark a given subroutine before and after an optimization.

The ARM simulator used by the ADS.

1. debugger is called the ARMulator and provides profiling and cycle counting features.

1

Saturday  
August

213-152 • WK 31

| M  | T  | W  | T  | F  | S  | S  |
|----|----|----|----|----|----|----|
|    |    |    | 1  | 2  | 3  | 4  |
| 6  | 7  | B  | 9  | 10 | 11 | 12 |
| 13 | 14 | 15 | 16 | 17 | 18 | 19 |
| 20 | 21 | 22 | 23 | 24 | 25 | 26 |
| 27 | 28 | 29 | 30 | 31 |    |    |

The ARMulator profiler works by sampling the program counter ~~at~~ pc at regular intervals.

The profiler identifies the function the pc points to and updates a hit counter for each function it encounters.

ARM implementation do not normally contain cycle-counting hardware, so to easily measure cycle counts you should use an ARM debugger with ARM simulator.

## INSTRUCTION SCHEDULING.

The time taken to execute instructions depends on the implementation pipeline.

Instructions that are conditional on the value of the ARM condition codes in the CPSR take one cycle if the condition is not met. If the condition is met, then following rules apply:-

- ALU operations such as addition, subtraction & logical operation take place in one cycle. These include a shift by an immediate value. If you use a register-shift, then add one cycle. If the instruction writes to pc, then 2 cycle.

SEPTEMBER 2015

| S  | M  | T  | W  | F  | S  |
|----|----|----|----|----|----|
| 1  | 2  | 3  | 4  | 5  |    |
| 6  | 7  | 8  | 9  | 10 | 11 |
| 13 | 14 | 15 | 16 | 17 | 18 |
| 20 | 21 | 22 | 23 | 24 | 25 |
| 27 | 28 | 29 | 30 |    |    |

3

Monday  
August

WK 82 • 215-150

- Load instructions that load N 32-bit words of memory such as LDR & LDM take N cycles to issue, but the result of the last word loaded is not available on the following cycle. The updated load address is available on the next cycle. This assumes zero-wait-state memory for an uncached system or a cache hit for a cached system. An LDM of a single value is exceptional, taking two cycles. If the instruction loads pc, then add two cycles.
- Load instructions that load 16-bit or 8-bit data such as LDRB, LDRSB, LDRH & LDRSH take one cycle to issue. The updated load address is available on the next cycle.
- Branch instructions takes 3 cycles.
- Store instructions that store N values take N cycle. This assumes zero-wait-state memory for an uncached system, or a cache hit or write buffer with N free entries for a cached system.
- Multiply instructions take a varying number of cycles depending on the value of the second operand.

4

Tuesday  
August

216-149 • WK 32

| JULY 2015 |    |    |    |    |    |    |
|-----------|----|----|----|----|----|----|
| M         | T  | W  | T  | F  | S  | S  |
|           |    |    | 1  | 2  | 3  | 4  |
| 6         | 7  | 8  | 9  | 10 | 11 | 12 |
| 13        | 14 | 15 | 16 | 17 | 18 | 19 |
| 20        | 21 | 22 | 23 | 24 | 25 | 26 |
| 27        | 28 | 29 | 30 | 31 |    |    |

## ARM9TDMI pipeline executing in ARM state.

- The ARM9TDMI processor five operations in parallel.

→ Fetch → Fetch from memory the instruction at address pc. The instruction is loaded into the core and then processes down the core pipeline.

→ Decode → Decode the instruction that was fetched in the previous cycle. The processor also reads the input operands from the register bank if are not available via one of the forwarding paths.

→ ALU → Executes the instruction that was decoded in the previous cycle. Note this instruction was originally fetched from address pc-8 (ARM state) or pc-4 (Thumb state). Normally this involves calculating the answer for a data processing operation, or the address for a load, store or branch operations. Some instructions may spend several cycles in this stage. For example, multiply and register-controlled shift operations take several ALU cycles.

5

Wednesday  
August

MK32 • 217-148

| REGISTERS | R1 | R2 | R3 | R4 | R5 | R6 | R7 | R8 | R9 | R10 | R11 | R12 | R13 | R14 | R15 | R16 |
|-----------|----|----|----|----|----|----|----|----|----|-----|-----|-----|-----|-----|-----|-----|
| A         | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10  | 11  | 12  | 13  | 14  | 15  | 16  |
| D         | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26  | 27  | 28  | 29  | 30  | 31  | 32  |
| S         | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42  | 43  | 44  | 45  | 46  | 47  | 48  |
| PC        | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58  | 59  | 60  | 61  | 62  | 63  | 64  |

→ LS1: Load or store the data specified by a load or store instruction. If the instruction is not a load or store, then this stage has no effect.

→ LS2: Extract and zero - or sign- extend the data loaded by a byte or halfword load instruction. If the instruction is not a load of an 8-bit byte or 16-bit halfword item, then this stage has no effect.

Example:-

This example shows the case where there is no interlock

ADD R0, R0, R1

ADD R0, R0, R2

This instruction pair take two cycles. The ALU calculate  $R0 + R1$  in one cycle. Therefore this result is available for the ALU to calculate  $R0 + R2$  in the second cycle.



OCTOBER

NOVEMBER

DECEMBER

6

Thursday  
August

218-147 • WK 32

| M  | T  | W  | T  | F  | S  | S  |
|----|----|----|----|----|----|----|
|    |    |    | 1  | 2  | 3  | 4  |
| 6  | 7  | 8  | 9  | 10 | 11 | 12 |
| 13 | 14 | 15 | 16 | 17 | 18 | 19 |
| 20 | 21 | 22 | 23 | 24 | 25 | 26 |
| 27 | 28 | 29 | 30 | 31 |    |    |

## Scheduling of load instructions:

Load instructions occur frequently in compiled code, accounting for approximately one third of all instructions. Careful scheduling of load instructions so that pipeline stalls don't occur can improve performance. The compiler attempts to schedule the code as best it can, but the aliasing problems of C that we looked at in Section 5.b limits the available optimizations. The compiler cannot move a load instruction before a store instruction unless it is certain that the two pointers used do not point to the same address.

Example of a memory-intensive task. The following function, str\_toLower, copies a zero-terminated string of characters from in to out. It converts the string to lowercase in the process.

```

void str_toLower(char *out, char *in)
{
    unsigned int c;
    do
    {
        c = *(in++);
        if(c >= 'A' && c <= 'Z')
            c = c + ('a' - 'A');
        *out++ = c;
    } while(*in);
}
```

|    |    |    |    |    |    |    |    |    |    |    |    |
|----|----|----|----|----|----|----|----|----|----|----|----|
| S  | E  | P  | R  | I  | N  | R  | E  | S  | E  | R  | H  |
| 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 | 12 |
| 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 |
| 25 | 26 | 27 | 28 | 29 | 30 | 31 |    |    |    |    |    |

7

Friday  
August

WK 32 • 219-146

```
* (out++) = (char) c
} while(c);
```

The ADS I.I compiler generates the following compiled output. Notice that the compiler optimized the condition ( $c \geq 'A' \&\& c \leq 'z'$ ) to the check that  $0 \leq c - 'A' \leq 'z' - 'A'$ . The compiler can perform this check using a single unsigned comparison.

### str\_tolower

```
LDRB  r2,[r1],#1
SUB  r3,r2,#0x41
CMP  r3,#0x19
ADDLS r2,r2,#0x20
STRB  r2,[r0],#1
CMP  r2,#0
BNE  str_tolower
MOV  pc,r14.
```

Unfortunately, the SUB instruction uses the value of c directly after the LDRB instruction that loads c. Consequently, the ARM9 TDMI pipeline will stall for two cycles. The compiler can't do any better since everything following the load of c depends on its value. However, there are two ways you can alter the structure of the algorithm to avoid the cycles by using assembly. We call these methods load scheduling by preloading and unrolling.

8

Saturday  
August:

220-145 • WK 32

| JULY 2015 |    |    |    |    |
|-----------|----|----|----|----|
| M         | T  | W  | T  | F  |
| 6         | 7  | 8  | 9  | 10 |
| 13        | 14 | 15 | 16 | 17 |
| 20        | 21 | 22 | 23 | 24 |
| 27        | 28 | 29 | 30 | 31 |

## Load Scheduling by preloading:

In this method of load scheduling, we load the data required for the loop at the end of the previous loop, rather than at the beginning of the current loop. To get performance improvement with little increase in code size, we don't unroll the loop.

Example:-

This assembly applies the preload method to the str\_tobus function.

|     |      |
|-----|------|
| out | RN 0 |
| is  | RN 1 |
| c   | RN 2 |
| t   | RN 3 |
| ;   |      |

loop

9

Sunday

```

SUB t, c, #'A'
CMP t, #'z' - 'A'
ADDLS c, c, #'a' - 'A'
STRB c, [out], #1
TEQ c, [in], #1
BNE loop
MOV pc, ls

```

SEPTEMBER 2015

| S  | M  | T  | W  | T  | F  | S  |
|----|----|----|----|----|----|----|
|    | 1  | 2  | 3  | 4  | 5  |    |
| 6  | 7  | 8  | 9  | 10 | 11 | 12 |
| 13 | 14 | 15 | 16 | 17 | 18 | 19 |
| 20 | 21 | 22 | 23 | 24 | 25 | 26 |
| 27 | 28 | 29 | 30 |    |    |    |

10

Monday  
August

WK 33 • 222-143

The scheduled version is one instruction longer than the C version, but we save two cycles for each inner loop iteration. This reduces the loop from 11 cycles per character to 9 cycles per character on an ARM9TDMI, giving a 1.22 times speed improvement.

The ARM architecture is particularly well suited to this type of preloading because instructions can be executed conditionally. Since loop i is loading the data for loop i+1 there is always a problem with the first & last loop. No byte load occurs on the last loop.

### Load Scheduling by Unrolling.

This method of load scheduling works by unrolling and then interleaving the body of the loop.

for example:- We can perform loop iterations i, i+1, i+2 interleaved. When the result of an operation from loop i is not ready, we can perform an operation from loop i+1 that avoids waiting for the loop i result.

This loop is the most efficient implementation we've looked at so far. The implementation requires seven cycles per character on ARM9TDMI. This gives a 1.57 times speed increase over the original strainer.

SEPTEMBER

OCTOBER

NOVEMBER

DECEMBER

11

Tuesday  
August

223-142 • WK 33

JULY 2015

| M  | T  | W  | T  | F  | S   |
|----|----|----|----|----|-----|
|    |    |    | 1  | 2  | 3 4 |
| 6  | 7  | 8  | 9  | 10 | 11  |
| 13 | 14 | 15 | 16 | 17 | 18  |
| 20 | 21 | 22 | 23 | 24 | 25  |
| 27 | 28 | 29 | 30 | 31 |     |

Again it is the conditional nature of the ARM instruction that makes this possible. We use conditional instructions to avoid storing characters that are past the end of the string.

However, the improvement in Example 1.10 does have some costs. The routine is more than double the code size of the original implementation. We have assumed that you can read up to two characters beyond the end of the input string, which may not be true if the string is right at the end of available RAM. Where reading off the end will cause a data abort. Also performance can be slower for very short strings because (1) stacking /x causes additional function call overhead & (2) the routine may process up two characters pointlessly, before discovering that they lie beyond the end of the string.

You should use this form of scheduling by insulating for time-critical parts of an application where you know the data size is large. If you also know the size of the size of the data at compile time, you can remove the problem of reading beyond the end of the arrays.

SEPTEMBER 2015

| S  | M  | T  | W  | T  | F  | S  |
|----|----|----|----|----|----|----|
| 1  | 2  | 3  | 4  | 5  |    |    |
| 6  | 7  | 8  | 9  | 10 | 11 | 12 |
| 13 | 14 | 15 | 16 | 17 | 18 | 19 |
| 20 | 21 | 22 | 23 | 24 | 25 | 26 |
| 27 | 28 | 29 | 30 |    |    |    |

12

Wednesday  
August

WK 33 • 224-141

out RN 0 ;  
 in RN 1 ;  
 ca0 RN 2 ;  
 t RN 3 ;  
 cal RN 12 ;  
 ca2 RN 14 ;  
 ; void str\_tolower\_unrolled(char \*out, char \*in)  
 str\_tolower\_unrolled  
 STMFD sp!, {lr} ;  
 loop\_next3  
 LDRB ca0, [in], #1  
 LDRB ca1, [in], #1  
 LDRB ca2, [in], #1  
 SUB t, ca0, #'A'  
 CMP t, #'Z' - #'A'  
 ADDLS cal, cal, #'a' - #'A'  
 SUB t, ca2, #'A'  
 CMP t, #'Z' - #'A'  
 ADDLS ca2, ca2, #'a' - #'A'  
 STRB ca0, [out], #1  
 TEQ ca0, #0  
 STRNEB ca1, [out], #1  
 TEQNE ca1, #0  
 STRNEB ca2, [out], #1  
 TEQNE ca2, #0  
 BNE loop\_next3  
 [LDMFD sp!, {pc}]

SEPTEMBER

OCTOBER

NOVEMBER

DECEMBER

# 13

Thursday  
August

225-140 • WK 33

| M  | T  | W  | T  | F  | S  | S  |
|----|----|----|----|----|----|----|
|    |    |    | 1  | 2  | 3  | 4  |
| 6  | 7  | 8  | 9  | 10 | 11 | 12 |
| 13 | 14 | 15 | 16 | 17 | 18 | 19 |
| 20 | 21 | 22 | 23 | 24 | 25 | 26 |
| 27 | 28 | 29 | 30 | 31 |    |    |

## REGISTER ALLOCATION:-

We can use 14 of the 16 visible ARM registers to hold general-purpose data. The other two registers are the stack pointer  $\$13$  and the program counter  $\$15$ . For a function to be ATPCS compliant it must preserve the callee values of registers  $\$4$  to  $\$11$ . ATPCS also specifies that the stack should be eight-byte aligned. Therefore you must preserve this alignment if calling subroutines. Use the following template for optimized assembly routine requiring many registers.

routine\_name

STMFD sp!, { $\$4-\$12, lr$ }

LDMFD sp!, { $\$4-\$12, pc$ }

Our only purpose in stacking  $\$12$  is to keep the stack eight-byte aligned. You need not stack  $\$12$  if your routine doesn't call other ATPCS routines. For ARM v5 and above you can use the preceding template even when being called from Thumb code. If your routine may be called from Thumb code on an ARMv4T processor, then modify the template as follows:

SEPTEMBER 2015

| S  | M  | T  | W  | T  | F  | S  |
|----|----|----|----|----|----|----|
|    | 1  | 2  | 3  | 4  | 5  |    |
| 6  | 7  | 8  | 9  | 10 | 11 | 12 |
| 13 | 14 | 15 | 16 | 17 | 18 | 19 |
| 20 | 21 | 22 | 23 | 24 | 25 | 26 |
| 27 | 28 | 29 | 30 |    |    |    |

WK 33 • 226-139

14

Friday  
August

routine\_name

STMFD sp!, {x4-x12, lr}

; body of routine

LDMFD sp!, {x4-x12, lr}

BX lr

In this section we look at how best to allocate variables to register numbers for register intensive tasks, how to use more than 14 local variables, & how to make the best use of the 14 available registers.

### Allocating Variables to Register Numbers:

When we write an assembly routine, it is best to start by using names for variables, rather than explicit register numbers. This allows you to change the allocation of variables to register numbers easily. You can even use different register names for the same physical register number when their use doesn't overlap. Register names increase the clarity and readability of optimized code.

For the most part ARM operations are orthogonal with respect to register numbers. In other words, specific register numbers do not have specific roles.

SEPTEMBER

OCTOBER

NOVEMBER

DECEMBER

# 15

Saturday  
August

227-138 • WK 33

| M  | T  | W  | T  | F  | S  |
|----|----|----|----|----|----|
|    |    |    | 1  | 2  | 3  |
| 6  | 7  | 8  | 9  | 10 | 11 |
| 13 | 14 | 15 | 16 | 17 | 18 |
| 20 | 21 | 22 | 23 | 24 | 25 |
| 27 | 28 | 29 | 30 | 31 |    |

• Argument registers. The ATPGS convention defines that the first four arguments to a function are placed in registers  $x0$  to  $x3$ . Further arguments are placed on the stack. The return value must be placed in  $x0$ .

• Registers used in a load or store multiple. Load & store multiple instructions LDM and STM operate on a list of registers in order of ascending register number. If  $x0$  and  $x1$  appear in the register list, then the processor will always load or store  $x0$  using a lower address than  $x1$  & so on...

• Load and store double word. The LD RD & STRD instructions introduced in ARMV5E operate on a pair of registers with sequential register numbers,  $Rd$  &  $Rd+1$ . Furthermore,  $Rd$  must be an even register number.

# 16

Sunday

Example:-

The assembly shows our final shift-bit routine. It uses all 14 available ARM registers.

K<sub>x</sub> RN 1<sub>x</sub>

shift\_bits

STMFD SP!, {Y4-Y11, 1<sub>x</sub>}RSB K<sub>x</sub>, K<sub>y</sub>, #32

MOV Y-0, #0

loop

LDMIA in!, {X\_0-X\_7}

ORR Y-0, Y-0, X-0, LSL K

MOV Y-1, X-0, LSR K<sub>x</sub>

ORR Y-1, Y-1, X-1, LSL K

MOV Y-2, X-1, LSR K<sub>x</sub>

ORR Y-2, Y-2, X-2, LSL K

MOV Y-2, X-1, LSR K<sub>x</sub>

ORR Y-2, Y-2, X-2, LSL K

MOV Y-3, X-2, LSR K<sub>x</sub>

ORR Y-3, Y-3, X-3, LSL K

MOV Y-4, X-3, LSR K<sub>x</sub>

ORR Y-4, Y-4, X-4, LSL K

MOV Y-5, X-4, LSR K<sub>x</sub>

ORR Y-5, Y-5, X-5, LSL K

MOV Y-6, X-5, LSR K<sub>x</sub>

ORR Y-6, Y-6, X-6, LSL K

MOV Y-7, X-6, LSR K<sub>x</sub>

ORR Y-7, Y-7, X-7, LSL K

STMIA out!, {Y-0-Y-7}

MOV Y-0, X-7, LSR K<sub>x</sub>

SUBS N, N, #256

BNE loop

MOV Y0, Y-0

LDMFD SP!, {Y4-Y11, PC}

# 18

Tuesday  
August

230-135 • WK 34

| M  | T  | W  | T  | F  | S  |
|----|----|----|----|----|----|
|    |    |    | 1  | 2  | 3  |
| 6  | 7  | 8  | 9  | 10 | 11 |
| 13 | 14 | 15 | 16 | 17 | 18 |
| 20 | 21 | 22 | 23 | 24 | 25 |
| 27 | 28 | 29 | 30 | 31 |    |

## Using More than 14 Local Variables.

If you need more than 14 local 32-bit variables, then you must store some variable on the stack. The standard procedure is to work outward from the inner-most loop of the algorithm, since the innermost loop has the greatest performance impact.

Example:

nested-loops

STMFD sp!, {x4-x11, l8}

loop1

STMFD sp!, {loop1 registers}

loop2

STMFD sp!, {loop2 registers}

loop3

B{cond} loop3

LDMFD sp!, {loop2 registers}

B{cond} loop2

LDMFD sp!, {loop1 registers}

B{cond} loop1

LDMFD sp!, {x4-x11, pc}.

SEPTEMBER 2015

| S  | M  | T  | W  | T  | F  | S  |
|----|----|----|----|----|----|----|
| 1  | 2  | 3  | 4  | 5  |    |    |
| 6  | 7  | 8  | 9  | 10 | 11 | 12 |
| 13 | 14 | 15 | 16 | 17 | 18 | 19 |
| 20 | 21 | 22 | 23 | 24 | 25 | 26 |
| 27 | 28 | 29 | 30 |    |    |    |

WK 34 • 233-132

21  
Friday  
August

## CONDITIONAL EXECUTION:

The processor core can conditionally execute most ARM instructions. This conditional execution is based on one 15 conditional codes. If you don't specify a condition the assembler defaults to execute always condition (AL). The other 14 conditions split into seven pairs of complements. The conditions depend on the four conditional flags N, Z, C, V stored in the cpsr register.

By default, ARM instructions do not update the N, Z, C, V flags in the ARM cpsr. For most instructions, to update these flags you append an S suffix to the instruction mnemonic.

Exceptions to this are comparison instructions that do not write to a determine destination register. Their sole purpose is to update the flags and so they don't require the S suffix.

By combining conditional execution and conditional setting of flags, you can implement simple if statement without need of Branches.

Example:- Assembly using conditional execution

CMP r1, #10

ADDLO C, r1, #10

ADDHS C, r1, #'A' - 10.

Conditional execution is even more powerful for cascading conditions.

SEPTEMBER

OCTOBER

NOVEMBER

DECEMBER

# 22

Saturday  
August

234-131 • WK 34

| M  | T  | W  | T  | F  | S  | U  |
|----|----|----|----|----|----|----|
| 6  | 7  | 8  | 9  | 10 | 11 | 12 |
| 13 | 14 | 15 | 16 | 17 | 18 | 19 |
| 20 | 21 | 22 | 23 | 24 | 25 | 26 |
| 27 | 28 | 29 | 30 | 31 |    |    |

Example: 2. Identify if c is vowel

8 TEQ C, # 'a'

9 TEQNE C, # 'e'

10 TEQNE C, # 'i'

11 TEQNE C, # 'o'

12 TEQNE C, # 'u'

ADDEA vowel, vowel, #1

As soon as one of the TEQ comparisons detects a match, the Z flag is set in the CPSR. The following TEQNE instructions have no effect as they are conditional on Z=0.

The next instruction to have effect is the ADDEA that increments vowel.

3 LOOPINg CONSTRUCTS.

Most routines critical to performance will contain a loop. We saw in instruction scheduling that on the ARM loops are fastest when they count down towards zero. This section describes how to implement these loops efficiently in assembly.

# 23

Sunday

Decrementing Counted Loops:

for a decrementing loop of  $N$  iterations, the loop counter  $i$  counts down from  $N$  to 1 inclusive. The loop terminates with  $i=0$ . An efficient implementation is

MOV i, N

loop

{ loops body goes here }  $\& i \geq N, N-1, \dots$

SUBS i, i, #1

BGT loop.

The loop overhead consists of a subtraction setting the condition code followed by a conditional branch. On ARM7 & ARM9 this overhead costs four cycles per loop. If  $i$  is an array index, then you may want to count down from  $N-1$  to 0 inclusive instead so that you can access array element zero.

You can implement this in the same way by using a different conditional branch.

SUBS i, N, #1

loop

SUBS i, i, #1

BGE loop

# 25

Tuesday  
August

237-128 • WK 35

| M  | T  | W  | T  | F  | S  |
|----|----|----|----|----|----|
|    |    |    | 1  | 2  | 3  |
| 6  | 7  | 8  | 9  | 10 | 11 |
| 13 | 14 | 15 | 16 | 17 | 18 |
| 20 | 21 | 22 | 23 | 24 | 25 |
| 27 | 28 | 29 | 30 | 31 |    |

## Unrolled Counted Loops:

Loop unrolling reduces the loop overhead by executing the loop body multiple times. However, there are problems to overcome.

Function sets  $N$  bytes of memory at address to the byte value  $c$ . The function needs to be efficient, so we will look at how to unroll the loop without placing extra restrictions on the input operands.

Void my-memset(char\* s, int c, unsigned int N)

To be efficient for large  $N$ , we need to write multiple bytes at a time using STR or STM instructions. Therefore our first task is to align the array pointers. However, it is only worth us doing this if  $N$  is sufficiently large. Let's assume we can choose a threshold value  $T_1$  & only bother to align the array when  $N \geq T_1$ . Clearly  $T_1 \geq 3$  as there is no point in aligning if we don't have four bytes to write!

Now suppose we have aligned the array  $s$ . We can use ~~store~~ multiples of eight ~~words~~ ~~bytes~~ to set memory efficiently. Example: we can use a loop of few store multiples of eight words each to set 128 bytes on each loop. However, it will only be worth doing this if  $N \geq T_2 \geq 128$ , where

26

Wednesday  
August

WK 36 • 23B-127

|    |    |    |    |    |
|----|----|----|----|----|
| S  | M  | W  | F  | S  |
| 1  | 2  | 3  | 4  | 5  |
| 6  | 7  | 8  | 9  | 10 |
| 11 | 12 | 13 | 14 | 15 |
| 16 | 17 | 18 | 19 | 20 |
| 21 | 22 | 23 | 24 | 25 |
| 26 | 27 | 28 | 29 | 30 |

$T_2$  is another threshold to be determined, later on. Finally, we are left with  $N < T_2$  bytes to set. We can write bytes in blocks of four using STR until  $N < 4$ . Then we can finish by writing bytes singly with STRB to the end of the array.

### Multiple Nested Loops:

Example :- The matrix using this single counter in register count:

R EQU 40

S EQU 40

T EQU 40

a RN 0 ; Points to an rows x column matrix

b RN 1 ;

c RN 2 ;

sum RN 3

bval RN 4

cval RN 12

count / RN 14

### matrix-mul

STMFD sp!, {8H, 18H}

MOV Count, # (R - 1)

loop:

ADD count, count, # (T - 1) &lt;&lt; 8

loop j

ADD count, count, #(s-1) << 1b  
MOV sum, #0

loop k

LDR bval, [b], #4  
LDR eval, [c], #4 \*T  
~~POP~~ ~~count, bval, eval~~

SUBS count, count, #1 << 1b

MLA sum, bval, eval, sum

BPL loop - K

STR sum, [a], #4

SUB c, c, #4 \*S \*T

ADD c, c, #4

ADDS count, count, #(1 << 16) - (1 << 8)

SUBPL b, b, #4 \*S

BPL loop - i

SUB c, c, #4 \*T

ADDS count, count, #(1 >> 8) - 1

BPL loop - 1

LDHFD sp!, {r4, pc}

Here as you observe the assembly program you can see 3 loops are used loop i, loop j, loop k.  
So we can say it is a nested loop.

SEPTEMBER 2015

| S  | M  | T  | W  | T  | F  | S  |
|----|----|----|----|----|----|----|
|    | 1  | 2  | 3  | 4  | 5  |    |
| 6  | 7  | 8  | 9  | 10 | 11 | 12 |
| 13 | 14 | 15 | 16 | 17 | 18 | 19 |
| 20 | 21 | 22 | 23 | 24 | 25 | 26 |
| 27 | 28 | 29 | 30 |    |    |    |

WK 35 • 210-125

28  
Friday  
August

## Negative indexing.

This loop structure counts from  $-N$  to 0 (inclusive or exclusive) in steps of size STEP

RSB i, N, #0

loop

ADDS i, i, #STEP

BLT loop.

## Logarithmic Indexing

This loop structure count down from  $2^N$  to 1 powers of two.

For example. if  $N=4$  then it counts

16, 8, 4, 2, 1.

MOV i, #1

MOV i, i, LSL N

loop

MOVS i, i, LSR #1

BNE loop.

29

Saturday  
August

241-124 • WK 35

| M  | T  | W  | T  | F  | S  | S  |
|----|----|----|----|----|----|----|
| 6  | 7  | 8  | 9  | 10 | 11 | 12 |
| 13 | 14 | 15 | 16 | 17 | 18 | 19 |
| 20 | 21 | 22 | 23 | 24 | 25 | 26 |
| 27 | 28 | 29 | 30 | 31 |    |    |

## Programs :

### 1. Demonstrating logical operation

AREA LOGIC, CODE, READONLY

ENTRY

LDR R0, = 5

LDR R1, = 3

AND R4, R0, R1

ORR R5, R0, R1

EOR R6, R0, R1

BIC R7, R0, R1

END.

### 2. find factorial of a given number (subroutine).

AREA FACTORIAL, CODE, READONLY

ENTRY

START LDR R0, #5

BL FACT

LDR R4, = DST

STR R5, [R4]

STOP B STOP

FACT

MOV S R1, R0

MOVEA R5, #1

LOOP

8 SUBNES R1,R1,#1

MULNE R0,R1,R0

BNE LOOP

MOV R5,R0

MOV PC,R14

9 AREA FACT, DATA, READWRITE

10 DST DCD 0

11 END.

12 3. To find Fibonacci series

1 AREA FIB, CODE, READONLY

2 ENTRY

3 MOV R0,#00

4 SUB R0,R0,#01

5 MOV R1,#01

6 MOV R4,#05

LDR R2, =FIB0

BACK ADD R0,R1

STR R0,[R2]

ADD R2,#04

MOV R3,R0

MOV R0,R1

MOV R1,R3

CMP R4,#00

BNE BACK

STOP B STOP

AREA FIBONACCI , DATA, READWRITE

2015 END.