

Campus | B5

7mm×30行 40页

Embedded Processor Synthesis - ELEC6234

Campus®

B5 点线本 7mm×30行 40页

KOKUYO

B5

点线本

7mm×30行

40页

## RISC architecture in embedded applications

### # RISC (Reduced Instruction Set Computer)

RISC is a microprocessor that is designed to perform a smaller number computer instruction so that it can operate at a higher speed.

John Cocke of IBM Research in Yorktown, New York, noticed that about 20% of the instruction in a computer did 80% of the work.

SPARC

MIPS

ARM

Scalable Processor Architecture

Microprocessor without Interlocked Pipeline stage

Advanced RISC Machine

### # RISC Characteristic

① Small instruction set  
the instruction set contains simple, basic instructions

more complex operations are performed using sequences of simple instructions

② Fixed length instruction  
each instruction is the same length, so that it may be fetched in a single operation

③ 1 clock-cycle instruction  
most instructions complete in one clock cycle. This is achieved through pipelining which allows the processor to handle several instructions at the same time. Pipeline is the key technique used in RISC machines to improve performance

### # Pipelining: a key RISC technique

Pipelining is a design technique where the computer's hardware processes more than one instruction at a time, and doesn't wait for one instruction to complete before starting the next.

RISC machine has the same four stages of instruction execution as in our typical CISC machine: <sup>平均</sup>fetch, decode, execute & write. But these stages are executed in parallel. As soon as one stage completes, it passes on the result to the next stage and then begins working on another instruction.

In a typical pipelined RISC design, each instruction takes 1 clock cycle for each stage, so the processor can accept 1 new instruction per clock.

### # Pipelining issues

Pipelining enhances performance

[Scenario 1] a two-stage pipeline: { Fetch : retrieve instruction from memory  
Execute : perform an ALU operation with register data.



[Scenario 2] a three-stage pipeline for load/store

Fetch : retrieve instruction from memory

Execute : calculate memory address

Read/write : transfer data from/to memory to/from CPU

[Branching and pipelines] what to do with pre-fetched instructions if there is a branch?



[Delayed branch] pipeline optimisation: execute the instruction after the branch to increase pipeline efficiency.  
↳ ('sub' executes before 'branch')!



## # Main RISC advantages for embedded applications

[Simpler hardware] - As the instruction set of a RISC processor is so simple, it uses up much less chip space; extra functions, such as memory management units or floating point arithmetic units, can also be placed on the same chip. Cost dramatically decreases.

- Smaller chips allow a semiconductor manufacturer to place more parts on a silicon wafer, which can lower the per-chip cost.

[Shorter design cycle] Since RISC processors are simpler than equivalent CISC processors, they can be designed more quickly, and can take advantage of other technological developments sooner than corresponding CISC designs, leading to greater leaps in performance between generations.

[Low energy consumption] Energy has become the primary performance characteristic in embedded designs, especially those aimed at mobile consumer markets. Here RISC processors have a clear advantage over CISC designs: simpler hardware, smaller control logic are features that naturally lead to low power consumption.

## # RISC's disadvantages

### [Code quality]

- The performance of a RISC processor depends greatly on the quality of the code that is executing.
- RISC processors are harder to program efficiently than their CISC equivalents. If the instruction scheduling in a program is poor, the processors can spend quite a bit of time stalling: wait for the result of one instruction before it can proceed with a subsequent instruction.
- Instruction scheduling rules can be complicated in RISC processors, hence the performance of a RISC application depends critically on the quality of the code generated by the compiler.

### [System Design]

- They require more instructions, and hence memory, than CISCs to implement applications.
- RISC processors require very fast memory systems to feed them instructions. RISC-based systems typically contain large memory caches, usually on the chip itself. This is known as a first-level cache.

## # Why is CISC still around?

Why are these still CISC CPUs being developed?

Why is intel spending time and money to manufacture new version of the architecture which started in 1970s?

Answer: backward compatibility (向后兼容)

The IBM compatible PC is the most common computer in the world. Intel wanted a CPU that would be run all applications that are in the hands of more than 100 million users.

## # CISC-RISC convergence

Much of the RISC philosophy has been adopted in recent CISC architecture: e.g. pipelines, branch prediction, hardware stacks.

Recent RISC developments, such as high degree of pipelining and parallelism, do not necessarily lead to simple control structures.

Modern RISCs and CISCs do not tend to have vastly different clock speeds.

The real performance issue seems to lie not in architecture but in optimization of instruction sets and details of machine organization

Today, most CISC processors are based on hybrid CISC-RISC architecture

These designs use a decoder to convert CISC instructions into RISC instructions before execution. They are then processed by a RISC core, which performs a few basic instructions very quickly.

Having a RISC core is advantageous because it allows performance enhancing features, such as pipelining and branch prediction.

ARM

MIPS

SPARC

PowerPC

## Ultra-low energy bit-serial architectures

is an instruction set architecture for a processor that has datapath widths and data register

### # 1-bit (bit-serial) processors

widths of 1 bit wide

Were used between 1940s and 1960s when hardware was very expensive. More recently they have not been considered seriously because they are slow.

However, they have several advantages that may be useful in many-core context

① Small physical size

② Less power, also overall energy for computation might be less, but needs to be investigated further.

### # 1-bit processors have been used in CMOS era

△ Motorola ICU (Industrial Control Unit)

⇒ △ Typical MC14500 system



### # Recent bit-serial machines

#### Processing element (PE)



### # Advantages of single-bit processing

Extreme simplicity

Very low power consumption

Many-core 1-bit systems can be reconfigured easily

- Vector processing (Single Instruction - Multiple Data path) is straightforward

- Pipelining for higher performance is straightforward

- Networking using single-bit data transfer (e.g. SPI) is straightforward

- No huge number of 'wires' between cores (No huge energy losses in wiring)

- No complex switching for reconfiguration (Further space and energy savings)

## # 1-bit adder gate-level diagram

$$\text{Sum} = A \oplus B \oplus C_{in}$$

$$\text{Cont} = AB + AC_{in} + BC_{in}$$



Single-bit serial adder for multi-bit addition

## # Serial multiplication

$$3 \times 5 = 15$$

|                                                                                                             |                                                                                                                                                                                                                                                                      |
|-------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| $  \begin{array}{r}  011 \\  \times 101 \\  \hline  011 \\  000 \\  + 011 \\  \hline  01111  \end{array}  $ | $  \begin{array}{ccccccccc}  & a_2 & a_1 & a_0 & & & & & \\  & x b_2 & b_1 & b_0 & & & & & \\  & a_2 b_2 & a_1 b_1 & a_0 b_0 & & & & & \\  & a_2 b_1 & a_1 b_2 & a_0 b_1 & & & & & \\  & a_2 b_0 & a_1 b_0 & a_0 b_2 & & & & & \\  \hline  m_1 & m_0  \end{array}  $ |
|-------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|

Algorithm for  $m = AxB$ :① Set  $m^{(0)} = 0$ ② for  $j=0$  to  $n-1$ : perform  $m^{(j+1)} = m^{(j)} + A_{i,j} \cdot 2^j$ 

$$\begin{aligned}
 m_0 &= 00000 \\
 &\quad 00000 \\
 m_1 &= DDD11 = 00011 \\
 &\quad 00011 \\
 m_2 &= DDD00 = DDD11 \\
 &\quad 00011 \\
 m_3 &= 01100 = 01111
 \end{aligned}$$

## # 1-bit multiplication

Uses  $n^2$  clock cyclesSteps for  $M \times Q$ :

1. Clear product P (nreset = 0, Qj = 0, for n cycles)

2. for  $j=0$  to  $n-1$ ; //  $A_i = M_i$ , if  $Q_j = 1$  otherwise no add, shift P onlyfor  $i=0$  to  $n-1$ :

clock carry register and Product register

## # Enhanced bit-serial multiplier

In a parallel bit-serial processor, several PEs can be used to accelerate multiplication

[Bit-serial processing] always shift after finishing one bit of  $Q_j$ 

| i | j | $M_i$ | $Q_j$ | C | PH  | PL  |
|---|---|-------|-------|---|-----|-----|
| 0 | 0 | 1     | 1     | 0 | 000 | 000 |
| 0 | 0 | 1     | 1     | 0 | 001 | 000 |
| 1 | 0 | 1     | 1     | 0 | 011 | 000 |
| 2 | 0 | 1     | 1     | 0 | 111 | 000 |
|   |   |       |       |   | 000 | 100 |
| 0 | 1 | 1     | 0     | 0 | 011 | 100 |
| 1 | 1 | 1     | 0     | 0 | 011 | 100 |
| 2 | 1 | 1     | 0     | 0 | 011 | 100 |
|   |   |       |       |   | 001 | 110 |
| 0 | 2 | 1     | 1     | 0 | 000 | 110 |
| 1 | 2 | 1     | 1     | 1 | 000 | 110 |
| 2 | 2 | 1     | 1     | 1 | 000 | 110 |
|   |   |       |       |   | 100 | 011 |

加完右移 (个位加完 "+" 十位)

 $M_i \& Q_j$

## # Bit-serial filter example

Bit-serial architectures are used nowadays for addition and multiplication in DSP e.g. digital filters

Example: pipelined bit-serial FIR filter :  $y(n) = -\frac{1}{8}x(n) + \frac{1}{2}x(n-1)$   
 constant coefficient multiplications are implemented as shifts and adds  
 $y(n) = -x(n) + x(n) \cdot 2^{-3} + x(n-1) \cdot 2^{-1}$



## # Shift and add parallel multiplication

111 M  
X101 Q

111000 ADD  
011100 Shift  
001110 Shift  
1000110 ADD  
100011 Shift

按Q取位 } 0: R shift  
} 1: 加上从然后shift

因为Q是按位判断，用一位扔一位，而shift也是逐次判断结束后移位，所以可以直接利用Q的无用位做移位

\* Double length accumulator AR shares storage with multiplier Q

Assignment. Module a neurons

Weights will be given

Carry-out vs. Overflow  $\leftarrow$  Unsigned number      Carryout  $\neq$  Overflow

## ARM Embedded Processor Cores

### # Data Sizes and Instruction Sets

ARM is a 32-bit load/store RISC architecture

- The only memory accesses allowed are loads and stores
- Most internal registers are 32 bits wide.
- Most instructions execute in a single cycle.

When used in relation to ARM cores

- Halfword means 16 bits (two bytes)
- Word means 32 bits (four bytes)
- Doubleword means 64 bits (eight bytes)

ARM cores implement two basic instruction sets

- ARM instruction set - instructions are all 32 bits long
- Thumb instruction set - instructions are a mix of 16 and 32 bits

Thumb-2 technology added many extra 32- and 16-bit instructions to the original 16-bit Thumb instruction set.

Depending on the core, many also implement other instruction sets

- VFP instruction set - 32 bit (vector) floating point instructions
- NEON instruction set - 32 bit SIMD instructions
- Jazelle DBX - provides acceleration for Java VMs (with additional software support)
- Jazelle-RCT - provides support for interpreted languages.

### # Processor Modes

ARM has seven basic operating modes

- Each mode has access to its own stack space and a different subset of registers
- Some operations can only be carried out in a privileged mode

| Mode             | Description                                                               |                   |
|------------------|---------------------------------------------------------------------------|-------------------|
| Supervisor (SVC) | Entered on reset and when a Supervisor call instruction (SVC) is executed | Privileged Modes  |
| FIQ              | Entered when a high priority (fast) interrupt is raised                   |                   |
| IRQ              | Entered when a normal priority interrupt is raised                        |                   |
| Abort            | Used to handle memory access violations                                   |                   |
| Undef            | Used to handle undefined instructions                                     |                   |
| System           | Privileged mode using the same registers as User mode.                    | Unprivileged Mode |
| User             | Mode under which most Applications/OS tasks run                           |                   |

### # The ARM Register Set

User mode      IRQ      FIQ      Undef      Abort      SVC

|          |
|----------|
| v0       |
| v1       |
| v2       |
| v3       |
| ...      |
| v11      |
| v12      |
| v13 (sp) |
| v14 (lr) |
| v15 (pc) |

Spsr : Saved-program register  
Sp : Stack pointer  
Lr : Link register  
Pc : program counter

|     |
|-----|
| v8  |
| v9  |
| v10 |
| v11 |
| v12 |

|          |
|----------|
| v13 (sp) |
| v14 (lr) |

|          |
|----------|
| v13 (sp) |
| v14 (lr) |

|          |
|----------|
| v13 (sp) |
| v14 (lr) |

|          |
|----------|
| v13 (sp) |
| v14 (lr) |

ARM has 37 registers, all 32-bits long  
- A subset of these registers is accessible in each mode  
- System mode uses the User mode register mode

## 字节顺序，端序尾序

Endianness: refers to the sequential order used to numerically interpret a range of bytes in computer memory as a larger, composed word value.

## [APSR : Application Program Status Register]

APSR holds copies of the Arithmetic Logic Unit (ALU) status flag. They are also known as the condition code flags. They are used to determine whether conditional instructions are executed or not.

## [Current Program Status Register]

CPSR holds ① the APSR flags ② the current processor mode ③ interrupt disable flags  
 ④ current processor state ⑤ endianness state ⑥ execution state bits for the IT block

Only the APSR flags are accessible in all modes

The endianness bit (E) of the CPSR is accessible only in privileged software execution. It can be read by `MSR` and written by `SETEND`, but `SETEND` is the preferred instruction to write to the E bit.

## [SPSRs : Saved Program Status Registers]

The SPSR is used to store the current value of the CPSR when an exception is taken so that it can be restored after handling the exception. Each exception handling mode can access its own SPSR. User mode and System mode do not have an SPSR because they are not exception handling modes.

## [SP, the stack pointer]

Register R13 is used as pointer to the active stack.

In Thumb code, most instructions cannot access SP. The only instructions that can access SP are those designed to use SP as a stack pointer. The use of SP for any purpose other than as a stack pointer is likely to break the requirements of operating systems, debuggers, and other software systems, causing them to malfunction.

## [LR, the Link Register]

Register R14 is used to store the return address from a subroutine. At other times, LR can be used for other purposes.

When a BL or BLX instruction performs a subroutine call, LR is set to the subroutine return address. To perform a subroutine return, copy LR back to the program counter. This is typically done in one of two ways, after entering the subroutine with BL or BLX instruction:

Return with a BX LR instruction.

On subroutine entry, store LR to the stack with an instruction of the form: `PUSH t,LR` and use a matching instruction to return: `POP t,PC`...

Memory saving is particularly important on a part with on-chip code memory, and no external memory interface. → Thumb code is able to provide up to 65% of the code size of ARM and 160% of the performance of an equivalent ARM processor connected to a 16-bit memory system.

## [The Thumb instruction set]

The thumb instruction set is a subset of the most commonly used 32-bit ARM instructions. Thumb instructions are each 16 bits long, and have a corresponding 32-bit ARM instruction that has the same effect on the processor model. Thumb instructions operate with the standard ARM register configuration, allowing excellent interoperability between ARM and Thumb states.

On execution, 16-bit Thumb instructions are transparently decompressed to full 32-bit ARM instructions in real time, without performance loss. Thumb has all the advantages of a 32-bit core. Therefore it offers a long branch range, powerful arithmetic operations, and a large address space.

Instruction fetching is slower than code execution. Effectively more thumb mode instructions are required compared to ARM mode, but in thumb mode code size is less and therefore more instructions can be executed in the same time.

Thumb code: the focus is code size since on-chip flash ROM is limited.

## # Program Status Registers



Single Instruction Multiple Data

## [Condition code flags]

- Unsigned number  
Carry-out overflow {  
- N : Negative result from ALU  
- Z : Zero result from ALU  
- C : ALU operation carried out  
- V : ALU operation overflowed

## [Sticky Overflow flag - Q flag]

- Indicate if saturation has occurred

T bit : { T=0 Processor in ARM state  
T=1 Processor in Thumb state

J bit : J=1 Processor in Jazelle state

## [SIMD Condition code bits - GE[3:0]]

- Used by some SIMD instructions

E bit { E=0 : Data load/store is little endian  
E=1 : Data load/store is big endian

A bit : A=1 Disable imprecise data aborts

## [IF THEN status bits - IT[abcde]]

- Controls conditional execution of Thumb instructions

## Mode bits : specify the processor mode

Interrupt disable bits { I=1 Disables IRQ  
F=1 Disables FIQ

## # Instruction Set basics

The ARM Architecture is a Load/Store architecture

- No direct manipulation of memory contents
- Memory must be loaded into the CPU to be modified, then written back out.

Cores are either in ARM state or Thumb state.

- This determines which instruction set is being executed.
- An instruction must be executed to switch between states.

The architecture allows programmers and compilation tools to reduce branching through the use of conditional execution.

- Method differs between ARM and Thumb, but the principle is that most (ARM) or all (Thumb) instructions can be executed conditionally.

## # Data Processing Instructions

These instructions operate on the contents of registers - They DO NOT affect memory.

|                                         | Arithmetic    |                          | Logical       |                   | Move      |
|-----------------------------------------|---------------|--------------------------|---------------|-------------------|-----------|
| manipulation (has destination register) | ADC<br>ADD    | SBC<br>SUB<br>RSB<br>RSC | EIC<br>AND    | ORR<br>EOR<br>ORN | MW<br>MOV |
| comparison (set flags only)             | CMV<br>(ADDS) | CMP<br>(SUBS)            | TST<br>(ANDS) | TEQ<br>(EORS)     |           |

Syntax:

&lt;operation&gt; {&lt;cond&gt;} {st} {Rd,} Rn, Operand2

Examples:

- ADD Rd, R1, R2 ; Rd = R1 + R2
- TEQ Rd, R1 ; If Rd=R1, Z flag will be set
- MOV Rd, R1 ; Copy R1 to Rd

## # Single Access Data Transfer

Use to move data between one or two registers and memory.

Load Reg Double Store Reg Double

|       |      |                      |
|-------|------|----------------------|
| LDRD  | STRD | Doubleword           |
| LDR   | STR  | Word                 |
| LDRB  | STRB | Byte                 |
| LDRH  | STRH | Halfword             |
| LDRSB |      | Signed byte load     |
| LDRSH |      | Signed halfword load |

}



Syntax: { LDR {<size>} {<cond>} Rd , <address> } { STR {<size>} {<cond>} Rd , <address> }

Example: LDRB r0 , [r1] ; load bottom byte of r0 from the byte of memory at address in r1

## # Multiple Register Data Transfer

These instructions move data between multiple registers and memory

Syntax: <LDM | STM> {<addressing-mode>} {<cond>} Rb+! , <register list>

4 addressing modes → IA: Increment address After each transfer. This is the default, and can be omitted  
 IB: Increment address Before each transfer (ARM only)  
 DA: Decrement address After each transfer (ARM only)  
 DB: Decrement address After each transfer

Eg: { LDM r10 , {r0 , r1 , r4} } ; load registers, using no base  
 { PUSH {r4-r6} , PC } ; store registers, using SP base,

Also: Push/Pop, equivalent to STMDA / LDMDA with SP! as base register.



↑ Writeback is specified with the '?' suffix

## # Subroutines

Implementing a conventional subroutine call requires two steps { Branch to the address of the required subroutine.

These steps are carried out in one instruction, BL { The return address is stored in the link register (lr/r14)

Return is by branching to the address in lr

|     |     |                                                |                                                                                                       |
|-----|-----|------------------------------------------------|-------------------------------------------------------------------------------------------------------|
| op1 | B   | Branch                                         | { copy the address of next instruction into lr (r14) }                                                |
|     | BL  | Branch with link                               |                                                                                                       |
| op2 | BLX | Branch with link, and exchange instruction set | { There instructions can be used as branches in Thumb-2EE code, but cannot be used to change state. } |
|     | BX  | Branch and exchange instruction set            |                                                                                                       |
|     | BLX | Branch with link, and exchange instruction set |                                                                                                       |
|     | BXJ | Branch, and change to Jazelle execution        |                                                                                                       |

## # Supervisor Call (SVC)

`SVC + <cond> + <src number>`

Causes an SVC exception.

The SVC handler can examine the SVC number to decide what operation has been requested.

But the core ignores the SVC number

By using the SVC mechanism, an operating system can implement a set of privileged operations (system calls) which applications running in user mode can request.

Thumb version is unconditional.

## # Exception Handling

When an exception occurs, the core

- { Copies CPSR into SPSR - <mode>
- Sets appropriate CPSR bits
- Change to ARM state (if appropriate)
- Change to exception mode
- Disable interrupts (if appropriate)
- Stores the return address in LR - <mode>
- Sets PC to vector address
- Restore CPSR from SPSR - <mode>
- To return, exception handler needs to { Restore PC from LR - <mode>

|      |                       |
|------|-----------------------|
| 0x1C | FIQ                   |
| 0x18 | IRQ                   |
| 0x14 | (Reserved)            |
| 0x10 | Data Abort            |
| 0x0C | Prefetch Abort        |
| 0x08 | Supervisor Call       |
| 0x04 | Undefined Instruction |
| 0x00 | Reset                 |
|      | Vector Table          |

Cores can enter ARM state or Thumb state when taking an exception. - Controlled through settings in CPSR

## # Exception handling process

By Core { Save processor status: { Copies CPSR into SPSR - <mode>

  Stores the return address in LR - <mode>

  Adjust LR based on exception type

Change processor status for exception: { Mode field bits  
  ARM or Thumb State  
  Interrupt disable bits (if appropriate)  
  Sets PC to vector bits



By Software { Execute exception handler: <users code>

Return to main application: { Restore CPSR from SPSR - <mode>

  Restore PC from LR - <mode>

## # What is NEON

Single Input Multiple Data

NEON is a wide SIMD data processing architecture

- Extension of the ARM instruction set (V7-A)

- 32 x 64-bit wide registers (can also be used as 16 x 128-bit wide registers)

NEON instructions perform "Packed SIMD" processing

- Registers are considered as vector of elements of the same data type.

- Data types available: signed/unsigned 8-bit, 16-bit, 32-bit, 64-bit, single prec. float.

- Instructions usually perform the same operation in all lanes.



## # NEON Coprocessor registers

NEON has a 256-byte register file { Separated from the core registers (r0-r15)  
Extension to the VFPv2 register file (VFPv3)

Two different views of the NEON registers

- 32x 64-bit registers (D0-D31)
- 16x 128-bit registers (Q0-Q15)

Enables register trades-offs

- Vector length can be variable
- Different registers available



## # NEON vectorizing example

How does the compiler perform vectorization?

```
void add_int (int *pa,
              int *restrict pb,
              unsigned int n, int x)
{
    unsigned int i;
    for (i=0; i<(n&~3); i++)
        pa[i] = pb[i] + x;
}
```

$\cancel{n \& ~3}$

Make Sure  
 $n \geq 4$   
If NOT

```
void add_int (int *pa, int *pb, unsigned n, int x)
{
    unsigned int i;
    for (i=(n&~3)>>2); i; i--)
    {
        *(pa+i) = *(pb+i) + x;
        *(pa+i+1) = *(pb+i+1) + x;
        *(pa+i+2) = *(pb+i+2) + x;
        *(pa+i+3) = *(pb+i+3) + x;
    }
    pa+=4;
    pb+=4;
}
```

## ① Analyze each loop

- Are pointer accesses safe for vectorization?
- What data types are being used?
- How do they map onto NEON vector registers?
- Number of loop iterations

## # Memory Types

Each defined memory region will specify a memory type

The memory type controls the following:

- Memory access ordering rules
- Catching and buffering behavior

There are 3 mutually exclusive memory types:

- Normal
- Device
- Strongly Ordered

Normal and Device memory allow additional attributes for specifying

- The cache policy
- Whether the region is shared
- Normal memory allows you to separately configure Inner and Outer cache policies (discussed in the Caches and TCMs module)



Typical memory system can have multiple levels of cache

- level 1 memory system typically consists of L1-cache, MMU/MPU and TCMS
  - level 2 memory system (and beyond) depends on the system design.

Memory attributes determine cache behavior at different levels

- Controlled by the MMU/MPU (discussed later)
  - Inner Cacheable attributes define memory access behavior in the L1 memory system
  - Outer Cacheable attributes define memory access behavior in the L2 memory system (if external) and beyond (as signals on the bus)

Before caches can be used, software setup must be performed.

## # ARM cache features

## Harvard Implementation for L1 caches – Separate Instructions and Data caches

Cache Lockdown - Prevents line Eviction from a specific Cache Way

### Pseudo-random and Round-robin replacement strategies

- Unused lines can be allocated before considering replacement

## Non-blocking data cache

- Cache lookup can hit before a Linefill is complete (also checks Linefill buffer)

## Streaming , Critical - Word - First

- Cache data is forwarded to the core as soon as the requested word is received in the Linefill buffer
  - Any word in the cache line can be requested first using a 'WRAP' burst on the bus

ECC or parity checking

$$\text{Byte} = 8 \text{ bits}$$

Word = 16 bits

# Example 32KB ARM cache

Address



v: valid bit

`d : dirty bit`

Cache has 8 words of data in each line

Each cache line contains Dirty bit(s)

↳ Indicates whether a particular cache line was modified by the ARM core

Each cache line can be Valid or Invalid

→ An invalid line is not considered when performing a Cache Lookup.

## # Cortex MPCore Processors

Stand Cortex cores, with additional logic to support MPCore - Available as 1-4 CPU variants

Includes integrated : ① Interrupt controller ② Snoop Control Unit (SCU) ③ Timers and Watchdogs

SCU connects one to four Cortex-A9 processors to the memory system through the AXI interfaces.

- Maintain data cache coherency (一致性) between the Cortex-A9 processors
- Initiate L2 AXI memory access.
- Arbitrate between Cortex-A9 processors requesting L2 accesses.
- Manage ACP accesses.

## # Interrupt Controller

MPCore processors include an integrated Interrupt Controller (IC)

- Implementation of the Generic Interrupt Controller (GIC) architecture

The IC provides :

- Configurable number of external interrupts (max 224)
- Interrupt prioritization and preemption
- Interrupt routing to different cores

Enable per CPU

- When not enabled, that CPU will use legacy nIRQ[n] and nFIQ[n] signals



## # System Timer - SysTick

Flexible System timer

- 24-bit self-reloading down counter
- Reload on count == 0
- Optionally cause SysTick interrupt on count == 0

- Reload Register ↗ provides value required for 1ms interval

- Calibration value ↗ - STCALIB inputs tied to appropriate value (System Tick Calibration Value Register)

Clock source is CPU clock or optional external timing reference

- Software selectable if provided

- Reference pulse widths High/Low must exceed processor clock period  
• Counted by sampling on processor clock

↳ The STCALIB and STCLKEN/STCLK inputs relate to the provision of a SYSTICK (System Tick) capability.  
The system counter in Cortex-M0 core is clocked by the free-running clock (e.g. SCK, FCLK), and it can count either the free-running clock itself, or it can count the cycles of an independent timing reference signal STCLKEN or STCLK (typically at a much lower frequency) if one has been provided on the SoC.

To indicate which timing reference has been provided on the SoC, and the exact properties of that reference, the designer must supply reference information (usually statically tied off) on the STCALIB inputs.

## # Modes Overview

ARM Processor



## # Instruction Set Examples

### [Data Processing]

|     |                  |                                    |
|-----|------------------|------------------------------------|
| MOV | $r2, r5$         | $; r2=r5$                          |
| ADD | $r5, #0x24$      | $; r5=r5+36$                       |
| ADD | $r2, r3, r4$     | $; r2=r3+(r4*4)$                   |
| LSL | $r2, #3$         | $; r2=r2*8$                        |
| MVT | $r9, #0x1234$    | $; Upper halfword of r9 = #0x1234$ |
|     | $r0, r1, r2, r3$ | $; r0=(r1*r2)+r3$                  |

Logic shift lefts  
Move Top writes a 16-bit immediate value to the MLA top halfword of a register  
Multiply-Accumulate

### [Memory Access]

Store Reg Byte → STRB  $r2, [r0, r1]_4$  ; Store lower byte in r2 at address {r0+r1} 4  
load → LDR  $r0, [r1, r2, LSL\#1]$  ; load r0 with data at address {r1+r2+4}

Handler mode can also be re-entered on exception return

### [Program Flow]

BL <label>  
; PC relative branch to <label> location, and return address stored in LR (r14)

## # Exception Handling

Exception types : ① Reset ② Non-maskable Interrupt (NMI) ③ Faults ④ PendSV ⑤ SVCall  
 ⑥ External Interrupt ⑦ SysTick Interrupt

Exception processed in Handler mode (except Reset)

- Exceptions always run privileged.

### Interrupt handling

- Interrupts are a sub-class of exception

- Automatic save and restore of processor registers (xPSR, PC, LR, R12, R3-R0)

- Allows handler to be written entirely in 'C'



## # External Interrupts

Cortex-Mx Integration Layer

External interrupts handled by Nested Vectored Interrupt Controller (NVIC) - Tightly coupled with processor core

One Non-maskable Interrupt (NMI) supported.

Number of external interrupts is implementation-defined.

## # Reset Behaviour



- ① A reset occurs (Reset input was asserted)
- ② Load MSP (Main Stack Pointer) register initial value from address 0x00
- ③ Load reset handler vector address from address 0x04
- ④ Reset handler executes in Thread Mode
- ⑤ Optional : Reset handler branches to the main program

## # Exception Behaviour



- ① Exception occurs
  - Current instruction stream stops
  - Processor accesses Vector table
- ② Vector address for the exception loaded from the vector table
- ③ Exception handler executes in Handler Mode
- ④ Exception handler returns to main.

## # Interrupt Service Routine Entry

Interrupt latency is the time that elapses from when an interrupt is generated to when the source of the interrupt is serviced.

When receiving an interrupt the processor will finish the current instruction for most instructions

- To minimize interrupt latency, the interrupt can take an interrupt during the execution of a multi-cycle instruction  
 Processor State automatically saved to the current stack.

- 8 registers are pushed = PC, R0-R3, R12, LR, xPSR

- Follows ARM Architecture Procedure Calling Standard (AAPCS) 调用规范

During (or after) state saving the address of the ISR is read from the Vector Table  
 Link Register is modified for interrupt return.

First instruction of ISR executed

- For Cortex-M3 or Cortex-M4 the total latency is normally 12 cycles, however, interrupt late-arrival and interrupt tail-chaining can improve IRQ latency.

ISR executes from Handler mode with Main stack.

## # Returning From Interrupt

Can return from interrupt with the following instructions when the PC is loaded with 'magic' value of 0xFFFF\_FFX (Some format as EXC\_RETURN)

- LDR PC, ...
- LDM/POP which includes loading the PC
- BX LR (most common)

If no interrupts are pending, foreground state is restored

- Stack and state specified by EXC\_RETURN is used

- Context restore on Cortex-M3 and Cortex-M4 requires 10 cycles

If other interrupts are pending, the highest priority may be serviced

- Serviced if interrupt priority is higher than the foreground's base priority

- Process is called **Tail-chaining** as foreground state is not yet restored

- Latency for servicing new interrupt is only 6 cycles on M3/M4 (state already saved)

If state restore is interrupted, it is abandoned

- New ISR executed without state saving (original state still intact and valid)

- Must still fetch new vector and refill pipeline (6-cycle latency on M3/M4).

### [尾链]

Tail-chaining is back-to-back processing of exceptions without the overhead of state saving and restoration between interrupts. The processor skips the pop of eight registers and push of eight registers when exiting one ISR and entering another because this has no effect on the stack contents.

The processor tail-chains if a pending interrupt has higher priority than all stacked exceptions

## # Processor Memory Map



## # Memory Types and Properties

There are 3 different memory types: ① Normal ② Device ③ Strongly Ordered

Normal memory is the most flexible memory type:

- Suitable for different types of memory, for example, ROM, RAM, Flash and SDRAM

- Access may be restarted

- Caches and Write Buffers are permitted to work alongside Normal memory

Device memory is suitable for peripherals and I/O devices

- Caches are not permitted, but write buffers are still supported.

- Unaligned accesses are unpredictable (非対齊アクセス)

- Accesses must not be restarted.

Load/store multiple instructions should not be used to access Device memory

Strongly ordered memory is similar to device memory.

- Buffers are not supported and the PPB is marked Strongly Ordered.

|     |    |
|-----|----|
| S0  | D0 |
| S1  |    |
| S2  |    |
| S3  | D1 |
| S4  |    |
| S5  |    |
| S6  |    |
| S7  |    |
| S8  | D2 |
| S9  |    |
| S10 |    |
| S11 | D3 |

## # Cortex-M4 Floating Point Registers

FPU provides a further 32 single-precision registers

Can be reviewed as either

|                                                                                            |
|--------------------------------------------------------------------------------------------|
| { 32x32-bit registers<br>16x64-bit double word registers<br>Any combination of the above } |
|--------------------------------------------------------------------------------------------|

## # Example ARM-based System



ARM core deeply embedded within an SoC  
 - External debug and trace via JTAG or CoreSight interface  
 Design can be have both external and internal memories  
 - Varying width, speed and size - depending on system requirement  
 Can include ARM licensed Corelink peripherals  
 - Interrupt controller, since core only has two interrupt sources  
 - other peripherals and interfaces.  
 Can include on-chip memory from ARM Artisan Physical IP libraries  
 (Advanced Microcontroller Bus Architecture)

## # An Example AMBA System



## # Advanced High Performance Bus



## # AXI Multi-Master System Design



## # Advanced extensible Interface

# MIPS Architecture

## # MIPS Architecture Summary

RISC - Reduced instruction set computer, developed by MIPS Technologies in mid 1980s.

One of the simplest RISC architectures, small hardware, low power consumption

1 instruction per clock cycle in basic version.

Dual-bus architecture with 32 general purpose registers connected to ALU

Separate SRAM

A range of versions - 32-bit vs 64-bit data

- Single cycle CPU vs pipelined CPU

Recently synthesizable MIPS cores for embedded FPGA applications are being offered by a number of third party vendors

## # Basic MIPS architecture



## # MIPS Instruction formats

Instructions are divided into three types: R, I and J

Every instruction starts with a 6-bit opcode

R-type instructions specify three registers, a shift amount field, and a function field;

I-type instructions specify two registers and a 16-bit immediate value

J-type instructions contain a 26-bit jump target address

| Type | 31-bit format            | (31:0) | 14:11 | 10:6    | 5:0      |
|------|--------------------------|--------|-------|---------|----------|
| R    | opcode(6)<br>rs(5)       | rs(5)  | rt(5) | rd(5)   | shamt(5) |
| I    | opcode(6)<br>rs(5)       | rs(5)  | rt(5) | imm(16) | func(6)  |
| J    | opcode(6)<br>address(26) |        |       |         |          |

## # Sample MIPS instructions

No reg. for D, C, Z, O, the value directly comes out from the ALU, branch without saving state

ADD \$2, \$3, \$4 ; \$2=\$3+\$4

- R-type ALU instruction

- Opcode is 01's, rd=2, rs=3, rt=4, func = 000010  
- 000000 00011 00100 00010 00000 00010

BEQ \$3, \$4, 4 ; branch if equal, i.e. \$3 == \$4

- I-type conditional branch instruction

- Opcode is 101011, rs=3, rt=4, imm=4 (skip next 4 instructions)

JALR \$2 ; save PC+1 in \$31, jump to [\$3]

- R-type jump instruction

- Opcode is 0's, rs=3, rt=0, rd=31 (by default), func = 001001  
- 000000 00011 00000 11111 00000 001001

SW \$2, 128(\$3) ; store word

- I-type memory address instruction

- Opcode is 101011, rs=3, rt=2, imm=128  
- 101011 00011 00010 00000000000000000000

ADDI \$2, \$3, 12 ; \$2=\$3+12

- I-type ALU instruction

- Opcode is 001000, rs=3, rt=2, imm=12  
- 001000 00011 00010 0000000000001100

J 128 ; jump to address PC ← PC<sub>31-28</sub> :: 128 :: 1600

- J-type pseudodirect jump instruction PC ← PC<sub>31-28</sub> :: IR<sub>25-0</sub> :: 1600

- Opcode is 000010, 26-bit pseudodirect address is 128/4 = 32  
- 000010 00000000000000000000000000000000

$$\begin{array}{ccccccc}
 R & \text{opcode} & + & \text{rs} & + & \text{rt} & + \text{rd} \\
 I & \text{opcode} & + & \text{rs} & + \text{rt} & + \text{rd} & \text{immediate} \\
 J & \text{opcode} & + & \text{rs} & + \text{rt} & + \text{rd} & \text{address} \cdot 2^{25-0}
 \end{array}$$

done: branch not equal /

## # Synthesised instructions

Some MIPS assembler instructions don't have direct hardware implementation.

- Eg:  $\text{abs } \$2, \$3 ; \$2 = \text{abs } (\$3)$

Resolved to:

- $\text{bge } \$2, \$3, \text{pos} ; \text{branch on greater or equal to 0}$
- $\text{sub } \$2, \$0, \$3 ; \$2 = -\$3$
- $\text{j out}$
- $\text{pos} : \text{add } \$2, \$0, \$3 ; \$2 = \$3$
- $\text{out} : \dots$

- Eg:  $\text{rol } \$2, \$3, \$4 ; \$3 = \$3 \text{ rotated left by } \$4 \text{ bits}$

Resolved to:

- $\text{addi } \$1, \$0, 32 ; \text{load } 1b10000 \text{ to } \$1$
- $\text{sub } \$1, \$1, \$4 ; \$1 = 1b10000 - \$4$
- $\text{srlv } \$1, \$3, \$1 ; \text{shift right logical variable} : \$1 = \$3 \gg \$1$
- $\text{sllv } \$2, \$3, \$4 ; \text{shift left logical variable} : \$2 = \$3 \ll \$4$
- $\text{or } \$2, \$2, \$1 ; \$2 = \$2 \text{ or } \$1$

REC  
30 min

The only reason required a pipeline is to speed up the instruction execution.  
Not something we want in Embedded systems

The speed depends on the propagation delay (maximum steps: 5)

## # Translation from C to MIPS machine code.

⇒ An if statement { if ( $i == j$ )  $f = g + h$ ;  
else  $f = g - h$ ;

[MIPS code] ;  $i \rightarrow \$1, j \rightarrow \$2, f \rightarrow \$3, g \rightarrow \$4, h \rightarrow \$5$   
 $bne \$1, \$2, \text{Lelse} ; \text{if } i \neq j, \text{ go to Lelse}$   
 $\text{add } \$3, \$4, \$5 ; f = g + h$   
 $j \text{ Lexit} ; \text{go to Lexit}$   
 $\text{Lelse} : \text{sub } \$3, \$4, \$5 ; f = g - h$   
 $\text{Lexit} : \dots$

⇒ A while statement and an array: while ( $A[i] == c$ )  
 $i += 1;$

[ $sll \$d, \$t, h$ ]  $\$d = \$t \ll h$  ; shift left logical

[MIPS code] ;  $i \rightarrow \$1, c \rightarrow \$2, A[i] \rightarrow \$3$   
; Address of  $A[i]$  in RAM →  $\$4$   
; Value of  $A[i]$  →  $\$5$

loop: ~~sll~~  $\text{sll } \$4, \$1, 2$  ;  $\$4 = i \times 4$ , prepare for byte count  
 $\text{add } \$4, \$4, \$3$  ;  $\$4 = \text{address of } A[i]$   
 $bne \$1, \$2, \text{Loop}$  ; test while condition  
 $\text{add } \$1, \$1, 1$  ;  $i += 1$   
 $j \text{ Loop}$

exit: ...

⇒ A for loop in C: for ( $i=0$ ;  $i < n$ ;  $i++$ )  $A[i] = B[i] + 10$ ;

[MIPS code]

```

xor $2, $2, $2 # $2=0, index register (i)
li $3, n # $3=n, iteration limit
sll $3, $3, 2 # multiply n by 4 (convert word count to byte count)
li $4, a # $4 = address of A[i] (assume A[i] < 2^16)
li $5, b # $5 = address of B[i] (assume B[i] < 2^16)

loop: add $6, $5, $2 # $6 = address of B[i]
    lwr $7, 0($6) # load B[i] from memory
    add $7, $7, 10 # $7 = B[i] + 10
    add $6, $4, $2 # $6 = address of A[i]
    sllr $7, 0($7) # store $7 into A[i]
    addi $2, $2, 4 # increment i (add 4 to maintain word count)
    bne $2, $3, loop # branch if index < n (bit-branch if less than)

```

## # Pipelined MIPS



## [MIPS pipeline stages]

## - Fetch (F)

read next instruction from memory, increment address counter  
assume 1 cycle to access memory

## - Decode (D)

read register operands, resolve instruction in control signals, compute branch target

## - Execute (E)

execute arithmetic / resolve branches

## - Memory (M) - Load / Store instructions only

Perform load/store accesses to memory  
Assume 1 cycle to access memory

## - Write back (W)

write arithmetic results to register file.



## # MIPS pipeline data hazards

## • Data hazards

- Register values "read" in decode, writing during write-back
- Hazard occurs when dependent instruction separated by less than 2 slots

• Examples :

|                         |                       |                       |
|-------------------------|-----------------------|-----------------------|
| - ADD \$2, \$X, \$X (E) | ADD \$2, \$X, \$X (M) | ADD \$2, \$3, \$4 (W) |
| - ADD \$X, \$2, \$X (D) | ...                   | ...                   |
| - ...                   | ADD \$X, \$2, \$X (D) | ...                   |
| - ...                   | ...                   | ADD \$X, \$2, \$3 (D) |

Conflict

- In most cases, data generated in same stage as data is required (EX)

## • Data forwarding

|                         |                       |                                   |
|-------------------------|-----------------------|-----------------------------------|
| - ADD \$2, \$X, \$X (M) | ADD \$2, \$X, \$X (W) | ADD \$2, \$3, \$4 (cont-of-pipes) |
| - ADD \$X, \$2, \$X (E) | ...                   | ...                               |
| - ...                   | ADD \$X, \$2, \$X (E) | ...                               |
| - ...                   | ...                   | ADD \$X, \$2, \$3 (E)             |

## • Load hazards

- Stalls required when data is not produced in same stage as it is needed for a subsequent instruction
- Examples :

LW \$2, 0 (\$X) (M)

ADD \$X, \$2 (E)

- When this occurs, insert a "noop" into EX state, stall F and D

LW \$2, 0 (\$X) (W)

NOOP (M)

ADD \$X, \$2 (E)

A W

E

&

E

\$2未被写回，就执行加法，加法加的是上期的数，加入一个noop，在同一周期内，前数被写回

- Forward from W to E

## # Example of MIPS pipeline operation



|                    | 1 | 2 | 3    | 4   | 5   | 6   | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
|--------------------|---|---|------|-----|-----|-----|---|---|---|----|----|----|----|----|----|
| add \$6, \$5, \$2  | F | D | E    | (M) | W   |     |   |   |   |    |    |    |    |    |    |
| lw \$7, 0(\$6)     | F | D | (E)  | M   | (W) |     |   |   |   |    |    |    |    |    |    |
| addi \$7, \$7, 10  | F | D | noop | (E) | M   | (W) |   |   |   |    |    |    |    |    |    |
| add \$6, \$4, \$2  |   | F | D    | E   | (M) | W   |   |   |   |    |    |    |    |    |    |
| sw \$7, 0(\$6)     |   | F | D    | (E) | M   | W   |   |   |   |    |    |    |    |    |    |
| addi \$2, \$2, 4   |   | F | D    | E   | (M) | W   |   |   |   |    |    |    |    |    |    |
| blt \$2, \$3, loop |   | F | D    | (E) | M   | W   |   |   |   |    |    |    |    |    |    |
| add \$6, \$5, \$2  |   | F | D    | E   | M   | W   |   |   |   |    |    |    |    |    |    |



## # Steps in Embedded processors synthesis (picoMIPS case study)

Develop and test ALU with typical operations: ADD, SUB, etc.

Develop General Purpose Registers (GPR) and test

Develop a simple Instruction Decoder and test with ALU and GPR

- In the decoder implementation NOP instruction, ALU instructions, LDI

Develop program Memory

Develop program Counter

Enhance instruction Decoder with J and typical conditional branches, e.g. BEQ, ~~BNE~~<sup>BLO</sup> etc

⇒ Basic processor core is functional at this point

Possible enhancements:

- Add data RAM
- RAM related instructions, eg. LD, ST
- Develop an I/O port, and implement IN and OUT instructions
- Add ALU hardware to support MUL instruction and shift instructions.
- Implement calls and returns

## Your coursework

- Modify the picoMIPS to implement a prescribed program
- Write a formal report
- Challenge: the smallest possible implementation on Altera Cyclone IV

## # FPGA Synthesis of a Microprocessor Core Notes on ALU Synthesis

### # Approaches to ALU Synthesis

Methods to synthesise ADD, SUB

- 1) Use atb in 9-bits to extract carry on 9th bit
- 2) Use atb in 8-bits, but add extra logic to extract carry from ALU MSB slice.
- 3) Synthesise ALU on structural level using 1-bit slices

For small hardware size use method 2) or 3)

Subtraction can be implemented using 2's complement subtraction rule:

- $A - B = A + \neg B + 1$
- This way the same hardware can be used for ADD and SUB

2's complement overflow (if needed!) can be obtained from simple logic as shown in ALU code below.

### # Writing synthesisable SystemVerilog code for combinational logic synthesis.

When writing code for decoders or multiplexers, use conditional statements if, then, else 1 case

Common trap:

Conditional statements with incompletely specified signals cannot be mapped into combinational logic:

Use the default clause:

Assign default values to all signals driven by combinational logic blocks (examples will follow)

In arithmetic circuit synthesis

Arithmetic operations + - are mapped into standard adders/subtractors

Synthesis tools, e.g. Simplify Pro/Quartus map the multiplication operator \* into dedicated multipliers if available, or combinational (cellular) multipliers - fast but very costly.

if(a)  
b=1'b0;  
else  
b=1'b1;

// b is incompletely specified

### # Decoder Example

```
module decoder2to4 (
    input logic [1:0] i, output logic [3:0] outp);
    always_comb
    begin
        outp = 4'b0000;
        unique case (i)
            0: outp[0] = 1'b1;
            1: outp[1] = 1'b1;
            2: outp[2] = 1'b1;
            3: outp[3] = 1'b1;
        endcase
    end
endmodule
```



### # Multiplexer example

```
module mux4to1 (
    input logic [3:0] i, input logic [1:0] sel, output logic outp);
    always_comb
    unique case (sel)
        0: outp = i[0];
        1: outp = i[1];
        2: outp = i[2];
        3: outp = i[3];
    endcase
endmodule
```

上图 outp0 和 outp3 分别于 i0 和 i3 And .



## # Arithmetic circuits - 8-bit adder

module adder8

input logic [7:0] A, B, input logic Cn,  
output logic [7:0] Sum, output logic Cont;

assign {Cont, Sum} = {A[7], A} + {B[7], B} + Cn; // assign-extend by 1 bit  
endmodule

## # Sample ALU module

module alu #(parameter n=8)

input logic [n-1:0] a, b, //ALU operands  
input logic [2:0] func, //ALU function  
output logic [3:0] flags, //ALU flags V, N, Z, C  
output logic [n-1:0] result; //ALU result

always\_comb

```
begin
  if (func == 'RSUB)
    b1 = ~b + 1'b1;
  else
    b1 = b;
```

ar1 = a + b1; //Adder result  
ar = ar1; // n-bit adder

end

always\_comb

begin

// default output values; prevent latches

flags = 3'b0;  
result = a; // default

case(func)

'RA = result = a;

'RB = result = b;

'RADD = begin:

result = ar;

flags[3] = a[7] &amp; b[7] &amp; ~result[7] | a[7] &amp; ~b[7] &amp; result[7]; // V ← Overflow

flags[2] = a[7] &amp; b[7] | a[7] &amp; ~result[7] | b[7] &amp; ~result[7]; // C ← carry on MSB

end

'RSUB = begin

result = ar;

flags[2] = a[7] &amp; ~b[7] &amp; result[7] | a[7] &amp; b[7] &amp; result[7]; // V

//C - note: pin 19 PPS inverts carry when subtracting

flags[0] = a[7] &amp; ~b[7] | a[7] &amp; result[7] | ~b[7] &amp; result[7];

end

'ROR = result = a/b;

'RAND = result = a &amp; b;

'RXOR = ...

endcase

always\_comb

flags[0] = result == {n{1'b0}}; // Z ← zero result

flags[2] = result[n-1]; // N ← Negative result

end

endmodule

```
'define RA 3'b000
'define RB 3'b001
'define RADD 3'b010
'define RSUB 3'b011
'define RAND 3'b100
'define ROR 3'b101
'define RXOR 3'b110
'define RNOR 3'b111
```



## # Carry Logic

| A | B | Cin | Sum | Cont |
|---|---|-----|-----|------|
| 0 | 0 | 0   | 0   | 0    |
| 0 | 0 | 1   | 1   | 0    |
| 0 | 1 | 0   | 1   | 0    |
| 0 | 1 | 1   | 0   | 1    |
| 1 | 0 | 0   | 1   | 0    |
| 1 | 0 | 1   | 0   | 1    |
| 1 | 1 | 0   | 0   | 1    |
| 1 | 1 | 1   | 1   | 1    |

When coding addition as A+B

We don't know Cin at MSB, but we know Sum[7] (Cin在Synthesis中,没有单独声明)

Carry is set at MSB slice if:

 $A[7] \& B[7] | B[7] \& \neg \text{Sum}[7] | A[7] \& \neg \text{Sum}[7]$  ← A或B的最高位为1, 但和的最高位为0

那么

In subtraction A-B, complement B[7]

## # Detection of 2's complement overflow

$A[7] \& B[7] \& \neg \text{Sum}[7] | \neg A[7] \& \neg B[7] \& \text{Sum}[7]$

Overflow occurs in addition if the signs of both operands are the same, but the result sign is different.

## Synthesis of registers and memory

### # Sequential logic synthesis

Registers and memory in processors

- Program Counter
- General Purpose Register File
- Instruction Register (not in picoMIPS)
- RAM - sequential RAM is modern FPGAs

Flip-flops :  
Clear  
Preset  
ClockEnable

### # RAM Synthesis

RAMs used in FPGA designs are typically arrays of latches with separate input and a separate output data bus, an address bus and a Write Enable signal; read is asynchronous.

Modern FPGA (e.g. Cyclone V) support only synchronous memory, where both read and write is synchronous.

Note that there many types of RAMs e.g.

- RAM with synchronous read
- RAM with one Enable controlling both ports
- RAM with Separate Enables controlling each port
- Multiple-Port RAMS

1853 Combinational logic

### # Standard RAM with asynchronous read

old template

```
module ram128x8 ( input logic we,
                    input logic [6:0] address,
                    input logic [7:0] din,
                    output logic [7:0] dout );
    // This 2-dimensional array defines the ram memory
    logic [7:0] ram [127:0];

```

```
// write block
always_latch
if (we)
    ram[address] <= din;
// Asynchronous read block
assign dout = ram[address];
```

endmodule.

```
module ram128x8sync ( output logic [7:0] dout,
                        input logic [6:0] address,
                        input logic [7:0] din,
                        input logic we, clk);
    logic [7:0] mem [127:0];

```

```
always_ff @ (posedge clk)
begin
    if (we)
        mem[address] <= din;
    dout <= mem[address];
end
endmodule
```

Synchronous RAM cannot provide the operand in one clock cycle  
 ① Re-size the instruction  
 ② Load the memory to its output

Altera Cyclone V devices contain two types of memory blocks. Both are Synchronous.

- ① M10K blocks - 10-kilobit (Kb) blocks for larger memory
- ② Memory logic array (MLABs) - 64-bit memory blocks for small memories. Each MLAB can be configured as ten 32x2 blocks, giving one 32x20 simple dual-port SRAM (Static random access memory).

## # Program Counter

```

module pc #(parameter Psize=6) // up to 64 instructions
    (input logic clk, reset, PCincr, PCabsbranch, PCrelbranch,
     input logic [Psize-1:0] Branchaddr,
     output logic [Psize-1:0] PCout);

    logic [Psize-1:0] Rbranch; // temp variable for addition operand

    always_comb // multiplexer to select next instruction offset
    if (PCincr)
        Rbranch = {{Psize-1}, {1'b0}}; // add1
    else
        Rbranch = Branchaddr; // add branch addr

    always_ff @(posedge clk or posedge reset) // async reset
    if (reset) // reset
        PCout <= {Psize{1'b0}};
    else if (PCincr | PCrelbranch) // increment or branch relative
        PCout <= PCout + Rbranch;
    else if (PCabsbranch) // absolute branch, load branch addr.
        PCout <= Branchaddr;
endmodule.

```



## # Program Memory

address width      instruction width

```

module prog #(parameter psize=6, isize=24)
    (input logic [Psize-1:0] address,
     output logic [isize:0] instr);

    logic [isize:0] progMem [(1 << psize)-1:0];

    // Get memory contents from file
    initial
        $readmemh("prog.hex", progMem);

    // Program memory read
    assign instr = progMem [address];
endmodule // end of module prog

```

## Coursework

### # Introduction

This exercise is done individually and the assessment is :

- By formal report describing the final design, its development, implementation and testing.
- By a laboratory demonstration of the final design on an Altera FPGA development system.

### # Specification

The objective of this exercise is to design an n-bit implementation of picoMIPS

The design should be as small as possible in terms of FPGA resources but sufficient to implement the affine transformation algorithm described below.

The size const function of the design is defined as follows :

$$\text{Cost} = \text{number of logic Elements used} + \max(0, \text{number of embedded multipliers used} - 2) + 30 \times \text{Kbits of RAM used}$$

Each logic Element has a flip-flop hence flip-flops are included in the above cost figure. The cost figure should be calculated for Altera Cyclone IV EP4CE115 and should be as low as possible. Altera Cyclone IV has 266 18x18 bit embedded hardware multipliers. If embedded multipliers are used in your implementation, up to two of them are 'free', i.e. they do not contribute to the size cost.

To demonstrate the cost figure of your design show in your report Altera Quartus synthesis statistics for Cyclone IV EP4CE115

### # Affine Transformation Algorithm 仿射变换

An affine transformation is a geometrical transformation that preserves co-linearity, i.e. all points lying on a line will also lie after the transformation and distance ratios are preserved. For two-dimensional images, the general affine transformation can be expressed as:

$$\begin{vmatrix} x_2 \\ y_2 \end{vmatrix} = A \begin{vmatrix} x_1 \\ y_1 \end{vmatrix} + B$$

Where  $[x_1, y_1]$  are the coordinates of a pixel before the transformation and  $[x_2, y_2]$  - after the transformation.

The  $2 \times 2$  matrix A and the two-element vector B provide the transformation coefficients. For example, a pure translation of pixels occurs if:

$$A = \begin{vmatrix} 1 & 0 \\ 0 & 1 \end{vmatrix}, B = \begin{vmatrix} b_1 \\ b_2 \end{vmatrix} \quad \text{Similar, the following coefficients implement pure scaling: } A = \begin{vmatrix} a_{11} & 0 \\ 0 & a_{22} \end{vmatrix}, B = \begin{vmatrix} 0 \\ 0 \end{vmatrix}$$

In practice different affine transformations are combined to produce a complex transformation.

### # Implementation of affine transformation in picoMIPS

You are required to develop both a smallest possible picoMIPS architecture and a machine-level program for the general affine transformation. The 6 constants that define the transformation must be included as immediate literals in your program. You can choose one out of two sets of transformation constants as outlined below in Section "Data Sets". Pixel coordinates must be read from the switches SW0-SW7 on the FPGA development system and the resulting pixel coordinates after the transformation displayed on the LEDs LED0-LED7. Switch SW8 provides handshaking functionality as described in the pseudocode below. Switch SW9 should act as an active-low PSEN

### # Pseudocode

1. Wait for coordinate  $x_1$  (red circle) by polling switch SW8. Wait while SW8=0. When SW8 becomes 1 (SW8=1) read the coordinate  $x_1$
2. Wait for switch SW8 to become 0
3. Wait for coordinate  $y_1$  (red circle) as specified in step 1
4. Wait for SW8 to become 0
5. Execute the affine transformation algorithm and display coordinate  $x_2$  (red circle) on LED0-LED7
6. Wait until SW8 becomes 1
7. Display coordinate  $y_2$  (red circle) on LED0-LED7
8. Wait until SW8 becomes 0
9. Go to step 1

from SW0-SW7

## # Input / Output

The input/output functionality can be implemented in several different ways. For example, you can design your own IN and OUT instructions for reading/writing data using external ports. To use fewer hardware resources, you can consider connecting the ALU output, or a register output directly to the LEDs. You could consider dedicating a specific register number, e.g. register 1 to the input port. In this way, an ADD instruction can be used to read data, e.g. ADD %5, %0, %1 would store the input data in register %5.

## # Data sets

In your implementation, choose one of the two following data sets:

$$A = \begin{bmatrix} 0.75 & 0.5 \\ -0.5 & 0.75 \end{bmatrix} \quad B = \begin{bmatrix} 20 \\ -20 \end{bmatrix} \quad \text{or} \quad A = \begin{bmatrix} 0.5 & -0.875 \\ -0.875 & 0.75 \end{bmatrix} \quad B = \begin{bmatrix} 5 \\ 12 \end{bmatrix}$$

# Suggested data formats      *Affine transformation and fixed-point representation*

For the two-dimensional affine transformation implemented in your picoMIPS:  $\begin{pmatrix} x_2 \\ y_2 \end{pmatrix} = A \times \begin{pmatrix} x_1 \\ y_1 \end{pmatrix} + B$

Use the following data formats. Pixel coordinates  $[x_1, y_1]$  and  $[x_2, y_2]$  and coefficients of vector B are 8-bit 2's complement signed integers, i.e. their values are in the range  $-128 \dots +127$ .

$2^{(-7)}$

The coefficients of matrix A are 2's complement signed fixed-point fractions in the range  $-1 \dots +1 - 2^{(-8)}$ , i.e. they are 2's complement fractional numbers with the radix point positioned after the most significant bits. Therefore the weights of the individual bit are:

Therefore the weights of the individual bits are:

$$\begin{array}{ccccccccc} -2^0 & 2^1(-1) & 2^2(-2) & 2^3(-3) & 2^4(-4) & 2^5(-5) & 2^6(-6) & 2^7(-7) \\ \text{Bit position} & 7 & 6 & 5 & 4 & 3 & 2 & 1 & 0 \end{array}$$

When coefficients of the matrix A are multiplied by pixel coordinates, a double-length 16-bit product is obtained which is a 2's complement number with the radix point positioned after the 9-th bit.

Note however that the result  $[x_2, y_2]$  must be an 8-bit 2's complement whole number

## # Binary multiplication examples

Binary multiplication of 2's complement 8-bit numbers yields a 16-bit result. As one of the numbers is represented in the range  $-1 \dots +1 - 2^{-7}$  and the other in the range  $-128 \dots +127$ , it is important to determine correctly which 8-bits of the 16-bit result represent the integer part which should be used for further calculations. The following examples illustrate which result bits represent the integer part.

[Example 1 :  $0.75 \times 6$ ]

$$\begin{array}{cccccccccccccccc} -2^8 & | & 2^7 & 2^6 & 2^5 & 2^4 & 2^3 & 2^2 & 2^1 & 2^0 & | & 2^{-1} & 2^{-2} & 2^{-3} & 2^{-4} & 2^{-5} & 2^{-6} & 2^{-7} \\ | & & & & & & & & & & & & & & & & & & & \downarrow \\ \text{Result} & & (\text{Integer}) & & & & & & & & & & & & & & & & & \end{array}$$

The new leading bit must now change from  $2^7$  to  $-2^7$

$$2^7 \ 2^6 \ 2^5 \ 2^4 \ 2^3 \ 2^2 \ 2^1 \ 2^0$$

while (sw8==0);

Load sw0-sw7 for x1

while (sw8==1);

Load sw0-sw8 for y1

while (sw8==0);

## ALU Synthesis

# RAM memory load and store instructions (the only instructions using the RAM (data memory) are load (LD) and store (ST))

ALU is used for address calculation

- typically ALU is programmed to copy operand B to Result : address is in register RS or immediate

Data move from RAM to Rd (load) or from RA to RAM (store)

{ Load operation :  $Rd \leftarrow RAM[Address]$

Store operation :  $RAM[Address] \leftarrow Rd$

# Immediate Jump, absolute or relative If there is subroutine (know where to return)

Absolute jump :  $PC \leftarrow \text{branch address}$ , where branch address is : Instruction [p-1:0]

Relative jump :  $PC \leftarrow PC + \text{Instruction}[p-1:0]$ , relative offset needs an adder within the PC

# Indirect Jump, absolute or relative

Branch address is in a register 结构同上

OKS

$$a+b \quad b$$

## picoMIPS - Variable - Architecture Motif-specific Processor

Problems with standard many-core approaches: 1-processor size

- 100 ARM or Intel cores would take up far too much space to physically fit on a SoC that powers a mobile or tablet device.
- New architecture designs are required using smaller and simpler cores.
- Energy, Area and Performance Overhead of the Network-on-chip
  - Networking and shared resources in many-core
  - Huge number of 'wires' between cores and themselves
  - Complex switching - large space and large proportion of energy

2 - Software

Current parallel programming techniques are not parallel enough

- Current iPhone contains a dual-core processor and yet outperforms smartphones with much higher core counts.

Most programs are serial by nature

- Standard architectures with lots of cores cannot be used effectively

Backward compatibility of software

- Smooth transitions from existing software to many-core implementations are difficult

Key design challenge in embedded mobile systems

Reduce energy consumption while delivering high performance

Promising approach: application-driven variable-architecture processors

A methodology is presented to create an energy-efficient variable architecture processor that is tailored to application requirements

Case studies: Affine transformation of pixels

JPEG decompressor hardware size is reduced four times compared with equivalent dedicated hardware.

Variable - architecture picoMIPS

Embedded processors usually run specific, fixed application types (motifs)

Proposed architecture for such applications: picoMIPS

picoMIPS:

- Inspired by MIPS, but not a MIPS clone or subset
- Variable size data width
- Variable structure and variable size hardware blocks:
  - ALU, register file, program memory, instruction decoder, I/O
  - Application specific instruction format and instruction set
  - Only essential data storage, instructions and ALU functions are implemented during synthesis.
  - More versatile than typical FPGA soft cores, e.g. Altera Nios

picoMIPS - can be customised as a motif-oriented processor

User specifies the width of the data bus  $n$ , typically 8, 16, 32, 64 bits and the program counter size  $p$ .

User specifies the instruction size  $i$ , instruction set format and instruction set to be implemented in the decoder.

## # Processor synthesis flow



## # PicoMIPS - Summary

A variable-architecture application-specific processor design methodology is proposed

- Oriented towards extremely low energy consumption
- Can be implemented in a many-core small worker environment

Suitable for embedded applications

Potential for extreme energy savings in ASIC implementations

- Standard energy saving techniques can be used, such as near-threshold design, dynamic voltage and frequency scaling etc.

Further work

- Software toolkit to automate the instruction set and architecture generation as well as translation of high-level code to machine-level instructions.

## Embedded Hardware Multipliers

### # Unsigned binary multiplication

$$\begin{array}{r}
 \begin{array}{cccc}
 A_3 & A_2 & A_1 & A_0 \\
 \times B_3 & B_2 & B_1 & B_0
 \end{array} \\
 \begin{array}{cccc}
 A_3B_0 & A_2B_0 & A_1B_0 & A_0B_0 \\
 A_3B_1 & A_2B_1 & A_1B_1 & A_0B_1 \\
 A_3B_2 & A_2B_2 & A_1B_2 & A_0B_2 \\
 \hline
 A_3B_3 & A_2B_3 & A_1B_3 & A_0B_3
 \end{array}
 \end{array}$$

2N-bit product

[Example]  $7 \times 5$  unsigned

$$\begin{array}{r}
 \begin{array}{c}
 \begin{array}{c} 111 \\ M \end{array} \text{ multiplicand} \\
 \times 101 \quad Q \text{ multiplier} \\
 \hline
 \begin{array}{c} 000 \\ \text{(ADD } 7, Q_0=1) \\ \text{(Don't Add, } Q_1=0) \\ \text{ADD } 7 \text{ shifted by 2 bits, } Q_2=1) \\ \text{(Result=35)} \end{array}
 \end{array}
 \end{array}$$

### Hardware implementations

① Sequential - adder, shift register, state machine

② Combinational - adders and AND gates for partial product  $A_i B_i$

Same example cycle by cycle for sequential hardware implementation

| M = 1 1 1 ( $M=7$ ) |       |                       |
|---------------------|-------|-----------------------|
| C                   | A     | Q=5 (current operand) |
| 0                   | 0 0 0 | 1 0 1                 |
| 0                   | 1 1 1 | 1 0 1                 |
| 0                   | 0 1 1 | 1 1 0                 |
| 0                   | 0 0 1 | 1 1 1                 |
| 1                   | 0 0 0 | 1 1 1                 |
| 0                   | 1 0 0 | 0 1 1                 |
| 0                   | 1 0 0 | 0 1 1                 |

$\frac{001}{+111} \rightarrow 1000$   
overflow  
 $A, Q = 0 1 0 0 0 1 1 = 35$

Note:  
Here the double length accumulator AQ shares storage with multiplier Q.

### # Sequential unsigned multiplier - hardware implementation



### # Combinational unsigned multiplier

$$\begin{array}{r}
 \begin{array}{cccc}
 X_3 & X_2 & X_1 & X_0 \\
 \times Y_3 & Y_2 & Y_1 & Y_0
 \end{array} \\
 \begin{array}{cccc}
 X_3Y_0 & X_2Y_0 & X_1Y_0 & X_0Y_0 \\
 X_3Y_1 & X_2Y_1 & X_1Y_1 & X_0Y_1 \\
 X_3Y_2 & X_2Y_2 & X_1Y_2 & X_0Y_2 \\
 \hline
 X_3Y_3 & X_2Y_3 & X_1Y_3 & X_0Y_3
 \end{array}
 \end{array}$$

$Z_7 Z_6 Z_5 Z_4 Z_3 Z_2 Z_1 Z_0$

Propagation Delay

$\sim 2N$  adders



## # Signed Multiplication

Can extend addition to  $2n$  bits

负×正 →  $n$  cycles

Negative multiplicand, positive multiplier -  $n$  cycles

$$\begin{array}{r}
 -5 \quad 1 \quad 1 \quad 1 \quad 1 \quad 1 \quad 0 \quad 1 \quad 1 \quad x \\
 5 \quad 0 \quad 0 \quad 0 \quad 0 \quad 0 \quad 1 \quad 0 \quad 1 = \\
 \hline
 1 \quad 1 \quad 1 \quad 0 \quad 1 \quad 1 \quad 0 \quad 1 \quad 1 \uparrow \\
 0 \quad 0 \uparrow \\
 1 \quad 1 \quad 1 \quad 0 \quad 1 \quad 1 \quad 0 \quad 0 \quad 0 \uparrow \\
 0 \quad 0 \uparrow \\
 \hline
 -25 \quad 1 \quad 1 \quad 0 \quad 0 \quad 0 \quad 1 \quad 1 \quad 1 = \\
 -128 \quad b_4 \quad 32 \quad t_4 \quad +2 \quad t_1
 \end{array}$$

Positive or negative multiplicand, negative multiplier

-  $2n$  cycles, but can be reduced to  $n$  cycles using a special treatment of the upper half

$$\begin{array}{r}
 7 \quad 0 \quad 0 \quad 0 \quad 0 \quad 0 \quad 1 \quad 1 \quad 1 \quad x \\
 -3 \quad 1 \quad 1 \quad 1 \quad 1 \quad 1 \quad 0 \quad 1 \quad 1 = \\
 \hline
 0 \quad 0 \quad 0 \quad 0 \quad 0 \quad 1 \quad 1 \quad 1 \quad 1 \uparrow \\
 0 \quad 0 \uparrow \\
 0 \quad 0 \quad 0 \quad 1 \quad 1 \quad 1 \quad 0 \quad 0 \quad 0 \uparrow \\
 0 \quad 0 \quad 1 \quad 1 \quad 1 \quad 0 \quad 0 \quad 0 \quad 0 \uparrow \\
 0 \quad 1 \quad 1 \quad 1 \quad 0 \quad 0 \quad 0 \quad 0 \quad 0 \uparrow \\
 1 \quad 1 \quad 1 \quad 0 \quad 0 \quad 0 \quad 0 \quad 0 \quad 0 \uparrow \\
 1 \quad 1 \quad 0 \uparrow \\
 1 \quad 0 \uparrow \quad \leftarrow \text{忽略}
 \end{array}$$

$$\begin{array}{r}
 -21 \quad 1 \quad 1 \quad 1 \quad 0 \quad 1 \quad 0 \quad 1 \quad 1 = \\
 -2^7 \quad 2^6 \quad 2^5 \quad 2^4 \quad 2^3 \quad 2^2 \quad 2^1 \quad 2^0
 \end{array}$$

# Combinational signed multiplication extending addition to  $2n$  bits - much exact circuitry

$$\begin{array}{l}
 \begin{array}{ccccccc}
 x_3 & x_2 & x_1 & x_0 \\
 y_3 & y_2 & y_1 & y_0
 \end{array} \\
 \begin{array}{c}
 x_3 y_0 \quad x_3 y_0 \quad x_3 y_0 \quad x_3 y_0 \quad x_2 y_0 \quad x_1 y_0 \quad x_0 y_0 \\
 + x_3 y_1 \quad x_3 y_1 \quad x_3 y_1 \quad x_3 y_1 \quad x_2 y_1 \quad x_1 y_1 \quad x_0 y_1 \\
 + x_3 y_2 \quad x_3 y_2 \quad x_3 y_2 \quad x_3 y_2 \quad x_2 y_2 \quad x_1 y_2 \quad x_0 y_2 \\
 - x_3 y_3 \quad x_3 y_3 \quad x_3 y_3 \quad x_3 y_3 \quad x_2 y_3 \quad x_1 y_3 \quad x_0 y_3
 \end{array} \\
 \begin{array}{c}
 0111 \quad 7 \\
 1101 \quad -3 \\
 0000 \\
 1100 \\
 \hline
 0111 \\
 0000 \\
 1101 \\
 1110 \quad 1011
 \end{array}
 \end{array}$$

$$-2^6 + 2^5 + 2^4 + 2^0 =$$



有借位进位最后一层为减法

## # Hardware efficient combinational signed multiplier Baugh-Wooley multiplier



n-1 bit addition

Special treatment of sign bits

Special treatment of last stage

Note the use of FA Co outputs

Red line shows the longest propagation delay

## # Multiplier block architecture



## # Inferring multipliers from SystemVerilog code

Ex:

$$\begin{array}{r}
 \begin{array}{r} 0 \ 1 \ 1 \ 1 \\ \times \ 1 \ 1 \ 0 \ 1 \\ \hline \end{array}
 \\ \begin{array}{r} 0 \ 0 \ 0 \ 0 \ 0 \ 1 \ 1 \\ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \\ 0 \ 0 \ 0 \ 1 \ 1 \ 1 \ 1 \\ 0 \ 0 \ 1 \ 1 \ 1 \ 1 \ 1 \\ \hline 1 \ 1 \ 0 \ 0 \ 1 \ 1 \ 1 \ 1 \end{array}
 \end{array}$$

## Pipelined CPUs

### # Key features of RISC and pipelines

- Small and simple instruction set.
- Most instructions execute in single clock cycle (no microcode)
- Memory access instructions only transfer data: memory  $\leftrightarrow$  registers
- ALU operations are register to register
- Large number of general purpose registers

- RISC lends itself easily to efficient instruction execution pipeline

- Pipelines accelerate performance
- Pipelines are not always used in embedded architectures
  - Hardware overheads are necessary to implement and control a pipeline.
  - But many embedded hardware processors are pipelined, e.g. ARM Cortex cores.
  - ARM Cortex M3 - 3 stage pipeline
  - ARM Cortex A9 - dynamic length, 9-11 stage pipeline.

### # Even simplest microcontrollers use pipelines

- AVR has a two-stage pipeline: parallel instruction fetches and instruction executions
- This is enabled by the Harvard architecture of the AVR and the use of instruction register.
- This basic pipelining results in fast execution, single-clock cycle per instruction.

### # AVR parallel fetch and execution



### # Pipeline hazards

Instruction cycle broken into n phases (one execution stage per phase)

- e.g., Fetch, Decoder, ReadOPs, Execute1, Execute2, WriteBack.

A new instruction is fetched each phase

Maximum speed gain is n times

Pipeline hazards reduce the ability to achieve a gain of n times.

### [Type of hazards]

#### - Resource

- Hazard occurs when instruction needs a resource being used by another instruction.

#### - Data

- RAW (read-after-write hazard: read is requested before write has finished)
- WAR (write-after-read hazard: write is requested before read is finished)
- WAW (write-after-write hazard: writes occur in an unintended order)

#### - Control

- Hazard occurs when a wrong fetch decision at a branch results in an extra instruction fetch and a pipeline flush.

RAW: 一条指令的操作数来自之前指令的结果

WAR: 一条指令要将结果写入某个寄存器中，但这个寄存器还在被其它指令读取

WAW: 如果两条指令都要将结果写到同一个寄存器中，那么后面的指令必须等到前面的指令写完。

## # Data hazards

Dependences between instructions may cause data hazards when Instr<sub>1</sub> and Instr<sub>2</sub> are so close that their overlapping within the pipeline would change their access order to Reg.

Three types of data hazards:

Read After Write (RAW): Instr<sub>2</sub> tries to read operand before Instr<sub>1</sub> writes it.

Write After Read (WAR): Instr<sub>2</sub> tries to write operand before Instr<sub>1</sub> reads it.

Write After Write (WAW): Instr<sub>2</sub> tries to write operand before Instr<sub>1</sub> writes it.

$$R1 = R2 + R3$$

$$R5 = R1 + R4$$

$$R1 = R2 + R3$$

$$R2 = R3 + R4$$

$$R1 = R2 + R3$$

$$R1 = R2 + R4$$

true dependence

anti-dependence

output dependence

## # Data access hazard - example

IF : Instruction Fetch  
 ID : Instruction Decode  
 EX : Execute  
 MEM : Memory Access  
 WB : Write Back

# Pipeline conflict due to data dependency hazard



## # Solutions to data access and data dependency hazards

Software solutions (Compiler Scheduling):

- Putting no-op instructions after each instruction that may cause a hazard
- Instruction scheduling: rearrange code to reduce no-ops

Hardware solutions

- Hazard detection hardware is necessary.
- Interlocking: stall pipeline for one or more cycles
- Forwarding: two types of forwarding:
  - The ALU result of Instr<sub>1</sub> in EX stage can immediately be forwarded back to ALU input of EX stage as an operand for Instr<sub>2</sub>.
  - The memory load data from MEM stage can be forwarded to ALU input of EX stage.
- Forwarding with interlocking
  - Assume that Instr<sub>2</sub> is data dependent on the load instruction Instr<sub>1</sub> then Instr<sub>2</sub> has to be stalled until the data loaded by Instr<sub>1</sub> becomes available.

## # Hardware solution to data dependency hazard - interlocking

# Solution to data dependency hazard - forwarding



Example of data dependency hazard not resolvable by forwarding

直接将结果输入 EX 阶段

用 bubble 避让指令

## # Example of data dependency hazard - resolved by forwarding and stalling

load Reg2, B



## # Control hazards

Control hazards occur when a wrong ~~function~~ fetch decision results in a new instruction fetch and the pipeline being flushed.

Dealing with control hazards may require significant hardware overheads.

Solution include:

- Multiple Pipeline streams
- Prefetching the branch target
- Using a loop buffer
- Branch Prediction
- Delayed Branch
- Reordering of Instructions
- Multiple Copies of Registers (backups)

flushed

## # 3-stage ARM Cortex M3 pipeline

Each instruction is executed in three stages

① Fetch: instruction is fetched from memory and replaced in pipeline

② Decode: instruction is decoded and data-paths signals prepared for next cycle.

③ Execute: instruction reads ALU operands from registers and writes result to destination register.

ARM pipeline is linear: processor throughput is one instruction per clock cycle while individual instruction takes three clock cycles.

- When a branch instruction is fetched, a pipeline faces difficulties, because wrong instructions may have been fetched into the pipeline. If this occurs, pipeline flushes and has to be refilled. ARM pipeline maintains high performance and predicts branch behaviour because PC address is calculated 2 instructions ahead of current instruction.



## # MIPS Pipeline

Pipelined version of MIPS is a good example of a pipelined RISC architecture (relative simple and real)

## [RISC] MIPS Instructions

## Load/Store Instructions

|     |                        |          |
|-----|------------------------|----------|
| LB  | Load Byte              | (I-type) |
| LBU | Load Byte Unsigned     |          |
| LH  | Load Halfword          |          |
| LHU | Load Halfword Unsigned |          |
| LW  | Load Word              |          |
| LWL | Load Word Left         |          |
| LWR | Load Word Right        |          |
| SB  | Store Byte             |          |
| SH  | Store Halfword         |          |
| SW  | Store Word             |          |
| SWL | Store Word Left        |          |
| SWR | Store Word Right       |          |

## Shift Instructions

|      |                                 |
|------|---------------------------------|
| SLL  | Shift Left Logic                |
| SRL  | Shift Right Logic               |
| SRA  | Shift Right Arithmetic          |
| SLLV | Shift Left Logic Variable       |
| SRLV | Shift Right Logic Variable      |
| SRAV | Shift Right Arithmetic Variable |

## Multiply / Divide Instructions

|       |                    |
|-------|--------------------|
| MULT  | Multiply (FP type) |
| MULTU | Multiply Unsigned  |
| DIV   | Divide             |
| DIVU  | Divide Unsigned    |
| MFHI  | Move From HI       |
| MTHI  | Move to HI         |
| MFLO  | Move From LO       |
| MTLO  | Move To LO         |

## Coprocessor Instructions

|       |                               |
|-------|-------------------------------|
| LWCZ  | Load Word to Coprocessor      |
| SWCZ  | Store Word to Coprocessor     |
| INTCZ | Move To Coprocessor           |
| MFCZ  | Move From Coprocessor         |
| CTCZ  | Move Control to Coprocessor   |
| CFCZ  | Move Control from Coprocessor |
| COPZ  | Coprocessor Operation         |
| BCZT  | Branch on Coprocessor ≠ True  |
| BCZF  | Branch on Coprocessor = False |

## Arithmetic Instructions (ALU immediate)

|       |                                     |
|-------|-------------------------------------|
| ADDI  | Add Immediate                       |
| ADDIU | Add Immediate Unsigned              |
| SLTI  | Set on Less Than Immediate          |
| SLTIU | Set on Less Than Immediate Unsigned |
| ANDI  | AND Immediate                       |
| ORI   | OR Immediate                        |
| XORI  | Exclusive -OR Immediate             |
| LUI   | Load Upper Immediate                |

## Arithmetic Instructions (3-operand , R-type)

|      |                           |
|------|---------------------------|
| ADD  | Add                       |
| ADDU | Add Unsigned              |
| SUB  | Subtract                  |
| SUBU | Subtract Unsigned         |
| SLT  | Set on Less Than          |
| SLTU | Set on Less Than Unsigned |
| AND  | AND                       |
| OR   | OR                        |
| XOR  | Exclusive -OR             |
| MOR  | MOR                       |

## Jump and Branch Instructions

|        |                                                   |
|--------|---------------------------------------------------|
| J      | Jump                                              |
| JAL    | Jump and Link                                     |
| JR     | Jump to Register                                  |
| JALR   | Jump and Link Register                            |
| BEQ    | Branch on Equal                                   |
| BNE    | Branch on not Equal                               |
| BLTZ   | Branch on Less than or equal to zero              |
| BGTZ   | Branch on Greater than zero                       |
| BLTZ   | Branch on Less than zero                          |
| BGEZ   | Branch on Greater than or equal to zero           |
| BLTZAL | Branch on Less than zero and Link                 |
| BGEZAL | Branch on Greater Than or Equal to zero and Link. |

## Special Instructions

|         |             |
|---------|-------------|
| SYSCALL | SYSTEM CALL |
| BREAK   | Break       |

## # MIPS Instruction Formats

|                       |           |    |        |           |       |
|-----------------------|-----------|----|--------|-----------|-------|
| I-type<br>(Immediate) | 6         | 5  | 5      | 16        |       |
|                       | Operation | rs | rt     | Immediate |       |
| J-type<br>(Jump)      | 6         | 5  | 5      | 5         | 6     |
|                       | Operation |    | Target |           |       |
| R-type<br>(Register)  | 6         | 5  | 5      | rd        | Shift |
|                       | Operation | rs | rt     | rd        | Shift |
|                       |           |    |        | Function  |       |

## Operation

rs

rt

## Immediate

Target

rd

Shift

## Function

## Operation code

Source register specifier

Source / destination register specifier

Immediate, branch, or address displacement

Jump target address

Destination register specifier

Shift amount

ALU / shift function specifier

## # Single-cycle MIPS Architecture



最先进的

RISC-V A state-of-the-art open-source RISC with potential for embedded applications

## # RISC-V features

### Small CPU

- many have 16 registers
- basic version includes only 33 instructions

### Ideas borrowed from SPARC

- register 0 is a constant 0
- presence of instruction register, gives rise to a simple, two-stage fetch-execute pipeline

### Ideas borrowed from MIPS

- No status register
- Conditional branches evaluate branch condition in the same cycle
- Memory-mapped I/O

### Novel ideals

- ALU intentionally lacks condition codes, branch conditions are evaluated by separate logic
  - But the lack of carry bit complicates multiple precision arithmetic
- RISC V does not detect or flag most arithmetic errors
- Extra space is reserved for new instructions, future extensions but also user-defined instructions

## # RISC-V two-stage pipeline with branch predictor

The controller will determine what to do if the predictor is not correct



Two Reg File Modules above are the same modular.

read more

## # RISC-V branch prediction logic

The predictor uses a four entry table  $B_{pred}$ . Each entry is a  $\langle pc+4, \text{target} \rangle$  pair.

In the fetch stage, the predictor uses  $pc+4$  to index the table, and if the PC's match then this is a ~~miss~~ hit.

If the PC's do not match, then this is a miss and  $pc+4$  is used as the next PC instead of target.

The hit/miss signal is pipelined to the execute stage. This signal tells the control logic if the branch was predicted taken or not taken. Actions of the control logic are:

| Predicted | Actual    | Mispredict? | Action to take                                                          |
|-----------|-----------|-------------|-------------------------------------------------------------------------|
| taken     | taken     | no          | no operation                                                            |
| not-taken | taken     | yes         | Kill instr in fetch, update table, $pc := \text{branch or jump target}$ |
| taken     | not-taken | yes         | Kill instr in fetch, do not update table, $pc := \text{correct } pc+4$  |
| not-taken | not-taken | no          | no action required.                                                     |

If the branch was predicted not-taken, and it was actually taken (i.e. a loop was entered), then table  $B_{pred}$  is updated by adding the appropriate  $pc+4$  and branch or jump target.

If the branch was predicted taken or not taken and it was actually not-taken (i.e. a loop exists), then there is no need to update the table.

## # Instruction format (R V32I - RISC-V 32-bit Integer Instruction Set)

|            |              |                                     |                              |          |                                                                            |                                                                                |
|------------|--------------|-------------------------------------|------------------------------|----------|----------------------------------------------------------------------------|--------------------------------------------------------------------------------|
| 31         | 25:24        | 20:17                               | 15:14                        | 12:11    | 7:6                                                                        | 0                                                                              |
|            | funct7       | rs2                                 | rs1                          | funct3   | rd                                                                         | opcode                                                                         |
| R-type     | $\leftarrow$ | $rd \leftarrow rs1 \text{ op } rs2$ | (register only instructions) |          |                                                                            |                                                                                |
| imm[11:0]  | rs1          | funct3                              | rd                           | opcode   | I-type                                                                     | $\leftarrow$ $rd \leftarrow rs1 \text{ op imm}$                                |
| imm[11:5]  | rs2          | rs1                                 | funct3                       | imm[4:0] | opcode                                                                     | S-type $\leftarrow$ memory store instructions; Loads are implemented as I-type |
| imm[31:12] |              | rd                                  | opcode                       | V-type   | $\leftarrow$ instructions involving PC, e.g. JAL (jump and link) and other |                                                                                |
|            |              |                                     |                              |          |                                                                            | instructions involving large immediates, e.g. LUI (load upper immediate)       |

## # Comments regarding the instruction set.

Extend instruction sets are also defined, e.g.:

- RV64I, RV128I (64-bit and 128-bit integer instructions)
- Extensions for integer multiplication and division.
- Extensions for floating point (single, double and quadruple precision)
- Extensions for bit manipulation.
- Extensions for vector architectures (SIMD)

$x_i$ , where  $i = 0..32$ , is used in RISC-V assembler to represent register  $i$ .

Register  $x0$  is always 0

NOP is implemented as: ADDI  $x0, x0, 0$

J is implemented as: JAL  $x0, \text{address}$

- JAL rd, address normally calls a subroutine, where  $pc+4$  is saved in rd.

## Soft processor cores in embedded systems

### # Microprocessors in embedded systems

Microprocessors are everywhere

A modern car has around 40 to 100 microprocessors

- Engine control, break systems, active suspension (Need powerful, 32-bit processors)
- Even motorised window control has a microprocessor

Designer face many challenges

- Which parts of a system on chip should be implemented as programmable, combinational and sequential logic
- Which parts should be implemented as programs on a microprocessor core?
- What processor architecture should be used?

• Little point in using a 32-bit MIPS with double precision floating point if one is designing a burglar alarm 防盗

### # Processor cores

Can be classified as 'hard' or 'soft'

- 'hard' core is an embedded processor, e.g. ARM, surrounded by FPGAs memory and programmable logic
- 'soft' core is synthesized from HDL code using standard FPGA's memory and logic

The microMIPS we are designing is a soft core

Hard cores are less configurable but have higher performance than soft cores

### # Hard processor cores

Example of hard cores on FPGAs:

- Altera offers ARM core in some of their FPGAs, e.g. Cyclone V
- Xilinx also offers ARM or PowerPC



Two PowerPC hard cores  
on a Xilinx Virtex II die  
两个硬核，不是从代码综合而来  
而是直接就在板子上

### # Soft cores

Use existing FPGA logic elements

Flexible, feature rich, reconfigurable

Customized memory size, data path width, ALU functionality, number of types of peripherals and

Typically have slower clock rates

Are said to consume more power than their hard-core counterparts (Not always true)

[Examples]

Altera: NIOS II - one of the most popular

Xilinx: MicroBlaze, PicoBlaze

Lattice: LatticeMico8

ARM Cortex MI - a third party core offered to FPGA vendors 供应商

There are many open-source soft cores, e.g.

- SPARC - GPL licence ([www.simplysparc.com](http://www.simplysparc.com))
- OpenRISC - also GPL (e.g. [opencores.org](http://opencores.org))
- DSP48a16

### # Development Tools

A dedicated CAD tool, specific to the particular core, is used to specify processor parameters.

- Register file size, hardware arithmetic, floating point, interrupts, I/O hardware, custom functions

- User programs, typically written in C or C++, are compiled by a custom compiler provided by the CAD tool and included in the synthesis

The tool outputs a synthesizable HDL model

- Normally the HDL code is protected and not viewable.

After any user-specified hardware is added, the design is synthesized using a standard synthesis tool.

- Altera Quartus integrates specification for the NIOS processor into a single tool used for general purpose synthesis

- Altera also provides a SOPC (System on Programmable Chip) Builder for advanced embedded system design.

## # Altera Nios II Soft Core Processor

A general purpose RISC with a Harvard memory architecture

Features a 32-bit Instruction Set Architecture (ISA), 32-general purpose registers, single-instruction 32x32 multiply and divide operations, and dedicated instructions for 64-bit and 128-bit products of multiplication.

Nios II has a performance of more than 150 MIPs (DMIPS) on Stratix family of FPGAs  
comes in three versions - economy, standard and fast core.

Each core version has a different number of pipeline stages, instruction and data cache memory and hardware components for X and Y.

Each core implementation varies in size and performance depending on the features that are selected.

Adding peripherals with the Nios II Processors is done through the Aralon Interface Bus which contains the necessary logic to interface processor with off-the-shelf (TILE) IP cores or custom designed peripherals.



## MIPS Instruction Reference

This is a description of the MIPS instruction set, their meanings, syntax, semantics, and bit encodings. The syntax given for each instruction refers to the assembly language syntax supported by the MIPS assembler. Hyphens ("") in the encoding indicate "don't care" bits which are considered when an instruction is being encoded.

General purpose registers (GPRs) are indicated with a dollar sign (\$). The word SWORD and UWORD refer to 32-bit signed and 32-bit unsigned data types.

The manner in which the processor executes an instruction and advances its program counter is as follows:

1. execute the instruction at PC

2. copy nPC to PC

3. add 4 or the branch offset to nPC

This behavior is indicated in the instruction specifications below. For brevity, the function advance\_pc(4) is used in many of instruction descriptions.

```
void advance_pc (LSMRP offset)
```

```
{
```

PC = nPC;      *Each address in memory stores one byte and each MIPS instruction requires four bytes of memory*

```
}
```

Note: All arithmetic immediate values are sign-extended. After that, they are handled as signed or unsigned 32-bit numbers, depending upon the instruction. The only difference between signed and unsigned instructions is that signed instructions can generate an overflow exception and unsigned instructions can not.

### [Instruction Descriptions]

#### ADD - Add (with overflow)

Description: Adds two registers and stores the result in a register

Operation:  $\$d = \$s + \$t$ ; advance\_pc(4);

Syntax: add \$d, \$s, \$t

Encoding: 0000 00ss ssst tttt dddd d 000 0010 0000

#### ADDI - Add Immediate (with overflow)

Description: Adds a register and a sign-extended immediate value and stores the result in a register

Operation:  $\$t = \$s + \text{imm}$ ; advance\_pc(4);

Syntax: addi \$t, \$s, imm

Encoding: 0010 00ss ssst tttt iiiii iiiii iiiii

#### ADDIU - Add immediate unsigned (no overflow)

Description: Adds a register and a sign-extended immediate value and stores the result in a register

Operation:  $\$t = \$s + \text{imm}$ ; advance\_pc(4);

Syntax: addiu \$t, \$s, imm

Encoding: 0010 01ss ssst tttt iiiii iiiii iiiii

#### ADDU - Add unsigned (no overflow)

Description: Adds two registers and stores the result in a register

Operation:  $\$d = \$s + \$t$ ; advance\_pc(4);

Syntax: addu \$d, \$s, \$t

Encoding: 0000 00ss ssst tttt dddd d000 0010 0001

#### AND - Bitwise and

Description: Bitwise ands two registers and stores the result in a register

Operation:  $\$d = \$s \& \$t$ ; advance\_pc(4);

Syntax: and \$d, \$s, \$t

Encoding: 0000 00ss ssst tttt dddd d000 0010 0100

**ANDI - Bitwise and immediate**

Description: Bitwise ands a register and an immediate value and stores the result in a register  
 Operation:  $t = \$s \& imm$ ; adv

Syntax: andi \$t, \$s, imm

Encoding: 0011 00ss sss\$ tttt iiiiiiiiiii

**BEQ - Branch on equal**

Description: Branch if the two registers are equal

Operation: if  $\$s == \$t$  advance\_pc(offset << 2); else advance\_pc(4);

Syntax: beq \$s, \$t, offset (pc jumps how many instructions)

Encoding: 0001 00ss sss\$ tttt iiiiiiiiiii

**BGEZ - Branch on greater than or equal to zero**

Description: Branches if the register is greater than or equal to zero

Operation: if  $\$s \geq 0$  advance\_pc(offset << 2); else advance\_pc(4);

Syntax: bgez \$s, offset

Encoding: 0000 01ss sss0 0001 iiiiiiiiiiiiiiiii

**BGEZAL - Branch on greater than or equal to zero link**

Description: Branches if the register is greater than or equal to zero and saves the return address in \$31

Operation: if  $\$s \geq 0$   $\$31 = PC + 8$  (or  $NP + 4$ ); advance\_pc(offset << 2); else advance\_pc(4);

Syntax: bgezal \$s, offset

Encoding: 0000 01ss sss1 0001 iiiiiiiiiiiiiiiii

**BGTZ - Branch on greater than zero**

Description: Branches if the register is greater than zero

Operation: if  $\$s > 0$  advance\_pc(offset << 2); else advance\_pc(4);

Syntax: bgtz \$s, offset

Encoding: 0001 11ss sss0 0000 iiiiiiiiiiiiiiiii

**BLEZ - Branch on less than or equal to zero**

Description: Branch if the register is less than or equal to zero

Operation: if  $\$s \leq 0$  advance\_pc(offset << 2); else advance\_pc(4);

Syntax: blez \$s, offset

Encoding: 0001 10ss sss0 0000 iiiiiiiiiiiiiiiii

**BLTZ - Branch on less than zero**

Description: Branches if the register is less than zero

Operation: if  $\$s < 0$   $\$31 = PC + 8$  (or  $NP + 4$ ); advance\_pc(offset << 2); else advance\_pc(4);

Syntax: bltzal \$s, offset

Encoding: 0000 01ss sss1 0000 iiiiiiiiiiiiiiiii

有无 link: 有类似于 call function  
没有就是 go to

$nPC = (PC \& 0xf0000000) | (target \ll 2);$

丁有度, offset is limited!

$PC+8$  or  $(nPC+4)$

↑  
 $nPC$ 在上次调用时已经+4, 再加4, 等于在  $PC=nPC$  时,  $PC+8 \approx$

No.

### BLTZAL - Branch on less than zero and link

Description: Branches if the register is less than zero and saves the return address in \$31

Operation: If  $\$s < 0$ ,  $PC+8$  (or  $nPC+4$ ) : advance- $PC$  ( $offset \ll 2$ ); else advance- $PC$  ( $4$ );

Syntax: bltzal \$s, offset

Encoding: 0000 01ss ssss 0000 iiiii iiiii iiiii

### BNE - Branch on not equal

Description: Branches if the two registers are not equal

Operation: If  $\$s \neq \$t$ , advance- $PC$  ( $offset \ll 2$ ); else advance- $PC$  ( $4$ );

Syntax: bne \$s, \$t, offset

Encoding: 0001 01ss ssss tttt iiiii iiiii iiiii iiiii

### DIV - Divide

Description: Divides  $\$s$  by  $\$t$  and stores the quotient in  $\$LO$  and the remainder in  $\$HI$

Operation:  $\$LO = \$s / \$t$ ;  $\$HI = \$s \% \$t$ ; advance- $PC(4)$ ;

Syntax: div \$s, \$t

Encoding: 0000 00ss ssss tttt 0000 0000 0001 1010

### DIVU - Divide unsigned

Description: Divides  $\$s$  by  $\$t$  and stores the quotient in  $\$LO$  and the remainder in  $\$HI$

Operation:  $\$LO = \$s / \$t$ ;  $\$HI = \$s \% \$t$ ; advance- $PC(4)$ ;

Syntax: divu \$s, \$t

Encoding: 0000 00ss ssss tttt 0000 0000 0001 1011

### J - Jump

Description: Jumps to the calculated address

Operation:  $PC = nPC$ ;  $nPC = (PC \& 0xf0000000) | (target \ll 2)$ ;

Syntax: j target

Encoding: 0000 10ii iiiii iiiii iiiii iiiii iiiii iiiii

### JAL - Jump and Link

Description: Jumps to the calculated address and stores the return address in \$31

Operation:  $\$31 = PC + 8$  (or  $nPC + 4$ );  $PC = nPC$ ;  $nPC = (PC \& 0xf0000000) | (target \ll 2)$ ;

Syntax: jal target

Encoding: 0000 11ii iiiii iiiii iiiii iiiii iiiii iiiii

### JR - Jump register

Description: Jumps to the address contained in register  $\$s$

Operation:  $PC = nPC$ ;  $nPC = \$s$ ;

Syntax: jr \$s

Encoding: 0000 00ss ssss 0000 0000 0000 0000 10000

**LB - Load byte**

Description: A byte is loaded into a register from the specified address

Operation:  $\$t = \text{MEM}[\$s + \text{offset}]$ ; advance\_pc(4);

Syntax: `lb $t, offset($s)`

Encoding: 1000 00ss ssst tttt iiiiiiiiiiiiiiiii

**LUI - Load upper immediate**

Description: The immediate values is shifted left 16 bits and stored in the register. The lower 16 bits are zeros.

Operation:  $\$t = (\text{imm} \ll 16)$ ; advance\_pc(4);

Syntax: `lui $t, imm`

Encoding: 0011 11-- ---t tttt iiiiiiiiiiiiiiiii

**LW - Load word**

Description: A word is loaded into a register from the specified address

Operation:  $\$t = \text{MEM}[\$s + \text{offset}]$ ; advance\_pc(4);

Syntax: `lw $t, offset($s)`

Encoding: 1000 11ss ssst tttt iiiiiiiiiiiiiiiii

**MFHI - Move from HI**

Description: The contents of register HI are moved to the specified register

Operation:  $\$d = \text{HI}$ ; advance\_pc(4);

Syntax: `mfhi $d`

Encoding: 0000 0000 0000 0000 dddd dddd 0001 0000

**MFLO - Move from LO**

Description: The contents of register LO are moved to the specified register

Operation:  $\$d = \text{LO}$ ; advance\_pc(4);

Syntax: `mflo $d`

Encoding: 0000 0000 0000 0000 dddd dddd 0010 0010

**MULT - Multiply**

Description: Multiplies  $\$s$  by  $\$t$  and stores the result in  $\$LO$

Operation:  $\$LO = \$s * \$t$ ; advance\_pc(4);

Syntax: `mult $s, $t`

Encoding: 0000 00ss ssst tttt 0000 0000 0001 1000

**MULTU - Multiply unsigned**

Description: Multiplies  $\$s$  by  $\$t$  and stores the result in  $\$LO$

Operation:  $\$LO = \$s * \$t$ ; advance\_pc(4);

Syntax: `multu $s, $t`

Encoding: 0000 00ss ssst tttt 0000 0000 0001 1001

### NOOP - no operation

Description: Perform no operation

Operation: advance\_pc(4)

Syntax: noop

Encoding: 0000 ... 0000

The encoding for a NOOP represents the instruction SLL \$0,\$0,0 which has no side effects

In fact, nearly every instruction that has \$0 as its destination register will have no side effect and can thus be considered a NOOP instruction.

### OR - Bitwise or

Description: Bitwise logical ors two registers and stores the result in a register

Operation: \$d = \$s | \$t ; advance\_pc(4);

Syntax: or \$d,\$s,\$t

Encoding: 0000 00ss ssst tttt dddd 0000 0010 0101

### ORI - Bitwise or immediate

Description: Bitwise ors a register and an immediate value and stores the result in a register

Operation: \$d = \$s | imm ; advance\_pc(4);

Syntax: ori \$t,\$s,imm

Encoding: 0011 00ss ssst tttt iiiii iiiii iiiii

### SB - Store Byte

Description: The least significant byte of \$t is stored at the specific address

Operation: MEM[\$s+offset] = (0x0f & \$t) ; advance\_pc(4);

Syntax: sb \$t, offset(\$s)

Encoding: 1010 00ss ssst tttt iiiii iiiii iiiii

### SLL - Shift left logical

Description: Shift a register value left by the shift amount listed in the instruction and places the result in a third register

Operation: \$d = \$t << h ; advance\_pc(4);

Syntax: sll \$d,\$t,h

Encoding: 0000 00ss ssst tttt dddd hhhh 0000

zero are shift in.

### SLLV - Shift left logical variable

Description: Shift a register value left by the value in a second register and places the result in a third register. Zeros are shifted in.

Operation: \$d = \$t << \$s ; advance\_pc(4);

Syntax: sllv \$d,\$t,\$s

Encoding: 0000 00ss ssst tttt dddd d--- --00 0100

### STL - Set on less than (signed)

Description: If \$s is less than \$t, \$d is set to one. It gets zero otherwise

Operation: if \$s < \$t \$d=1 ; advance\_pc(4); else \$d=0 ; advance\_pc(4);

Syntax: slt \$d,\$s,\$t

Encoding: 0000 00ss ssst tttt dddd d000 0010 1010

**SLTI - Set on less than immediate (signed)**

Description: If  $\$s$  is less than immediate,  $\$t$  is set to one. It gets zero otherwise.  
 Operation: If  $\$s < \text{imm}$   $\$t=1$ ; advance\_pc(4); else  $\$t=0$ ; advance\_pc(4);  
 Syntax: stti \$t,\$s,imm  
 Encoding: 0010 10ss ssst tttt iiiiiiiiiii

**SLTIU - Set on less than immediate unsigned**

Description: If  $\$s$  is less than the unsigned immediate,  $\$t$  is set to one. It gets zero otherwise.  
 Operation: If  $\$s < \text{imm}$   $\$t=1$ ; advance\_pc(4); else  $\$t=0$ ; advance\_pc(4);  
 Syntax: stiu \$t,\$s,imm  
 Encoding: 0010 11ss ssst tttt iiui iiii iiii

**SLTU - Set on less than unsigned**

Description: If  $\$s$  is less than  $\$t$ ,  $\$d$  is set to one. It gets zero otherwise.  
 Operation: If  $\$s < \$t$   $\$d=1$ ; advance\_pc(4); else  $\$d=0$ ; advance\_pc(4);  
 Syntax: stiu \$d,\$s,\$t  
 Encoding: 0000 00ss ssst tttt dddd d000 0010 1011

**SRA - Shift right algorithm**

Description: Shift a register value right by the shift amount (shamt) and places the value in the destination register. The sign bit is shifted in.  
 Operation:  $\$d = \$t \gg h$ ; advance\_pc(4);  
 Syntax: sra \$d,\$t,h  
 Encoding: 0000 00-- --t tttt dddd dhhh hh00 0011

**SRL - Shift right logical**

Description: Shift a register value right by the shift amount (shamt) and places the value in the destination register. Zeros are shifted in.  
 Operation:  $\$d = \$t \gg h$ ; advance\_pc(4);  
 Syntax: srl \$d,\$t,h  
 Encoding: 0000 00-- --t tttt dddd dhhh hh00 0010

**SRLV - Shift right logical variable**

Description: Shifts a register value right by the amount specified in  $\$s$  and places the value in the destination register. Zeros are shifted in.  
 Operation:  $\$d = \$t \gg \$s$ ; advance\_pc(4);  
 Syntax: srlv \$d,\$t,\$s  
 Encoding: 0000 00ss ssst tttt dddd d000 0000 0110

### SUB - Subtract

Description: Subtract two registers and stores the result in a register

Operation:  $\$d = \$s - \$t$ ; advance\_pc(4);

Syntax: sub \$d, \$s, \$t

Encoding: 0000 00ss ssst tttt dddd d000 0010 0010

### SUBU - Subtract unsigned

Description: Subtract two registers and stores the result in a register

Operation:  $\$d = \$s - \$t$ ; advance\_pc(4);

Syntax: subu \$d, \$s, \$t

Encoding: 0000 00ss ssst tttt dddd d000 0010 0011

### SW - Store word

Description: The contents of \$t is stored at the specific address.

Operation: MEM[(\$s + offset)] = \$t; advance\_pc(4);

Syntax: sw \$t, offset(\$s)

Encoding: 1010 11ss ssst tttt iiiii iiiii iiiii

### XOR - Bitwise exclusive OR

Description: Exclusive ors two registers and stores the result in a register

Operation:  $\$d = \$s \wedge \$t$ ; advance\_pc(4);

Syntax: xor \$d, \$s, \$t

Encoding: 0000 00ss ssst tttt dddd d---- 0110 0110

### XORI - Bitwise exclusive OR immediate

Description: Bitwise exclusive ors a register and an immediate value and stores the result in a register

Operation:  $\$t = \$s \wedge imm$ ; advance\_pc(4);

Syntax: xori \$t, \$s, imm

Encoding: 0011 10ss ssst tttt iiiii iiiii iiiii

Campus 无线装订本  
B5 40页  
WCN-CNB1430



6"937748"309932  
MADE IN CHINA

国誉商业(上海)有限公司  
<http://www.kokuyo.cn/st/>  
上海市奉贤区人杰路128号

TEL : 400-820-0798 FAX : 021-3255-8508

产地:上海市 QB/T1438-2007合格

B5 252×179mm

● 采用日本进口纸张, 纸质细滑, 牢固不掉页。