an assembler, emulator, and debugger for a subset of the ARM assembly language
- âś… The assembler is (mostly!) complete for the chosen subset of the specification.
- 🚧 The emulator is in the middle of a rewrite to align its functionality with the exact behaviour specified in the ARM manual, so isn't currently fully functional.
- 🚧 There is an accompanying browser-based debugger that is currently in a different repository - I'm working on polishing it up and then I'm planning to integrate the entire project into a single monorepo.
Feel free to get in touch or raise a github issue, and check back soon for updates!
eremius' predecessors Komodo, Perentie, and Bennett, developed at the University of Manchester, are all named after monitor lizards. Varanus Eremius is the latin name for the Rusty Desert Monitor Lizard - a nod to the project's predecessors and it's implementation language.
Category | Mnemonic | Status |
---|---|---|
Branch | B | âś… |
Data Processing | ADD | âś… |
SUB | âś… | |
CMP | âś… | |
MOV | âś… | |
Data Transfer | LDR | âś… |
STR | âś… | |
LDRB | âś… | |
STRB | âś… | |
LDM | âś… | |
STM | âś… | |
System Call | SVC | âś… |
Pseudo-Instruction | ADR | âś… |
Assembler Directive | DEFW | âś… |
DEFB | âś… | |
DEFS | âś… | |
ORIGIN | âś… | |
ALIGN | âś… | |
ENTRY | âś… | |
EQU | âś… |
Mnemonic Extension | Meaning |
---|---|
EQ |
Equal |
NE |
Not Equal |
CS /HS |
Carry Set/Unsigned Higher or Same |
CC /LO |
Carry Clear/Unsigned Lower |
MI |
Minus (Negative) |
PL |
Plus (Positive or Zero) |
VS |
Overflow |
VC |
No Overflow |
HI |
Unsigned Higher |
LS |
Unsigned Lower or Same |
GE |
Signed Greater Than or Equal |
LT |
Signed Less Than |
GT |
Signed Greater Than |
LE |
Signed Less Than or Equal |
AL |
Always (Unconditional) |
There are 4 types of shifter operands:
Format | Name |
---|---|
#<immediate> |
Immediate |
<Rm> |
Register |
<Rm>, <shift> #<shift_imm> |
Register Shift By Immediate |
<Rm>, <shift> <Rs> |
Register Shift By Register |
All addressing modes involve a base register and an offset.
There are 3 types of offset value:
Format | Name |
---|---|
#+/-<offset_12> |
Immediate |
+/-<Rm> |
Register |
+/-<Rm>, <shift> #<shift_imm> |
Scaled Register |
There are also 3 types of offset:
Format | Name |
---|---|
[<Rn>, #<offset>] |
Offset |
[<Rn>, #<offset>]! |
Pre-Indexed |
[<Rn>], #<offset> |
Post-Indexed |
The 9 combinations of these formats form the 9 possible addressing modes.
A Label is a program-relative address that can be assigned to any line in the program.
Causes a branch to a target address.
B{L}{<cond>} <target_address>
Behaviour | |
---|---|
L |
Specifies that the instruction should store a return address in the link register (R14) |
<cond> |
Specifies under what circumstances the instruction should be executed (see Condition Flags) |
Behaviour | |
---|---|
<target_address> |
Specifies the address to branch to |
Adds two values. Can optionally update the condition flags based on the result.
ADD{<cond>}{S} <Rd>, <Rn>, <shifter_operand>
Behaviour | |
---|---|
<cond> |
Specifies under what circumstances the instruction should be executed (see Condition Flags) |
S |
Specifies that the instruction should update the Current Program Status Register (CPSR) Flags |
Behaviour | |
---|---|
<Rd> |
Specifies the destination register |
<Rn> |
Specifies the register that contains the first operand |
<shifter_operand> |
Specifies the second operand (see Shifter Operands) |
Subtracts one value from another. Can optionally update the condition flags based on the result.
ADD{<cond>}{S} <Rd>, <Rn>, <shifter_operand>
Behaviour | |
---|---|
<cond> |
Specifies under what circumstances the instruction should be executed (see Condition Flags) |
S |
Specifies that the instruction should update the Current Program Status Register (CPSR) Flags |
Behaviour | |
---|---|
<Rd> |
Specifies the destination register |
<Rn> |
Specifies the register that contains the first operand |
<shifter_operand> |
Specifies the second operand (see Shifter Operands) |
Compares two values, always updating the condition flags.
CMP{<cond>} <Rn>, <shifter_operand>
Behaviour | |
---|---|
<cond> |
Specifies under what circumstances the instruction should be executed (see Condition Flags) |
Behaviour | |
---|---|
<Rn> |
Specifies the register that contains the first operand |
<shifter_operand> |
Specifies the second operand (see Shifter Operands) |
Writes a value to a register.
MOV{<cond>}{S} <Rd>, <shifter_operand>
Behaviour | |
---|---|
<cond> |
Specifies under what circumstances the instruction should be executed (see Condition Flags) |
S |
Specifies that the instruction should update the Current Program Status Register (CPSR) Flags |
Behaviour | |
---|---|
<Rd> |
Specifies the destination register |
<shifter_operand> |
Specifies the operand (see Shifter Operands) |
Behaviour | |
---|---|
<cond> |
Specifies under what circumstances the instruction should be executed (see Condition Flags) |
S |
Specifies that the instruction should update the Current Program Status Register (CPSR) Flags |
Behaviour | |
---|---|
<Rd> |
Specifies the destination register |
<shifter_operand> |
Specifies the operand (see Shifter Operands) |
Loads a word into a register.
When used with a constant, this is a psuedo-instruction that the assembler will replace with either a data processing isntruction or an LDR
instruction pointing to a literal in memory.
LDR{<cond>} <Rd>, <source>
Behaviour | |
---|---|
<cond> |
Specifies under what circumstances the instruction should be executed (see Condition Flags) |
Behaviour | |
---|---|
<Rd> |
Specifies the destination register |
<source> |
Specifies the address (see Load/Store Address Operands) or a constant expression prefixed with = |
LDR r1, =0xfff; pseudo-instruction to load the constant 0xfff into r1
Stores a word to memory.
STR{<cond>} <Rd>, <address>
Behaviour | |
---|---|
<cond> |
Specifies under what circumstances the instruction should be executed (see Condition Flags) |
Behaviour | |
---|---|
<Rd> |
Specifies the source register |
<address> |
Specifies the address (see Load/Store Address Operands) |
Loads a byte from memory into a register and zero-entends it to a word.
LDRB{<cond>} <Rd>, <address>
Behaviour | |
---|---|
<cond> |
Specifies under what circumstances the instruction should be executed (see Condition Flags) |
Behaviour | |
---|---|
<Rd> |
Specifies the destination register |
<address> |
Specifies the address (see Load/Store Address Operands) |
Stores the least significant byte of a register to memory.
STRB{<cond>} <Rd>, <address>
Behaviour | |
---|---|
<cond> |
Specifies under what circumstances the instruction should be executed (see Condition Flags) |
Behaviour | |
---|---|
<Rd> |
Specifies the source register |
<address> |
Specifies the address (see Load/Store Address Operands) |
Loads values into multiple registers from sequential memory locations.
LDM{<cond>}<addressing_mode> <Rn>{!}, <registers>
Behaviour | |
---|---|
<cond> |
Specifies under what circumstances the instruction should be executed (see Condition Flags) |
<addressing_mode> |
Specifies how to produce a sequential range of addresses (see Load Multiple Addressing Modes) |
Behaviour | |
---|---|
<Rn> |
Specifies the base register used by <addressing_mode> , which can be optionally written back to if followed by ! |
<registers> |
Specifies the list of registers to be loaded, separated by commas and surrounded by { and } |
Name | |
---|---|
IB /ED |
Increment Before/Empty Descending Stack |
IA /FD |
Increment After/Full Descending Stack |
DB /EA |
Decrement Before/Empty Ascending Stack |
DA /FA |
Decrement After/Full Ascending Stack |
Stores values from multiple registers into sequential memory locations.
STM{<cond>}<addressing_mode> <Rn>{!}, <registers>
Behaviour | |
---|---|
<cond> |
Specifies under what circumstances the instruction should be executed (see Condition Flags) |
<addressing_mode> |
Specifies how to produce a sequential range of addresses (see Load Multiple Addressing Modes) |
Behaviour | |
---|---|
<Rn> |
Specifies the base register used by <addressing_mode> , which can be optionally written back to if followed by ! |
<registers> |
Specifies the list of registers to be stored, separated by commas and surrounded by { and } |
Name | |
---|---|
IB /FA |
Increment Before/Full Ascending Stack |
IA /EA |
Increment After/Empty Ascending Stack |
DB /FD |
Decrement Before/Full Descending Stack |
DA /ED |
Decrement After/Empty Descending Stack |
Calls a system function.
SVC{<cond>} <immed_24>
Behaviour | |
---|---|
<cond> |
Specifies under what circumstances the instruction should be executed (see Condition Flags) |
Behaviour | |
---|---|
<immed24> |
Specifies what system function is being requested (see System Functions) |
Behaviour | |
---|---|
0 |
Outputs the character in R0 |
1 |
Inputs a character into R0 |
2 |
Halts the program |
3 |
Outputs the C string starting at the address in R0 |
4 |
Outputs the number in R0 as a decimal |
Loads an address into a register.
This is a psuedo-instruction that will be replaced with either one or two data processing instructions. Not all addresses can be generated into one instruction, so the L
flag exists to allow two instructions to be generated.
ADR{<cond>}{L} <Rd>, <target_address>
Behaviour | |
---|---|
<cond> |
Specifies under what circumstances the instruction should be executed (see Condition Flags) |
L |
Specifies whether to allow assembling this pseudo-instruction into two data processing instructions rather than one, allowing for a wider range of addresses |
Behaviour | |
---|---|
<Rd> |
Specifies the destination register |
<target_address> |
Specifies the address to load |
Reserves one or multiple bytes of space in memory and puts initial values in them.
This is an assembler directive, and will not generate any actual instructions.
DEFB <expression>{, ...}
Behaviour | |
---|---|
<expression> |
Specifies the value to put in the word |
string DEFB "Hello", 0
Reserves one or multiple words of space in memory and puts initial values in them.
This is an assembler directive, and will not generate any actual instructions.
DEFW <expression>{, ...}
Behaviour | |
---|---|
<expression> |
Specifies the value to put in the word |
square table DEFW 0, 1, 4, 9, 16, 25
Reserves a byte of space in memory and puts an initial value in it.
This is an assembler directive, and will not generate any actual instructions.
DEFW <expression>
Behaviour | |
---|---|
<expression> |
Specifies the value to put in the word |
Reserves a block of space in memory.
This is an assembler directive, and will not generate any actual instructions.
DEFS <size> {, <fill>}
Behaviour | |
---|---|
<size> |
Specifies size of the block to reserve |
<fill> |
Specifies an optional value to fill each byte in the space with |
Sets the address of the following code.
This is an assembler directive, and will not generate any actual instructions.
ORIGIN <target_address>
Behaviour | |
---|---|
<target_address> |
Specifies the address to place the following code |
Aligns the following code to the next word boundary.
This is an assembler directive, and will not generate any actual instructions.
ALIGN
Places the following code at the start of the program, serving as the entry point.
This is an assembler directive, and will not generate any actual instructions.
ENTRY
Defines a name for a literal value.
This is an assembler directive, and will not generate any actual instructions.
discount EQU 100
...
SUB R5, R2, #discount
The Assembler is broken down into multiple stages and uses multiple intermediate representations. I've found this makes the code more modular and easier to reason about. These are mostly zero-cost abstractions as they make heavy use of Rust Iterators. There is only one point where we have to take into account the entire program, which is the symbol resolution step. This is the only intermediate step where we make a complete pass of the program - it can still be considered a two-pass process, like most assemblers.
Converts a string to tokens.
Converts tokens to an AST (Abstract Syntax Tree), consisting of Statements. A Statement can contain a label and a comment, both optional.
The AST is a one-to-one structured representation of what the user wrote, making it easier to work with. No other alterations are made at this stage.
Example:
start CMP R6, R4
gets converted to
Line {
label: Some(
"start",
),
statement: Some(
Instruction {
kind: DataProcessing {
condition: AL,
kind: Comparison {
kind: CMP,
source: Register(
6,
),
shifter: Register(
Register(
4,
),
),
},
},
},
),
}
Builds the symbol table and literal pools.
Converts the AST to a High-Level Intermediate Representation (HIR) and constructs a symbol table. The Statements are converted to Instructions (with the Operands left unresolved).
In this stage, comments and empty lines are discarded, directives are applied, and psuedo-instructions are expanded. This lets us decide the final memory addresses of each instruction and piece of data.
Converts the High-Level Intermediate Representation (HIR) to a Low-Level Intermediate Representation (LIR) by resolving symbols and encoding immediates.
The LIR is a one-to-one structured representation of the machine code. This is also the format used by the emulator.
Converts Instructions and Data into 32-bit words.
There are some snapshot tests to check for regressions. These can be run using the cargo test
command.