



Escola de Engenharia  
Universidade do Minho

**ESRG**  
EMBEDDED SYSTEMS  
RESEARCH  
GROUP

---

## **VeSPA**

---

*Industrial Electronics and Computers Engineering  
Embedded Systems and Computers*

Professor:  
Adriano Tavares

Afonso Maior, PG53959  
Mário Lima, PG54079

# Contents

|                                                 |            |
|-------------------------------------------------|------------|
| <b>Acronyms</b>                                 | <b>iii</b> |
| <b>List of Figures</b>                          | <b>iv</b>  |
| <b>1 Analysis</b>                               | <b>1</b>   |
| 1.1 Problem statement . . . . .                 | 1          |
| 1.2 Requirements . . . . .                      | 1          |
| 1.2.1 Functional requirements: . . . . .        | 1          |
| 1.2.2 Non-functional requirements: . . . . .    | 1          |
| 1.3 Constraints . . . . .                       | 1          |
| 1.3.1 Technical constraints: . . . . .          | 1          |
| 1.3.2 Non-technical constraints: . . . . .      | 1          |
| 1.4 CPU . . . . .                               | 2          |
| 1.5 Instruction formats and behaviour . . . . . | 2          |
| 1.6 Interrupt controller . . . . .              | 7          |
| 1.7 UART . . . . .                              | 7          |
| 1.8 Tools . . . . .                             | 8          |
| <b>2 Design</b>                                 | <b>9</b>   |
| 2.1 Datapath . . . . .                          | 9          |
| 2.1.1 ALU . . . . .                             | 10         |
| 2.1.2 Check Branch Condition . . . . .          | 11         |
| 2.1.3 Register File . . . . .                   | 11         |
| 2.2 Control Unit . . . . .                      | 12         |
| 2.2.1 Fetch . . . . .                           | 13         |
| 2.2.2 Decode . . . . .                          | 13         |
| 2.2.3 Execute . . . . .                         | 13         |
| 2.3 Memory . . . . .                            | 14         |
| 2.3.1 Memory organization . . . . .             | 15         |
| 2.4 Interrupt Controller . . . . .              | 15         |
| 2.4.1 Interrupt request handling . . . . .      | 16         |
| 2.4.2 Stack . . . . .                           | 17         |
| 2.5 UART . . . . .                              | 18         |
| 2.5.1 Baudrate generator . . . . .              | 18         |
| 2.5.2 UART TX . . . . .                         | 19         |
| 2.5.3 UART RX . . . . .                         | 20         |
| 2.5.4 System integration . . . . .              | 21         |
| <b>3 Implementation and Verification</b>        | <b>22</b>  |
| 3.1 Datapath . . . . .                          | 22         |
| 3.1.1 ALU . . . . .                             | 23         |
| 3.1.2 Check Branch Condition . . . . .          | 24         |
| 3.1.3 Register file . . . . .                   | 25         |
| 3.2 Control Unit . . . . .                      | 27         |
| 3.3 Memory . . . . .                            | 29         |
| 3.4 Interrupt Controller . . . . .              | 33         |

---

## Contents

|          |                              |           |
|----------|------------------------------|-----------|
| 3.5      | UART . . . . .               | 35        |
| 3.5.1    | Baudrate generator . . . . . | 35        |
| 3.5.2    | UART TX . . . . .            | 36        |
| 3.5.3    | UART RX . . . . .            | 37        |
| 3.6      | Top level . . . . .          | 38        |
| <b>4</b> | <b>Conclusion</b>            | <b>41</b> |
| <b>5</b> | <b>References</b>            | <b>42</b> |

# Acronyms

|             |                                             |
|-------------|---------------------------------------------|
| <b>FPGA</b> | Field-programmable Logic Array              |
| <b>IC</b>   | Interrupt Controller                        |
| <b>HDL</b>  | Hardware Description Language               |
| <b>ISA</b>  | Instruction Set Architecture                |
| <b>CPU</b>  | Central Processing Unit                     |
| <b>ALU</b>  | Arithmetic and Logic Unit                   |
| <b>UART</b> | Universal Asynchronous Receiver-Transmitter |
| <b>IRQ</b>  | Interrupt Request Lines                     |
| <b>ISR</b>  | Interrupt Service Routine                   |
| <b>RISC</b> | Reduced Instruction Set Computer            |

# List of Figures

|      |                                                                          |    |
|------|--------------------------------------------------------------------------|----|
| 1.1  | VeSPA basic instruction format . . . . .                                 | 2  |
| 1.2  | ALU instructions format 1 . . . . .                                      | 3  |
| 1.3  | ALU instructions format 2 . . . . .                                      | 3  |
| 1.4  | ALU instructions behaviour . . . . .                                     | 3  |
| 1.5  | BXX instruction format . . . . .                                         | 3  |
| 1.6  | BXX instruction behaviour . . . . .                                      | 3  |
| 1.7  | JMP instruction format . . . . .                                         | 4  |
| 1.8  | JMP instruction behaviour . . . . .                                      | 4  |
| 1.9  | JMPL instruction format . . . . .                                        | 4  |
| 1.10 | JMPL instruction behaviour . . . . .                                     | 4  |
| 1.11 | LD/LDI instructions format . . . . .                                     | 5  |
| 1.12 | LD/LDI instruction behaviour . . . . .                                   | 5  |
| 1.13 | LDX instruction format . . . . .                                         | 5  |
| 1.14 | LDX instruction behaviour . . . . .                                      | 5  |
| 1.15 | ST instruction format . . . . .                                          | 5  |
| 1.16 | ST instruction behaviour . . . . .                                       | 5  |
| 1.17 | STX instruction format . . . . .                                         | 6  |
| 1.18 | STX instruction behaviour . . . . .                                      | 6  |
| 1.19 | PUSH instruction format . . . . .                                        | 6  |
| 1.20 | PUSH instruction behaviour . . . . .                                     | 6  |
| 1.21 | POP instruction format . . . . .                                         | 7  |
| 1.22 | POP instruction behaviour . . . . .                                      | 7  |
| 1.23 | UART protocol representation . . . . .                                   | 7  |
| 2.1  | Datapath design . . . . .                                                | 9  |
| 2.2  | ALU functional unit design . . . . .                                     | 10 |
| 2.3  | Split register file diagram . . . . .                                    | 11 |
| 2.4  | Copy register file diagram . . . . .                                     | 12 |
| 2.5  | Control unit state machine . . . . .                                     | 13 |
| 2.6  | Memory design . . . . .                                                  | 14 |
| 2.7  | Memory organization design . . . . .                                     | 15 |
| 2.8  | Interrupt controller connections diagram . . . . .                       | 16 |
| 2.9  | Additional input to PC data . . . . .                                    | 16 |
| 2.10 | Interrupt controller behaviour . . . . .                                 | 17 |
| 2.11 | Connections between UART modules . . . . .                               | 18 |
| 2.12 | Baudrate generator design . . . . .                                      | 19 |
| 2.13 | UART transmission module state machine . . . . .                         | 19 |
| 2.14 | UART reception module state machine . . . . .                            | 20 |
| 2.15 | Integrated system modules . . . . .                                      | 21 |
| 3.1  | ALU simulation . . . . .                                                 | 24 |
| 3.2  | Check condition simulation . . . . .                                     | 25 |
| 3.3  | Register file IP configurations . . . . .                                | 25 |
| 3.4  | Register file without memory IP simulation . . . . .                     | 26 |
| 3.5  | Register file with memory IP simulation . . . . .                        | 27 |
| 3.6  | Register file with increased frequency and reading from port A . . . . . | 27 |

|      |                                                                          |    |
|------|--------------------------------------------------------------------------|----|
| 3.7  | Register file with increased frequency and reading from port B . . . . . | 27 |
| 3.8  | Memory IPs configurations . . . . .                                      | 29 |
| 3.9  | Configured memory IPs characteristics . . . . .                          | 29 |
| 3.10 | Initial memory simulation results . . . . .                              | 31 |
| 3.11 | Memory simulation results using IP . . . . .                             | 31 |
| 3.12 | Memory simulation results using IP and extra cycle . . . . .             | 31 |
| 3.13 | Tables implemented in database . . . . .                                 | 35 |
| 3.14 | UART TX simulation . . . . .                                             | 37 |
| 3.15 | UART RX simulation . . . . .                                             | 38 |
| 3.16 | Top level synthesis . . . . .                                            | 40 |

# 1 | Analysis

## 1.1 Problem statement

The project at hand involves the design and implementation of the VeSPA processor, a RISC architecture with distinctive features such as Load-Store, 32-bit data processing, Big-Endian byte order, and a 3-address machine configuration. Additionally, the project encompasses the integration of an assigned peripheral and an interrupt controller for enhanced functionality and system interaction.

Furthermore, the whole implementation was to be performed using Verilog HDL and tested using an FPGA.

## 1.2 Requirements

In order to ensure proper functioning of the system, a set of functional and non-functional requirements has to be met.

### 1.2.1 Functional requirements:

- Implement an instruction set adhering to RISC architecture;
- Support a Load-Store architecture;
- Support a byte addressable memory;
- Handle interrupts through the use of an interrupt controller;
- Support Big-Endian byte order;

### 1.2.2 Non-functional requirements:

- Ensure an efficient performance of the system;
- Be reliable to ensure proper functioning over time.

## 1.3 Constraints

There is also a set of boundaries on how the project can be executed that is defined by its constraints, whether technical or non-technical.

### 1.3.1 Technical constraints:

- The implementation is performed on a Zybo Z7-10 board;
- Implementation performed using Verilog HDL.
- CPU implementation split into datapath and control unit;
- Utilize memory IPs to provide better system performance.

### 1.3.2 Non-technical constraints:

- Project developed by a group of two members;
- Project deadline at the end of the semester.

## 1.4 CPU

A processor is a special circuit that is typically composed by two main components: the control unit and the datapath. This approach promotes a modular design that may result in a greater ease of development and a higher level of flexibility, this way being more adaptable to different possible requirements. Not only that, but this approach also possibly promotes an increase in performance as it makes it simpler to implement a possible pipeline mechanism that allows for different instructions to be in different stages of processing simultaneously. This implementation of the VeSPA microprocessor's ISA was also performed according to this methodology.

VeSPA has a RISC architecture, embodying the principles of simplicity and efficiency in its design. As a RISC processor, it employs a streamlined set of instructions, each executing in a single clock cycle, to optimize performance and enhance instruction throughput.

This processor follows a Load-Store architecture, adhering to the philosophy that arithmetic and logic operations are exclusively performed on data residing in registers. Memory interactions are facilitated through dedicated load and store instructions, ensuring a clear delineation between computational tasks and data movement.

As referred in the requirements, VeSPA's memory model is byte-addressable. This addressing capability provides flexibility and precision in memory access, accommodating a diverse range of data structures and addressing requirements.

Another feature of the VeSPA processor is its adherence to Big-Endian byte order. In this byte order, the most significant byte of a data word is stored at the lowest memory address, followed by subsequent bytes in decreasing significance.

This processor is also a 32-bit machine, this way being capable of addressing a significant amount of memory and still maintaining a streamlined architecture for efficient instruction execution.

In addition to its RISC architecture and memory characteristics, VeSPA operates as a 3-address machine, enhancing its program versatility and execution efficiency.

## 1.5 Instruction formats and behaviour

Since VeSPA has a RISC architecture, its instruction set is made up of a reduced set of instructions, with a fixed size (32 bits) and simple decoding. Therefore, instructions are composed of different aligned fields, and for all instructions, the first 5 bits define the opcode and the rest are dedicated to the operands.



Figure 1.1: VeSPA basic instruction format

- **ALU operations**

The instruction formats and behaviour for all the operations performed by the ALU are displayed in figures 1.2, 1.3 and 1.4 below.



Figure 1.2: ALU instructions format 1



Figure 1.3: ALU instructions format 2



Figure 1.4: ALU instructions behaviour

- **BXX**

The instruction format and algorithm for all the BXX operations are displayed in figures 1.5 and 1.6 below.



Figure 1.5: BXX instruction format



Figure 1.6: BXX instruction behaviour

- **JMP**

The instruction format and algorithm for the JMP operation are displayed in figures 1.7 and 1.8 below.



Figure 1.7: JMP instruction format



Figure 1.8: JMP instruction behaviour

- **JMPL**

The instruction format and algorithm for the JMPL operation are displayed in figures 1.9 and 1.10 below.



Figure 1.9: JMPL instruction format



Figure 1.10: JMPL instruction behaviour

- **LD/LDI**

The instruction format and algorithm for the LD and LDI operations are displayed in figures 1.11 and 1.12 below.



Figure 1.11: LD/LDI instructions format



Figure 1.12: LD/LDI instruction behaviour

- **LDX**

The instruction format and algorithm for the LDX operation are displayed in figures 1.13 and 1.14 below.



Figure 1.13: LDX instruction format



Figure 1.14: LDX instruction behaviour

- **ST**

The instruction format and algorithm for all the ST operation are displayed in figures 1.15 and 1.16 below.



Figure 1.15: ST instruction format



Figure 1.16: ST instruction behaviour

- **STX**

The instruction format and algorithm for the STX operation are displayed in figures 1.17 and 1.18 below.



Figure 1.17: STX instruction format



Figure 1.18: STX instruction behaviour

- **PUSH**

Given that the processor developed contains a stack, a push operation was added with the intent to provide the user a simple way of using the stack, in this case, by "pushing" data into it. The instruction format and algorithm for the PUSH operation are displayed in figures 1.19 and 1.20 below.



Figure 1.19: PUSH instruction format



Figure 1.20: PUSH instruction behaviour

- **POP**

Like the addition of the PUSH operation, a POP operation was also added to provide a simple way of retrieving data from the stack. The instruction format and algorithm for this operation are displayed in figures 1.21 and 1.22 below.



Figure 1.21: POP instruction format



Figure 1.22: POP instruction behaviour

## 1.6 Interrupt controller

An interrupt controller is a hardware component in a computer system that manages and prioritizes interrupt signals from various devices and peripherals. The referred interrupt-generating devices are connected to the interrupt controller through specific interrupt request lines (IRQs). They are then interpreted by the component in question which then sends signals to the CPU that determine the flow of execution. This way, the interrupt controller ensures that the CPU attends said interrupts in a controlled and organized manner.

Typically, an interrupt controller also allows for the disabling of the interrupts as a whole and individually (interrupt masking).

An interrupt vector is an address pointing to the location of a particular interrupt's corresponding ISR. As referred before, after receiving and interpreting an IRQ, the interrupt controller then sends signals to the CPU. These signals allow for the fetching of the correct interrupt vector and, consequentially, for the execution of the corresponding ISR.

## 1.7 UART

The previously referred peripheral assigned to be implemented was the UART. The UART (Universal Asynchronous Receiver-Transmitter) is a technology that allows data transfer between devices. As the name implies, this protocol is asynchronous, which means there is no shared clock signal between devices. Communication between devices is done through two main pins: TX for data transmission and RX for data reception. This peripheral allows data transfer in a bidirectional manner, which means devices can transmit and receive information simultaneously. Data is transferred bit by bit, with a transmission rate called baudrate. Regarding the protocol, the line is normally set to a high level, and, when transmission begins, the line is set to a low level (start bit). The message is then sent, bit by bit. Finally, the stop bit is sent, which sets the line high again. This process is represented in the following figure 1.23.



Figure 1.23: UART protocol representation

## 1.8 Tools

The project at hand makes use of Vivado as the main tool for the development of the project. Vivado is a design environment developed by Xilinx for the development of FPGA (Field-Programmable Gate Array) and SoC (System-on-Chip) designs. It provides a range of tools for designing, simulating, and implementing digital circuits on Xilinx FPGAs. Vivado supports a variety of design flows, including RTL (register-transfer level) design, high-level synthesis, and IP (intellectual property) integration. This tool also provides a user-friendly interface that makes its utilization simple and intuitive.

# 2 | Design

## 2.1 Datapath

The CPU's datapath is responsible for the execution of all operations upon the received data. This component determines how the data is transferred between different functional units such as the ALU and the register file, for example. In figure 2.1 presented below is displayed the system's datapath design. It is worth mentioning that the variables represented in blue simply act as connections between the different functional units, the ones in green and red are inputs and outputs to the datapath, respectively.



Figure 2.1: Datapath design

As depicted in the previous figure, the datapath receives and sends a number of inputs and outputs, respectively:

- **Inputs:**

- reset - Reset signal;
- clock - Clock signal;
- PCload - Program counter load;
- memRE - Memory read enable;
- memWE - Memory write enable;
- IRload - Instruction register load;
- rfWE - Register file write enable;
- SPload - Stack pointer load;
- ALU\_operation - Signals an ALU operation;
- pc\_sel - Multiplexer that mediates the source of the pc input;

- mem\_addr\_sel - Multiplexer that mediates the memory address;
- rf\_read2\_addr\_sel - Multiplexer that mediates the register file's read2 address;
- rf\_write\_sel - Multiplexer that mediates the source of the register file's data input;
- mem\_in\_sel - Multiplexer that mediates the data input to the memory block;
- sp\_in\_sel - Multiplexer that mediates the data input to the stack pointer block;
- slu\_op2\_sel - Multiplexer that mediates the source of the second operand used by the ALU block (imm16 or from the register file);

- **Outputs:**

- imm\_op - immediate operand extracted from the instruction register;
- opcode - Opcode extracted from the instruction register;
- Branch\_condition -Branch condition extracted from the instruction register;
- C - Carry flag controlled by ALU;
- Z - Zero flag controlled by ALU;
- N - Negative flag controlled by ALU;
- V - Overflow flag controlled by ALU;
- irq\_addr - Address of the IRQ's corresponding ISR;
- int\_ack\_complete - Acknowledgement of the attendance of a ISR;
- int\_ack\_complete - Acknowledgement of the attendance of a IRQ;

As referred above, the datapath contains different functional units. These will be described in the following subsections as well as their design for this specific system:

### 2.1.1 ALU

The ALU, as the name implies, is responsible for the handling of all of the arithmetic and logic operations (ADD, SUB, OR, AND, NOT, XOR, CMP). This functional unit has two inputs: one that comes from the register file "r1\_out" and one that can also come from the register file, although in this case, from "r2\_out", but also from the instruction register's extracted "imm\_16", depending on the operation. It's output acts as an input to the register file's data input and it is also responsible for the control of the condition codes (C, N, Z, V). The figure 2.2 displayed below illustrates the referred functional unit.



Figure 2.2: ALU functional unit design

### 2.1.2 Check Branch Condition

This functional unit handles the checking of the condition associated with operations of type branch (BRA, BNV, BCC, BCS, BVC, BVS, BEQ, BNE, BGE, BLT, BGT, BLE, BPL, BMI). To achieve this purpose, its inputs are the condition codes and the branch condition extracted from the instruction register, and it signals if the condition is checked or not.

### 2.1.3 Register File

As mentioned in the analysis phase, the Vespa is a 3-address machine. As such, to allow instructions like those in alu, in which the two operands come from the register file and the result goes to it, to execute in 1 clock cycle, it is necessary that the register file has 3 ports.

As the register file is a component that is located on a critical path of the datapath, it is necessary to implement memory using an IP, in order to make readings and writings more efficient.

However, after checking the Vivado documentation, it is not possible to configure a block RAM with 3 ports.

Therefore, two alternatives and respective trade-offs for implementing the register file will be presented below, in order to have 2 outputs and 1 data input.

#### Option 1

Implement the register file by split. To do this, it is necessary to instantiate two dual-port block RAMs with 16 addresses each, as you can see in figure x. To decide which of the two memories to access, the most significant bit of the address is used.

The advantage of this method is the reduction of resources spent, compared to the method presented below. The disadvantage is that a programmer can never read two registers from the same bank. However, this can be managed by the compiler. A diagram of this approach is displayed in figure 2.3 below.



Figure 2.3: Split register file diagram

**Option 2**

Implement the register file by copying. To do this, instead of instantiating a single dual-port block RAM with 32 addresses, two are instantiated, as can be seen in figure x. By combining the data inputs and the write enable signal, the two memories always have the same content, allowing one of each output to be used.

An advantage of this method is that, unlike the previous one, the user can read 2 registers from the same bank. However, this method takes up twice as much resources.

After analyzing the advantages and disadvantages, it was decided to use the second alternative, since the size of the register file is not very large and the board used has sufficient resources. A diagram of this approach is displayed in figure 2.4 below.



Figure 2.4: Copy register file diagram

## 2.2 Control Unit

A processor's control unit is responsible for managing the various operations of the CPU. The control unit works by activating and deactivating signals, attending to the desired operation to be performed, that will serve as inputs to the datapath. These two units work in unison to provide the desired functionality of the CPU. This component is implemented resorting to a state machine that is depicted in figure 2.5 displayed below.



Figure 2.5: Control unit state machine

Each state of the system activates a given number of signals that will serve as inputs to the datapath circuit. The sequence of steps the CPU follows to execute an instruction is typically divided into three stages that form a continuous cycle: fetch, decode and execute.

### 2.2.1 Fetch

In the fetch state, signals will be sent to the datapath so that the CPU retrieves an instruction from the memory.

### 2.2.2 Decode

The decode state comes after the fetch state and precedes the execute state, as is verifiable in the previously presented state machine. In this state, the fetched instruction's opcode will be analysed (or, as the name implies, "decoded") in order to understand what operation needs to be performed.

### 2.2.3 Execute

The execute state is responsible for executing the correct operation based on the decoded instruction provided by the decode state.

There are some other considerations regarding the design of the control unit state machine that are worth noting. As referred in the register file design, due to its latency when reading, there could be a need to add an extra cycle to cope with this condition. However, to avoid this delay in the execution, the decode estate was also used to read from the register file. With this being said, the only instructions that need an extra cycle are the ones that access the memory, so, an extra state follows those. This way, there is no wasted performance.

## 2.3 Memory

As mentioned in the analysis phase, following the Von Newman architecture, the code and data will be in the same memory. Therefore, it is important to define a good organization, as will be shown later.

According to Vespa specifications, the memory is byte addressable and the address and data bus is 32 bits.

The memory is single-port, which means that it has only one address input, one data input and one data output. Furthermore, it contains a write enable pin that allows writing and it also contains an output that indicates whether the memory has finished its initialization process after the reset.

In this project, an aligned memory will be implemented, which means that, as the memory is byte addressable, whenever there is an attempt to write or read at an address that is not divisible by 4, the 'error' signal is set to 1, alerting what happened. Later, this pin will be connected to the interrupt controller, thus allowing an exception to be raised.

Despite the address bus being 32 bits, it was decided to allocate 8192 bytes ( $2^{13}$ ) of memory, half for code and the other half for data, as will be shown later.

Since the memory is located in a critical area of the datapath, it is necessary to implement the memory using a memory IP. According to the Vivado documentation, all available IPs have at least one cycle of latency when reading. Given this, and to maintain synchronization with the rest of the processor, it is necessary to overcome the effect of latency. Two alternatives and their respective advantages and disadvantages will be presented below.

Since it is byte addressable and the data is 32 bits, it would be necessary to write/read 4 bytes simultaneously. To do this, an IP RAM block with 4 ports was needed, as can be seen in figure 2.6. However, according to the software documentation, IPs have a maximum of 2 ports.



Figure 2.6: Memory design

This way, so that it is possible to write/read 4 bytes simultaneously, the memory will be composed of 4 instances of a single-port block RAM.

Thus, the last 2 bits of the address serve to identify which of the 4 instances should be accessed and the remaining 11 bits will be used to provide the address to all 4 block RAMs.

A disadvantage of this method is that each instruction has to be divided into 4 to be placed in each memory block. Then it will be necessary for the compiler to generate 4 different files with the divided instructions.

**Option 1**

Use a PLL (Phase-locked loop) or a MMCM (Mixed-Mode Clock Manager) to double the clock frequency for memory. An advantage of this method is that it will not be necessary to add extra states in the control unit for instructions that read from memory. However, this has a disadvantage. Normally, the instructions that take the longest to execute are those that access memory, so these will force instructions that take less time to operate at half the frequency established for the memory.

**Option 2**

Use an extra cycle only for instructions that read from memory. This alternative proves to be the most suitable, as it allows all instructions to operate at a higher frequency.

**2.3.1 Memory organization**

As referred before, the VeSPA follows a Von Newmann architecture, which makes a proper memory organization even more important. The design of the system's memory organization is displayed in figure 2.7 below. The upper half of the memory is a RAM and is dedicated to the data, and the lower half is a ROM and is dedicated to the code memory. This division is achieved by the interpretation of the address most significant bit. If the most significant bit is 1, the system can write to it. As displayed, the stack is mapped in the top of the data memory and it grows downwards according to the current needs of the system. There is also an area where the developed peripheral is mapped, as shown in the figure as well.



Figure 2.7: Memory organization design

**2.4 Interrupt Controller**

The interrupt controller to be developed has some particular features that should be specified to provide a better insight of its inner-working:

- Capable of handling interruptions from up to 4 different sources;

- The interrupts have a fixed priority, where the interruption with the lowest index has the highest priority;
- Capable of handling the nesting of different interrupt sources.
- Capable of enabling and disabling the interrupts as a whole and also individually.

In order to implement the component in a way that it can support all of the referred features, the connection signals were established as displayed in figure 2.8 below.



Figure 2.8: Interrupt controller connections diagram

#### 2.4.1 Interrupt request handling

Considering the component's features and its design, an important trade-off regarding how the it's outputs are handled took place. When an interrupt request is received, there are three hypothesis on how to handle it:

##### Option 1:

An input sent directly from the control unit could be added to the multiplexer that mediates the data input of the PC, as displayed in figure 2.9, acting as the address of the correct IRQ on the interrupt vector. This option proves to be useful as it promotes a simpler implementation, however, there is a need to use an additional control bit for the multiplexer because of only one extra input.



Figure 2.9: Additional input to PC data

**Option 2:**

Perform no changes to the current datapath and control the signals sent from the control unit in order to use the JMP (PC = Rx + imm16) instruction. This option has the downside of the need to use an extra clock cycle to load the Rx value and store the Rx on a stack.

**Option 3:**

Use the instruction JMP with link. This option proves to be useful as there is no need for a stack implementation and for a change in the datapath, however, there is the need of a dedicated register on the register file for interrupt handling, which would imply additional resource expenditure and a less flexible system.

Out of the three possible solutions, the chosen one was the first, adding an input sent directly from the control unit to the multiplexer that mediates the data input of the PC, given that it is a solution that implies a minimal change to the datapath and very little extra resource expenditure.

## 2.4.2 Stack

As referred throughout the presentation of this subject, when an interrupt is received, it typically, as the name implies, "interrupts" the current execution flow of the processor and executes the contents of its ISR, as long as it is enabled. The nesting of the interrupts consists in providing the capability of interrupting the execution of an interrupt's corresponding ISR, when an interrupt of a higher priority is received. A stack memory was implemented given that this feature implied that more than one program counter can possibly need to be stored, so that after the ISR's execution, it can go to where it had left off.

The behaviour of the designed interrupt controller is displayed in the following figure 2.10.



Figure 2.10: Interrupt controller behaviour

## 2.5 UART

The UART peripheral, which protocol was explained in the analysis, can be divided into 3 sub-modules:

- Baudrate generator;
- UART TX;
- UART RX.

The figure 2.11 presented below represents the connections between these referred modules.



Figure 2.11: Connections between UART modules

In this project, a UART with configurable baudrate was implemented through a memory-mapped register (address 4100), however the data size is fixed (8 bits) and does not contain a parity bit.

### 2.5.1 Baudrate generator

As can be seen in the figure 2.12 below, the baudrate generator module receives as input a value that defines the baudrate (baudrate\_counter). This value can be calculated with the following expression:

$$\text{baudrate\_counter} = (\text{clk} \rightarrow 125\text{Mhz}) / (\text{desired baudrate})$$

Using this value and the clock, the baudrate tick is generated.



Figure 2.12: Baudrate generator design

### 2.5.2 UART TX

The UART transmission module can be represented by the state machine in figure 2.13.



Figure 2.13: UART transmission module state machine

Initially, in the "IDLE" state, the system waits for the start bit to reach 1. After that, it switches to the

"START" state, where the output "tx\_bit" is set to 0 and where the data to be sent is placed in the buffer internal. It then enters the "DATA" state where data is sent bit by bit. To do this, the internal buffer shifts right at each upward transition of the clock tick.

Finally, when the 8 bits are sent, it advances to the "STOP" state where the "tx\_bit" is set to 1, advancing again to the "IDLE" state.

### 2.5.3 UART RX

The UART reception module is represented by the state machine displayed in figure 2.14 below.



Figure 2.14: UART reception module state machine

In the beginning, the state "IDLE" waits for the "rx\_bit" input to be set, which is the start bit. After that, if the reception is enabled, the state is then "DATA" where the value of the "rx\_bit" input is stored at every positive edge of the clock in the most significant position of an internal buffer, shifting its content to the right.

After receiving 8 bits, the state turns to "STOP", where the buffer value is inserted in the output, before returning to the state "IDLE"

### 2.5.4 System integration

Having specified the design for each of the modules that compose the whole system, it becomes relevant to specify how they are connected between each other and how they communicate. Figure 2.15 displays how all of the referred modules are connected as well as the connections of their previously displayed respective wires.



Figure 2.15: Integrated system modules

# 3 | Implementation and Verification

## 3.1 Datapath

The datapath is a very important part of the processor. It is responsible for executing all instructions through signals coming from the control unit. The fetch process is executed there, as illustrated in the code below.

```
1 //update instruction register when ir_load = 1
2 always @(posedge clk) begin
3     if(reset == 1'b1) begin
4         IR = 0;
5     end
6     else if(ir_load == 1'b1) begin
7         IR = mem_out;
8     end
9 end
10
11
12 //update PC value
13 always @(posedge clk) begin
14     if(reset == 1'b1) begin
15         PC = `PC_INIT;
16     end
17     else if(pc_load == 1'b1) begin
18         PC = pc_in;
19     end
20 end
```

Algoritmo 3.1: Program counter and instruction register

As shown in the design phase, it is also composed of a series of multiplexers important for selecting some inputs, for example. Some of these multiplexers are represented in the code below and are also available in the attachment.

```
1 //alu operator 2 mux
2 assign alu_op2 = alu_op2_sel ? imm16_extend : rf_out2;
3
4 //alu data input mux
5 assign rf_write_data = (rf_write_sel == 2'b00) ? imm22_extend :
6     (rf_write_sel == 2'b01) ? PC :
7     (rf_write_sel == 2'b10) ? mem_out : alu_res;
8
9
10 //jump address (rs1 + imm16)
11 wire [`REG_MSB:0]jmp_addr;
12 assign jmp_addr = rf_out1 + imm16_extend;
13
14
15 //program counter input mux
16 wire [`REG_MSB:0]pc_in;
17 assign pc_in = (pc_sel == 3'b000) ? (jmp_addr) :
18     (pc_sel == 3'b001) ? (PC + imm23_extend) :
19     (pc_sel == 3'b010) ? (PC + 4) :
20     (pc_sel == 3'b011) ? (int_number << 2) + 4 : mem_out;
21
22
23 //memory address mux
24 wire [`REG_MSB:0]mem_addr;
25 assign mem_addr = (mem_addr_sel == 3'b000) ? SP :
26     (mem_addr_sel == 3'b001) ? imm22_extend :
27     (mem_addr_sel == 3'b010) ? (imm17_extend + rf_out1) :
28     (mem_addr_sel == 3'b011) ? 32'd4104 : PC;
29
30
31 //memory data input mux
32 wire [`REG_MSB:0]mem_in;
33 assign mem_in = (mem_in_sel == 2'b00) ? PC :
34     (mem_in_sel == 2'b01) ? uartRX_data: rf_out2;
```

Algoritmo 3.2: Datapath multiplexers

Furthermore, the ALU, register file and memory module are instantiated in it.

### 3.1.1 ALU

As specified in the design the ALU has 3 inputs: OP1 (operand 1), OP2 (operand 2) and alu\_operation. These are interpreted by the ALU, the result is sent through the output and the condition codes are updated accordingly. The implementation of this module is presented in the listing below.

```

1 //alu module
2 module alu(
3   input reset,
4   input [`REG_MSB:0] op1,
5   input [`REG_MSB:0] op2,
6   input [2:0] operation,
7   input enable,
8   output reg C,
9   output reg Z,
10  output reg N,
11  output reg V,
12  output reg [`REG_MSB:0] res
13 );
14
15
16 //ALU INSTRUCTION CODE DEFINE
17 `define ADD_ALU 3'd0
18 `define SUB_ALU 3'd1
19 `define OR_ALU 3'd2
20 `define AND_ALU 3'd3
21 `define NOT_ALU 3'd4
22 `define XOR_ALU 3'd5
23
24
25 //check if is a sub operation
26 //used to update overflow flag
27 wire sub;
28 assign sub = (operation == `SUB_ALU) ? 1'b1 : 1'b0;
29
30 //clear all condition flags
31 initial begin
32   C <= 0;
33   Z <= 0;
34   N <= 0;
35   V <= 0;
36   res <= 0;
37 end
38
39
40 always@(*) begin
41   if (reset == 1'b1) begin
42     C <= 0;
43     Z <= 0;
44     N <= 0;
45     V <= 0;
46     res <= 0;
47   end
48   else if(enable == 1'b1) begin
49     case(operation)
50       //sum operation
51       `ADD_ALU:
52         {C, res} = op1 + op2;
53
54       //subtraction operation
55       `SUB_ALU:
56         {C, res} = op1 - op2;
57
58       //logic OR operation
59       `OR_ALU:
60         res = op1 | op2;
61
62       //logic AND operation
63       `AND_ALU:
64         res = op1 & op2;
65
66       //logic NOT operation
67       `NOT_ALU:
68         res = ~op1;
69
70       //logic XOR operation
71       `XOR_ALU:
72         res = op1 ^ op2;
73
74     endcase
75
76     //update zero flag
77     Z <= ~(|res);
78
79     //update negative flag
80     N <= res[`REG_MSB];
81
82     //update overflow flag if is ADD OR SUB operation
83     if(operation == `ADD_ALU || operation == `SUB_ALU) begin

```

```

84         V <= (res[`REG_MSB] & ~op1[`REG_MSB] & ~(sub ^ op2[`REG_MSB]))
85         |(~res[`REG_MSB] & ~op1[`REG_MSB] & (sub ^ op2[`REG_MSB]));
86     end
87   end
88 endmodule

```

Algoritmo 3.3: ALU implementation

To test this implementation, the simulation displayed in figure 3.2 was performed. From the simulation it is possible to verify that the module is working correctly as the condition codes are being set and reset as they should according to the operations performed.



Figure 3.1: ALU simulation

### 3.1.2 Check Branch Condition

As referred in the design phase, this functional unit has the branch condition that comes from the instruction register and the condition codes as inputs and a single output that signals if the condition has been checked or not, which ultimately will result in the execution of the branch or not. The implementation of this component is displayed in the listing below.

```

1 //check branch condition module
2 module check_cond(
3   input [3:0]cond,
4   input c,
5   input z,
6   input n,
7   input v,
8   output reg out
9 );
10
11   always @(*) begin
12     case (cond)
13       `BRA: //branch always
14         out <= 1;
15
16       `BNV: //branch never
17         out <= 0;
18
19       `BCC: //branch if not carry
20         out <= ~c;
21
22       `BCS: //branch if carry
23         out <= c;
24
25       `BVC: //branch if not overflow
26         out <= ~v;
27
28       `BVS: //branch if overflow
29         out <= v;
30
31       `BEQ: //branch if equal
32         out <= z;
33
34       `BNE: //branch if not equal
35         out <= ~z;
36
37       `BGE: //branch if greater than or equal
38         out <= (~n & ~v) | (n & v);
39
40       `BLT: //branch if less than

```

```

41         out <= (n & ~v) | (~n & v);
42
43     `BGT: //branch if greater than
44         out <= ~z & ((~n & ~v) | (n & v));
45
46     `BLE: //branch if less than or equal
47         out <= z | ((n & ~v) | (~n & v));
48
49     `BPL: //branch if plus
50         out <= ~n;
51
52     `BMI: //branch if minus
53         out <= n;
54
55     endcase
56 end
endmodule

```

Algoritmo 3.4: Check branch condition implementation

To test this implementation, the simulation displayed in figure 3.2 was performed. From the simulation it is possible to verify that the module is working correctly as the "out" signal is set when the branch condition and the condition codes match.



Figure 3.2: Check condition simulation

### 3.1.3 Register file

To implement the register file, the first thing was using Vivado's memory configurator, configuring the settings as displayed in figure 3.3. We started by defining the memory as 'True Dual Port Ram' and changed the size and number of registers to 32. The 'RSTA Pin' option was activated so that it was possible to reset the memory and have a output that indicates when the memory has finished its initialization process.



Figure 3.3: Register file IP configurations

The following module was then developed where 2 blocks of memory were instantiated and connected in order to create the register file by copy as explained in the design.

```

1  `define REG_MSB      31      //register most significant bit
2  `define RF_SIZE      32      //register file size
3
4
5
6 //register file module
7 module registerFile(
8   input clk,
9   input rst,
10  input we,
11  input [4:0]w_addr,
12  input [4:0]r1_addr,
13  input [4:0]r2_addr,
14  input [`REG_MSB:0]data_in,
15  output [`REG_MSB:0]read1_out,
16  output [`REG_MSB:0]read2_out,
17  output initialized
18 );
19
20
21 wire rsta_busy_rf1;
22 wire rsta_busy_rf2;
23 wire rstb_busy_rf1;
24 wire rstb_busy_rf2;
25
26 assign initialized = (rsta_busy_rf1 & rstb_busy_rf1 & rsta_busy_rf2 & rstb_busy_rf2);
27
28 RAM32X32B rf1(.clkb(clk),
29   .rsta(rst),
30   .rstb(rst),
31   .web(web),
32   .addrb(w_addr),
33   .dinb(data_in),
34   .clka(clk),
35   .wea(1'b0),
36   .addra(r1_addr),
37   .douta(read1_out),
38   .dina(0),
39   .rsta_busy(rsta_busy_rf1),
40   .rstb_busy(rstb_busy_rf1)
41 );
42
43 RAM32X32B rf2(.clkb(clk),
44   .rsta(rst),
45   .rstb(rst),
46   .web(web),
47   .addrb(w_addr),
48   .dinb(data_in),
49   .clka(clk),
50   .wea(1'b0),
51   .addra(r2_addr),
52   .douta(read2_out),
53   .dina(0),
54   .rsta_busy(rsta_busy_rf2),
55   .rstb_busy(rstb_busy_rf2)
56 );
57
58 endmodule

```

Algoritmo 3.5: UART TX implementation

As can be seen in the summary of figure 3.3, the reading latency on both ports is 1 clock cycle, however on port B it indicates that it is from the rising edge of the reading clock. So, for testing purposes, the register file without IP was developed, so that latency could be checked.

Then, a test bench was carried out and simulated with the register file without IP, obtaining the result shown in figure 3.4.



Figure 3.4: Register file without memory IP simulation

Then, using the same test bench, but changing the register file with IP, the result in figure 3.5 was obtained. It can be seen that in fact the reading is not done correctly and that there is in fact a delay cycle.



Figure 3.5: Register file with memory IP simulation

Then, the clock frequency for the memory was doubled and the reading was done from port A and the result in figure 3.6 was obtained. In this way, it can be seen that the result is now the same as in memory without IP.



Figure 3.6: Register file with increased frequency and reading from port A

However, to verify the effect of latency on port B from the rising transition of the reading clock cycle, the reading was moved to port B, verifying that there is a greater delay, as displayed in figure 3.7.



Figure 3.7: Register file with increased frequency and reading from port B

Taking these simulations in account, it was decided to place reading from port A and writing from port B.

## 3.2 Control Unit

As presented in the design phase, the control unit is implemented through the use of a state machine. Said state machine's implementation is presented in the listing displayed below.

```

1  always @(posedge clk) begin
2      if(rst == 1'b1) begin
3          state <= s_START;
4      end
5
6      else begin
7          case(state)
8              s_START: begin
9                  if(blockRams_init == 1'b0)
10                     state <= s_FETCH;
11                 else
12                     state <= s_START;
13             end
14
15             s_FETCH: begin
16                 state <= s_DECODE;
17             end
18

```

```

19          s_DECODE: begin
20              if(opcode == s_BXX && branch_checked == 1'b0)
21                  state <= s_FETCH;
22              else
23                  state <= opcode;
24          end
25
26          s_HLT:
27              state <= s_HLT;
28
29          s_IRQ1: begin
30              state <= s_IRQ2;
31          end
32
33          default: begin
34              if(state == s_LD || state == s_LDX || state == s_ST || state == s_STX || state == s_PUSH ||
35                  state == s_POP || state == s_JMP || state == s_RETI || state == s_IRQ2 || state == s_BXX) begin
36
37                  state <= s_EXTRA;
38              end
39
40              else if(int_req) begin
41                  state <= s_IRQ1;
42              end
43
44              else
45                  state <= s_FETCH;
46          end
47      endcase
48  end
49

```

Algoritmo 3.6: UART TX implementation

As referred in a previous chapter of this document, the control unit sends specific signals to the datapath according to the current state of the system. The implementation of this mechanism is displayed in the listing presented below.

```

1 assign ir_load = (state == s_FETCH) ? 1'b1 : 1'b0;
2
3 assign pc_load = (state == s_JMP || state == s_FETCH || state == s_BXX || state == s_IRQ2 || state == s_RETI) ? 1'b1 : 1'b0;
4
5 assign pc_sel = (state == s_JMP) ? 3'b000 :
6     (state == s_BXX) ? 3'b001 :
7     (state == s_FETCH) ? 3'b010 :
8     (state == s_IRQ2) ? 3'b011 : 3'b100;
9
10 assign rf_read2_addr_sel = (-(state == s_FETCH) && ~(state == s_EXTRA) && (opcode == s_ST || opcode == s_STX)) ? 1'b1 : 1'b0;
11
12 assign rf_write_sel = (state == s_JMP && imm_op == 1'b1) ? 2'b01 :
13     (state == s_LD || state == s_LDX) ? 2'b10 :
14     (state == s_LDI) ? 2'b00 : 2'b11;
15
16 assign we_mem = (state == s_ST || state == s_STX || state == s_IRQ2 || state == s_MAP) ? 1'b1 : 1'b0;
17
18 assign we_rf = (state == s_ADD || state == s_SUB || state == s_OR ||
19     state == s_AND || state == s_NOT || state == s_XOR ||
20     state == s_LD || state == s_LDI || state == s_LDX ||
21     state == s_JMP && imm_op == 1'b1) ? 1'b1 : 1'b0;
22
23 assign alu_ctrl = (state == s_SUB || state == s_CMP) ? 3'b001 :
24     (state == s_OR) ? 3'b010 :
25     (state == s_AND) ? 3'b011 :
26     (state == s_NOT) ? 3'b100 :
27     (state == s_XOR) ? 3'b101 : 3'b000;
28
29 assign alu_op2_sel = (imm_op == 1'b1) ? 1'b1 : 1'b0;
30
31 assign alu_en = (state == s_ADD || state == s_SUB || state == s_OR ||
32     state == s_AND || state == s_NOT || state == s_XOR ||
33     state == s_CMP) ? 1'b1 : 1'b0;
34
35 assign sp_load = (state == s_IRQ1 || state == s_POP ||
36     state == s_PUSH || state == s_RETI) ? 1'b1 : 1'b0;
37
38 assign sp_in_sel = (state == s_IRQ1 || state == s_PUSH) ? 1'b0 : 1'b1;
39
40 assign mem_addr_sel = ((state == s_DECODE && (opcode == s_RETI || opcode == s_POP)) || state == s_RETI ||
41     state == s_POP || state == s_IRQ2) ? 3'b000 :
42     ((state == s_DECODE && opcode == s_LD) || state == s_LD || state == s_ST) ? 3'b001 :
43     ((state == s_DECODE && opcode == s_LDX) || state == s_LDX || state == s_STX) ? 3'b010 :
44     (state == s_MAP) ? 3'b011 : 3'b100;
45
46 assign mem_in_sel = (state == s_IRQ1) ? 2'b00 :
47     (state == s_MAP) ? 2'b01 : 2'b10;
48
49 assign int_ack_complete = (state == s_RETI) ? 1'b1 : 1'b0;
50
51 assign int_ack_attended = (state == s_IRQ2) ? 1'b1 : 1'b0;

```

Algoritmo 3.7: UART TX implementation

### 3.3 Memory

As mentioned in the design, to implement byte-addressable memory, it was necessary to instantiate 4 block RAMs with 2048 addresses each. In figure 3.8 below you can see the settings made in the Vivado memory configurator.



Figure 3.8: Memory IPs configurations

After these configurations, it is possible to check all the characteristics of the generated IP in the memory configurator summary, as displayed in figure 3.9 below.



Figure 3.9: Configured memory IPs characteristics

Below is the code created, where you can check the four block RAM instances. A check has also been added to see if an attempt is made to write code memory and if an unaligned access attempt occurs.

```

1 module memorySynthesis(
2     input clk,
3     input rst,
4     input we,
5     input [`REG_MSB:0]addr,
6     input [`REG_MSB:0]data_in,
7     output [`REG_MSB:0]data_out,
8     output initialized,
9     output error
10 );
11
12     assign error = (addr[1:0] != 2'b00 || (we && addr[`USED_MEM_MSBT] == 1'b0)) ? 1'b1 : 1'b0;
13
14     wire [10:0]addr_aux;
15     assign addr_aux = addr[`USED_MEM_MSBT:2];
16
17
18     wire we_aux;
19     assign we_aux = we;
20
21
22     wire rsta_busy00;
23     wire rsta_busy01;
24     wire rsta_busy10;
25     wire rsta_busy11;
26
27     assign initialized = (rsta_busy00 & rsta_busy01 & rsta_busy10 & rsta_busy11); //memory init
28
29 //.....BIG-ENDIAN.....
30
31 //first 8 most significant bits of data
32 MEM00 mem0(.clka(clk),
33             .rsta(rst),
34             .wea(we_aux),
35             .addr(addr_aux),
36             .dina(data_in[31:24]),
37             .douta(data_out[31:24]),
38             .rsta_busy(rsta_busy00)
39         );
40
41 //second 8 most significant bits of data
42 MEM01 mem1(.clka(clk),
43             .rsta(rst),
44             .wea(we_aux),
45             .addr(addr_aux),
46             .dina(data_in[23:16]),
47             .douta(data_out[23:16]),
48             .rsta_busy(rsta_busy01)
49         );
50
51 //third 8 most significant bits of data
52 MEM10 mem2(.clka(clk),
53             .rsta(rst),
54             .wea(we_aux),
55             .addr(addr_aux),
56             .dina(data_in[15:8]),
57             .douta(data_out[15:8]),
58             .rsta_busy(rsta_busy10)
59         );
60
61 //8 less significant bits of data
62 MEM11 mem3(.clka(clk),
63             .rsta(rst),
64             .wea(we_aux),
65             .addr(addr_aux),
66             .dina(data_in[7:0]),
67             .douta(data_out[7:0]),
68             .rsta_busy(rsta_busy11)
69         );
70
71 endmodule

```

Algoritmo 3.8: Memory implementation using IP

For the purpose of testing the need for an extra cycle when reading memory by IP, the memory was also implemented without using an IP, as shown in the code below.

```

1 `define MEM_REG_MSB      7      //memory register most significant bit
2 `define REG_MSB        31      //register most significant bit
3 `define MEM_SIZE       (1<<8) //memory size
4 `define USED_MEM_MSB    7      // address most significant bit
5
6
7 module memory(
8     input clk,
9     input we,
10    input rst,
11    input [`REG_MSB:0]addr,
12    input [`REG_MSB:0]data_in,

```

```

13     output [`REG_MSB:0] data_out
14 );
15
16 reg [`MEM_REG_MSB:0] MEM [`MEM_SIZE - 1:0];
17
18 always @ (posedge clk) begin
19     if (we==1'b1 && addr<`MEM_SIZE) begin
20         $display(addr);
21         {MEM[addr], MEM[addr+1], MEM[addr+2], MEM[addr+3]} = data_in; //big-endian
22     end
23 end
24
25 always @ (posedge clk) begin
26     if (rst == 1'b1) begin
27         {MEM[0], MEM[1], MEM[2], MEM[3]} = {5'd11, 5'd1, 22'd4}; //LDI: R1 = 4
28         {MEM[4], MEM[5], MEM[6], MEM[7]} = {5'd9, 5'd7, 5'd1, 1'd1, 16'd12}; //JMP: L:R1
29         {MEM[16], MEM[17], MEM[18], MEM[19]} = {5'd7, 5'd0, 5'd1, 1'd1, 16'd4}; //CMP:
30         {MEM[20], MEM[21], MEM[22], MEM[23]} = {5'd8, 4'd3, 23'd16}; //BEQ
31         {MEM[40], MEM[41], MEM[42], MEM[43]} = {5'd0, 27'd0}; //NOP
32         {MEM[44], MEM[45], MEM[46], MEM[47]} = {5'd11, 5'd2, 22'd3}; //LDI: R2 = 3
33         {MEM[48], MEM[49], MEM[50], MEM[51]} = {5'd1, 5'd3, 5'd1, 1'd0, 5'd2, 11'd0}; //ADD: R3 = R1 + R2
34         {MEM[52], MEM[53], MEM[54], MEM[55]} = {5'd31, 27'd0}; //HALT
35     end
36 end
37
38 assign data_out = {MEM[addr], MEM[addr+1], MEM[addr+2], MEM[addr+3]};
39
40
41 initial begin
42     $readmemb("/home/mariolima/Desktop/embedded_systems/VESPA_BHV/code.txt", MEM);
43 end
44 endmodule

```

Algoritmo 3.9: Memory implementation for testing

After simulating, the following results were obtained (figure 3.10).



Figure 3.10: Initial memory simulation results

Then, using the same test bench, it was changed to memory with IP and simulated. As seen in figure 3.11, the instructions are not being read correctly.



Figure 3.11: Memory simulation results using IP

Then an extra cycle was added and, in this way, the same result was obtained as when simulating memory without IP (figure 3.12).



Figure 3.12: Memory simulation results using IP and extra cycle

As mentioned previously, one of the disadvantages of the memory being made up of 4 different blocks is that writing code in code memory can be a complicated task, as it is necessary to write the instruction in binary, and then divide it into 4 to place in each .coe file. To overcome this problem, a Python script was created so that it was possible to write the desired code in assembly and then automatically convert it to binary and divide it into different .coe files. The following code shows this script that can later be used by the compiler.

```

1 #function that makes the following steps:
2 # - read code.txt file
3 # - parsing the instruction
4 # - convert to binary and and write in binarycode.txt
5 # - divide the binary code and write in each .coe file
6 # - divide the binary code and write in each .coe file
7 def parse_and_generate_coe(input_filename, output_filenames):
8     files = []
9     first = True
10
11    with open(input_filename, 'r') as input_file:
12
13        for file_name in output_filenames:
14            id = open(file_name, 'w')
15            id.write("memory_initialization_radix=2;\n")           #coe header -> data format
16            id.write("memory_initialization_vector= \n")
17            files.append(id)
18
19
20        for line in input_file:
21            words = line.split()
22
23            instruction = words[0]
24            operands = words[1:]
25
26            print(f"parse: {instruction}, {operands}")
27
28            binary_representation = convert_instruction_to_binary(instruction, operands)
29
30            files[4].write(binary_representation + "\n")
31
32            binary = int(binary_representation, 2)
33
34            print(f"Binary Representation: {binary_representation}")
35            print(f"Binary : {binary}")
36
37            divided_bytes = divide_instruction(binary)
38
39            print(f"Divided Bytes: {divided_bytes}")
40
41            if first:
42                first = False
43
44            else:
45                files[0].write(",\n")
46                files[1].write(",\n")
47                files[2].write(",\n")
48                files[3].write(",\n")
49
50            files[3].write(format(int(divided_bytes[0]), '08b'))
51            files[2].write(format(int(divided_bytes[1]), '08b'))
52            files[1].write(format(int(divided_bytes[2]), '08b'))
53            files[0].write(format(int(divided_bytes[3]), '08b'))
54
55
56            files[0].write(";\n")
57            files[1].write(";\n")
58            files[2].write(";\n")
59            files[3].write(";\n")
60
61            files[0].close()
62            files[1].close()
63            files[2].close()
64            files[3].close()
65
66 if __name__ == "__main__":
67     print("START")
68
69     input_filename = "code.txt"
70     output_filenames = ["code00.coe", "code01.coe", "code10.coe", "code11.coe", "binaryCode.txt"]
71     parse_and_generate_coe(input_filename, output_filenames)

```

Algoritmo 3.10: Developed python script

This way, it is possible to write the code as follows and then generate the .coe files.

```

1 JMP 0 20          //reset
2 JMP 0 208         //irq source 1
3 JMP 0 160         //irq source 2

```

```

4      JMP 0 88          //irq source 3
5      JMP 0 208
6      LDI 0 0
7      LDI 30 3158749
8      ST 30 4100
9      LDI 30 95
10     ST 30 4096
11     LDI 4 0
12     LDI 2 0
13     LDI 3 0
14     LDI 1 0
15     XORI 1 1 1
16     SUBI 31 4 1
17     BEQ 12
18     NOP
19     NOP
20     JMP 0 48
21     HALT
22     NOP
23     NOP
24     NOP
25     ADDI 2 2 1
26     SUBI 31 2 10
27     BEQ 16
28     ADDI 3 2 48
29     ORI 3 3 256
30     ST 3 4108
31     RETI
32     NOP

```

Algoritmo 3.11: Python script utilization

### 3.4 Interrupt Controller

For the nested interrupt controller, the following module was developed. Briefly, whenever there is an upward transition from one of the sources, the pending interruptions variable is updated. Furthermore, according to the pending interruptions and their priority, the variable 'nextIrq' is updated with the number of the interruption to be serviced next. If there are no interruptions to be attended to, the variable has the value of 4 so that it is a value greater than the number of interruptions. It is then checked whether the interrupt being serviced has a lower priority than the next one. If this happens, a request is sent to the control unit.

Then, when the control unit responds to the interrupt, the controller receives an ack and sets the completed bit to 0 to indicate that the interrupt has started to be serviced.

If, however, a higher priority interrupt is received, the process repeats itself, sending a new request.

```

1  module interrupt_controller(
2      input rst,
3      input clk,
4      input [3:0] int_sources,
5      input int_ack_complete,
6      input int_ack_attended,
7      output reg int_req,
8      output [1:0]int_number,
9      input ea,
10     input en1,
11     input en2,
12     input en3
13 );
14
15
16 wire [3:0] int_sources_2;
17 wire [2:0]nextIrq;
18 reg [2:0]currentIrq;
19 reg [3:0]pending;
20 reg [3:0]completed;
21 reg [3:0]int_sources_prev;
22
23
24 //sources activation
25 assign int_sources_2[0] = int_sources[0] && ea;
26 assign int_sources_2[1] = int_sources[1] && en1 && ea;
27 assign int_sources_2[2] = int_sources[2] && en2 && ea;
28 assign int_sources_2[3] = int_sources[3] && en3 && ea;
29
30
31 //checks the next interrupt to be attended according to priorities
32 assign nextIrq = (pending[3] && ~pending[2] && ~pending[1] && ~pending[0]) ? 3'b011 :
33   (pending[2] && ~pending[1] && ~pending[0]) ? 3'b010 :

```

```

34             (pending[1] && ~pending[0]) ? 3'b001 :
35             (pending[0]) ? 3'b000 : 3'b100;
36
37
38 //asynchronous source checking
39 always@(*) begin
40     if(rst == 1'b1) begin
41         pending <= 4'b0000;
42         int_sources_prev <= 4'b0000;
43     end
44     else if(int_sources_2 != int_sources_prev) begin      //if any of the sources changed state
45         pending <= pending | int_sources_2;                  //when the source reaches zero it does not change
46         int_sources_prev <= int_sources_2;                  //so changes only on rising edge
47     end
48     else if(int_ack_complete == 1'b1) begin           //if receive ack, clear pending
49         pending <= pending & ~(4'b0001 << currentIrq);
50     end
51 end
52
53
54
55 always @ (posedge clk) begin
56     if(rst == 1'b1) begin
57         int_req <= 1'b0;
58         currentIrq <= 3'b100;
59         completed <= 4'b1111;
60     end
61     else begin
62         if(int_ack_complete == 1'b1) begin           //when receive an ack, update completed variable
63             completed <= completed | (4'b0001 << currentIrq);
64             currentIrq <= nextIrq;
65
66             //if one ends and the one that is about to enter hasn't started yet
67             if((completed >> nextIrq) & 4'b0001 == 4'b0001) && int_sources_2 != 4'b0000) begin
68                 int_req = 1'b1;
69             end
70         end
71         else if(currentIrq > nextIrq) begin        //if the next irq has higher priority
72             currentIrq <= nextIrq;
73             int_req <= 1'b1;                          //send new request
74         end
75         else if(int_ack_attended == 1'b1) begin    //if it was attended
76             completed <= completed & ~(4'b0001 << currentIrq);          //clear completed bit
77             int_req <= 1'b0;                          //clear request
78         end
79     end
80 end
81
82 assign int_number = currentIrq[1:0];
83
84 endmodule

```

Algoritmo 3.12: Interrupt controller implementation

As can be seen in the simulation, the upward transition from source 2 begins and from that moment on, an interrupt request is sent from the interrupt controller to the control unit. When the interrupt starts to be answered, the control unit sends an ack to the interrupt controller. After this, the upward transition of source 2 takes place. Since this has higher priority, the service routine for interrupt 3 is interrupted and the service routine for interrupt 2 starts to be executed. When the RETI of ISR2 is executed, the control unit sends another ack to inform that it was complete, after which the ISR3 continues its execution until it reaches the RETI instruction. During this entire process, you can check how the stack works correctly.



Figure 3.13: Tables implemented in database

## 3.5 UART

As mentioned in the design phase, the UART peripheral can be divided into 3 modules:

- Baudrate generator;
- UART TX;
- UART RX;

### 3.5.1 Baudrate generator

To implement the baudrate generator, a cycle counter from the main clock was used (125 Mhz). This module receives as input the value of clock ticks to be counted to generate the correct baudrate (counter = 125M / baudrate).

However, in this module, this value is divided by 2 in order to reduce this counter by half and thus, when the counter overflows, the baudrate tick changes state, thus generating a signal with 50% duty-cycle, as represented the following code.

```

1  module uart_baudrate_generator(
2  	input rst,
3  	input clk,
4  	input enable,
5  	input [19:0] baudrate_counter,
6  	output reg tick
7  );
8
9 	wire clk2;
10 	assign clk2 = clk & enable;           //if disabled, stop the internal clock
11
12 	reg [19:0] internal_counter;
13

```

```

14
15    always @(posedge clk) begin
16        if(rst) begin
17            tick <= 0;
18            internal_counter <= 20'd1;
19        end
20        else if (internal_counter == 1) begin
21            tick <= ~tick;                                //toggle tick bit
22            internal_counter <= {1'b0, baudrate_counter[19:1]}; //divide the counter by 2 and update internal counter
23        end
24        else begin
25            internal_counter <= internal_counter - 20'd1;      //decrement counter
26        end
27    end
28 endmodule

```

Algoritmo 3.13: Baudrate generator implementation

### 3.5.2 UART TX

The UART TX module is implemented according to the state machine shown in the design phase. The implementation of this module is displayed in the listing presented below.

```

1  `define DATA_SIZE 8
2
3 module uart_tx(
4     input tick,
5     input tx_start,
6     input [`DATA_SIZE-1:0] data_in,
7     input rst,
8     output reg tx_bit,
9     output reg tx_done
10 );
11
12 `define S_IDLE1 2'b00
13 `define S_START1 2'b01
14 `define S_DATA1 2'b10
15 `define S_STOP1 2'b11
16
17 reg [1:0]state;
18 reg [`DATA_SIZE-1:0] buffer;
19 reg [2:0]counter;
20
21
22 always@(posedge tick) begin
23     if(rst) begin
24         state = `S_IDLE1;
25         tx_bit <= 1'b1;
26     end
27     else begin
28         case(state)
29             `S_IDLE1: begin
30                 if(tx_start == 1'b1)           //wait for start command
31                     state = `S_START1;
32             end
33             `S_START1: begin
34                 tx_done <= 0;
35                 tx_bit <= 0;
36                 buffer <= data_in;          //loads the data to be sent to the buffer
37                 counter <= 7;              //reset counter
38                 state = `S_DATA1;
39             end
40         end
41         `S_DATA1: begin
42             tx_bit <= buffer[0];        //invit the less significant bit
43             counter <= counter - 1;     //decrement counter
44             buffer = {1'b0, buffer[7:1]}; //shift right buffer
45
46             if(counter == 0)
47                 state = `S_STOP1;
48         end
49     endcase
50
51     `S_STOP1: begin
52         tx_done <= 1;
53         tx_bit <= 1;                //stop bit
54         state = `S_IDLE1;
55     end
56
57     default:
58         state = `S_IDLE1;
59     endcase
60   end
61 endmodule

```

Algoritmo 3.14: UART TX implementation

In figure 3.14 presented below, it is possible to check the correct functioning of this module, where the value 0x4C was transmitted. In the buffer, it is also visible that the value is shifted to the right (divided by two). Furthermore, it is verified that the protocol is complied with, starting by setting the bit to 0 and ending with the bit to 1. At the end of the transmission it is also possible to verify that the "tx\_done" signal is at a high level.



Figure 3.14: UART TX simulation

### 3.5.3 UART RX

Like the UART TX, the UART RX was implemented using a state machine, as demonstrated in the design phase. The listing presented below shows the code developed to implement this module.

```

1 `define DATA_SIZE 8
2
3 module uart_rx(
4   input tick,
5   input rst,
6   input rx_en,
7   input rx_bit,
8   output reg [DATA_SIZE-1:0] data_out,
9   output reg rx_done
10 );
11
12   `define S_IDLE      2'b00
13   `define S_DATA      2'b01
14   `define S_STOP      2'b10
15
16   reg [3:0] counter;
17   reg [DATA_SIZE-1:0] buffer;
18   reg [1:0] state;
19
20   initial begin
21     state <= `S_IDLE;
22     rx_done <= 0;
23   end
24
25   always@(posedge tick) begin
26     if(rst) begin
27       state <= `S_IDLE;
28       rx_done <= 0;
29     end
30     else if(rx_en == 1'b1) begin
31       case(state)
32         `S_IDLE: begin
33           rx_done <= 0;
34
35           if(rx_bit == 1'b0) begin //wait for start bit
36             counter <= 0;          //reset counter
37             buffer <= 0;          //clear internal buffer
38             state <= `S_DATA;
39           end
40         end
41         `S_DATA: begin
42           buffer <= {rx_bit, buffer[7:1]}; //put received bit in the most significant buffer bit and shift right
43           counter = counter + 1;          //increment counter
44
45           if(counter[3])
46             state <= `S_STOP;          //if the counter overflows
47         end
48         `S_STOP: begin
49           data_out <= buffer;        //put received data in output
50           rx_done <= 1;
51           state <= `S_IDLE;
52         end
53         default:
54           state <= `S_IDLE;
55       endcase
56     end
57   end
58 endmodule

```

Algoritmo 3.15: UART RX implementation

Through the simulation presented below in figure 3.15, it is possible to verify the correct functioning of this module. Initially, it waits for 'rx\_bit' (din) to reach 0. From that moment on, with each upward transition of the baudrate tick, the value of 'rx\_bit' is stored in the buffer, shifting it to the left. After receiving 8 bits and 'rx\_bit' reaches 1, the value stored in the buffer is placed in the output.



Figure 3.15: UART RX simulation

## 3.6 Top level

With all off the modules correctly implemented and verified the top level was formed through the instantiation and connection of all the other developed blocks. The implementation of said top level is displayed in the listing below.

```

1  module vespa_toplevel(input clk,
2                         input rst,
3                         input rx_bit,
4                         output tx_bit,
5
6                         input [2:0]buttons, //only for test in leds
7                         output [4:0]leds, //only for test in leds
8                         output [7:0]result //only for test in leds
9 );
10
11 //wires to make connection are not represented here for space reasons
12
13
14 uart UART(.clk(clk),
15           .rst(rst),
16           .uart_en(uart_config[20]),
17           .rx_bit(rx_bit),
18           .rx_en(uart_config[21]),
19           .tx_start(tx_data[8]),
20           .tx_data(tx_data[7:0]), //8 bits
21           .baudrate(uart_config[19:0]),
22           .tx_done(tx_done),
23           .tx_bit(tx_bit),
24           .rx_done(rx_done),
25           .rx_data(rx_data)
26 );
27
28
29 controlUnit CONTROLUNIT(.clk(clk),
30                         .rst(rst),
31                         .alu_en(alu_en),
32                         .alu_ctrl(alu_ctrl),
33                         .CC(C),
34                         .ZZ(Z),
35                         .NN(N),
36                         .VV(V),
37                         .opcode(opcode),
38                         .branch_cond(branch_cond),
39                         .imm_op(imm_op),
40                         .blockRams_init(blockRams_init),
41                         .ir_load(IR_load),
42                         .pc_load(PC_load),
43                         .sp_load(sp_load),
44                         .we_mem(we_mem),
45                         .we_rf(we_rf),
46                         .pc_sel(pc_sel),
47                         .mem_addr_sel(mem_addr_sel),
48                         .rf_read2_addr_sel(rf_read2_addr_sel),
49                         .rf_write_sel(rf_write_sel),
50                         .mem_in_sel(mem_in_sel),
51                         .sp_in_sel(sp_in_sel),

```

```

52          .alu_op2_sel(alu_op2_sel),
53          .int_req(int_req),
54          .irq_addr(irq_addr),
55          .int_ack_complete(int_ack_complete),
56          .int_ack_attended(int_ack_attended),
57          .map_req(0),
58
59          .leds(leds) //only for test in leds
60      );
61
62
63 datapath DATAPATH(.clk(clk),
64     .reset(rst),
65     .we_rf(we_rf),
66     .we_mem(we_mem),
67     .pc_load(PC_load),
68     .ir_load(IR_load),
69     .sp_load(sp_load),
70     .irq_addr(irq_addr),
71     .alu_ctrl(alu_ctrl),
72     .alu_en(alu_en),
73     .C(C),
74     .Z(Z),
75     .N(N),
76     .V(V),
77     .opcode(opcode),
78     .imm_op(imm_op),
79     .branch_cond(branch_cond),
80     .rf_write_sel(rf_write_sel),
81     .pc_sel(pc_sel),
82     .mem_addr_sel(mem_addr_sel),
83     .rf_read2_addr_sel(rf_read2_addr_sel),
84     .alu_op2_sel(alu_op2_sel),
85     .mem_in_sel(mem_in_sel),
86     .sp_in_sel(sp_in_sel),
87     .blockRams_init(blockRams_init),
88     .error(error),
89     .uartRX_data({24'b0,rx_data}),
90     .uartTX_data(tx_data),
91     .uart_config(uart_config),
92     .irq_config(irq_config),
93     .int_number(int_number),
94
95     .result2(result2) //only for test in leds
96 );
97
98 assign irq_src1 = buttons[0];
99 assign irq_src2 = buttons[1];
100 assign irq_src3 = buttons[2];
101
102 interrupt_controller IRQ(.clk(clk),
103     .rst(rst),
104     .int_sources({irq_src3, irq_src2, irq_src1, error}),
105     .int_ack_complete(int_ack_complete),
106     .int_ack_attended(int_ack_attended),
107     .int_req(int_req),
108     .int_number(int_number),
109     .ea(uart_config[0]),
110     .en1(uart_config[1]),
111     .en2(uart_config[2]),
112     .en3(uart_config[3])
113 );
114
115
116 assign result = {int_req, int_number, result2}; //only for test in leds
117 endmodule

```

Algoritmo 3.16: Top level implementation

The following image shows a diagram of blocks generated by Vivado's Linter and comparing this with what was presented in the design phase it can be seen that the blocks and connections correspond.



Figure 3.16: Top level synthesis

## **4 | Conclusion**

The development of the VeSPA processor, the interrupt controller and the assigned peripheral were an opportunity to the group to solidify the concepts acquired over the course of the semester and to learn in a way that wouldn't be possible without "getting our hands dirty". This allowed us to gain experience in a way that wouldn't be possible without this project. This, without a doubt, made this curricular unit taught by professor Adriano Tavares much more rich and its completion leaves us better prepared for the next semester and, of course, for all future work. In the future, our intellectual curiosity will continue to be fuelled and this project will be continued over the course of the second semester, where, among other features, a pipeline will be implemented which will provide the processor a great increase in performance and leaves the group excited for the future and new challenges.

## 5 | References

[https://support.xilinx.com/s/knowledge-base?language=en\\_US&tabset-0c7e1=2](https://support.xilinx.com/s/knowledge-base?language=en_US&tabset-0c7e1=2)

[https://support.xilinx.com/s/topiccatalog?language=en\\_US](https://support.xilinx.com/s/topiccatalog?language=en_US)

<https://www.xilinx.com/support.htmldocumentation>