

**Department of Electronic and Telecommunication Engineering  
University of Moratuwa**

**EN 3030 - Circuits and Systems Design**



**PROJECT REPORT**

**DESIGN AND IMPLEMENTATION OF A  
CUSTOM PROCESSOR FOR MATRIX MULTIPLICATION**

| Name                   | Index Number |
|------------------------|--------------|
| ABEWICKRAMA D.S.N      | 170009C      |
| HETTIARACHCHI M.N      | 170222X      |
| THANUKA M.B.S.D        | 170614C      |
| WICKRAMASINGHE J.A.D.L | 170692K      |

This report is submitted in partial fulfillment of the requirements  
for the module EN 3030 - Circuits and Systems Design.

July 7, 2021

## Contents

|                                                  |            |
|--------------------------------------------------|------------|
| <b>List of Figures</b>                           | <b>iii</b> |
| <b>List of Tables</b>                            | <b>iv</b>  |
| <b>1 Abstract</b>                                | <b>1</b>   |
| <b>2 Introduction</b>                            | <b>2</b>   |
| 2.1 Processor Design . . . . .                   | 2          |
| 2.2 Central Processing Unit (CPU) . . . . .      | 2          |
| 2.3 Problem Statement . . . . .                  | 3          |
| 2.4 Proposed Solution based on FPGA . . . . .    | 3          |
| <b>3 Designing a FPGA based custom Processor</b> | <b>5</b>   |
| 3.1 What is FPGA? . . . . .                      | 5          |
| 3.2 Instruction Set Architecture (ISA) . . . . . | 5          |
| 3.2.1 Instructions Set . . . . .                 | 6          |
| 3.2.2 State Diagram . . . . .                    | 7          |
| 3.2.3 Processor Architecture . . . . .           | 7          |
| 3.3 Modules and Components . . . . .             | 10         |
| 3.3.1 Top level module . . . . .                 | 10         |
| 3.3.2 Processor . . . . .                        | 10         |
| 3.3.3 Data Memory (DRAM) . . . . .               | 11         |
| 3.3.4 Instruction Memory (IRAM) . . . . .        | 12         |
| 3.3.5 Control Unit . . . . .                     | 12         |
| 3.3.6 Arithmetic And Logic Unit (ALU) . . . . .  | 14         |
| 3.3.7 Registers . . . . .                        | 15         |
| 3.3.8 Accumulator . . . . .                      | 15         |
| 3.3.9 Bus . . . . .                              | 17         |
| 3.3.10 Memory Controller . . . . .               | 18         |

|                                                            |           |
|------------------------------------------------------------|-----------|
| 3.3.11 Clock Divider . . . . .                             | 19        |
| 3.3.12 CoreID . . . . .                                    | 20        |
| 3.3.13 End Core . . . . .                                  | 20        |
| <b>4 Algorithms and Design Considerations</b>              | <b>21</b> |
| 4.1 Matrix Multiplication Algorithm . . . . .              | 21        |
| 4.2 Assembly Code . . . . .                                | 25        |
| <b>5 Design Optimization</b>                               | <b>27</b> |
| 5.1 Minimizing the number of instructions in ISA . . . . . | 27        |
| <b>6 Problems Faced &amp; Solutions</b>                    | <b>28</b> |
| <b>7 Assembler and Simulator</b>                           | <b>29</b> |
| 7.1 Assembler . . . . .                                    | 29        |
| 7.2 Simulation . . . . .                                   | 30        |
| 7.2.1 Python Simulation . . . . .                          | 30        |
| 7.2.2 Model Sim Simulation . . . . .                       | 30        |
| <b>8 Performance Evaluation</b>                            | <b>31</b> |
| <b>References</b>                                          | <b>32</b> |
| <b>A Appendix</b>                                          | <b>33</b> |
| A.1 Assembly Code . . . . .                                | 33        |
| A.2 Top Level Module . . . . .                             | 34        |
| A.3 The Proessor . . . . .                                 | 36        |
| A.4 Control Unit . . . . .                                 | 38        |
| A.5 ALU . . . . .                                          | 53        |
| A.6 Register . . . . .                                     | 53        |
| A.7 Data Register . . . . .                                | 54        |
| A.8 Address Register . . . . .                             | 54        |
| A.9 Accumulator . . . . .                                  | 54        |

|                                            |    |
|--------------------------------------------|----|
| A.10 Bus . . . . .                         | 55 |
| A.11 Memory Controller . . . . .           | 56 |
| A.12 Clock Divider . . . . .               | 57 |
| A.13 CoreID . . . . .                      | 58 |
| A.14 End Core . . . . .                    | 58 |
| A.15 ALU Test Bench . . . . .              | 59 |
| A.16 Address Register Test Bench . . . . . | 60 |
| A.17 Bus Test Bench . . . . .              | 61 |
| A.18 Register Test Bench . . . . .         | 65 |
| A.19 Top Module Test Bench . . . . .       | 66 |
| A.20 Assembler code . . . . .              | 66 |
| A.21 Multicore Python Simulation . . . . . | 68 |

## List of Figures

|                                         |    |
|-----------------------------------------|----|
| 1 Basic Diagram of the CPU[1] . . . . . | 2  |
| 2 Task flow . . . . .                   | 3  |
| 3 State Diagram . . . . .               | 7  |
| 4 RTL Netlist View . . . . .            | 7  |
| 5 Single Core . . . . .                 | 8  |
| 6 Multi Core . . . . .                  | 9  |
| 7 Top Level Module . . . . .            | 10 |
| 8 Processor . . . . .                   | 10 |
| 9 DRAM . . . . .                        | 11 |
| 10 IRAM . . . . .                       | 12 |
| 11 Control Unit . . . . .               | 12 |
| 12 ALU . . . . .                        | 14 |
| 13 Register . . . . .                   | 15 |
| 14 Accumulator . . . . .                | 15 |

|    |                                               |    |
|----|-----------------------------------------------|----|
| 15 | Data Register . . . . .                       | 16 |
| 16 | Address Registers . . . . .                   | 16 |
| 17 | Bus . . . . .                                 | 17 |
| 18 | Memory Controller . . . . .                   | 18 |
| 19 | Register:Clock Divider . . . . .              | 19 |
| 20 | Register:CoreID . . . . .                     | 20 |
| 21 | Register:End Core . . . . .                   | 20 |
| 22 | Matrix Multiplication . . . . .               | 21 |
| 23 | The Single Core Algorithm . . . . .           | 22 |
| 24 | The Multi Core Algorithm . . . . .            | 24 |
| 25 | FPGA Development Board . . . . .              | 28 |
| 26 | Generated MIF file by the Assembler . . . . . | 29 |
| 27 | ModelSim Simulation . . . . .                 | 30 |
| 28 | Performance Analysis . . . . .                | 31 |

## List of Tables

|    |                                                          |    |
|----|----------------------------------------------------------|----|
| 1  | Instruction Set (ISA) . . . . .                          | 6  |
| 2  | Inputs and outputs of top level module . . . . .         | 10 |
| 3  | Inputs and outputs of the Processor . . . . .            | 11 |
| 4  | Port description for the DRAM . . . . .                  | 12 |
| 5  | Port description for the IRAM . . . . .                  | 12 |
| 6  | Inputs and outputs of the Control Unit . . . . .         | 13 |
| 7  | Bit assignment for the write enable signal . . . . .     | 13 |
| 8  | Bit assignment for the Reset enable signal . . . . .     | 13 |
| 9  | Bit assignment for the Increment enable signal . . . . . | 14 |
| 10 | Port description for the ALU . . . . .                   | 14 |
| 11 | ALU operation signal . . . . .                           | 14 |
| 12 | Port description for the Register . . . . .              | 15 |

|    |                                                      |    |
|----|------------------------------------------------------|----|
| 13 | Port description for the ALU . . . . .               | 16 |
| 14 | Port description for the Data Register . . . . .     | 16 |
| 15 | Port description for the AR . . . . .                | 17 |
| 16 | Bit assignment for the Read enable signal . . . . .  | 18 |
| 17 | Port description for the Memory Controller . . . . . | 19 |
| 18 | Control Signal for memory access . . . . .           | 19 |
| 19 | Port description for the Clock Divider . . . . .     | 20 |
| 20 | Port description for the CoreID . . . . .            | 20 |
| 21 | DRAM (Single Core) . . . . .                         | 21 |
| 22 | DRAM (Multi Core) . . . . .                          | 23 |
| 23 | Performance Analysis . . . . .                       | 31 |

## 1 Abstract

With the breakthrough of embedded systems in todays world there are numerous number of custom single purpose processors in use. Benefits of such processors are high performance, small in size and low power consumption. In this report we demonstrate the design architecture followed and a performance evaluation of the custom processor we built using a FPGA to do matrix multiplication.

The processor we designed in general consists of 4 cores, a common instruction memory and data memory and a memory controller which controls the memory access from each core. Our ISA consists of 28 instructions and the defined memory consists of 16 bit words. Each core consists of 12 Registers,a Control Unit and an Arithmetic and Logic Unit.

And finally we provide a comprehensive performance evaluation of the speed of the processor when the number of cores increases from 1 to 4 for different matrices.

## 2 Introduction

This report describes the step-by-step procedure used to design a processor to perform matrix multiplication. And also, in this report we include the test codes that are used, and the simulation results and the physical hardware implementations.

### 2.1 Processor Design

The main requirement of this assignment is to design a processor and a CPU (Central Processing Unit) with multiple cores which can multiply two given matrices. We did this task using Verilog Hardware Description Language (HDL) and the implementations are done using the Quartus Prime Lite edition 19.1 and the Development board with Cyclone IV FPGA and the simulations are done using ModelSim-Altera. We implemented Quad Core Processor to do this task.

### 2.2 Central Processing Unit (CPU)

The main part of any digital computer system is the Central Processing Unit (CPU) which is the physical heart of the entire computer system. The electronic circuitry in the computer system which performs basic arithmetic functions such as addition, subtraction, multiplication, and division, and logical operations such as the selection of desired problem-solving procedure based on predefined decision criteria and the comparing data, and the input/output(I/O) operations according to the given instructions, is called the Central Processing Unit (CPU)[2].



Figure 1: Basic Diagram of the CPU[1]

The Central processing unit (CPU) is generally composed of the main memory, control unit, and arithmetic-logic unit. The control unit of the central processing unit regulates and integrates the operations of the computer. It selects and retrieves instructions from the main memory in proper sequence.

### 2.3 Problem Statement

We have given on this project to multiply two matrices with using multiple cores. Therefore, we specially focused to implement this using multiple cores and increase the speed of the task. There are no limits for the size of the input matrices. But we defined some constraints for the size of matrices as following to avoid overflowing the DRAM which is designed with a memory size of 512, 16 bits words.

$$m \times n + n \times p + m \times p < 506$$

m: no of rows in input matrix A

n: no of columns in input matrix A

p: no of columns in input matrix B

We multiplied the two input matrices using a python code for the verification of the designed processor.

### 2.4 Proposed Solution based on FPGA

The main requirement was to design a processor to multiply two matrices. Therefore, two input matrices should be stored in the memory unit of the processor at the starting. And also, after the multiplication of two matrices, result matrix needs to be sent back to the computer and display it. The task flow of the project is mainly as follows.



Figure 2: Task flow

First, we generated the input file in the memory initialization file (MIF) format with the addresses and 16 bits words. Then we imported the input file using the In-System memory content editor and stored the data in the memory unit, so that data can be accessed when required.

Then the FPGA programmed using the Verilog HDL code with the IRAM and the DRAM. After initializing the start, designed processor multiplied the two input matrices and store results matrix in the DRAM.

Finally, the output file exported to a directory in the local machine using the In-System memory content editor and displayed it. After deciding the task flow of the project, the following key points are the main requirements of the designing process.

1. Generate MIF files to import data from the computer and export data to the computer.
2. To store the input matrices and the result matrix in the CPU, the designing memory unit needs to be large.
3. Design ISA for the processor which is capable of multiplying two matrices.

### 3 Designing a FPGA based custom Processor

#### 3.1 What is FPGA?

The Field Programmable Gate Array, which is commonly known as FPGA is an integrated circuit that consists of user programmable interconnections of internal and configurable hardware logic blocks(CLBs) to customize operations for a specific application. FPGAs can be reprogrammed to perform custom applications or functionality requirements after manufacturing. This feature differentiates the FPGAs from Application Specific Integrated Circuits (ASICs), which are custom manufactured to perform pre defined specific tasks. [3] For this project we have used Cyclone DE2-115 development and educational board[4] which contains a chip manufactured by Altera with other peripheral devices. We used Intel Altera Quartus Prime[5], Model Sim[6] software and Verilog as the Hardware Descriptive Language (HDL).

#### 3.2 Instruction Set Architecture (ISA)

The Instruction Set Architecture (ISA) is the part of the processor that is visible to the programmer or compiler writer which is more or less machine language. The ISA serves as the boundary between software and hardware. The instruction set provides commands to the processor, on what needs to be done. The instruction set consists of addressing modes, instructions, native data types, registers, memory architecture, interrupt, and exception handling, and external I/O.

### 3.2.1 Instructions Set

| Instruction | Opcode | State    | Address | Operation                            |
|-------------|--------|----------|---------|--------------------------------------|
| FETCH       |        | FETCH1   | 0       | READ                                 |
|             |        | FETCH2   | 1       | DR <= IRAM, PC <= PC + 1             |
|             |        | FETCH3   | 2       | IR <= DR, READ                       |
|             |        | FETCH4   | 3       | IDLE                                 |
| LDAC x      | 1      | LDAC1    | 5       | DR <= IRAM, PC <= PC + 1             |
|             |        | LDAC2    | 6       | AR <= DR, READ                       |
|             |        | LDAC3    | 7       | DR <= DRAM                           |
|             |        | LDAC4    | 8       | AC <= DR                             |
| LOADACR     | 2      | LOADAC1  | 9       | AR <= R, READ                        |
|             |        | LOADAC2  | 10      | DR <= DRAM                           |
|             |        | LOADAC3  | 11      | AC <= DR                             |
| STAC        | 3      | STAC1    | 12      | AR <= R, DR <= AC, WRITE             |
|             |        | STAC2    | 13      | DRAM[R] <= DR                        |
| MVACR       | 4      | MVACR1   | 14      | R <= AC                              |
| MVACA       | 5      | MVACA1   | 15      | A <= AC                              |
| MVACSUM     | 6      | MVACSUM1 | 16      | SUM <= AC                            |
| MVACI       | 7      | MVACI1   | 17      | I <= AC                              |
| MVSUMAC     | 8      | MVSUMAC1 | 18      | AC <= SUM                            |
| MVIR        | 9      | MVIR1    | 19      | R <= I                               |
| MVKR        | 10     | MVKR1    | 20      | R <= K                               |
| MVJR        | 11     | MVJR1    | 21      | R <= J                               |
| MVIDR       | 12     | MVIDR1   | 22      | R <= ID                              |
| MULR        | 13     | MULR1    | 23      | AC <= AC*R                           |
| MULA        | 14     | MULA1    | 24      | AC <= AC*A                           |
| ADDR        | 15     | ADDR1    | 25      | AC <= AC+R                           |
| ADDSUM      | 16     | ADDSUM1  | 26      | AC <= AC + SUM                       |
| COMP        | 17     | COMP1    | 27      | AC <= AC - R, if AC < 0 z=1 else z=0 |
| INCK        | 18     | INCK1    | 28      | K <= K+1                             |
| INCJ        | 19     | INCJ1    | 29      | J <= J+1                             |
| INCR        | 20     | INCR1    | 30      | R <= R+1                             |
| JMPZ a      | 21     | JMPZY1   | 31      | DR <= IRAM                           |
|             |        | JMPZY2   | 32      | PC <= DR                             |
|             |        | JMPZN1   | 33      | PC <= PC + 1                         |
| JUMP a      | 22     | JUMP1    | 34      | DR <= IRAM                           |
|             |        | JUMP2    | 35      | PC <= DR                             |
|             |        | RSTALL   | 36      | ID <= Core ID                        |
| RSTI        | 24     | RSTI1    | 37      | I <= ID                              |
| RSTJ        | 25     | RSTJ1    | 38      | J <= 0                               |
| RSTK        | 26     | RSTK1    | 39      | K <= 0                               |
| RSTSUM      | 27     | RSTSUM1  | 40      | SUM <= 0                             |
| NOP         | 28     | NOP1     | 41      | NO OPERATION                         |

Table 1: Instruction Set (ISA)

### 3.2.2 State Diagram

The operation of this processor is based on a behavioral model called finite-state machine, which consist of a finite number of states that executes one task at any given time. In order to achieve the matrix multiplication, the machine repeatedly follows a pre-defined, finite set of sequence flows. The following diagram represents the state machine of our custom made processor.



Figure 3: State Diagram

### 3.2.3 Processor Architecture

#### 3.2.3.1 Processor RTL Netlist View



Figure 4: RTL Netlist View

### 3.2.3.2 Single Core Data Path



Figure 5: Single Core

### 3.2.3.3 Multi Core Data Path



Figure 6: Multi Core

### 3.3 Modules and Components

#### 3.3.1 Top level module



Figure 7: Top Level Module

This is the module that contains instances of all the lower-level modules that are required to implement the matrix multiplication. This module contains instances of clock\_divider, core, end\_core, memory\_controller, IRAM and DRAM modules.

Inputs and outputs of this module are listed below.

| Port          | Description                                                  |
|---------------|--------------------------------------------------------------|
| Clk           | Input signal that contains 50 MHz clock signal from the FPGA |
| Ena           | Input signal to indicate the start of the process            |
| End_process_1 | Output signal to indicate the end of the process of core 1   |
| End_process_2 | Output signal to indicate the end of the process of core 2   |
| End_process_3 | Output signal to indicate the end of the process of core 3   |
| End_process_4 | Output signal to indicate the end of the process of core 4   |

Table 2: Inputs and outputs of top level module

#### 3.3.2 Processor



Figure 8: Processor

This module is used to instantiate multiple cores in the top module. This can be considered as the base processing unit of the project. This module contains instances for units listed below.

- Control Unit – State machine for the processor, generates control signals.
- Bus – Transport data between units
- ALU – Arithmetic and Logic Unit
- PC – Program counter
- AC – Accumulator
- AR – Address Register
- DR, IR – Data Register, Instruction Register
- ID, I, J, K, A, SUM – Special Purpose Registers
- R – General Purpose Register

Inputs and outputs of this module are listed below.

| Port        | Description                                                           |
|-------------|-----------------------------------------------------------------------|
| clk         | Input signal that contains scaled clock signal from the clock divider |
| status      | Input signal to indicate the start of the process                     |
| iram_out    | Input signal with the output data of IRAM                             |
| dram_out    | Input signal with the output data of DRAM                             |
| core_id     | Input signal which contains the core id of the respective core        |
| end_core    | Input signal to terminate the processor                               |
| dram_wren   | Output signal to DRAM write enable signal                             |
| iram_wren   | Output signal to IRAM write enable signal                             |
| memory_mode | Output signal with the control signal for memory controller module    |
| pc_out      | Output signal with the address for the IRAM                           |
| ar_out      | Output signal with the address for the DRAM                           |
| bus_out     | Output signal with the data for the DRAM                              |
| end_process | Output signal which shows the end of the process                      |

Table 3: Inputs and outputs of the Processor

### 3.3.3 Data Memory (DRAM)



Figure 9: DRAM

This module is used to store data for the processor. Two matrices and other required data is stored in this. After the matrix multiplication final results are written in to this memory. A pre-built

module, RAM: 1-PORT module was used to design this module. Data memory was designed with a memory size of 512, where each word is 16 bits wide.

| Port    | Description                                                  |
|---------|--------------------------------------------------------------|
| clock   | Input signal that contains 50 MHz clock signal from the FPGA |
| wren    | Input signal with the write enable signal                    |
| address | Input signal with the address that should be read            |
| q       | Output signal with the instruction                           |

Table 4: Port description for the DRAM

### 3.3.4 Instruction Memory (IRAM)



Figure 10: IRAM

This module is used to store instructions for the processor. A pre-built module, RAM: 1-PORT module was used to design this module. Instruction memory was designed with a memory size of 256, where each word is 16 bits wide.

| Port    | Description                                                  |
|---------|--------------------------------------------------------------|
| clock   | Input signal that contains 50 MHz clock signal from the FPGA |
| address | Input signal with the address that should be read            |
| q       | Output signal with the instruction                           |

Table 5: Port description for the IRAM

### 3.3.5 Control Unit



Figure 11: Control Unit

This module manages the operations in the processor. The state machine of the processor is implemented in this module. Control unit takes instructions from IRAM and generates control signals to manage other modules in the processor. Input signals and output signals of this module is given in the following table.

| Port        | Description                                                                 |
|-------------|-----------------------------------------------------------------------------|
| clock       | Input signal that contains scaled clock signal from the clock divider       |
| instruction | Input signal with the instruction from the IRAM                             |
| z           | Input signal with the flag from ALU                                         |
| end_core    | Input signal to terminate the processor                                     |
| status      | Input signal to indicate the start of the process                           |
| alu_op      | Output signal with the control signal for ALU                               |
| write_en    | Output signal with the control signal for write enable signal for registers |
| read_en     | Output signal with the control signal for read enable signal for BUS        |
| inc_en      | Output signal with the control signal for increment signal for registers    |
| rst_en      | Output signal with the control signal for reset signal for registers        |
| memory_mode | Output signal with the control signal for memory controller module          |
| end_process | Output signal which shows the end of the process                            |

Table 6: Inputs and outputs of the Control Unit

### 3.3.5.1 Write Enable

This is a 16 bit wide control signal which is used to control input data of the registers and other components connected to the BUS. Each bit is related to 16 different inputs of registers. If bit value in the write\_en[ ] signal equals to 1, the register will take the data from the corresponding input.

| Bit position | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5   | 4  | 3    | 2 | 1                 | 0                   |
|--------------|----|----|----|----|----|----|---|---|---|---|-----|----|------|---|-------------------|---------------------|
| Input        | PC | DR | AR | IR | R  | *  | I | * | * | A | SUM | AC | DRAM | * | AR<br>(from<br>R) | AC<br>(from<br>ALU) |

Table 7: Bit assignment for the write enable signal

\* : bit is not used as a control signal

If there is no specific source mentioned, all the input data comes from BUS.

### 3.3.5.2 Reset Enable

This is a 16 bit wide control signal which is used to reset the value of the registers. If rst\_en[i] equals to 1,  $i^{th}$  register value will be reset to 0 (in ID register the value will be reset to core id).

| Bit position | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5   | 4  | 3 | 2 | 1 | 0 |
|--------------|----|----|----|----|----|----|---|---|---|---|-----|----|---|---|---|---|
| Register     | PC | *  | AR | IR | R  | ID | I | J | K | A | SUM | AC | * | * | * | * |

Table 8: Bit assignment for the Reset enable signal

\* : bit is not used as a control signal

### 3.3.5.3 Increment Enable

This is a 5 bit wide control signal which is used to increment the value of 5 registers by 1. If  $\text{inc\_en}[i]$  equals to 1,  $i^{th}$  register value will be incremented by 1.

| Bit position | 4  | 3  | 2 | 1 | 0 |
|--------------|----|----|---|---|---|
| Register     | PC | AR | R | J | K |

Table 9: Bit assignment for the Increment enable signal

### 3.3.6 Arithmetic And Logic Unit (ALU)



Figure 12: ALU

This module is used to manage all the arithmetic and logic operations in the processor. Input signal and output signals of the module are given below.

| Port    | Description                                                           |
|---------|-----------------------------------------------------------------------|
| clock   | Input signal that contains scaled clock signal from the clock divider |
| in1     | Input signal with data from AC                                        |
| in2     | Input signal with data from the Bus                                   |
| alu_op  | Input signal with the control signal for ALU                          |
| alu_out | Output signal which contains data after the operation                 |
| z       | Output signal which used as a flag after COMP operation               |

Table 10: Port description for the ALU

#### alu\_op:

This is a 2 bit wide control signal which is used to select the operations in ALU.

| alu_op signal | Operation      | Description                                                                  |
|---------------|----------------|------------------------------------------------------------------------------|
| 01            | Addition       | $\text{alu\_out} \leq \text{in1} + \text{in2}$                               |
| 10            | Comparison     | If $(\text{in1} < \text{in2})$ : $\text{z} \leq 1$ , else: $\text{z} \leq 0$ |
| 11            | Multiplication | $\text{alu\_out} \leq \text{in1} * \text{in2}$                               |

Table 11: ALU operation signal

### 3.3.7 Registers



Figure 13: Register

This module is used to implement 16-bit registers in the processor. Seven special purpose registers and general-purpose register R were implemented using this module.

| Port     | Description                                                           |
|----------|-----------------------------------------------------------------------|
| clock    | Input signal that contains scaled clock signal from the clock divider |
| write_en | Input signal with control signal to enable input data from Bus        |
| rst_en   | Input signal with control signal to reset the register value to 0     |
| inc_en   | Input signal with control signal to increment the register value by 1 |
| data_in  | Input signal with the data from Bus                                   |
| data_out | Output signal of the register                                         |

Table 12: Port description for the Register

### 3.3.8 Accumulator



Figure 14: Accumulator

This module is used as a special purpose 16 bit register which stores data from ALU after its operations. In addition to the data input from Bus, output of the ALU is directly connected to the AC.

| Port      | Description                                                           |
|-----------|-----------------------------------------------------------------------|
| clock     | Input signal that contains scaled clock signal from the clock divider |
| write_en  | Input signal with control signal to select input data from Bus        |
| aluout_en | Input signal with control signal to select input data from ALU output |
| rst_en    | Input signal with control signal to reset the value of AC to 0        |
| data_in   | Input signal with the data from Bus                                   |
| alu_out   | Input signal with the data from ALU output                            |
| data_out  | Output signal of the AC this is connected to both ALU and Bus         |

Table 13: Port description for the ALU

### 3.3.8.1 Data Registers



Figure 15: Data Register

This module is used as a special purpose 16-bit register, which stores the data that has been read from the DRAM or that has to be written to the DRAM.

| Port     | Description                                                           |
|----------|-----------------------------------------------------------------------|
| clock    | Input signal that contains scaled clock signal from the clock divider |
| write_en | Input signal with control signal to enable input data from Bus        |
| data_in  | Input signal with the data from Bus                                   |
| data_out | Output signal of the DR                                               |

Table 14: Port description for the Data Register

### 3.3.8.2 Address Registers



Figure 16: Address Registers

This module is used as a special purpose 16-bit register, which stores the 16-bit address of the DRAM block that is being written or read. This register gets its input data from 2 sources: bus

and R register.

| Port      | Description                                                           |
|-----------|-----------------------------------------------------------------------|
| clock     | Input signal that contains scaled clock signal from the clock divider |
| write_en1 | Input signal with control signal to select input data from Bus        |
| write_en2 | Input signal with control signal to select input data from register R |
| rst_en    | Input signal with control signal to reset the value of AR to 0        |
| inc_en    | Input signal with control signal to increment the value of AR by 1    |
| data_in1  | Input signal with the data from Bus                                   |
| data_in2  | Input signal with the data from register R                            |
| data_out  | Output signal of the AR                                               |

Table 15: Port description for the AR

### 3.3.9 Bus



Figure 17: Bus

This module is used to transport data between registers and memory. This module will fetch data from one of the inputs according to the control signal “read\_en”, and keeps it until one of the inputs to the bus(register out values) changes.

#### 3.3.9.1 Read Enable

This is a 4 bit wide control signal which is used to control the input data of the Bus. Outputs of 9 registers, data output from DRAM and data output from IRAM is connected to the Bus.

| read_en | Data fetched from |
|---------|-------------------|
| 0001    | DR                |
| 0010    | R                 |
| 0011    | ID                |
| 0100    | I                 |
| 0101    | J                 |
| 0110    | K                 |
| 0111    | A                 |
| 1000    | SUM               |
| 1001    | AC                |
| 1010    | DRAM              |
| 1011    | IRAM              |

Table 16: Bit assignment for the Read enable signal

### 3.3.10 Memory Controller



Figure 18: Memory Controller

This module was used to manage the data memory access for the multicore processor. As the design consists of 4 parallel working cores, and memory only has one port for read and write operations a new component was introduced to control the memory access process.

| Port         | Description                                                           |
|--------------|-----------------------------------------------------------------------|
| clock        | Input signal that contains scaled clock signal from the clock divider |
| ar_out1      | Input signal with the memory address from core 1                      |
| ar_out2      | Input signal with the memory address from core 2                      |
| ar_out3      | Input signal with the memory address from core 3                      |
| ar_out4      | Input signal with the memory address from core 4                      |
| bus_out1     | Input signal with the data from core 1 to be written in memory        |
| bus_out2     | Input signal with the data from core 2 to be written in memory        |
| bus_out3     | Input signal with the data from core 3 to be written in memory        |
| bus_out4     | Input signal with the data from core 4 to be written in memory        |
| dram_out     | Input signal with the data read from memory                           |
| mode         | Input signal with the data from core 1 to control the memory access   |
| end_process2 | Input signal that indicates end of the process of core 2              |
| end_process3 | Input signal that indicates end of the process of core 3              |
| end_process4 | Input signal that indicates end of the process of core 4              |
| ar_out       | Output signal with the memory address to DRAM                         |
| bus_out      | Output signal with the data to be written in DRAM                     |
| dram_out1    | Output signal with the data from DRAM to core 1                       |
| dram_out2    | Output signal with the data from DRAM to core 2                       |
| dram_out3    | Output signal with the data from DRAM to core 3                       |
| dram_out4    | Output signal with the data from DRAM to core 4                       |

Table 17: Port description for the Memory Controller

This module takes the 4-bit control signal “mode” from the first core and changes memory access settings accordingly.

| Mode | Operation                                    |
|------|----------------------------------------------|
| 0000 | All cores read from the same memory location |
| 0001 | Data read from memory for core 1             |
| 0010 | Data read from memory for core 2             |
| 0011 | Data read from memory for core 3             |
| 0100 | Data read from memory for core 4             |
| 0101 | Memory write for core 1                      |
| 0110 | Memory write for core 2                      |
| 0111 | Memory write for core 3                      |
| 1000 | Memory write for core 4                      |

Table 18: Control Signal for memory access

### 3.3.11 Clock Divider



Figure 19: Register:Clock Divider

This module was used to slow down the clock speed. Since there were some malfunctions with higher clock speeds this was used to get a 10 times slower clock signal from the original 50 MHz clock.

| Port   | Description                                                            |
|--------|------------------------------------------------------------------------|
| clk_in | Input signal that contains 50 MHz clock signal from the FPGA           |
| ena    | Input signal which indicates the start of the operation                |
| clk    | Output signal that contains scaled clock signal from the clock divider |

Table 19: Port description for the Clock Divider

### 3.3.12 CoreID



Figure 20: Register:CoreID

This module is used as a special purpose 16-bit register, which stores the ID of the corresponding core. Core ID is given to this register as an input signal.

| Port     | Description                                                                 |
|----------|-----------------------------------------------------------------------------|
| clock    | Input signal that contains scaled clock signal from the clock divider       |
| rst_en   | Input signal with control signal to reset the register value to the core ID |
| data_in  | Input signal with the core ID                                               |
| data_out | Output signal of the register                                               |

Table 20: Port description for the CoreID

### 3.3.13 End Core



Figure 21: Register:End Core

This module is used to terminate the processing of cores. It compares “PC” values of all the other cores with PC value of core 1 to determine whether or not to end the processes of those cores.

## 4 Algorithms and Design Considerations

The task given was to design a multicore processor to multiply two matrices. As the first step for this task a matrix multiplication algorithm was designed only considering a single core.

### 4.1 Matrix Multiplication Algorithm

#### 4.1.0.1 Single Core Implementation

In addition to the two matrices, their sizes and starting memory addresses of each matrices are stored in the memory to be taken as inputs.

| Address | 0 | 1 | 2 | 3  | 4  | 5  | ap-bp    | bp-cp    | cp.. |
|---------|---|---|---|----|----|----|----------|----------|------|
| Data    | m | n | p | ap | bp | cp | Matrix A | Matrix B | 0    |

Table 21: DRAM (Single Core)

Matrix multiplication algorithm consists of 3 main loops. First loop iterates through number of rows in the final matrix. Second loop iterates through the elements in each row. Third loop is used to calculate one element in the matrix.

Consider matrix multiplication:

$$A(m \times n) \times B(n \times p) = C(m \times p)$$

m = number of rows in matrix A

n = number of columns in matrix A = number of rows in matrix B

p = number of columns in matrix B

ap = address of the first element of matrix A ( = 7 )

bp = address of the first element of matrix B ( = ap + m\*n )

cp = address of the first element of matrix C ( = bp + n\*p )

i = current row of the matrix A

j = current column of the matrix A (current row of the matrix B)

k = current column of the matrix B



Figure 22: Matrix Multiplication



Figure 23: The Single Core Algorithm

#### 4.1.0.2 Multi Core Implementation

The processor was designed with 4 cores and user has the ability to choose number of cores that should be used in the process. In order to achieve that, “noc” (Number of cores) parameter was added into the input data of memory.

|                |     |   |   |   |    |    |    |          |          |      |
|----------------|-----|---|---|---|----|----|----|----------|----------|------|
| <b>Address</b> | 0   | 1 | 2 | 3 | 4  | 5  | 6  | ap-bp    | bp-cp    | cp.. |
| <b>Data</b>    | noc | m | n | p | ap | bp | cp | Matrix A | Matrix B | 0    |

Table 22: DRAM (Multi Core)

All the cores are given a core ID (0,1,2,3) at the beginning of the process. A control unit of each core decides whether to continue its operations or not by comparing its core ID with the “noc”. In the multicore operation each active core will parallelly calculate the elements of a row in the resulting matrix.

A single core is given the task of calculating elements in several rows in each iterations. The row number can be calculated as:

$$\text{Row number} = \text{core\_id} * \text{iteration} + \text{noc}$$

If the row number is larger than the total number of rows, the relevant core will end its operations.

#### Example:

Consider the multiplication of two 5x5 matrices A and B. It will generate a new 5x5 matrix C. If number of cores equals to 3 the processor will use 3 cores to carry out the operations.

After comparing core ID and noc 4<sup>th</sup> core will end its operations. Then in the first iteration the first 3 cores will compute the values of elements in the first 3 rows of matrix C. Then in the second iteration 3<sup>rd</sup> core will end its operations, and first and second cores will calculate the elements in the remaining 2 rows. In the 3<sup>rd</sup> iteration 1<sup>st</sup> and 2<sup>nd</sup> cores will end its operations.



Figure 24: The Multi Core Algorithm

## 4.2 Assembly Code

Detailed assembly code which was used to implement multi core algorithm was given below.

1. // A\*B = C matrix multiplication
2. RSTALL // ID <- Core ID (0,1,2,3,4) // I,j,sum <-0
3. LDAC noc //AC <- M[noc] = No of cores
4. MVIDR //R <- ID
5. INCR // R<-R+1 (Core\_ID+1)
6. COMP // if AC < R, z =1, else z=0 (comparing core ID with number of cores)
7. JMPZ mend
8. RSTI // I<-ID
9. RSTJ //J <- 0
10. **loop:**
11. MVIR //R <- I
12. LDAC n // no.of rows in A = no.of rows in C
13. INCR // I <- I + 1
14. COMP // (comparing current row with total number of rows)
15. JMPZ mend // JUMP if z=1
16. RSTK //K <-> 0
17. **loop1:**
18. MVKR //R <- K
19. LDAC p //no.of columns in B = no.of columns in C
20. INCR
21. COMP
22. JMPZ mend1 // JUMP if z=1
23. RSTSUM //SUM <- 0
24. JUMP loop2
25. **mstore:**
26. MVIR //R <- I
27. LDAC p //
28. MULR // AC <- AC\*R
29. MVKR // R <-K
30. ADDR // AC <-> AC+R
31. MVACR //R <-AC
32. LDAC cp //AC <-M[cp] (addr of first element of C matrix)
33. ADDR // AC <- AC+R
34. MVACR // R <- AC
35. MVSUMAC // AC<-SUM
36. STAC // M[R] <- AC
37. INCK // K<-K+1
38. RSTJ

```
39.      JUMP loop1
40.      loop2:
41.      MVJR    //R<-J
42.      LDAC m   //no.of columns in A = no.of rows in B
43.      INCR
44.      COMP
45.      JMPZ mstore
46.      MVIR
47.      LDAC m
48.      MULR    //AC<-AC*R(i*m)
49.      MVJR    //R<-J
50.      ADDR    //AC<-AC+R
51.      MVACR    //R<-AC
52.      LDAC ap  //AC <- M[ap] (addr of first element of A matrix)
53.      ADDR    //AC<-AC+R (AC+AP)
54.      MVACR    //R <-AC
55.      LDAC R   //AC <- M[R]
56.      MVACA    //A<-AC
57.      MVJR    //R<-J
58.      LDAC p   //AC<-M[p]
59.      MULR    //AC<-AC*R
60.      MVKR    //R<-K
61.      ADDR    //AC<-AC+R
62.      MVACR    //R<-AC
63.      LDAC bp  //AC<-M[bp]
64.      ADDR    //AC<-AC+R (AC+BP)
65.      MVACR
66.      LDACR   //AC <- M[R]
67.      MULA    //AC<-AC*A
68.      ADDSUM   //AC<-AC+SUM
69.      MVACSUM  //SUM <- AC
70.      INCJ    //J<-J+1
71.      JUMP loop2
72.      mend1:
73.      MVIR    //R<-I
74.      LDAC noc
75.      ADDR
76.      MVACI
77.      JUMP loop
78. mend:
79. ENDOP
```

## 5 Design Optimization

### 5.1 Minimizing the number of instructions in ISA

In the initial architecture for the processor 41 instructions were defined for the ISA. In our new optimized ISA these instructions were reduced to 28 instructions. This reduction was achieved by doing the following changes to the processor design.

1. There were dedicated registers to store the dimensions of the matrices ( $n,m,p$ ) and the starting point memory addresses of each matrix( $ap, cp, bp$ ). Therefore when loaded from memory they were transferred from AC register to the dedicated registers using separate instructions. These registers were eliminated from our optimized design and whenever the above values are required it is loaded from the data memory and is temporarily stored in the general purpose register R. Therefore all these operations were replaced using a single instruction (MVACR) and multiple memory accesses.
2. In the initial design our memory was a 8 bit memory and hence the resulting matrix after multiplication was stored in two memory blocks therefore the register S was used to store the number of memory blocks needed for each element of the matrix. As our new design has a 16 bit memory this register and instructions relating to it was omitted.
3. We had dedicated instructions to do subtractions with changing the 'z' flag for the matrix dimensions. In our new design these are replaced by a single COMP instruction which compares the values of registers AC and R and flags 'z' accordingly.
4. The instructions to perform multiplication and addition was also optimized in our new design leaving only the desired instructions.

## 6 Problems Faced & Solutions

- Hardware debugging
  - Debugging using FPGA board was one of the major challenges we faced during the design phase. Model Sim software provides the ability to check any internal signal or register value during simulations. However, in the hardware implementation we had to use inbuilt indicators of FPGA board to check internal signals and register values of the processor. Some of those indicators are,
    - \* Seven Segment Displays : to indicate register values, check memory output, check bus output
    - \* LEDs : check control signals like “write\_enable” and “read\_enable”
    - \* Push buttons: to generate a manual clock
    - \* Switches: give custom inputs



Figure 25: FPGA Development Board

- Glitches due to the high clock speed
  - Some of the modules like ALU was first tested separately prior adding to the processor. These modules worked as intended with a manually generated clock signal. However, whenever the 50 MHz clock signal was used these modules did not generate the expected outcomes. In order to overcome this we used a module named “clock\_divider” to scale down the clock speed by a factor of 10.
- Long compilation times
  - Tested all the modules separately before combining them.
- Difficulty in developing algorithms and testing using FPGA development board when group members are in different places due to Covid'19 pandemic situation.
  - We used online video conference meetings for communication and programmed modules. Also, we used ModelSim simulation tool to simulate the design without the FPGA board.

## 7 Assembler and Simulator

### 7.1 Assembler

The assembler is used to convert the assembly language code to machine code. In our project, the assembler written in python reads a source code file (.s format) which has the algorithm in assembly language code for the project. The assembler python code has the ISA and the relevant opcodes saved as a dictionary and maps each line in the assembly language code to the proper binary code and stores these in the required arrangement in a Memory Initialization File (.mif format). This MIF file has the content for the Instruction Memory of the processor.

```
WIDTH=16;
DEPTH=256;

ADDRESS_RADIX=HEX;
DATA_RADIX=BIN;

CONTENT BEGIN
    000 : 0000000000010111;
    001 : 0000000000000001;
    002 : 0000000000000000;
    003 : 0000000000001100;
    004 : 0000000000010100;
    005 : 0000000000010001;
    006 : 0000000000010101;
    007 : 0000000001011010;
    008 : 0000000000011000;
    009 : 0000000000011001;
    00A : 0000000000001001;
    00B : 0000000000000001;
    00C : 0000000000000001;
    00D : 0000000000010100;
    00E : 0000000000010001;
    00F : 0000000000010101;
    010 : 0000000001011010;
    011 : 0000000000011010;
    012 : 0000000000001010;
    013 : 0000000000000001;
    014 : 0000000000000011;
    015 : 0000000000010100;
    016 : 0000000000010001;
    017 : 0000000000010101;
    018 : 0000000001010011;
    019 : 0000000000011011;
    01A : 0000000000010110;
    01B : 00000000000101101;
    01C : 0000000000001001;
```

Figure 26: Generated MIF file by the Assembler

## 7.2 Simulation

### 7.2.1 Python Simulation

The Python simulation was done at the beginning of the project in order to verify if the algorithm we developed using the defined ISA works properly and gives out the expected results. Here the DRAM was given as list where each element corresponds to a data memory block. The singlecore() function does the exact same process of a single core in our processor design and for a multicore architecture the function is instantiated based on the number of cores in our design. In the defined singlecore() function each line or a couple of lines (when a condition is checked) is written according to the order of the instructions stored in IRAM.

### 7.2.2 Model Sim Simulation

The Modelsim simulation was done through a test bench. Test benches are also used to verify FPGA designs. In test benches the behaviour of the hardware is depicted through waveforms. In our project we have written the test bench for the top module code. As shown in A.19 in the Appendix section, the inputs to the top module are stated as registers and an initial value is hardcoded to start the processor. A clock of period 20ns (50MHz) is defined here which is initially assigned to the value zero. The Modelsim simulator will by default show the waveforms of the inputs of the top module and the relevant outputs.



Figure 27: ModelSim Simulation

## 8 Performance Evaluation

After completing the multi core processor for matrix multiplication it was tested using different matrices. The performance analysis of the multicore processor was done by comparing the total time consumption.

Four square matrices were used in these tests and total number of clock cycles to complete the matrix multiplication was taken as the time measurement. Each matrix multiplication was done using different number of cores (i.e., 1,2,3,4 cores). Those results are given in the following graph.



Figure 28: Performance Analysis

The time consumption for same two matrix multiplication for different number of cores depends on the number of iterations. Therefore, time consumption will decrease with the number of cores.

**Example:** Considering multiplication of two 6x6 matrices,

| Number of cores | Number of iterations |
|-----------------|----------------------|
| 1               | 6                    |
| 2               | 3                    |
| 3               | 2                    |
| 4               | 2                    |

Table 23: Performance Analysis

As a single core has the highest number of iterations, it has the highest time consumption (59775 clock cycles). Number of iterations has its lowest value when number of cores equal to 3 and 4. Therefore, it has the lowest time consumption (19991 clock cycles).

## References

- [1] ComputerGeek. Cpufig, 2021. <https://www.computerhunger.com/what-is-the-cpu-and-its-function-components-and-diagram/>.
- [2] The Editors of Encyclopaedia Britannica. Cpu, 2021. <https://www.britannica.com/technology/central-processing-unit>.
- [3] Xilinx. Fpga, 2021. <https://www.xilinx.com/products/silicon-devices/fpga/what-is-an-fpga.html>.
- [4] Terasic. De2-115, 2003-2013. <https://www.intel.com/content/dam/www/programmable/us/en/portal/dsn/42/doc-us-dsnbk-42-1404062209-de2-115-user-manual.pdf>.
- [5] Intel. Quartus prime lite edition, 2020. <https://fpgasoftware.intel.com/?edition=lite>.
- [6] Intel. Modelsim, 2020. <https://www.intel.com/content/www/us/en/software/programmable/quartus-prime/model-sim.html>.

## A Appendix

### A.1 Assembly Code

```
1 RSTALL
2 LDAC
3 16'D0
4 MVIDR
5 INCR
6 COMP
7 JMPZ
8 16'D90
9 RSTI
10 RSTJ
11 MVIR
12 LDAC
13 16'D1
14 INCR
15 COMP
16 JMPZ
17 16'D90
18 RSTK
19 MVKR
20 LDAC
21 16'D3
22 INCR
23 COMP
24 JMPZ
25 16'D83
26 RSTSUM
27 JUMP
28 16'D45
29 MVIR
30 LDAC
31 16'D3
32 MULR
33 MVKR
34 ADDR
35 MVACR
36 LDAC
37 16'D6
38 ADDR
39 MVACR
40 MVSUMAC
41 STAC
42 INCK
43 RSTJ
44 JUMP
45 16'D18
46 MVJR
47 LDAC
48 16'D2
49 INCR
50 COMP
51 JMPZ
52 16'D28
53 MVIR
```

```
54 LDAC
55 16'D2
56 MULR
57 MVJR
58 ADDR
59 MVACR
60 LDAC
61 16'D4
62 ADDR
63 MVACR
64 LDACR
65 MVACA
66 MVJR
67 LDAC
68 16'D3
69 MULR
70 MVKR
71 ADDR
72 MVACR
73 LDAC
74 16'D5
75 ADDR
76 MVACR
77 LDACR
78 MULA
79 ADDSUM
80 MVACSUM
81 INCJ
82 JUMP
83 16'D45
84 MVIR
85 LDAC
86 16'D0
87 ADDR
88 MVACI
89 JUMP
90 16'D10
91 ENDOP
```

## A.2 Top Level Module

```
1 module top_module(
2     input [1:0] status ,
3     input clk ,
4     input clockm ,
5     input ena ,
6     output z ,
7     output [15:0] write_en ,
8     output [3:0] read_en ,
9     output end_process1 ,
10    output end_process2 ,
11    output end_process3 ,
12    output end_process4 ,
13    output reg[31:0] clk_counter
14 );
15
```

```

16         wire clock;
17         wire [15:0] iram_out;
18         wire [15:0] dram_out; wire [15:0] dram_out1; wire [15:0] dram_out2;
19         wire [15:0] dram_out3; wire [15:0] dram_out4;
20         wire dram_wren;
21         wire iram_wren;
22         wire [15:0] pc_out1; wire [15:0] pc_out2; wire [15:0] pc_out3; wire
23         [15:0] pc_out4;
24
25             wire [1:0] end_core1; wire [1:0] end_core2; wire [1:0] end_core3;
26             wire [1:0] end_core4;
27             wire [3:0] memory_mode;
28
29             wire [15:0] ar_out; wire [15:0] ar_out1; wire [15:0] ar_out2; wire [
30             15:0] ar_out3; wire [15:0] ar_out4;
31             wire [15:0] bus_out; wire [15:0] bus_out1; wire [15:0] bus_out2;
32             wire [15:0] bus_out3; wire [15:0] bus_out4;
33             wire [15:0] ac_out1; wire [15:0] ac_out2; wire [15:0] ac_out3; wire
34             [15:0] ac_out4;
35             wire [15:0] dr_out;
36             wire [15:0] ir_out;
37             wire [15:0] i_out1; wire [15:0] i_out2; wire [15:0] i_out3; wire [15:0]
38             i_out4;
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
//counter for the performance analysis
initial begin
    clk_counter = 32'd0;
end

always @ (posedge clock )
begin
    if(~end_process1)
    begin
        clk_counter = clk_counter + 1;
    end
end

clock_divider clkdvdr(.inclk(clk),.ena(ena),.clk(clock));

core core1(.clock(clock), .iram_out(iram_out),.dram_out(dram_out1),.status(
    status),
.dram_wren(dram_wren),.iram_wren(iram_wren),.pc_out(pc_out1),.ar_out(ar_out1)
),.bus_out(bus_out1),.end_process(end_process1),.ac_out(ac_out1),
.write_en(write_en),.read_en(read_en),.dr_out(dr_out),.ir_out(ir_out),.z(z)
,.core_id(16'd0),.memory_mode(memory_mode),.end_core(end_core1),.i_out(
i_out1));

core core2(.clock(clock), .iram_out(iram_out),.dram_out(dram_out2),.status(
    status),
.dram_wren(),.iram_wren(),.pc_out(pc_out2),.ar_out(ar_out2),.bus_out(
    bus_out2),.end_process(end_process2),.ac_out(ac_out2),
.write_en(),.read_en(),.dr_out(),.ir_out(),.z(),.core_id(16'd1),.end_core(
    end_core2),.i_out(i_out2));

core core3(.clock(clock), .iram_out(iram_out),.dram_out(dram_out3),.status(

```

```

      status),
60 .dram_wren(),.iram_wren(),.pc_out(pc_out3),.ar_out(ar_out3),.bus_out(
    bus_out3),.end_process(end_process3),.ac_out(ac_out3),
61 .write_en(),.read_en(),.dr_out(),.ir_out(),.z(),.core_id(16'd2),.end_core(
    end_core3),.i_out(i_out3);

62
63 core core4(.clock(clock), .iram_out(iram_out),.dram_out(dram_out4),.status(
    status),
64 .dram_wren(),.iram_wren(),.pc_out(pc_out4),.ar_out(ar_out4),.bus_out(
    bus_out4),.end_process(end_process4),.ac_out(ac_out4),
65 .write_en(),.read_en(),.dr_out(),.ir_out(),.z(),.core_id(16'd3),.end_core(
    end_core4),.i_out(i_out4);

66
67
68 end_core end_core(.clock(clock),.pc_out1(pc_out1),.pc_out2(pc_out2),.pc_out3(
    pc_out3),.pc_out4(pc_out4),
69 .end_core1(end_core1),.end_core2(end_core2),.end_core3(end_core3),.end_core4(
    end_core4));

70
71
72 memory_controller memory_controller(.clock(clk),.ar_out1(ar_out1),.ar_out2(
    ar_out2),.ar_out3(ar_out3),.ar_out4(ar_out4),
73 .bus_out1(bus_out1),.bus_out2(bus_out2),.bus_out3(bus_out3),.bus_out4(
    bus_out4),.dram_out(dram_out),.mode(memory_mode),
74 .ar_out(ar_out),.bus_out(bus_out),.dram_out1(dram_out1),.dram_out2(dram_out2),
    ),.dram_out3(dram_out3),.dram_out4(dram_out4),
75 .end_process2(end_process2),.end_process3(end_process3),.end_process4(
    end_process4));

76
77
78
79 IRAM_IP  IRAM(.address(pc_out1),.clock(clk),.data(),.wren(),.q(iram_out));
80
81 DRAM  DRAM(.address(ar_out),.clock(clk),.data(bus_out),.wren(dram_wren),.q(
    dram_out));
82
83
84 endmodule

```

### A.3 The Processor

```

1 module core(
2     input  clock,
3     input [15:0]iram_out ,
4     input [15:0]dram_out ,
5     input [1:0]status ,
6     input [15:0] core_id ,
7     input [1:0] end_core ,
8
9     output dram_wren ,
10    output iram_wren ,
11    output [3:0]memory_mode ,
12    output [15:0]pc_out ,
13    output [15:0]ar_out ,
14    output [15:0]bus_out ,
15    output [15:0]ac_out ,

```

```
16      output [15:0] write_en ,
17      output [3:0]  read_en ,
18      output [15:0] dr_out ,
19      output [15:0] ir_out ,
20      output [15:0] i_out ,
21      output end_process ,
22      output z
23  );
24
25
26      wire[1:0] alu_op;
27      wire[4:0] inc_en;
28      wire[15:0] rst_en;
29      wire[15:0] id_out;
30      wire[15:0] j_out;
31      wire[15:0] k_out;
32      wire[15:0] a_out;
33      wire[15:0] r_out;
34      wire[15:0] sum_out;
35      wire[15:0] alu_out;
36
37
38 //CU//
39 control_unit CU(.clock(clock), .z(z), .instruction(ir_out) ,.alu_op(alu_op), .
40   write_en(write_en) ,.inc_en(inc_en) ,.read_en(read_en),
41   .rst_en(rst_en) ,.end_process(end_process) ,.status(status) ,.memory_mode(
42     memory_mode) ,.end_core(end_core) );
43
44 //BUS//
45 bus BUS( .clock(clock),.read_en(read_en),.id(id_out),.i(i_out),
46   .j(j_out),.k(k_out),.a(a_out),.dr(dr_out),.ac(ac_out),.r(r_out),.sum(sum_out)
47   ),.DRAM(dram_out),.IRAM(iram_out),.busout(bus_out)) ;
48
49 //ALU//
50 alu ALU(.clock(clock),.in1(ac_out),.in2(bus_out),.alu_op(alu_op),.out(
51   alu_out),.z(z));
52
53 //PC//
54 register_16 PC(.clk(clock),.write_en(write_en[15]),.rst_en(rst_en[15]), .
55   inc_en(inc_en[4]),.data_in(bus_out),.data_out(pc_out));
56
57 //DR//
58 DR DR(.clk(clock),.write_en(write_en[14]),.data_in(bus_out),.data_out(dr_out)
59   );
60
61 //AR//
62 AR AR(.clk(clock),.write_en1(write_en[13]),.write_en2(write_en[1]),.rst_en(
63   rst_en[13]),.inc_en(inc_en[3]),
64   .data_in1(bus_out),.data_in2(r_out),.data_out(ar_out));
65
66 //IR//
67 register_16 IR(.clk(clock),.write_en(write_en[12]),.rst_en(rst_en[12]), .
68   inc_en(),.data_in(bus_out),.data_out(ir_out));
69
70 //R//
71 register_16 R(.clk(clock),.write_en(write_en[11]),.rst_en(rst_en[11]), .
```

```

    inc_en(inc_en[2]),.data_in(bus_out),.data_out(r_out));

66 //ID//
67 CoreID ID(.clk(clock),.rst_en(rst_en[10]),.data_in(core_id),.data_out(id_out));
68
69 //I//
70 register_16 I(.clk(clock),.write_en(write_en[9]),.rst_en(rst_en[9]),.inc_en(
71     () ,.data_in(bus_out),.data_out(i_out));
72
73 //J//
74 register_16 J(.clk(clock),.write_en(write_en[8]),.rst_en(rst_en[8]),.inc_en(
75     inc_en[1]),.data_in(),.data_out(j_out));
76
77 //K//
78 register_16 K(.clk(clock),.write_en(write_en[7]),.rst_en(rst_en[7]),.inc_en(
79     inc_en[0]),.data_in(),.data_out(k_out));
80
81 //A//
82 register_16 A(.clk(clock),.write_en(write_en[6]),.rst_en(rst_en[6]),.inc_en(
83     () ,.data_in(bus_out),.data_out(a_out));
84
85 //SUM//
86 register_16 SUM(.clk(clock),.write_en(write_en[5]),.rst_en(rst_en[5]),.
87     inc_en(),.data_in(bus_out),.data_out(sum_out));
88
89 assign dram_wren = write_en[3];
90 assign iram_wren = write_en[2];
91
92 endmodule

```

#### A.4 Control Unit

```

1 module control_unit(input clock,
2 input z,
3 input [1:0]end_core,
4 input [15:0] instruction ,
5 output reg [1:0] alu_op,
6 output reg [15:0] write_en ,
7 output reg [4:0] inc_en,
8 output reg [3:0] read_en,
9 output reg [15:0] rst_en,
10 output reg end_process ,
11 output reg[3:0] memory_mode ,
12 input [1:0] status );
13
14
15 reg [6:0] present = 7'd67;
16 reg [6:0] next = 7'd67;
17
18

```

```
19 parameter
20 fetch0 = 7'd67,
21 fetch1 = 7'd0,
22 fetch2 = 7'd1,
23 fetch3 = 7'd2,
24 fetch4 = 7'd3,
25
26 nop1 = 7'd4,
27
28 ldac1 = 7'd5,
29 ldac2 = 7'd6,
30 ldac3 = 7'd7,
31 ldac4 = 7'd8,
32
33
34 loadac1 = 7'd9,
35 loadac2 = 7'd10,
36 loadac3 = 7'd11,
37
38 stac1 = 7'd12,
39 stac2 = 7'd13,
40
41 mvacr1 = 7'd14,
42 mvaca1 = 7'd15,
43 mvacsum1 = 7'd16,
44 mvaci1 = 7'd17,
45 mvsumac1 = 7'd18,
46 mviri1 = 7'd19,
47 mvkri1 = 7'd20,
48 mvjri1 = 7'd21,
49 mvidr1 = 7'd22,
50
51 mulr1 = 7'd23,
52 mula1 = 7'd24,
53 addr1 = 7'd25,
54 addsum1 = 7'd26,
55 comp1 = 7'd27,
56
57 inck1 = 7'd28,
58 incj1 = 7'd29,
59 incr1 = 7'd30,
60
61 jmpzy1 = 7'd31,
62 jmpzy2 = 7'd32,
63
64 jmpzn1 = 7'd42,
65 jmpzn2 = 7'd43,
66
67 jump1 = 7'd33,
68 jump2 = 7'd34,
69
70 rstall1 = 7'd35,
71 rsti1 = 7'd36,
72 rstj1 = 7'd37,
73 rstk1 = 7'd38,
74 rstsum1 = 7'd39,
75
76 idle = 7'd40,
```

```
77 endop  = 7'd41 ,
78
79 fetchx = 7'd44 ,
80 ldacx = 7'd45 ,
81 ldacy = 7'd46 ,
82 ldacz = 7'd68 ,
83
84 jmpzyx = 7'd47 ,
85 jumpx = 7'd48 ,
86 loadacx1 =7'd49 ,
87 loadacx2 =7'd50 ,
88 loadacx3 =7'd51 ,
89 loadacx4 =7'd52 ,
90 loadacx5 =7'd53 ,
91 loadacx6 =7'd54 ,
92 loadacx7 =7'd55 ,
93 loadacx8 = 7'd69 ,
94 loadacx9 = 7'd70 ,
95
96 stacx1 = 7'd56 ,
97 stacx2 = 7'd57 ,
98 stacx3 = 7'd58 ,
99 stacx4 = 7'd59 ,
100 stacx5 = 7'd60 ,
101 stacx6 = 7'd61 ,
102 stacx7 = 7'd62 ,
103 stacx8 = 7'd63 ,
104 stacx9 = 7'd64 ,
105 stacx10 = 7'd65 ,
106 stacy = 7'd66 ;
107
108 always @(posedge clock)
109 present <= next;
110
111 always @(posedge clock)
112 begin
113 if (present == endop)
114 end_process <= 1'd1;
115 else
116 end_process <= 1'd0;
117 end
118
119 always @(present or z or instruction or status)
120 case(present)
121 idle: begin
122 read_en <= 4'd0;
123 inc_en <= 5'b00000 ; // PC AR R J K
124 write_en <= 16'b0000000000000000 ;
125 rst_en <= 16'b0000000000000000 ;
126 alu_op <= 2'd0;
127 if (status == 2'b01)
128 next <= fetch1;
129 else
130 next <= idle;
131 end
132
133 fetch0: begin
134 rst_en <= 16'b1111111111111111 ;
```

```
135 next <= fetch1;
136 end
137
138 fetch1: begin
139   read_en <= 4'd0;
140   inc_en <= 5'b00000 ;
141   write_en <= 16'b000000000000000000000000 ;
142   rst_en <= 16'b000000000000000000000000 ;
143   alu_op <= 2'd0;
144   memory_mode <= 4'd0;
145   next <= fetch2;
146 end
147
148 fetch2: begin
149   read_en <= 4'd11;
150   inc_en <= 5'b10000 ;
151   write_en <= 16'b0100000000000000 ;
152   rst_en <= 16'b0000000000000000 ;
153   alu_op <= 2'd0;
154   next <= fetchx;
155 end
156
157 fetchx: begin
158   read_en <= 4'd0;
159   inc_en <= 5'b00000 ;
160   write_en <= 16'b000000000000000000000000 ;
161   rst_en <= 16'b0000000000000000 ;
162   alu_op <= 2'd0;
163   next <= fetch3;
164 end
165
166
167 fetch3: begin
168   read_en <= 4'd1;
169   inc_en <= 5'b00000 ;
170   write_en <= 16'b0001000000000000 ;
171   rst_en <= 16'b0000000000000000 ;
172   alu_op <= 2'd0;
173   next <= fetch4;
174 end
175
176 fetch4: begin
177   read_en <= 4'd0;
178   inc_en <= 5'b00000 ;
179   write_en <= 16'b0000000000000000 ;
180   rst_en <= 16'b0000000000000000 ;
181   alu_op <= 2'd0;
182   if (end_core==2'b10) next <= endop;
183   else if (instruction [15:0]==16'd28) next <= nop1;
184   else if (instruction [15:0]==16'd1) next <= ldac1;
185   else if (instruction [15:0]==16'd2) next <= loadac1;
186   else if (instruction [15:0]==16'd3) next <= stac1;
187   else if (instruction [15:0]==16'd4) next <= mvacri;
188   else if (instruction [15:0]==16'd5) next <= mvaca1;
189   else if (instruction [15:0]==16'd6) next <= mvacsum1;
190   else if (instruction [15:0]==16'd7) next <= mvaci1;
191   else if (instruction [15:0]==16'd8) next <= mvsumaci1;
192   else if (instruction [15:0]==16'd9) next <= mvir1;
```

```

193 else if (instruction [15:0]==16'd10)      next <= mvkr1;
194 else if (instruction [15:0]==16'd11)      next <= mvjr1;
195 else if (instruction [15:0]==16'd12)      next <= mvidr1;
196 else if (instruction [15:0]==16'd13)      next <= mulr1;
197 else if (instruction [15:0]==16'd14)      next <= mula1;
198 else if (instruction [15:0]==16'd15)      next <= addr1;
199 else if (instruction [15:0]==16'd16)      next <= addsum1;
200 else if (instruction [15:0]==16'd17)      next <= comp1;
201 else if (instruction [15:0]==16'd18)      next <= inc1;
202 else if (instruction [15:0]==16'd19)      next <= incj1;
203 else if (instruction [15:0]==16'd20)      next <= incr1;
204 else if (instruction [15:0]==16'd22)      next <= jump1;
205 else if (instruction [15:0]==16'd23)      next <= r stalled1;
206 else if (instruction [15:0]==16'd24)      next <= rsti1;
207 else if (instruction [15:0]==16'd25)      next <= r stj1;
208 else if (instruction [15:0]==16'd26)      next <= r stk1;
209 else if (instruction [15:0]==16'd27)      next <= r stsum1;
210 else if ((instruction [15:0]==16'd21) & (z==1)) next <= jmpzy1;
211 else if ((instruction [15:0]==16'd21) & (z==0)) next <= jmpzn1;
212 else next <= endop;
213
214 end
215
216
217 ldac1: begin
218   read_en <= 4'd11;
219   inc_en <= 5'b10000 ;
220   write_en <= 16'b0100000000000000 ;
221   rst_en <= 16'b0000000000000000 ;
222   alu_op <= 2'd0;
223   memory_mode <= 4'd0;
224   next <= ldacx;
225 end
226
227 ldacx: begin
228   read_en <= 4'd0;
229   inc_en <= 5'b00000 ;
230   write_en <= 16'b0000000000000000 ;
231   rst_en <= 16'b0000000000000000 ;
232   alu_op <= 2'd0;
233   next <= ldac2;
234 end
235
236 ldac2: begin
237   read_en <= 4'd1;
238   inc_en <= 5'b00000 ;
239   write_en <= 16'b0010000000000000 ;
240   rst_en <= 16'b0000000000000000 ;
241   alu_op <= 2'd0;
242   next <= ldacz;
243 end
244
245 ldacz: begin
246   read_en <= 4'd0;
247   inc_en <= 5'b00000 ;
248   write_en <= 16'b0000000000000000 ;
249   rst_en <= 16'b0000000000000000 ;
250   alu_op <= 2'd0;

```

```
251 next <= ldac3;
252 end
253
254 ldac3: begin
255 read_en <= 4'd10;
256 inc_en <= 5'b00000 ;
257 write_en <= 16'b0100000000000000 ;
258 rst_en <= 16'b0000000000000000 ;
259 alu_op <= 2'd0;
260 next <= ldacy;
261 end
262
263 ldacy: begin
264 read_en <= 4'd0;
265 inc_en <= 5'b00000 ;
266 write_en <= 16'b0000000000000000 ;
267 rst_en <= 16'b0000000000000000 ;
268 alu_op <= 2'd0;
269 next <= ldac4;
270 end
271
272 ldac4: begin
273 read_en <= 4'd1;
274 inc_en <= 5'b00000 ;
275 write_en <= 16'b0000000000010000 ;
276 rst_en <= 16'b0000000000000000 ;
277 alu_op <= 2'd0;
278 next <= fetch1;
279 end
280
281 //////////////////////////////////////////////////////////////////
282 loadac1: begin
283 read_en <= 4'd2;
284 inc_en <= 5'b00000 ;
285 write_en <= 16'b0010000000000000 ;
286 rst_en <= 16'b0000000000000000 ;
287 alu_op <= 2'd0;
288 next <= loadac2;
289 end
290 //step1
291 loadac2 : begin
292 read_en <= 4'd10;
293 inc_en <= 5'b00000 ;
294 write_en <= 16'b0100000000000000 ;
295 rst_en <= 16'b0000000000000000 ;
296 alu_op <= 2'd0;
297 memory_mode <= 4'd1;
298 next <= loadacx1;
299 end
300
301 loadacx1: begin
302 read_en <= 4'd0;
303 inc_en <= 5'b00000 ;
304 write_en <= 16'b0000000000000000 ;
305 rst_en <= 16'b0000000000000000 ;
306 alu_op <= 2'd0;
307 memory_mode <= 4'd1;
308 next <= loadacx2;
```

```
309 end
310
311
312 //step2
313 loadacx2 : begin
314 read_en <= 4'd10;
315 inc_en <= 5'b00000 ;
316 write_en <= 16'b0100000000000000 ;
317 rst_en <= 16'b0000000000000000 ;
318 alu_op <= 2'd0;
319 memory_mode <= 4'd2;
320 next <= loadacx3;
321 end
322
323 loadacx3: begin
324 read_en <= 4'd0;
325 inc_en <= 5'b00000 ;
326 write_en <= 16'b0000000000000000 ;
327 rst_en <= 16'b0000000000000000 ;
328 alu_op <= 2'd0;
329 memory_mode <= 4'd2;
330 next <= loadacx4;
331 end
332
333 //step3
334 loadacx4 : begin
335 read_en <= 4'd10;
336 inc_en <= 5'b00000 ;
337 write_en <= 16'b0100000000000000 ;
338 rst_en <= 16'b0000000000000000 ;
339 alu_op <= 2'd0;
340 memory_mode <= 4'd3;
341 next <= loadacx5;
342 end
343
344 loadacx5: begin
345 read_en <= 4'd0;
346 inc_en <= 5'b00000 ;
347 write_en <= 16'b0000000000000000 ;
348 rst_en <= 16'b0000000000000000 ;
349 alu_op <= 2'd0;
350 memory_mode <= 4'd3;
351 next <= loadacx6;
352 end
353
354 //step4
355 loadacx6 : begin
356 read_en <= 4'd10;
357 inc_en <= 5'b00000 ;
358 write_en <= 16'b0100000000000000 ;
359 rst_en <= 16'b0000000000000000 ;
360 alu_op <= 2'd0;
361 memory_mode <= 4'd4;
362 next <= loadacx7;
363 end
364
365 loadacx7: begin
366 read_en <= 4'd0;
```

```
367 inc_en <= 5'b00000 ;
368 write_en <= 16'b0000000000000000 ;
369 rst_en <= 16'b0000000000000000 ;
370 alu_op <= 2'd0;
371 memory_mode <= 4'd4;
372 next <= loadacx8;
373 end
374
375 loadacx8: begin
376 read_en <= 4'd10;
377 inc_en <= 5'b00000 ;
378 write_en <= 16'b0100000000000000 ;
379 rst_en <= 16'b0000000000000000 ;
380 alu_op <= 2'd0;
381 memory_mode <= 4'd4;
382 next <= loadacx9;
383 end
384
385 loadacx9: begin
386 read_en <= 4'd0;
387 inc_en <= 5'b00000 ;
388 write_en <= 16'b0000000000000000 ;
389 rst_en <= 16'b0000000000000000 ;
390 alu_op <= 2'd0;
391 memory_mode <= 4'd4;
392 next <= loadac3;
393 end
394
395 loadac3 : begin
396 read_en <= 4'd1;
397 inc_en <= 5'b00000 ;
398 write_en <= 16'b0000000000010000 ;
399 rst_en <= 16'b0000000000000000 ;
400 alu_op <= 2'd0;
401 next <= fetch1;
402 end
403
404
405 /////////////////////////////////
406 stac1: begin
407 read_en <= 4'd9;
408 inc_en <= 5'b00000 ;
409 write_en <= 16'b0100000000000000 ;
410 rst_en <= 16'b0000000000000000 ;
411 alu_op <= 2'd0;
412 next <= stacy;
413 end
414
415 stacy:begin
416 read_en <= 4'd1;
417 inc_en <= 5'b00000 ;
418 write_en <= 16'b0000000000000000 ;
419 rst_en <= 16'b0000000000000000 ;
420 alu_op <= 2'd0;
421 memory_mode <= 4'd5;
422 next <= stac2;
423 end
424
```

```
425 stac2: begin
426   read_en <= 4'd1;
427   inc_en <= 5'b00000 ;
428   write_en <= 16'b00000000000000001000 ;
429   rst_en <= 16'b00000000000000000000 ;
430   alu_op <= 2'd0;
431   memory_mode <= 4'd5;
432   next <= stacx1;
433 end
434
435 stacx1: begin
436   read_en <= 4'd1;
437   inc_en <= 5'b00000 ;
438   write_en <= 16'b00000000000000000000 ;
439   rst_en <= 16'b00000000000000000000 ;
440   alu_op <= 2'd0;
441   memory_mode <= 4'd5;
442   next <= stacx2;
443 end
444 //core2
445 stacx2: begin
446   read_en <= 4'd1;
447   inc_en <= 5'b00000 ;
448   write_en <= 16'b00000000000000000000 ;
449   rst_en <= 16'b00000000000000000000 ;
450   alu_op <= 2'd0;
451   memory_mode <= 4'd6;
452   next <= stacx3;
453 end
454
455 stacx3: begin
456   read_en <= 4'd1;
457   inc_en <= 5'b00000 ;
458   write_en <= 16'b00000000000000001000 ;
459   rst_en <= 16'b00000000000000000000 ;
460   alu_op <= 2'd0;
461   memory_mode <= 4'd6;
462   next <= stacx4;
463 end
464
465 stacx4: begin
466   read_en <= 4'd1;
467   inc_en <= 5'b00000 ;
468   write_en <= 16'b00000000000000000000 ;
469   rst_en <= 16'b00000000000000000000 ;
470   alu_op <= 2'd0;
471   memory_mode <= 4'd6;
472   next <= stacx5;
473 end
474 //core3
475 stacx5: begin
476   read_en <= 4'd1;
477   inc_en <= 5'b00000 ;
478   write_en <= 16'b00000000000000000000 ;
479   rst_en <= 16'b00000000000000000000 ;
480   alu_op <= 2'd0;
481   memory_mode <= 4'd7;
482   next <= stacx6;
```

```
483 end
484
485 stacx6: begin
486   read_en <= 4'd1;
487   inc_en <= 5'b00000 ;
488   write_en <= 16'b00000000000000001000 ;
489   rst_en <= 16'b00000000000000000000 ;
490   alu_op <= 2'd0;
491   memory_mode <= 4'd7;
492   next <= stacx7;
493 end
494
495 stacx7: begin
496   read_en <= 4'd1;
497   inc_en <= 5'b00000 ;
498   write_en <= 16'b00000000000000000000 ;
499   rst_en <= 16'b00000000000000000000 ;
500   alu_op <= 2'd0;
501   memory_mode <= 4'd7;
502   next <= stacx8;
503 end
504 ////core4
505 stacx8: begin
506   read_en <= 4'd1;
507   inc_en <= 5'b00000 ;
508   write_en <= 16'b00000000000000000000 ;
509   rst_en <= 16'b00000000000000000000 ;
510   alu_op <= 2'd0;
511   memory_mode <= 4'd8;
512   next <= stacx9;
513 end
514
515 stacx9: begin
516   read_en <= 4'd1;
517   inc_en <= 5'b00000 ;
518   write_en <= 16'b00000000000000001000 ;
519   rst_en <= 16'b00000000000000000000 ;
520   alu_op <= 2'd0;
521   memory_mode <= 4'd8;
522   next <= stacx10;
523 end
524
525 stacx10: begin
526   read_en <= 4'd1;
527   inc_en <= 5'b00000 ;
528   write_en <= 16'b00000000000000000000 ;
529   rst_en <= 16'b00000000000000000000 ;
530   alu_op <= 2'd0;
531   memory_mode <= 4'd8;
532   next <= fetch1;
533 end
534
535
536
537 /////////////////
538 mvacr1: begin
539   read_en <= 4'd9;
540   inc_en <= 5'b00000 ;
```

```
541 write_en <= 16'b0000100000000000 ;
542 rst_en <= 16'b0000000000000000 ;
543 alu_op <= 2'd0;
544 next <= fetch1;
545 end
546
547 mvacal: begin
548   read_en <= 4'd9;
549   inc_en <= 5'b00000 ;
550   write_en <= 16'b000000000100000 ;
551   rst_en <= 16'b0000000000000000 ;
552   alu_op <= 2'd0;
553   next <= fetch1;
554 end
555
556 mvacsum1: begin
557   read_en <= 4'd9;
558   inc_en <= 5'b00000 ;
559   write_en <= 16'b0000000000100000 ;
560   rst_en <= 16'b0000000000000000 ;
561   alu_op <= 2'd0;
562   next <= fetch1;
563 end
564
565 mvaci1: begin
566   read_en <= 4'd9;
567   inc_en <= 5'b00000 ;
568   write_en <= 16'b0000001000000000 ;
569   rst_en <= 16'b0000000000000000 ;
570   alu_op <= 2'd0;
571   next <= fetch1;
572 end
573
574 mvsumac1: begin
575   read_en <= 4'd8;
576   inc_en <= 5'b00000 ;
577   write_en <= 16'b0000000000010000 ;
578   rst_en <= 16'b0000000000000000 ;
579   alu_op <= 2'd0;
580   next <= fetch1;
581 end
582
583
584 mvir1: begin
585   read_en <= 4'd4;
586   inc_en <= 5'b00000 ;
587   write_en <= 16'b0000100000000000 ;
588   rst_en <= 16'b0000000000000000 ;
589   alu_op <= 2'd0;
590   next <= fetch1;
591 end
592
593 mvkr1 : begin
594   read_en <= 4'd6;
595   inc_en <= 5'b00000 ;
596   write_en <= 16'b0000100000000000 ;
597   rst_en <= 16'b0000000000000000 ;
598   alu_op <= 2'd0;
```

```
599 next <= fetch1;
600 end
601
602 mvjr1: begin
603 read_en <= 4'd5;
604 inc_en <= 5'b00000 ;
605 write_en <= 16'b0000100000000000 ;
606 rst_en <= 16'b0000000000000000 ;
607 alu_op <= 2'd0;
608 next <= fetch1;
609 end
610
611 mvidr1: begin
612 read_en <= 4'd3;
613 inc_en <= 5'b00000 ;
614 write_en <= 16'b0000100000000000 ;
615 rst_en <= 16'b0000000000000000 ;
616 alu_op <= 2'd0;
617 next <= fetch1;
618 end
619
620 mulr1: begin
621 read_en <= 4'd2;
622 inc_en <= 5'b00000 ;
623 write_en <= 16'b0000000000000001 ;
624 rst_en <= 16'b0000000000000000 ;
625 alu_op <= 2'd3;
626 next <= fetch1;
627 end
628
629 mula1: begin
630 read_en <= 4'd7;
631 inc_en <= 5'b00000 ;
632 write_en <= 16'b0000000000000001 ;
633 rst_en <= 16'b0000000000000000 ;
634 alu_op <= 2'd3;
635 next <= fetch1;
636 end
637
638 addr1: begin
639 read_en <= 4'd2;
640 inc_en <= 5'b00000 ;
641 write_en <= 16'b0000000000000001 ;
642 rst_en <= 16'b0000000000000000 ;
643 alu_op <= 2'd1;
644 next <= fetch1;
645 end
646
647 addsum1: begin
648 read_en <= 4'd8;
649 inc_en <= 5'b00000 ;
650 write_en <= 16'b0000000000000001 ;
651 rst_en <= 16'b0000000000000000 ;
652 alu_op <= 2'd1;
653 next <= fetch1;
654 end
655
656 comp1: begin
```

```
657 read_en <= 4'd2;
658 inc_en <= 5'b00000 ;
659 write_en <= 16'b00000000000000000000 ;
660 rst_en <= 16'b00000000000000000000 ;
661 alu_op <= 2'd2;
662 next <= fetch1;
663 end
664
665 inck1: begin
666 read_en <= 4'd0;
667 inc_en <= 5'b00001 ;
668 write_en <= 16'b00000000000000000000 ;
669 rst_en <= 16'b00000000000000000000 ;
670 alu_op <= 2'd0;
671 next <= fetch1;
672 end
673
674 incj1: begin
675 read_en <= 4'd0;
676 inc_en <= 5'b00010 ;
677 write_en <= 16'b00000000000000000000 ;
678 rst_en <= 16'b00000000000000000000 ;
679 alu_op <= 2'd0;
680 next <= fetch1;
681 end
682
683 incr1: begin
684 read_en <= 4'd0;
685 inc_en <= 5'b00100 ;
686 write_en <= 16'b00000000000000000000 ;
687 rst_en <= 16'b00000000000000000000 ;
688 alu_op <= 2'd0;
689 next <= fetch1;
690 end
691
692 jmpzy1: begin
693 read_en <= 4'd11;
694 inc_en <= 5'b00000 ;
695 write_en <= 16'b01000000000000000000 ;
696 rst_en <= 16'b00000000000000000000 ;
697 alu_op <= 2'd0;
698 next <= jmpzyx;
699 end
700
701 jmpzyx: begin
702 read_en <= 4'd0;
703 inc_en <= 5'b00000 ;
704 write_en <= 16'b00000000000000000000 ;
705 rst_en <= 16'b00000000000000000000 ;
706 alu_op <= 2'd0;
707 next <= jmpzy2;
708 end
709
710 jmpzy2: begin
711 read_en <= 4'd1;
712 inc_en <= 5'b00000 ;
713 write_en <= 16'b10000000000000000000 ;
714 rst_en <= 16'b00000000000000000000 ;
```

```
715 alu_op <= 2'd0;
716 next <= fetch1;
717 end
718
719
720 jmpzn1: begin
721 read_en <= 4'd0;
722 inc_en <= 5'b10000 ;
723 write_en <= 16'b0000000000000000 ;
724 rst_en <= 16'b0000000000000000 ;
725 alu_op <= 2'd0;
726 next <= fetch1;
727 end
728
729
730
731 jump1: begin
732 read_en <= 4'd11;
733 inc_en <= 5'b00000 ;
734 write_en <= 16'b0100000000000000 ;
735 rst_en <= 16'b0000000000000000 ;
736 alu_op <= 2'd0;
737 next <= jumpx;
738 end
739
740 jumpx: begin
741 read_en <= 4'd0;
742 inc_en <= 5'b00000 ;
743 write_en <= 16'b0000000000000000 ;
744 rst_en <= 16'b0000000000000000 ;
745 alu_op <= 2'd0;
746 next <= jump2;
747 end
748
749 jump2: begin
750 read_en <= 4'd1;
751 inc_en <= 5'b00000 ;
752 write_en <= 16'b1000000000000000 ;
753 rst_en <= 16'b0000000000000000 ;
754 alu_op <= 2'd0;
755 next <= fetch1;
756 end
757
758
759 r stalled1: begin
760 read_en <= 4'd0;
761 inc_en <= 5'b00000 ;
762 write_en <= 16'b0000000000000000 ;
763 rst_en <= 16'b011011111100000 ;
764 alu_op <= 2'd0;
765 next <= fetch1;
766 end
767
768 r stalled1: begin
769 read_en <= 4'd3;
770 inc_en <= 5'b00000 ;
771 write_en <= 16'b0000001000000000 ;
772 rst_en <= 16'b0000000000000000 ;
```

```
773 alu_op <= 2'd0;
774 next <= fetch1;
775 end

776
777 rstj1: begin
778 read_en <= 4'd0;
779 inc_en <= 5'b00000 ;
780 write_en <= 16'b0000000000000000 ;
781 rst_en <= 16'b0000000100000000 ;
782 alu_op <= 2'd0;
783 next <= fetch1;
784 end

785
786 rstk1: begin
787 read_en <= 4'd0;
788 inc_en <= 5'b00000 ;
789 write_en <= 16'b0000000000000000 ;
790 rst_en <= 16'b0000000010000000 ;
791 alu_op <= 2'd0;
792 next <= fetch1;
793 end

794
795 rstsum1: begin
796 read_en <= 4'd0;
797 inc_en <= 5'b00000 ;
798 write_en <= 16'b0000000000000000 ;
799 rst_en <= 16'b0000000000100000 ;
800 alu_op <= 2'd0;
801 next <= fetch1;
802 end

803
804 nop1: begin
805 read_en <= 4'd0;
806 inc_en <= 5'b00000 ;
807 write_en <= 16'b0000000000000000 ;
808 rst_en <= 16'b0000000000000000 ;
809 alu_op <= 2'd0;
810 next <= fetch1;
811 end

812
813 endop: begin
814 read_en <= 4'd0;
815 inc_en <= 5'b00000 ;
816 write_en <= 16'b0000000000000000 ;
817 rst_en <= 16'b0000000000000000 ;
818 alu_op <= 2'd0;
819 next <= endop;
820 end

821
822 default: begin
823 read_en <= 4'd0;
824 inc_en <= 5'b00000 ;
825 write_en <= 16'b0000000000000000 ;
826 rst_en <= 16'b0000000000000000 ;
827 alu_op <= 2'd0;
828 next <= fetch1;
829 end
830 endcase
```

```
831  
832 endmodule
```

## A.5 ALU

```
1 module alu(input clock,  
2 input [15:0] in1,  
3 input [15:0] in2,  
4 input [1:0] alu_op,  
5 output reg [15:0] out,  
6 output reg z);  
7  
8  
9 always @(*)  
10 begin  
11     case(alu_op)  
12         2'd1: out <= in1 + in2;  
13         2'd2: begin  
14             if(in1<in2)  
15                 begin  
16                     z <= 1;  
17                 end  
18             else  
19                 begin  
20                     z <= 0;  
21                 end  
22             end  
23         2'd3: out <= in1*in2;  
24  
25     endcase  
26  
27 end  
28  
29 endmodule
```

## A.6 Register

```
1 // 16 bit register with read,write,inc,rst  
2 module register_16(clk,write_en,rst_en,inc_en,data_in,data_out);  
3 input clk;  
4 input write_en;  
5 input rst_en;  
6 input inc_en;  
7 input [15:0] data_in;  
8 output reg [15:0] data_out;  
9  
10 always @ (posedge clk)  
11 begin  
12     if (write_en == 1)  
13         data_out <= data_in;  
14     if (rst_en == 1)  
15         data_out <= 16'd0;  
16     if (inc_en == 1)  
17         data_out <= data_out + 16'd1;
```

```
18 end
19 endmodule
```

## A.7 Data Register

```
1 module DR(clk,write_en,data_in,data_out); //DR register
2 input clk;
3 input write_en;
4 input [15:0] data_in;
5 output reg [15:0] data_out;
6
7 always @ (posedge clk)
8 begin
9     if (write_en == 1)
10         data_out <= data_in;
11
12 end
13
14 endmodule
```

## A.8 Address Register

```
1 module AR(clk,write_en1,write_en2,rst_en,inc_en,data_in1,data_in2,data_out);
2     //normal register with read,write,inc,rst
3 input clk;
4 input write_en1;
5 input write_en2;
6 input rst_en;
7 input inc_en;
8 input [15:0] data_in1;
9 input [15:0] data_in2;
10 output reg [15:0] data_out;
11
12 always @ (posedge clk)
13 begin
14     if (write_en1 == 1)
15         data_out <= data_in1;
16     if (write_en2 == 1)
17         data_out <= data_in2;
18     if (rst_en == 1)
19         data_out <= 16'd0;
20     if (inc_en == 1)
21         data_out <= data_out + 16'd1;
22 end
23
24 endmodule
```

## A.9 Accumulator

```
1 module AC(input clk,
2             input write_en,
3             input rst_en,
4             input aluout_en,
```

```

5      input [15:0] data_in ,
6      input [15:0] alu_out ,
7      output reg [15:0] data_out );
8
9 always@ (posedge clk)
10 begin
11     if (write_en == 1)
12         data_out <= data_in;
13     if (aluout_en == 1)
14         data_out <= alu_out;
15     if (rst_en == 1)
16         data_out <= 16'd0;
17 end
18 endmodule

```

## A.10 Bus

```

1 module bus(
2     input clock,
3     input [3:0] read_en,
4     input [15:0] id,
5     input [15:0] i,
6     input [15:0] j,
7     input [15:0] k,
8     input [15:0] a,
9     input [15:0] dr,
10    input [15:0] ac,
11    input [15:0] r,
12    input [15:0] sum,
13    input [15:0] DRAM,
14    input [15:0] IRAM,
15    output reg[15:0] busout ) ;
16
17
18 always @ (id or i or j or k or a or dr or ac or r or sum or read_en)
19 begin
20 case(read_en)
21 4'd1: busout <= dr ;
22 4'd2: busout <= r ;
23 4'd3: busout <= id ;
24 4'd4: busout <= i ;
25 4'd5: busout <= j ;
26 4'd6: busout <= k ;
27 4'd7: busout <= a ;
28 4'd8: busout <= sum;
29 4'd9: busout <= ac;
30 4'd10: busout <= DRAM ;
31 4'd11: busout <= IRAM ;
32 default: busout <= 16'd0;
33 endcase
34 end
35 endmodule
36 endmodule

```

## A.11 Memory Controller

```
1 module memory_controller(
2     input  clock,
3     input [15:0] ar_out1,
4     input [15:0] ar_out2,
5     input [15:0] ar_out3,
6     input [15:0] ar_out4,
7
8     input [15:0] bus_out1,
9     input [15:0] bus_out2,
10    input [15:0] bus_out3,
11    input [15:0] bus_out4,
12
13    input [15:0] dram_out,
14    input [3:0] mode,
15
16    input end_process2,
17    input end_process3,
18    input end_process4,
19
20    output reg[15:0] ar_out,
21    output reg[15:0] bus_out,
22
23    output reg[15:0] dram_out1,
24    output reg[15:0] dram_out2,
25    output reg[15:0] dram_out3,
26    output reg[15:0] dram_out4);
27
28
29 always @(posedge clock)
30 begin
31     if (mode == 4'd1)
32     begin
33         ar_out <= ar_out1;
34         dram_out1 <= dram_out;
35     end
36     else if (mode == 4'd2)
37     begin
38         ar_out <= ar_out2;
39         dram_out2 <= dram_out;
40     end
41     else if (mode == 4'd3)
42     begin
43         ar_out <= ar_out3;
44         dram_out3 <= dram_out;
45     end
46     else if (mode == 4'd4)
47     begin
48         ar_out <= ar_out4;
49         dram_out4 <= dram_out;
50     end
51
52     else if (mode == 4'd5)
53     begin
54         ar_out <= ar_out1;
55         bus_out <= bus_out1;
56     end
```

```

57         else if ((mode == 4'd6) & (end_process2 != 1))
58             begin
59                 ar_out <= ar_out2;
60                 bus_out <= bus_out2;
61             end
62         else if ((mode == 4'd7) & (end_process3 != 1))
63             begin
64                 ar_out <= ar_out3;
65                 bus_out <= bus_out3;
66             end
67         else if ((mode == 4'd8) & (end_process4 != 1))
68             begin
69                 ar_out <= ar_out4;
70                 bus_out <= bus_out4;
71             end
72         else if (mode == 4'd0)
73             begin
74                 ar_out <= ar_out1;
75                 dram_out1 <= dram_out;
76                 dram_out2 <= dram_out;
77                 dram_out3 <= dram_out;
78                 dram_out4 <= dram_out;
79             end
80     end
81
82 endmodule

```

## A.12 Clock Divider

```

1 module clock_divider(inclk,ena,clk);
2
3
4 parameter maxcount=32'd10;// input 50MHz clock into output 5MHz clk
5
6 input inclk;
7 input ena;
8 output reg clk=1;
9
10 reg [31:0] count=32'd0;
11
12 always @ (posedge inclk )
13 begin
14     if (ena)
15         begin
16             if (count==maxcount)
17                 begin
18                     clk=~clk;
19                     count=32'd0;
20                 end
21             else
22                 begin
23                     count=count+1;
24                 end
25         end
26     else

```

```

27          begin
28              clk=0;
29          end
30      end
31
32 endmodule

```

### A.13 CoreID

```

1 module CoreID(clk,rst_en,data_in,data_out); //Core ID register
2
3     input clk;
4     input rst_en;
5     input[15:0] data_in;
6     output reg [15:0] data_out;
7
8     always @ (posedge clk)
9     begin
10         if (rst_en == 1)
11             data_out <= data_in;
12     end
13 endmodule

```

### A.14 End Core

```

1 module end_core(
2     input clock,
3     input[15:0] pc_out1,
4     input[15:0] pc_out2,
5     input[15:0] pc_out3,
6     input[15:0] pc_out4,
7     output reg[1:0] end_core1,
8     output reg[1:0] end_core2,
9     output reg[1:0] end_core3,
10    output reg[1:0] end_core4);
11
12 always @ (posedge clock)
13 begin
14     if (pc_out1 != pc_out2)
15         begin
16             end_core2 <= 2'b10;
17             end_core3 <= 2'b10;
18             end_core4 <= 2'b10;
19             end
20     else if (pc_out1 != pc_out3)
21         begin
22             end_core3 <= 2'b10;
23             end_core4 <= 2'b10;
24             end
25     else if (pc_out1 != pc_out4) end_core4 <= 2'b10;
26     else
27         begin
28             end_core1 <= 2'b00;
29             end_core2 <= 2'b00;

```

```
30                     end_core3 <= 2'b00;
31                     end_core4 <= 2'b00;
32             end
33         end
34
35 endmodule
```

### A.15 ALU Test Bench

```
1 `timescale 1ns/1ps
2
3 module ALU_tb();
4
5 localparam CLK_PERIOD = 2;
6
7 reg clock;
8 reg [15:0] in1;
9 reg [15:0] in2;
10 reg [1:0] alu_op;
11 wire [15:0] out;
12 wire z;
13
14 alu_test1 alu(.clock(clock),.in1(in1),.in2(in2),.alu_op(alu_op),.out(out),.z
15     (z));
16
17 initial begin
18     clock = 1'b0;
19     forever begin
20         #(CLK_PERIOD/2);
21         clock = ~clock;
22     end
23 end
24
25 initial begin
26     #10
27     in1 = 16'd3;
28     in2 = 16'd2;
29     alu_op = 2'd1;
30
31     #10
32     in1 = 16'd3;
33     in2 = 16'd3;
34     alu_op = 2'd1;
35
36     #10
37     in1 = 16'd4;
38     in2 = 16'd2;
39     alu_op = 2'd1;
40
41     #10
42     in1 = 16'd3;
43     in2 = 16'd2;
44     alu_op = 2'd2;
45
46     #10
47     in1 = 16'd3;
```

```
47         in2 = 16'd3;
48         alu_op = 2'd2;
49
50         #10
51         in1 = 16'd4;
52         in2 = 16'd6;
53         alu_op = 2'd2;
54
55         #10
56         in1 = 16'd3;
57         in2 = 16'd2;
58         alu_op = 2'd3;
59
60         #10
61         in1 = 16'd3;
62         in2 = 16'd3;
63         alu_op = 2'd3;
64
65         #10
66         in1 = 16'd4;
67         in2 = 16'd2;
68         alu_op = 2'd3;
69
70     end
71
72 endmodule
```

## A.16 Address Register Test Bench

```
1 `timescale 1ns/1ps
2
3 module AR_tb();
4
5 localparam CLK_PERIOD = 2;
6
7 reg clk, write_en1, write_en2, rst_en, inc_en;
8 reg [15:0] data_in1;
9 reg [15:0] data_in2;
10 wire [15:0] data_out;
11
12 AR AR(.clk(clk),.write_en1(write_en1),.write_en2(write_en2),
13 .rst_en(rst_en),.inc_en(inc_en),.data_in1(data_in1),
14 .data_in2(data_in2),.data_out(data_out));
15
16 initial begin
17     clk = 1'b0;
18     forever begin
19         #(CLK_PERIOD/2);
20         clk = ~clk;
21     end
22 end
23
24 initial begin
25     #10
26     write_en1 = 1;
27     write_en2 = 0;
```

```
28         rst_en = 0;
29         inc_en = 0;
30         data_in1 = 16'd100;
31         data_in2 = 16'd127;
32
33         #10
34         write_en1 = 0;
35         write_en2 = 1;
36         rst_en = 0;
37         inc_en = 0;
38         data_in1 = 16'd15;
39         data_in2 = 16'd10;
40
41         #10
42         write_en1 = 0;
43         write_en2 = 0;
44         rst_en = 0;
45         inc_en = 1;
46         data_in1 = 16'd1;
47         data_in2 = 16'd1;
48
49         #10
50         write_en1 = 0;
51         write_en2 = 0;
52         rst_en = 1;
53         inc_en = 0;
54         data_in1 = 16'd10;
55         data_in2 = 16'd10;
56
57     end
58
59 endmodule
```

### A.17 Bus Test Bench

```
1 `timescale 1ns/1ps
2
3 module bus_tb();
4
5 localparam CLK_PERIOD = 2;
6
7 reg clock;
8 reg [3:0] read_en;
9 reg [15:0] id;
10 reg [15:0] i;
11 reg [15:0] j;
12 reg [15:0] k;
13 reg [15:0] a;
14 reg [15:0] dr;
15 reg [15:0] ac;
16 reg [15:0] r;
17 reg [15:0] sum;
18 reg [15:0] DRAM;
19 reg [15:0] IRAM;
20 wire [15:0] busout;
```

```
22
23 bus1 bus(.clock(clock),.read_en(read_en),.id(id),.i(i),.j(j),.k(k),.a(a),.dr
24     (dr),
25 .ac(ac),.r(r),.sum(sum),.DRAM(DRAM),.IRAM(IRAM),.busout(busout));
26
27 initial begin
28     clock = 1'b0;
29     forever begin
30         #(CLK_PERIOD/2);
31         clock = ~clock;
32     end
33 end
34
35 initial begin
36     #10
37     read_en= 4'd1;
38     id= 16'd1;
39     i= 16'd02;
40     j= 16'd03;
41     k= 16'd04;
42     a= 16'd05;
43
44     dr= 16'd07;
45     ac= 16'd08;
46     r= 16'd09;
47     sum= 16'd10;
48     DRAM= 16'd011;
49     IRAM= 16'd012;
50
51     #10
52     read_en= 4'd2;
53     id= 16'd1;
54     i= 16'd02;
55     j= 16'd03;
56     k= 16'd04;
57     a= 16'd05;
58
59     dr= 16'd07;
60     ac= 16'd08;
61     r= 16'd09;
62     sum= 16'd10;
63     DRAM= 16'd011;
64     IRAM= 16'd012;
65
66     #10
67     read_en= 4'd3;
68     id= 16'd1;
69     i= 16'd02;
70     j= 16'd03;
71     k= 16'd04;
72     a= 16'd05;
73
74     dr= 16'd07;
75     ac= 16'd08;
76     r= 16'd09;
77     sum= 16'd10;
78     DRAM= 16'd011;
    IRAM= 16'd012;
```

```
79      #10
80      read_en= 4'd4;
81      id= 16'd1;
82      i= 16'd02;
83      j= 16'd03;
84      k= 16'd04;
85      a= 16'd05;
86
87      dr= 16'd07;
88      ac= 16'd08;
89      r= 16'd09;
90      sum= 16'd10;
91      DRAM= 16'd011;
92      IRAM= 16'd012;
93
94      #10
95      read_en= 4'd5;
96      id= 16'd1;
97      i= 16'd02;
98      j= 16'd03;
99      k= 16'd04;
100     a= 16'd05;
101
102     dr= 16'd07;
103     ac= 16'd08;
104     r= 16'd09;
105     sum= 16'd10;
106     DRAM= 16'd011;
107     IRAM= 16'd012;
108
109     #10
110     read_en= 4'd6;
111     id= 16'd1;
112     i= 16'd02;
113     j= 16'd03;
114     k= 16'd04;
115     a= 16'd05;
116
117     dr= 16'd07;
118     ac= 16'd08;
119     r= 16'd09;
120     sum= 16'd10;
121     DRAM= 16'd011;
122     IRAM= 16'd012;
123
124     #10
125     read_en= 4'd7;
126     id= 16'd1;
127     i= 16'd02;
128     j= 16'd03;
129     k= 16'd04;
130     a= 16'd05;
131
132     dr= 16'd07;
133     ac= 16'd08;
134     r= 16'd09;
135     sum= 16'd10;
```

```
137     DRAM= 16'd011;
138     IRAM= 16'd012;
139
140     #10
141     read_en= 4'd8;
142     id= 16'd1;
143     i= 16'd02;
144     j= 16'd03;
145     k= 16'd04;
146     a= 16'd05;
147
148     dr= 16'd07;
149     ac= 16'd08;
150     r= 16'd09;
151     sum= 16'd10;
152     DRAM= 16'd011;
153     IRAM= 16'd012;
154
155     #10
156     read_en= 4'd9;
157     id= 16'd1;
158     i= 16'd02;
159     j= 16'd03;
160     k= 16'd04;
161     a= 16'd05;
162
163     dr= 16'd07;
164     ac= 16'd08;
165     r= 16'd09;
166     sum= 16'd10;
167     DRAM= 16'd011;
168     IRAM= 16'd012;
169
170     #10
171     read_en= 4'd10;
172     id= 16'd1;
173     i= 16'd02;
174     j= 16'd03;
175     k= 16'd04;
176     a= 16'd05;
177
178     dr= 16'd07;
179     ac= 16'd08;
180     r= 16'd09;
181     sum= 16'd10;
182     DRAM= 16'd011;
183     IRAM= 16'd012;
184
185     #10
186     read_en= 4'd11;
187     id= 16'd1;
188     i= 16'd02;
189     j= 16'd03;
190     k= 16'd04;
191     a= 16'd05;
192
193     dr= 16'd07;
194     ac= 16'd08;
```

```
195     r= 16'd09;
196     sum= 16'd10;
197     DRAM= 16'd011;
198     IRAM= 16'd012;
199
200
201
202 end
203
204 endmodule
```

### A.18 Register Test Bench

```
1 `timescale 1ns/1ps
2
3 module register_tb();
4
5 localparam CLK_PERIOD = 20;
6
7 reg clk;
8 reg write_en;
9 reg rst_en;
10 reg inc_en;
11 reg [15:0] data_in;
12 wire [15:0] data_out;
13
14 Register1 register(.clk(clk),.write_en(write_en),.rst_en(rst_en),
15 .inc_en(inc_en),.data_in(data_in),.data_out(data_out));
16
17 initial begin
18     clk = 1'b0;
19     forever begin
20         #(CLK_PERIOD/2);
21         clk = ~clk;
22     end
23 end
24
25 initial begin
26     #10
27     write_en = 1;
28     data_in = 16'd10;
29
30     #10
31     write_en = 0;
32     data_in = 16'd15;
33
34     #10
35     inc_en = 1;
36
37     #10
38     inc_en = 0;
39     rst_en = 1;
40
41 end
42
43 endmodule
```

### A.19 Top Module Test Bench

```

1 `timescale 1ns/1ps
2
3 module top_module_tb();
4
5 localparam CLK_PERIOD = 20;
6
7 reg [1:0] status;
8 reg clk;
9 reg clockm;
10 reg ena;
11 wire z;
12 wire [15:0] write_en;
13 wire [3:0] read_en;
14 wire end_process1;
15 wire end_process2;
16 wire end_process3;
17 wire end_process4;
18 wire [31:0] clk_counter;
19
20 top_module top_module (
21     .status(status),
22     .clk(clk),
23     .clockm(clockm),
24     .ena(ena),
25     .z(z),
26     .write_en(write_en),
27     .read_en(read_en),
28     .end_process1(end_process1),
29     .end_process2(end_process2),
30     .end_process3(end_process3),
31     .end_process4(end_process4),
32     .clk_counter(clk_counter));
33
34 initial begin
35     clk = 1'b0;
36     clockm = 1'b0;
37     ena <= 1'b1;
38     status <= 2'b01;
39     forever begin
40         #(CLK_PERIOD/2);
41         clk = ~clk;
42         clockm = ~clockm;
43     end
44 end
45
46
47 endmodule

```

### A.20 Assembler code

```

1 assemblyf = open("C:\FPGA_project\Verilog16_1\Verilog16\multicorefinal.s", "r")
2 iram = open("C:\FPGA_project\Verilog16_1\Verilog16\iram.mif", "w")
3 a = True
4 ISA = {
5     'LDAC' : '0000000000000001',

```

```

6      'LDACR' : '000000000000000010',
7      'STAC' : '000000000000000011',
8      'MVACR' : '0000000000000000100',
9      'MVACA' : '0000000000000000101',
10     'MVACSUM' : '0000000000000000110',
11     'MVACI' : '0000000000000000111',
12     'MVSUMAC' : '00000000000000001000',
13     'MVIR' : '00000000000000001001',
14     'MVKR' : '00000000000000001010',
15     'MVJR' : '00000000000000001011',
16     'MVIDR' : '00000000000000001100',
17     'MULR' : '00000000000000001101',
18     'MULA' : '00000000000000001110',
19     'ADDR' : '00000000000000001111',
20     'ADDSUM' : '000000000000000010000',
21     'COMP' : '000000000000000010001',
22     'INCK' : '000000000000000010010',
23     'INCJ' : '000000000000000010011',
24     'INCR' : '000000000000000010100',
25     'JMPZ' : '000000000000000010101',
26     'JUMP' : '000000000000000010110',
27     'RSTALL' : '000000000000000010111',
28     'RSTI' : '000000000000000011000',
29     'RSTJ' : '000000000000000011001',
30     'RSTK' : '000000000000000011010',
31     'RSTSUM' : '000000000000000011011',
32     'NOP' : '000000000000000011100'
33 }
34 addr = 0
35 lines = assemblyf.readlines()
36 iram.write("WIDTH=16;\nDEPTH=256;\n\nADDRESS_RADIX=HEX;\nDATA_RADIX=BIN;\n\nCONTENT\nBEGIN")
37 iram.write("\n\t")
38 while addr<len(lines):
39     line = "".join(lines[addr].split())
40     if line == "ENDOP":
41         hexaddr = format(addr, '03x')
42         iram.write(str('['+hexaddr.upper()+'..OFF] : 0000000000000000;'))
43         break
44     elif line == "":
45         continue
46     else:
47         opcode = ISA[line]
48         hexaddr = format(addr, '03x')
49         iram.write(str(hexaddr.upper() + ' : ' + opcode + ';'))
50         iram.write("\n\t")
51         if line in ["LDAC", "JUMP", "JMPZ"]:
52             addr += 1
53             hexaddr = format(addr, '03x')
54             memaddr = "".join(lines[addr][4:].split())
55             memaddr = format(int(memaddr), '016b')
56             iram.write(str(hexaddr.upper() + ' : ' + memaddr + ';'))
57             iram.write("\n\t")
58     addr += 1
59
60 iram.write("\nEND;")
61
62 assemblyf.close()
63 iram.close()

```

## A.21 Multicore Python Simulation

```
1 NoC=2
2
3 Memory = [NoC,
4     4,
5     4,
6     4,
7     0,
8     0,
9     0,
10    1,
11
12    1,2,3,4,
13    5,6,7,8,
14    9,10,11,12,
15    13,14,15,16,
16
17    17,18,19,20,
18    21,22,23,24,
19    25,26,27,28,
20    29,30,31,32,
21
22    0,0,0,0,
23    0,0,0,0,
24    0,0,0,0,
25    0,0,0,0,
26    0,0,0,0,
27    0,0,0,0,
28    0,0,0,0,
29    0,0,0,0
30 ]
31
32 n = Memory[1]
33 m = Memory[2]
34 p = Memory[3]
35 s = Memory[7]
36 Memory[4],Memory[5],Memory[6] = 8,(8+n*m),(8+n*m+m*p)
37
38 def singlecore(CoreID,M=Memory):
39     ID = CoreID
40     AC = M[0]
41     R = ID
42     R += 1
43     if AC<R:
44         z = 1
45     else:
46         z = 0
47     if z==1:
48         return None
49     I = ID
50     J = 0
51
52     while True:      #loop
53         R = I
54         AC = M[1]
55         R+=1
56         if AC<R:
57             z = 1
58         else:
59             z = 0
60         if z == 1:
61             return None
```

```
62     K = 0
63
64     while True: #loop1
65         R = K
66         AC = M[3]
67         R+=1
68         #AC = AC-R
69         if AC < R:
70             z=1
71         else:
72             z=0
73         if z==1:
74             break
75     SUM = 0
76
77     while True: #loop2
78         R = J
79         AC = M[2]
80         R+=1
81         if AC < R:
82             z=1
83         else:
84             z=0
85         if z == 1:
86             R = I
87             AC = M[3]
88             AC = AC*R
89             R = K
90             AC = AC+R
91             R = AC
92             #AC = M[7]
93             #AC = AC*R
94             #R = AC
95             AC = M[6]
96             AC = AC+R
97             R = AC
98
99             AC = SUM
100            M[R]=AC
101
102            K = K+1
103            J = 0
104            break
105
106            R = I
107            AC = M[2]
108            AC = AC*R
109            R = J
110            AC += R
111            R = AC
112            AC = M[4]
113            AC += R
114            R = AC
115            AC = M[R]
116            A = AC
117
118            R = J
119            AC = M[3]
120            AC = AC*R
121            R = K
122            AC +=R
123            R = AC
124            AC = M[5]
```

```
125      AC += R
126      R = AC
127      AC = M[R]
128
129      AC = AC*A
130      AC = AC+SUM
131      SUM = AC
132
133      J +=1
134      continue
135      R = I
136      AC = M[0]
137      AC +=R
138      I = AC
139
140 singlecore(0)
141 singlecore(1)
142 #singlecore(2)
143 #singlecore(3)
144
145 print(Memory[40:44])
146 print(Memory[44:48])
147 print(Memory[48:52])
148 print(Memory[52:56])
```