



# DEPARTMENT OF ELECTRONIC & TELECOMMUNICATION ENGINEERING UNIVERSITY OF MORATUWA

EN3030 - CIRCUITS AND SYSTEMS DESIGN

## Multi-Core Processor Design for Matrix Multiplication using FPGA

---

### Group 26

---

*Authors:*

H.U.D.B. Haputhanthri  
D.B. Ishtaweera  
S.N. Liyanagoonawardena  
R.M.S.N.Rajapakshe

*Index Number:*

|         |
|---------|
| 170208K |
| 170242H |
| 170348M |
| 170476D |

This report is submitted as a partial fulfillment for the EN3030 - Circuits and Systems Design  
December 29, 2021

# Contents

|                                                                |           |
|----------------------------------------------------------------|-----------|
| <b>List of Figures</b>                                         | <b>4</b>  |
| <b>List of Tables</b>                                          | <b>5</b>  |
| <b>1 Introduction</b>                                          | <b>6</b>  |
| 1.1 Processor Design & Central Processing Unit (CPU) . . . . . | 6         |
| 1.2 Microprocessor . . . . .                                   | 6         |
| 1.3 Problem Statement . . . . .                                | 6         |
| 1.4 General Overview of the Proposed Solution . . . . .        | 7         |
| <b>2 Processor Design</b>                                      | <b>9</b>  |
| 2.1 Field Programmable Gate Array (FPGA) Overview . . . . .    | 9         |
| 2.2 Instruction Set Architecture (ISA) . . . . .               | 9         |
| 2.2.1 Data Path . . . . .                                      | 10        |
| 2.2.2 Instructions . . . . .                                   | 11        |
| 2.2.3 Micro-instructions . . . . .                             | 12        |
| 2.3 Components & Modules . . . . .                             | 15        |
| 2.3.1 Processor . . . . .                                      | 15        |
| 2.3.2 Arithmetic and Logic Unit (ALU) . . . . .                | 16        |
| 2.3.3 Register . . . . .                                       | 17        |
| 2.3.4 Accumulator (AC) . . . . .                               | 19        |
| 2.3.5 Buffer . . . . .                                         | 20        |
| 2.3.6 Bus . . . . .                                            | 20        |
| 2.3.7 Control Unit (CU) . . . . .                              | 20        |
| 2.3.8 OPs Decoder . . . . .                                    | 23        |
| 2.3.9 Clock . . . . .                                          | 25        |
| 2.3.10 Clock Corrector: Control Unit . . . . .                 | 25        |
| 2.3.11 Data Memory (DM)/ Multi-port Data Memory . . . . .      | 26        |
| 2.3.12 Instruction Memory (IM) . . . . .                       | 27        |
| 2.3.13 Top Module . . . . .                                    | 28        |
| <b>3 Algorithm</b>                                             | <b>29</b> |
| 3.1 Computation inside a core . . . . .                        | 29        |
| 3.2 Assembly Code . . . . .                                    | 32        |
| <b>4 Verilog Implementation</b>                                | <b>35</b> |
| 4.1 Simulation using Modelsim/Questasim . . . . .              | 35        |
| 4.2 Quartus Implementation . . . . .                           | 36        |
| 4.2.1 Design Flow Summary . . . . .                            | 38        |
| 4.2.2 Resource Usage Summary . . . . .                         | 39        |
| 4.2.3 Resource Utilization by Entity . . . . .                 | 39        |
| 4.2.4 RTL Design . . . . .                                     | 41        |
| <b>5 Performance Evaluation</b>                                | <b>46</b> |
| 5.1 Detailed Comparison . . . . .                              | 46        |

|     |                              |    |
|-----|------------------------------|----|
| 5.2 | Summary of results . . . . . | 51 |
| 5.3 | Conclusion . . . . .         | 51 |

|          |                        |           |
|----------|------------------------|-----------|
| <b>6</b> | <b>Acknowledgement</b> | <b>51</b> |
|----------|------------------------|-----------|

|                   |           |
|-------------------|-----------|
| <b>References</b> | <b>52</b> |
|-------------------|-----------|

# List of Figures

|    |                                                                                           |    |
|----|-------------------------------------------------------------------------------------------|----|
| 1  | Flow of the general overview of the proposed solution . . . . .                           | 8  |
| 2  | Data path of the proposed solution . . . . .                                              | 10 |
| 3  | State machine . . . . .                                                                   | 14 |
| 4  | Arithmetic and Logic Unit and Accumulator . . . . .                                       | 16 |
| 5  | An example for registers without increment . . . . .                                      | 18 |
| 6  | An example for registers with increment . . . . .                                         | 19 |
| 7  | CU, OPs Decoder, Clock Corrector and Clock . . . . .                                      | 21 |
| 8  | Implementation of Control Unit (Logic can be found in Fig. 9) . . . . .                   | 22 |
| 9  | Logic block in Control Unit in Fig. 8 . . . . .                                           | 23 |
| 10 | Implementation of Clock corrector for Control Unit . . . . .                              | 26 |
| 11 | Comparison between corrected clock for control unit with the standard clock . . . . .     | 26 |
| 12 | Data Memory . . . . .                                                                     | 27 |
| 13 | Instruction Memory . . . . .                                                              | 28 |
| 14 | Storing $Matrix_1$ to the data memory . . . . .                                           | 30 |
| 15 | Storing $Matrix_2$ to the data memory . . . . .                                           | 31 |
| 16 | Storing $Matrix_{out}$ to data the memory . . . . .                                       | 31 |
| 17 | Variables $i, j, count$ in Algorithm 1 . . . . .                                          | 32 |
| 18 | Identifying allocated computations of core with $core\_id$ . . . . .                      | 32 |
| 19 | Simulation in Questasim . . . . .                                                         | 35 |
| 20 | Using Quartus prime lite software . . . . .                                               | 36 |
| 21 | Using VSCode for coding . . . . .                                                         | 37 |
| 22 | 10 Core top level design . . . . .                                                        | 42 |
| 23 | Single Core RTL design . . . . .                                                          | 43 |
| 24 | Control Unit design . . . . .                                                             | 44 |
| 25 | ALU design . . . . .                                                                      | 45 |
| 26 | Matrix results for 4X4 matrix with 1 core . . . . .                                       | 46 |
| 27 | Calculation time for 4X4 matrix with 1 core . . . . .                                     | 46 |
| 28 | Matrix results for 4X4 matrix with 4 cores . . . . .                                      | 47 |
| 29 | Calculation time for 4X4 matrix with 4 cores . . . . .                                    | 47 |
| 30 | Matrix results for 8X8 matrix with 1 core . . . . .                                       | 48 |
| 31 | Calculation time for 8X8 matrix with 1 core . . . . .                                     | 48 |
| 32 | Matrix results for 8X8 matrix with 10 cores . . . . .                                     | 49 |
| 33 | Calculation time for 8X8 matrix with 10 cores . . . . .                                   | 49 |
| 34 | Performance comparison for different number of cores and different matrix sizes . . . . . | 51 |
| 35 | uOPs for microinstructions Part 1 . . . . .                                               | 53 |
| 36 | uOPs for microinstructions Pert 2 . . . . .                                               | 54 |

## List of Tables

|    |                                                 |    |
|----|-------------------------------------------------|----|
| 1  | Instruction Set . . . . .                       | 11 |
| 2  | Micro-instructions . . . . .                    | 12 |
| 3  | Inputs and Outputs of the processor . . . . .   | 15 |
| 4  | Functions of the ALU . . . . .                  | 17 |
| 5  | Registers and their sizes . . . . .             | 18 |
| 6  | Control signals decoded from the uOPs . . . . . | 24 |
| 7  | Inputs and Outputs of the DM . . . . .          | 27 |
| 8  | Inputs and Outputs of the IM . . . . .          | 28 |
| 9  | Design Flow Summary . . . . .                   | 38 |
| 10 | Resource Usage Summary . . . . .                | 39 |
| 11 | Resource Utilization by Entity . . . . .        | 39 |
| 12 | Results observed for 4x4 matrices . . . . .     | 47 |
| 13 | Results observed for 8x8 matrices . . . . .     | 50 |
| 14 | Results observed for 16x16 matrices . . . . .   | 50 |

# 1 Introduction

## 1.1 Processor Design & Central Processing Unit (CPU)

The goal of this project is to implement a simple processor that is capable of handling single instruction and multiple data processing in order to compute and output the results of a matrix multiplication. The processor is specifically designed and optimized to do matrix multiplication of square matrices of  $n \times n$ . However, the proposed solution also facilitates the multiplication of matrices with different shapes.

The proposed solution is implemented in Verilog Hardware Description Language using the QuartusLite 18.1.0.625.

The CPU can be seen as the brain of the computer that has the ability to retrieve and execute instructions. This processing is done using Control Unit (CU) and the Arithmetic and Logic Unit (ALU) where the CU retrieves the instructions in the correct order and interprets them while performing the logical tasks where the ALU conducts the necessary mathematical operations [1]. CPU consists of different types of registers and is directly associated with the main memory, input and output externally.

## 1.2 Microprocessor

The microprocessor is a small electronic device that directs current through the desired paths to achieve the specific outputs. While it shares many of its functions with the CPU, it includes additional functionalities using other processing units. These control and data processing logic are integrated in a single or lesser number of integrated circuits (ICs). While microprocessor contains both combinational and sequential digital logic it conducts its operations using binary number system representations.

## 1.3 Problem Statement

Matrix multiplication is an essential function used in the linear algebra and is frequently used in applications of statistics, applied mathematics, physics, engineering and economics [2].

This project aims to implement a multi-core processor which multiply two  $p \times q$  and  $q \times r$  matrices and obtain the results in the form of a  $p \times r$  matrix. It is specifically designed and optimized for square matrix multiplication with shape  $(n \times n)$  where  $n = p = q = r$ .

## 1.4 General Overview of the Proposed Solution

The proposed solution facilitates the multiplication of two  $p \times q$  and  $q \times r$  matrices using a number of cores as defined by the user as per his requirement.

Initially, following the input of the two matrices, along with the number of cores and matrix sizes, the implemented Python code will rearrange the matrices to be stored appropriately in the memory as per the proposed memory allocation defined in the algorithm. Next, the Verilog testbench will read the rearranged data which will be stored in the Data Memory (DM). Following the storage, the cores will initialize the operation and complete the multiplication of the two input matrices. After this realization, the Verilog testbench will write the results of the process stored in the DM to an external text file. Finally, the implemented Python code will rearrange the data stored in that text file in a more user readable format and outputs the results of the matrix multiplication while informing the latency of the computation. This is achieved after comparing the obtained results with expected outcome calculated from a matrix multiplication in Python.



Figure 1: Flow of the general overview of the proposed solution.

## 2 Processor Design

### 2.1 Field Programmable Gate Array (FPGA) Overview

FPGA can be seen as a semiconductor Integrated Circuit (IC) which has mutable electrical functionalities. The programmable connectivity between such millions of ICs can be implemented by the user according to the desired tasks. FPGAs provide several benefits such as; flexibility, acceleration of system performance and integration of multiple functions leading to comparatively fewer device failures due to less number of external devices [3]. At present, FPGAs are utilized in numerous field such as; ASIC prototyping, aerospace and defense, automotive, consumer electronics, industrial, medical, video & image processing and wired and wireless communications [4].

### 2.2 Instruction Set Architecture (ISA)

ISA can be seen as the programmers' view of the microprocessor which includes the information required to interact with it such as the information needed by the programmer to write a program for the microprocessor, the specifications needed by the compiler to compile the programs written in high level language.

When designing an ISA, it is crucial to focus on few important factors such as [5];

- The expected operation of the ISA and if the designed Instruction set includes all the instructions required to perform that task.
- The orthogonality of the instructions in order to minimize the overlapping which can lead to inefficient resource allocation and thus, higher cost.
- Trade offs that are related to performance, size and cost. Here it is important to note that, few registers would lead to more accessing of the memory which in turn increases the time consumption and decreases the performance but more registers would lead to a higher cost.
- The backward compatibility if a second design is expected to be released.
- Floating point data processing
- The requirement of interruptions
- The requirement of conditional instructions

### 2.2.1 Data Path

Figure 2 illustrates the optimized final data path diagram of the proposed solution.



Figure 2: Data path of the proposed solution.

### 2.2.2 Instructions

Considering the task is not general purpose but specific for matrix multiplication, we have mainly focused on developing Instruction Set which can allocate resources much faster with minimal number of clock cycles in order to do the calculations need to do the matrix multiplication. Maintaining orthogonality was also focused in order to reduce the overlapping of instructions.

Table 1 summarizes all the instructions, their opcodes and the operations used in the proposed project.

Table 1: Instruction Set

| Instruction | Opcode | Operation                                                           |
|-------------|--------|---------------------------------------------------------------------|
| SETN        | 0      | $n \leftarrow M[\alpha_n]$                                          |
| SETC        | 1      | $c \leftarrow M[\alpha_c]$                                          |
| SETTP1      | 2      | $TP1 \leftarrow n * i$                                              |
| SETTP2      | 3      | $TP2 \leftarrow n * j + k$                                          |
| SETTP3      | 4      | $TP3 \leftarrow (n * i + j) + 2k$ and reset <i>count</i>            |
| RNGI        | 5      | $i_{count} = n // c + n \% c! = 0$ and reset <i>TR2</i>             |
| STRTI       | 6      | $i_{start} = i_{count} \times core\_id$                             |
| ENDI        | 7      | $i_{end} = i_{count} \times (core\_id + 1)$                         |
| LDACTP1     | 8      | $AC \leftarrow M[TP1]$ and increment <i>count</i>                   |
| LDACTP2     | 9      | $AC \leftarrow M[TP2]$                                              |
| LDACTP3     | 10     | $AC \leftarrow M[TP3]$                                              |
| LDN         | 11     | $AC \leftarrow n$                                                   |
| MVACTR      | 12     | $AC \leftarrow TR$                                                  |
| MVCOUNT     | 13     | $TR \leftarrow count$ , increment <i>TP1</i> , increment <i>TP2</i> |
| MVJ         | 14     | $TR \leftarrow j$                                                   |
| MVI         | 15     | $TR \leftarrow i$                                                   |
| MVIE        | 16     | $AC \leftarrow i_{end}$ and increment <i>i</i>                      |
| MUL         | 17     | $AC \leftarrow AC * TR$                                             |
| ADD         | 18     | $AC \leftarrow AC + TR$                                             |
| SUB         | 19     | $AC \leftarrow AC - TR$ if $AC=0$ then $Z=1$ else $Z=0$             |
| STAC        | 20     | $M[TP3] \leftarrow AC$ , reset <i>TR2</i> , increment <i>j</i>      |
| JNPZ        | 21     | if $Z=0$ then jump to $\alpha$                                      |
| JNPZY       | 22     | jump to $\alpha$                                                    |
| JNPZN       | 23     | continue without jump                                               |
| RSTJ        | 24     | $j \leftarrow 0$                                                    |
| END         | 25     | $END \leftarrow 1$                                                  |
| MVTR2       | 26     | $TR \leftarrow TR2$                                                 |
| MVACTR2     | 27     | $TR2 \leftarrow AC$                                                 |
| LDTR2       | 28     | $AC \leftarrow TR2$                                                 |

### 2.2.3 Micro-instructions

Table 2 presents the micro-instructions for each instruction in the proposed solution.

Table 2: Micro-instructions

| Instruction | Micro-instruction                                                            |
|-------------|------------------------------------------------------------------------------|
| FETCH       | ARPC<br>DRMI, PCINC<br>IRDR, ARPC                                            |
| SETN        | DRMI, PCINC<br>ARDR<br>DRMD<br>NDR                                           |
| SETC        | DRMI, PCINC<br>ARDR<br>DRMD<br>CDR                                           |
| SETTP1      | ACN<br>TRI<br>MUL<br>TP1AC                                                   |
| SETTP2      | ACN<br>TRJ<br>MUL<br>ADDK<br>TP2AC                                           |
| SETTP3      | ACN<br>TRI<br>MUL<br>TRJ<br>ADD<br>ADDK<br>ADDK<br>TP3AC, COUNTZ0            |
| RNGI        | ACN<br>TRC<br>INTDIV<br>TR2AC<br>ACN<br>ISMOD<br>TRTR2<br>ADD<br>ICAC, TR2Z0 |
| STRTI       | TRIC<br>ACID                                                                 |

| <b>Instruction</b> | <b>Micro-instruction</b>           |
|--------------------|------------------------------------|
|                    | MUL<br>IAC                         |
| ENDI               | ACID<br>ACINC, TRIC<br>MUL<br>IEAC |
| LDACTP1            | ARTP1,COUNTINC<br>DRMD<br>ACDR     |
| LDACTP2            | ARTP2<br>DRMD<br>ACDR              |
| LDACTP3            | ARTP3<br>DRMD<br>ACDR              |
| LDN                | ACN                                |
| MVACTR             | TRAC                               |
| MVCOUNT            | TRCOUNT, TP1INC, TP2INC            |
| MVJ                | TRJ                                |
| MVI                | TRI                                |
| MVIE               | ACIE, IINC                         |
| MUL                | MUL                                |
| ADD                | ADD                                |
| SUB                | SUB                                |
| STAC               | ARTP3, TR2Z0<br>DRAC<br>MDDR, JINC |
| JNPZ               |                                    |
| JNPZY              | DRMI<br>PCDR                       |
| JNPZN              | PCINC                              |
| RSTJ               | JZ0                                |
| END                | ENDINC                             |
| MVTR2              | TRTR2                              |
| MVACTR2            | TR2AC                              |
| LDTR2              | ACTR2                              |

Figure 3 depicts the state machine of the proposed solution.



Figure 3: State machine

## 2.3 Components & Modules

### 2.3.1 Processor

The processor contains modules and registers used to conduct all processing operations and the controlling and managing of the related instructions. However memory related modules such as Data Memory (DM) and Instruction memory (IM) are situated outside the processor. The inputs and outputs for the processor are as shown in the Table 3;

Table 3: Inputs and Outputs of the processor

|        |          |                                                                                    |
|--------|----------|------------------------------------------------------------------------------------|
| Input  | DM out   | Data coming from the data memory                                                   |
|        | IM out   | Instructions coming from the Instruction memory                                    |
|        | Start    | Signal given to indicate the start of the process                                  |
|        | Clock    | Clock pulse given for synchronization                                              |
|        | Reset    | Signal given to clear all the registers except register in Control unit            |
| Output | DM in    | Data received to Data memory                                                       |
|        | AR out   | Output from the Address Register (AR) goes to DM and IM to point memory locations. |
|        | Read MI  | Read Signal for reading Instruction Memory                                         |
|        | Write MD | Write signal for writing to Data Memory                                            |
|        | Read MD  | Read Signal for reading Data the memory                                            |

The processor consists of several main modules listed below;

- CU - Is responsible for managing and controlling all the processes in the processor.
- ALU - Is responsible for conducting all arithmetic and logical operations.
- AC - A special purpose register directly connected to the ALU and is used to store ALU outputs immediately.
- AR - A special purpose register to point to the address of the data/instruction to be read.
- TR - A general purpose register for temporary storage which is utilized for computations in ALU.
- DR - A special purpose register to store both instruction and data which will be written to/ read from instruction/ data memories.
- PC - A special purpose register to point to the next instruction to be fetched.
- N - A special purpose register to store the dimensions of the input matrices ( $n$ ).
- C - A special purpose register to store the number of cores ( $c$ ).

- IR - A special purpose register to store current instruction.
- I - A special purpose register to store the row index of the position of output matrix which is currently being computed (More details can be found in Fig. 17, Algorithm 1).
- J - A special purpose register to store the column index of the position of the output matrix which is currently being computed (More details can be found in Fig. 17, Algorithm 1).
- ID - A special purpose register to store the core identity (*core\_id*).
- COUNT - A special purpose register to store the variable *count* which indicates the current position of row-I in first matrix and column-J in second matrix (More details can be found in Fig. 17, Algorithm 1).
- TP1 - A special purpose register to store the memory location of the first matrix which is being currently computed.
- TP2 - A special purpose register to store the memory location of the second matrix which is being currently computed.
- TP3 - A special purpose register to store the memory location of the output matrix which will be computed next.
- IC - A special purpose register to indicate the number of rows of output matrix which should be calculated by a particular core ( $i_c$ )
- IE - A special purpose register to indicate the last row of the output matrix which should be calculated by a particular core ( $i_{end}$ )
- TR2 - A general purpose register to store the intermediate values of computations done by ALU
- END - A special purpose register to indicate the end of matrix multiplication process of a particular core.

### 2.3.2 Arithmetic and Logic Unit (ALU)

Figure 4 illustrates the Arithmetic and Logic Unit and Accumulator used in the proposed solution.



Figure 4: Arithmetic and Logic Unit and Accumulator

ALU is responsible for conduction of all arithmetic and logic functions in the processor. As the goal of this project is to multiply two matrices, ALU plays a vital role in the proposed solution.

ALU is located inside the processor where it takes 3 main inputs;

- in-bus - 16 bit input from the bus
- in-AC - 16 bit input from the AC
- Operation - 3 bit control signal from the OPs Decoder (which will be explained in detailed in the section 2.3.8) to select the required specific function. The utilization of 3 bits is to facilitate the 6 different functions of the proposed ALU.

ALU is directly connected with the AC register and AC is connected to Z register through a NOR gate and to the bus via a buffer.

The value of Z will be utilized in the Algorithm 1 of the proposed solution and explained in the section 2.3.7.

In the proposed solution, ALU conducts 6 main functions which are mentioned in Table 4;

Table 4: Functions of the ALU

| Function | Description                                                                 |
|----------|-----------------------------------------------------------------------------|
| Mul      | $\text{data-out} = \text{in-AC} * \text{in-bus};$                           |
| Add      | $\text{data-out} = \text{in-AC} + \text{in-bus};$                           |
| Sub      | $\text{data-out} = \text{in-AC} - \text{in-bus};$                           |
| Int-div  | $\text{data-out} = (\text{in-AC} - \text{in-AC \% in-bus}) / \text{in-bus}$ |
| Is-mod   | $\text{data-out} = ((\text{in-AC \% in-bus}) != 0)$                         |
| Pass     | $\text{data-out} = \text{in-bus};$                                          |

### 2.3.3 Register

When designing the ISA, we have defined the registers in a way that it would bring maximum efficiency to the desired task and that it would need minimum possible references to main memory in between operations in order to avoid the inefficient time consumption.

The registers utilized in the proposed solution and their sizes are shown in the table 5 .

Table 5: Registers and their sizes

| Register | Size (bits) |
|----------|-------------|
| AR       | 16          |
| TR       | 16          |
| TR2      | 16          |
| DR       | 16          |
| IR       | 16          |
| PC       | 16          |
| AC       | 16          |
| N        | 16          |
| C        | 16          |
| J        | 16          |
| ID       | 16          |
| COUNT    | 16          |
| TP1      | 16          |
| TP2      | 16          |
| TP3      | 16          |
| IC       | 16          |
| I        | 16          |
| IE       | 16          |

Registers in general can mainly be divided into two groups that are registers with and without increment. The importance of those are explained below.

**Registers without increment** - Figure 5 shows an example for registers without increment used in the proposed solution.



Figure 5: An example for registers without increment

The purpose of these registers is to temporarily store data during each processing tasks for a short period of time. These registers are not given an increment control signal and thus any increment related to these must be done by passing the register data through the ALU and feeding it back to the same register through the bus.

They consists of following control signals;

- Load enable - load the data to register
- Reset - Reset the data of register to 0

The data stored in these registers are always available at the input of buffer as shown in Figure 5 and will fed into the bus whenever a buffer control signal is issued. In the proposed solution the registers can be written in each positive edge if the load enable control signal specific to each register is given.

**Registers with increment** Figure 6- shows an example for registers with increment used in the proposed solution.



Figure 6: An example for registers with increment

While these registers are similar to registers without increment they have the capability to increment its value by 1 without passing the values through the ALU. This is especially useful in tracking the interactions in a loop while maintaining efficiency. The registers TP1, TP2, AC, J, I, COUNT are such examples.

Apart from that, we have utilized registers with several other types of control signals which helps to improve the performance of the proposed hardware solution. All the registers consists of one or more following control signals (More details can be found in Fig. 2);

- Load enable - Load the data to register
- Reset - Reset the data of register to 0
- Increment - Increment the data in register by 1
- Increment-K - Increment the data in register by K. This is a special type of register only used for AC to reduce the number of clock cycles.

#### 2.3.4 Accumulator (AC)

The Accumulator is a register with increment, which has a unique combination of connections where it receives its input from the ALU and sends its output to both back to the ALU and its connected buffer, after conducting the necessary arithmetic and logical operations.

AC also passes its output into the Z flag through a NOT gate. This mainly utilized to find the end of a loop by comparing the current position with the maximum length of the particular loop. (More details can be found in the section (1) where the ALU output will be equal to 0 which will in turn make the Z flag 1.

It is also given the previously mentioned control signals such as LDAC, RSTAC and INCAC in addition to the ADDKAC which increment the data in the register by a value K.

### 2.3.5 Buffer

Buffer is a single input single output unit which passes the input with/without the enabling of the buffer as per its implementation. This is used select the which register output should be passed to bus. In the proposed solution the buffer is mainly used to control the flow of 16 bits input to the bus from the registers, Data Memory and the Instruction Memory. These buffers are enabled using control signals issued from the OPs Decoder as explained below.

### 2.3.6 Bus

Bus refers to the set of wires connecting all registers and modules within the processor except the CU. It transfers data to and from all other modules inside the processor and can even exchange data with the Instruction Memory (IM) and Data Memory (DM).

In the proposed solution the LD $x$  (load enable of  $x$ ) control signal given to registers allows the data in the bus to be written into the registers and the control signal  $x$ BUS given to the Buffers allows the data sent out from the registers to pass through the buffer and to be fed into the bus.

### 2.3.7 Control Unit (CU)

Figure 7 shows the CU, OPs Decoder, Clock Corrector and Clock of the proposed solution.



Figure 7: CU, OPs Decoder, Clock Corrector and Clock

CU generates the control signals and handles all the operations within the processor and their respective instructions. The main source of input for the CU is the IR, in addition to the corrected clock signal, where it fetches the respective instructions in addition to the Z flag. Finally, it outputs the command signals(OPs) in the negative edge of the clock. This implementation is influenced from [5].

After obtaining the appropriate instruction, it decodes them into commands(OPs), conditions for the multiplexer within the CU, BT and the next Address. Finally, it outputs the uOPS into the OPs Decoder unit to issue the respective control signals to perform ALU operations, load registers, write register content into the bus, memory read and memory write.

The Z flag denotes the status of the ALU output where  $Z=1$  in case where the ALU output equals zero. Inside the CU, after selecting the relevant micro instruction from the microcode memory, CU issues another 2 bit conditional signal and a 1 bit BT signal which will use to get the output from the two multiplexers inside the CU. One multiplexer will use those values to select 1, Z, or Z' as its input and the other one will use it to select the correct addresses to be used in the next step. This can either lead to the incremented next location of the micro code memory (current address +1), to the address from the microcode memory (address specified by ADDR) or to a new instruction address (MAP address) from the IR.

Overall flow of the CU is shown in Fig. 8.

Note that, the clock received for the CU is returned from Clock Corrector module explained below. The release of uOPs from control unit is done at negative clock edge cycles to avoid the conflicts with operations in the other modules.



Figure 8: Implementation of Control Unit (Logic can be found in Fig. 9)



Figure 9: Logic block in Control Unit in Fig. 8

**Mapping Logic in Control Unit** In order to obtain the corresponding starting microcode memory address given a particular instruction, mapping logic is introduced. This is implemented as a look-up table which contains the mapping between IR and MAP address. Implementation details can be found in Appendix (Fig. 35, 36).

### 2.3.8 OPs Decoder

This unit generates the control signals using vertical microcodes methods [5]. The implementation of a OPs Decoder instead of using horizontal microcodes method is due to;

Horizontal microcodes consists majorly of zeros. Thus, more than 85 percent bits are inactive. However, vertical microcode reduces this number of bits.

These control signals are used to manage several processes such as;

- ALU operations - 3 bits signals to facilitate up to 8 arithmetic and logical functions. The proposed solution utilizes 6 different functions. The functions are summarized in the table 4
- LD<sub>x</sub> - Signals to load the registers with the content in the bus or another register, as per their connectivity.
- xBUS - Control signal sent to each buffer to feed the 16 bit output from each register to be written into the bus.
- RST<sub>x</sub> - Control signal to clear and reset the values in a register.
- INC<sub>x</sub> - Control signal to increment the values in a register.

The table 6 summarizes the decoding of the uOPs and their respective control signals.

Table 6: Control signals decoded from the uOPs

| Control Signal | uOPs                                                       |
|----------------|------------------------------------------------------------|
| DREAD          | DRMD                                                       |
| IREAD          | DRMI                                                       |
| DWRITE         | MDDR                                                       |
| BUSMEM         | MDDR                                                       |
| MEMBUSD        | DRMD                                                       |
| MEMBUSI        | DRMI                                                       |
| TRBUS          | ISMOD, INTDIV, ADD, MUL, SUB                               |
| DRBUS          | ARDR, ACDR, MDDR                                           |
| PCBUS          | ARPC                                                       |
| NBUS           | ACN                                                        |
| CBUS           | TRC                                                        |
| JBUS           | TRJ                                                        |
| IDBUS          | ACID                                                       |
| COUNTBUS       | TRCOUNT                                                    |
| TP1BUS         | ARTP1                                                      |
| TP2BUS         | ARTP2                                                      |
| TP3BUS         | ARTP3                                                      |
| ICBUS          | TRIC                                                       |
| IBUS           | TRI                                                        |
| IEBUS          | ACIE                                                       |
| TR2BUS         | ACTR2, TRTR2                                               |
| ACBUS          | TRAC, DRAC                                                 |
| LDAR           | ARDR, ARPC, ARTP1, ARTP2, ARTP3                            |
| LDTR           | TRAC, TRJ, TRTR2, TRIC, TRC, TRI, TRCOUNT                  |
| LDDR           | DRAC, DRMI, DRMD                                           |
| LDPC           | PCDR                                                       |
| LDN            | NDR                                                        |
| LDC            | CDR                                                        |
| LDIR           | IRDR                                                       |
| LDTP1          | TP1AC                                                      |
| LDTP2          | TP2AC                                                      |
| LDTP3          | TP3AC                                                      |
| LDIC           | ICAC                                                       |
| LDI            | IAC                                                        |
| LDIE           | IEAC                                                       |
| LDTR2          | TR2AC                                                      |
| LDAC           | ISMOD, INTDIV, ACN, ACIE, ACTR2, ADD, ACID, ACDR, MUL, SUB |
| RSTAR          | RESET                                                      |
| RSTTR          | RESET                                                      |
| RSTDTR         | RESET                                                      |
| RSTPC          | RESET                                                      |
| RSTN           | RESET                                                      |

| Control Signal | uOPs            |
|----------------|-----------------|
| RSTC           | RESET           |
| RSTIR          | RESET           |
| RSTJ           | JZ0, RESET      |
| RSTID          | RESET           |
| RSTCOUNT       | COUNTZ0, RESET  |
| RSTTP1         | RESET           |
| RSTTP2         | RESET           |
| RSTTP3         | RESET           |
| RSTIC          | RESET           |
| RSTI           | RESET           |
| RSTIE          | RESET           |
| RSTTR2         | TR2Z0, RESET    |
| RSTAC          | RESET           |
| RSTEND         | START, RESET    |
| INCPC          | PCINC           |
| INCJ           | JINC            |
| INCCOUNT       | COUNTINC        |
| INCTP1         | TP1INC          |
| INCTP2         | TP2INC          |
| INCI           | IINC            |
| INCAC          | ACINC           |
| INCEND         | ENDINC          |
| ALU0           | MUL, SUB, ISMOD |
| ALU1           | ADD, SUB        |
| ALU2           | INTDIV, ISMOD   |
| ADDKAC         | ADDK            |

### 2.3.9 Clock

The clock signal is used to synchronize and coordinate the actions of the process where it oscillates in between the high and low states. This is especially used in synchronous sequential logic which changes its state changes only at discrete times of the clock signals. In the proposed solution, all the sequential implementations operate in either positive edge or the negative edge of the clock cycle and the process initiation in each core occurs at a positive edge.

### 2.3.10 Clock Corrector: Control Unit

Before the issuing of START signal by user and after getting END signal from the core, core should remain in the *idle* state. To achieve this, we have implemented Clock Corrector module as shown in Fig. 10. This makes the clock for control unit sensitive to START and END signals of the process. In a nutshell, this module is responsible to supply the standard clock as the clock for CU from the start to end of the process and otherwise remains 0.

Particularly this method will give the capability to RESET all other registers except control unit register without starting the computation process and keep the processor in idle state when there is no computations.

Apart from that, this module ensure that regardless of the instance the user starts the process, the initiation of the processing of each core occurs at a positive edge of the clock signal.



Figure 10: Implementation of Clock corrector for Control Unit

The corrected clock for CU when compares to the standard clock is shown in Figure 11



Figure 11: Comparison between corrected clock for control unit with the standard clock

### 2.3.11 Data Memory (DM)/ Multi-port Data Memory

DM is the main memory storage used to store the two input matrices and the resultant matrix. This is located outside the processor and has a width of 16 bits and a height of 1000 lines.

Since this is a multi-core solution, our DM is made such that it can be accessed parallelly each core is directed towards separate locations to avoid any memory conflict. Hence, this implemented by multi-port memory structure where there is multiple AR ports, read-enable ports and data-out ports, one for each core. Currently, it is implemented to ten cores, however it can be expanded to any number of cores due to its scalability. In the proposed solution, at the initiation of the process, the data is always written into the memory in a positive clock edge.

The inputs and output of the DM are as shown in table 7.

Table 7: Inputs and Outputs of the DM

|        |            |                                                                                                                                                                                                                                     |
|--------|------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Input  | AR-out     | Output from the AR where it goes into IM and DM. This is a 16 bit address of a DM location in which data must be stored or retrieved. This input is loaded into the DM when the control signal DREAD is issued.                     |
|        | output-BUS | Output from the Bus which includes the data to be stored which are sent from the rest of the registers via bus, in addition to AR. This input is stored into the DM when the control signal BUSMEM is issued along with the DWRITE. |
|        | Clk        | Clock Signal                                                                                                                                                                                                                        |
| Output | input-BUS  | Input to the bus where the requested data from DM is fed into the bus when the buffer is enabled using the control signal MEMBUSD                                                                                                   |

Figure 12 illustrates the Data Memory of the proposed solution.



Figure 12: Data Memory

### 2.3.12 Instruction Memory (IM)

The IM is stored with the instructions to be executed in the correct order and is located outside the processor. In the proposed solution it is stored with 36 lines (16 bits wide), in the order as implemented in the assembly code (Section 3.3). It is a Read Only Memory (ROM) where new information cannot be written into it.

The input and output of the IM are as shown in table 8.

Table 8: Inputs and Outputs of the IM

|        |            |                                                                                                                                          |
|--------|------------|------------------------------------------------------------------------------------------------------------------------------------------|
| Input  | AR-out out | Output from the AR where it goes to IM and DM. This is a 16 bit address of a IM location.                                                |
| Output | input-BUS  | Input to the bus where the requested instruction from IM is fed into the bus when the buffer is enabled using the control signal MEMBUSI |

Figure 13 depicts the Instruction Memory of the proposed solution.



Figure 13: Instruction Memory

### 2.3.13 Top Module

The top module facilitates the interconnections and the initiations of the multiple cores, Multi-port Data Memory and the Instruction Memory. It also allows to take out the final output matrix from the system.

### 3 Algorithm

#### 3.1 Computation inside a core

##### 1. Initialization

- (a) Store data to data memory
  - i. **Matrix 1:** Flatten matrix will be saved in the Data memory ( $M_D$ / DM) as shown in Fig. 14. Saving will be started from  $0^{th}$  memory location.
  - ii. **Matrix 2:** Flatten matrix of Transpose of Matrix 2 will be saved in the Data memory ( $M_D$ ) as shown in Fig. 15. Saving will be started from  $K^{th}$  memory location. Here  $K = (\text{mem}_\text{size} - 2) // 3$ .
  - iii. **Number of cores ( $c$ ):** Number of cores ( $c$ ) will be saved in the memory. This will be used to identify allocated computations of each core.
  - iv. **Matrix size ( $n$ ):** Matrix size ( $n$ ) will be saved in the memory. This will be used to identify allocated computations of each core.
- (b) Initialize the core with  $\text{core\_id}$   
Here the  $\text{core\_id}$  represents a unique identity of each core. This is vary from  $[0, c-1]$ .  
Here  $c$  is the total number of cores.

##### 2. Identifying the allocated computations of the core utilizing the $\text{core\_id}$

As shown in Fig. 18 core identifies the allocated computations through the  $\text{core\_id}$ .

##### 3. Computing allocated locations of output matrix

---

#### Algorithm 1: Computation in $p^{th}$ core

---

**Result:**  $M_D$  with computed output matrix

```

Initialization:  $\text{core\_id} \leftarrow p, c \leftarrow M_D, n \leftarrow M_D, \text{Matrix}_1 \leftarrow M_D, \text{Matrix}_2 \leftarrow M_D ;$ 
 $i_{\text{count}} \leftarrow \text{calculate using } n, c ;$ 
 $i_{\text{start}} \leftarrow \text{core\_id} \times i_c ;$ 
 $i_{\text{end}} \leftarrow (\text{core\_id} + 1) \times i_c ;$ 
for  $i; i_{\text{start}} \leq i < i_{\text{end}}; i++ \text{ do}$ 
  for  $j; 0 \leq j < n; j++ \text{ do}$ 
     $\text{output\_value} = 0 ;$ 
    for  $count; 0 \leq count < n; count++ \text{ do}$ 
       $\text{output\_value} += \text{matrix}_1[i][count] \times \text{matrix}_2[j][count];$ 
    end
    save  $\text{Matrix}_{\text{out}}[i, j] = \text{output\_value}$  to  $M_D;$ 
  end
end

```

---

Fig. 17 shows the clarification regarding the variables  $i, j, count$ .

#### 4. Saving to memory

Output matrix will be saved in the Data memory ( $M_D$ ) starting from  $2 \times K$  memory location as shown in Fig. 16. Here  $K = (\text{mem}_{\text{size}} - 2) // 3$ .

Figure 18 depicts the process of memory access in the proposed solution.



Figure 14: Storing  $\text{Matrix}_1$  to the data memory

Figure 15: Storing  $Matrix_2$  to the data memoryFigure 16: Storing  $Matrix_{out}$  to data the memory

Figure 17: Variables  $i, j, count$  in Algorithm 1Figure 18: Identifying allocated computations of core with *core\_id*

### 3.2 Assembly Code

The Assembly code which is stored in IM for the proposed solution is as follows;

1. SETN
2.  $\alpha_n$
3. SETC
4.  $\alpha_c$

5. RNGI
6. ENDI
7. STRTI
8. RSTJ
9. SETTP1
10. SETTP2
11. SETTP3
12. LDACTP1
13. MVACTR
14. LDACTP2
15. MUL
16. MVTR2
17. ADD
18. MVACTR2
19. MVCOUNT
20. LDN
21. SUB
22. JNPZ
23. 11
24. LDTR2
25. STAC
26. MVJ
27. LDN
28. SUB
29. JNPZ
30. 8
31. MVIE
32. MVI

33. SUB

34. JNPZ

35. 7

36. END

## 4 Verilog Implementation

In this section, verilog implementation details, simulation details and final RTL design is discussed.

### 4.1 Simulation using Modelsim/Questasim

Simulation is really vital when testing the verilog implementation of the design. It is quite fast and easy compared to fully fabricating the design in an actual FPGA board. In our team we have used both a customized version of Modelsim [6] for Intel boards and Questasim 10.6 [7] by Siemens. When configuring a new project in Quartus software we can add the simulator choice accordingly.

Each module is tested thoroughly using separate test benches. This approach enabled to detect bugs within the module and to move to higher hierarchical designs later. Then full core is tested from another test bench. Multi core implementation of 10 cores is finally tested and evaluated which is included in the results section.

Waveform feature in the Questasim was really helpful to debug the errors. Using correct wires and registers to the waveform allowed to detect memory errors, timing errors, etc. Also \$display command is used to print important values on the terminal in the software for debugging. To measure runtime \$realtime command is used. Also in Questasim once the verilog code of modules are compiled, any further change done in a test bench doesn't require you to compile the whole design again. Simply updating, recompiling and simulating the edited test bench is sufficient. This reduced testing time in a huge amount.



Figure 19: Simulation in Questasim

## 4.2 Quartus Implementation

Our processor is developed using Verilog HDL language and used Intel Quartus [8] Prime software to fabricate the design. The Verilog files can be imported to the Quartus project easily. Also Analysis and Synthesis feature can be used to compile the code and view the final RTL diagram. Quartus software allows to easily integrate Questasim simulation software which enables to open it directly from Quartus. Default testbenches can be added and configured too. Furthermore errors and warnings are shown after the Analysis and Synthesis is completed. We can go to the error directly from double clicking on the error shown on the terminal. In Quartus selecting the top module correctly is vital for the final RTL design view. Figure 20 shows a screenshot of the Quartus prime lite software with the compiled final design.



Figure 20: Using Quartus prime lite software

For Verilog programming we have used Visual Studio Code as it has intellisense feature and code auto-completion. Furthermore we used version control systems such as GitHub to collaboratively manage work among the team as we all contributed through online. Figure 21 shows a snapshot of VSCode.

Furthermore Altera Cyclone IV EP4CE115F23C7 is selected as the chip for the final design fabrication. Final task was to use multi core design to multiply two matrices together and display the results. For this we have used text file based method to write the initial matrices to the memory and to retrieve the final results from the memory. This is achieved through

the test bench by accessing the main memory using a separate Mux. Our algorithm explained above allows to utilise specified number of cores even though we have fabricated 8-10 cores in the analysis stage.

```

top_tb.v -- FPGA-Multicore-Matrix-Multiplication-Sem5
EXPLORER    top_tb.v M  top.v  _details.txt M
OPEN EDITORS top_tb.v
FPGA-MULTICORE-MATRIX-MU...
simulation
control-unit.pdf
FPGA-Multicore-Matri...
FPGA-Multicore-Matri...
FPGA-Multicore-Matri...
rtl-alu.pdf
rtl-single-core.pdf
rtl-top.pdf
src
.gitignore
alu_tb.v
buffer_tb.v
clock_corrector_tb.v
controller_tb.v
data_memory_multi_p...
instruction_memory_tb.v
logic_block_tb.v
mapping_block_tb.v
microcode_tb.v
mux1_control_unit_tb.v
ops_decoder_tb.v
README.md
reg_ac_tb.v
reg_pc_tb.v
reg_pc_tb.v.bak
reg_rst_inc_tb.v
reg_rst_load_tb.v
top_tb.v
PROBLEMS  OUTPUT  DEBUG CONSOLE  TERMINAL
[5 9 8 9 6 5]
[5 7 0 7 4 9]
[8 3 9 6 3 7]
[0 4 4 5 6 5]
[[2 7 8 8 2 0]
[7 3 9 6 9 5]
[1 2 6 5 9 0]
[2 0 4 3 2 1]
[4 7 4 6 4 8]
[0 9 4 7 2 8]]
Calculated Answer:
[[ 19  60  74  79  67  36]
[ 94 108 158 140 154  34]
[123 165 249 232 215  94]
[ 89 165 183 198 121 114]
[ 78 167 289 212 162  77]
[ 66 107 124 130 116  65]]
Correct Answer:
[[ 19  60  74  79  67  36]
[ 123 165 249 232 215  94]
[ 89 165 183 198 121 114]
[ 78 167 289 212 162  77]
[ 66 107 124 130 116  65]]
Ln 108, Col 1  Spaces: 4  UTF-8  CRLF  Verilog  ⌂

```

Figure 21: Using VSCode for coding

#### 4.2.1 Design Flow Summary

Table 9 shows the design flow summary of 10 core design with a memory of 1000 length and a width of 2 bytes.

Table 9: Design Flow Summary

| Flow Status                        | Successful - Wed Jul 07 08:57:46 2021       |
|------------------------------------|---------------------------------------------|
| Quartus Prime Version              | 18.1.0 Build 625 09/12/2018 SJ Lite Edition |
| Revision Name                      | FPGA-Multicore-Matrix-Multiplication-Sem5   |
| Top-level Entity Name              | top                                         |
| Family                             | Cyclone IV E                                |
| Device                             | EP4CE115F23C7                               |
| Timing Models                      | Final                                       |
| Total logic elements               | 235,764                                     |
| Total pins                         | 71                                          |
| Total virtual pins                 | 0                                           |
| Total memory bits                  | 0                                           |
| Embedded Multiplier 9-bit elements | 20                                          |
| Total PLLs                         | 0                                           |

#### 4.2.2 Resource Usage Summary

Table 10 shows the resource usage summary of 10 core design with a memory of 1000 length and a width of 2 bytes.

Table 10: Resource Usage Summary

| Resource                                    | Usage     |
|---------------------------------------------|-----------|
| Total Registers                             | 18810     |
| - I/O Registers                             | 0         |
| - Dedicated logic registers                 | 18810     |
| Total fan-out                               | 982304    |
| Total combinational functions               | 234794    |
| Maximum fan-out node                        | clk~input |
| Maximum fan-out                             | 18750     |
| Logic elements by node                      |           |
| - normal node                               | 230444    |
| - arithmetic node                           | 4350      |
| Logic element usage by number of LUT inputs |           |
| - 3 input functions                         | 6736      |
| - 4 input functions                         | 224371    |
| - <=2 input functions                       | 3687      |
| I/O Pins                                    | 71        |
| Estimated Total logic elements              | 235,764   |
| Embedded Multiplier 9-bit elements          | 20        |
| Average fan-out                             | 3.87      |

#### 4.2.3 Resource Utilization by Entity

Table 11 shows the Resource Utilization by Entity of 10 core design with a memory of 1000 length and a width of 2 bytes.

Table 11: Resource Utilization by Entity

| Compilation Hierarchy Node    | Combinational ALUTs | Dedicated Logic Registers |
|-------------------------------|---------------------|---------------------------|
| top                           | 234794 (0)          | 18810 (0)                 |
| core:processor1               | 1407 (5)            | 281 (0)                   |
| alu:ALU                       | 632 (63)            | 0 (0)                     |
| lpm_divide:Div0               | 270 (0)             | 0 (0)                     |
| lpm_divide_lkm:auto_generated | 270 (0)             | 0 (0)                     |
| sign_div_unsign_dnh:divider   | 270 (0)             | 0 (0)                     |
| alt_u_div_eaf:divider         | 270 (269)           | 0 (0)                     |

Continued on next page

Table 11 : Continued from previous page

| <b>Compilation Hierarchy Node</b>      | <b>Combinational ALUTs</b> | <b>Dedicated Logic Registers</b> |
|----------------------------------------|----------------------------|----------------------------------|
| add_sub_8pc:add_sub_1                  | 1 (1)                      | 0 (0)                            |
| lpm_divide:Mod0                        | 299 (0)                    | 0 (0)                            |
| lpm_divide_ocm:auto_generated          | 299 (0)                    | 0 (0)                            |
| sign_div_unsign_dnh:divider            | 299 (0)                    | 0 (0)                            |
| alt_u_div_eaf:divider                  | 299 (298)                  | 0 (0)                            |
| add_sub_8pc:add_sub_1                  | 1 (1)                      | 0 (0)                            |
| lpm_mult:Mult0                         | 0 (0)                      | 0 (0)                            |
| mult_7dt:auto_generated                | 0 (0)                      | 0 (0)                            |
| buffer:DR_buf                          | 1 (1)                      | 0 (0)                            |
| buffer:ID_buf                          | 120 (120)                  | 0 (0)                            |
| buffer:MEMBUSI_buf                     | 67 (67)                    | 0 (0)                            |
| buffer:TR_buf                          | 11 (11)                    | 0 (0)                            |
| clock_corrector_module:clock_corrector | 1 (1)                      | 1 (1)                            |
| controller:controller                  | 186 (0)                    | 7 (0)                            |
| cu_reg_rst_load:register               | 7 (7)                      | 7 (7)                            |
| logic_block:logic_box                  | 1 (1)                      | 0 (0)                            |
| mapping_block:mapping_func             | 71 (71)                    | 0 (0)                            |
| microcode:micromemory                  | 81 (81)                    | 0 (0)                            |
| mux1_control_unit:mux1                 | 25 (25)                    | 0 (0)                            |
| mux2_control_unit:mux2                 | 1 (1)                      | 0 (0)                            |
| ops_decoder:decoder                    | 19 (19)                    | 0 (0)                            |
| reg_ac:AC                              | 158 (158)                  | 16 (16)                          |
| reg_pc:PC                              | 48 (48)                    | 16 (16)                          |
| reg_rst_inc:COUNT                      | 16 (16)                    | 16 (16)                          |
| reg_rst_inc:END                        | 2 (2)                      | 1 (1)                            |
| reg_rst_inc:J                          | 16 (16)                    | 16 (16)                          |
| reg_rst_load:AR                        | 18 (18)                    | 16 (16)                          |
| reg_rst_load:C                         | 1 (1)                      | 16 (16)                          |
| reg_rst_load:DR                        | 1 (1)                      | 16 (16)                          |
| reg_rst_load:IC                        | 1 (1)                      | 16 (16)                          |
| reg_rst_load:IE                        | 1 (1)                      | 16 (16)                          |
| reg_rst_load:IR                        | 17 (17)                    | 16 (16)                          |
| reg_rst_load:N                         | 1 (1)                      | 16 (16)                          |
| reg_rst_load:TP3                       | 16 (16)                    | 16 (16)                          |
| reg_rst_load:TR2                       | 17 (17)                    | 16 (16)                          |
| reg_rst_load:TR                        | 4 (4)                      | 16 (16)                          |
| reg_rst_load_inc:I                     | 16 (16)                    | 16 (16)                          |
| reg_rst_load_inc:TP1                   | 16 (16)                    | 16 (16)                          |
| reg_rst_load_inc:TP2                   | 16 (16)                    | 16 (16)                          |
| core:processor2                        | 1438 (5)                   | 281 (0)                          |

Continued on next page

Table 11 : Continued from previous page

| <b>Compilation Hierarchy Node</b>   | <b>Combinational ALUTs</b> | <b>Dedicated Logic Registers</b> |
|-------------------------------------|----------------------------|----------------------------------|
| core:processor3                     | 1438 (5)                   | 281 (0)                          |
| core:processor4                     | 1436 (5)                   | 281 (0)                          |
| core:processor5                     | 1435 (5)                   | 281 (0)                          |
| core:processor6                     | 1438 (5)                   | 281 (0)                          |
| core:processor7                     | 1432 (5)                   | 281 (0)                          |
| core:processor8                     | 1433 (5)                   | 281 (0)                          |
| core:processor9                     | 1430 (5)                   | 281 (0)                          |
| core:processor10                    | 1431 (5)                   | 281 (0)                          |
| data_memory_multi_port:DM_4         | 220413 (220413)            | 16000 (16000)                    |
| dmem_mux:data_mem_mux               | 21 (21)                    | 0 (0)                            |
| instruction_memory:IM               | 26 (26)                    | 0 (0)                            |
| tb_mux_mem_data:mem_data_select_mux | 16 (16)                    | 0 (0)                            |

#### 4.2.4 RTL Design

RTL viewer in Intel Quartus Prime software allows to visualize our Verilog modules and their interconnections. This comes in handy to find miss connections and wrong connections between modules. Below are some RTL views of the core components of our design.

## 10 Core Top level design



Figure 22: 10 Core top level design

### Single Core RTL design



Figure 23: Single Core RTL design

## Control Unit design



Figure 24: Control Unit design

## ALU design



Figure 25: ALU design

## 5 Performance Evaluation

### 5.1 Detailed Comparison

To evaluate the performance of the designed multi-core processor we have selected random initialized matrices with sizes 4X4, 8X8 and 16X16 and analysed the time taken for matrix multiplication when number of cores are increased from 1 to 10.

The results are listed below.

1. For N = 4,

Some screenshots taken during the testing stage.

```
PS D:\FPGA\Project\FPGA based Multicore Matrix Multiplier\io_txt_files> python save_matrix2.py --n 4 --c 1
written to data_to_mem.txt
PS D:\FPGA\Project\FPGA based Multicore Matrix Multiplier\io_txt_files> python check_matrix.py --n 4
matrix 1:
[[ 1 10  2  7]
 [10  4  7  6]
 [ 7  4 19 14]
 [ 2  4 12 13]]
matrix 2:
[[ 5  0 10 13]
 [ 2 11  0 19]
 [14  7 11  5]
 [14 12  7 16]]

Calculated Answer:
[[151 208  81 325]
 [240 165 219 337]
 [505 345 377 486]
 [368 284 243 370]]
Correct Answer:
[[151 208  81 325]
 [240 165 219 337]
 [505 345 377 486]
 [368 284 243 370]]

Is answer correct : True
```

Figure 26: Matrix results for 4X4 matrix with 1 core

```
#      4
#      1
# Reading data - completed!
# Time for multiplication (us):          4
# File Opened - results_from_mem.txt
# Writing Results - completed!
# ** Note: $stop  : D:/FPGA/Project/FPGA based Multicore Matrix Multiplier/top_tb.v(127)
#   Time: 191705 ns  Iteration: 0  Instance: /top_tb
# Break in Module top_tb at D:/FPGA/Project/FPGA based Multicore Matrix Multiplier/top_tb.v line 127
```

Figure 27: Calculation time for 4X4 matrix with 1 core

```

PS D:\FPGA\Project\FPGA based Multicore Matrix Multiplier\io_txt_files> python save_matrix2.py --n 4 --c 4
written to data_to_mem.txt
PS D:\FPGA\Project\FPGA based Multicore Matrix Multiplier\io_txt_files> python check_matrix.py --n 4
matrix 1:
[[15 10 18 12]
 [16 3 8 4]
 [18 13 6 7]
 [18 0 14 6]]
matrix 2:
[[16 18 6 6]
 [2 7 13 0]
 [17 2 0 14]
 [8 4 13 16]]

Calculated Answer:
[[662 424 376 534]
 [430 341 187 272]
 [472 455 368 304]
 [574 376 186 400]]
Correct Answer:
[[662 424 376 534]
 [430 341 187 272]
 [472 455 368 304]
 [574 376 186 400]]

Is answer correct : True

```

Figure 28: Matrix results for 4X4 matrix with 4 cores

```

#      4
#      4
# Reading data - completed!
# Time for multiplication (us):          1
# File Opened - results_from_mem.txt
# Writing Results - completed!
# ** Note: $stop  : D:/FPGA/Project/FPGA based Multicore Matrix Multiplier/top_tb.v(127)
#   Time: 160715 ns Iteration: 0 Instance: /top_tb
# Break in Module top_tb at D:/FPGA/Project/FPGA based Multicore Matrix Multiplier/top_tb.v line 127

```

Figure 29: Calculation time for 4X4 matrix with 4 cores

Table 12: Results observed for 4x4 matrices

| N | No. of Cores | Time(us) |
|---|--------------|----------|
| 4 | 1            | 4        |
|   | 2            | 2        |
|   | 3            | 2        |
|   | 4            | 1        |
|   | 5            | 1        |
|   | 6            | 1        |
|   | 7            | 1        |
|   | 8            | 1        |
|   | 9            | 1        |
|   | 10           | 1        |

2. For N = 8,

Some screenshots taken during the testing stage.

```
PS D:\FPGA\Project\FPGA based Multicore Matrix Multiplier\io_txt_files> python save_matrix2.py --n 8 --c 1
written to data_to_mem.txt
PS D:\FPGA\Project\FPGA based Multicore Matrix Multiplier\io_txt_files> python check_matrix.py --n 8
matrix 1:
[[14  0  8  19 10 13 15  0]
 [ 7  0 15 12 19 17  0  2]
 [15  2 15  7 11  9  7 13]
 [15  3  7 13  8 18 17  2]
 [12  3  1 12 12 17 18  6]
 [13  1 10 16 14  5 15  9]
 [ 9  7  2 13 18 10 11 14]
 [ 1 12  2 11  8  3 18  0]]
matrix 2:
[[ 9  8  8 16 13  7  8 15]
 [14 18 12 19  7  4 17 10]
 [ 2 14 16 10 12  1 11  0]
 [10 17 16 17  1 11 18 18]
 [ 9  6  2  9 13  4  9 14]
 [11 12  1 17  2 17 10 15]
 [13  3 15 14 19  2  0  3]
 [ 6 16 11 16  6  6 13  8]]
Calculated Answer:
[[ 760  808  802 1148  738  606  762  932]
 [ 583  820  565  958  576  573  804  858]
 [ 630  888  775 1105  768  494  803  785]
 [ 824  840  787 1236  788  651  760  938]
 [ 837  794  717 1208  769  638  730  951]
 [ 741  867  889 1164  843  506  812  881]
 [ 812  932  761 1228  750  584  891  986]
 [ 630  577  649  826  584  297  534  544]]
Correct Answer:
[[ 760  808  802 1148  738  606  762  932]
 [ 583  820  565  958  576  573  804  858]
 [ 630  888  775 1105  768  494  803  785]
 [ 824  840  787 1236  788  651  760  938]
 [ 837  794  717 1208  769  638  730  951]
 [ 741  867  889 1164  843  506  812  881]
 [ 812  932  761 1228  750  584  891  986]
 [ 630  577  649  826  584  297  534  544]]
Is answer correct : True
```

Figure 30: Matrix results for 8X8 matrix with 1 core

```
#      8
#      1
# Reading data - completed!
# Time for multiplication (us):          29
# File Opened - results_from_mem.txt
# Writing Results - completed!
# ** Note: $stop    : D:/FPGA/Project/FPGA based Multicore Matrix Multiplier/top_tb.v(127)
#           Time: 441985 ns Iteration: 0 Instance: /top_tb
# Break in Module top_tb at D:/FPGA/Project/FPGA based Multicore Matrix Multiplier/top_tb.v line 127
```

Figure 31: Calculation time for 8X8 matrix with 1 core

```

PS D:\FPGA\Project\FPGA based Multicore Matrix Multiplier\io_txt_files> python save_matrix2.py --n 8 --c 10
written to data_to_mem.txt
PS D:\FPGA\Project\FPGA based Multicore Matrix Multiplier\io_txt_files> python check_matrix.py --n 8
matrix 1:
[[12 10 18 4 12 0 0 13]
 [ 5 19 7 1 15 3 14 6]
 [ 3 19 13 5 14 9 13 7]
 [12 11 2 12 19 12 10 1]
 [ 9 2 0 6 0 16 4 5]
 [ 7 15 13 2 19 0 8 12]
 [19 6 17 9 17 13 10 0]
 [ 5 4 14 6 1 14 9 2]]
matrix 2:
[[19 6 2 8 6 9 1 10]
 [ 2 17 10 1 5 5 18 1]
 [ 7 2 5 0 12 11 1 8]
 [ 1 14 11 17 13 16 8 9]
 [15 8 17 12 1 10 9 4]
 [ 2 11 3 3 11 14 18 16]
 [13 16 19 12 9 7 15 1]
 [15 17 11 8 0 7 13 5]]
Calculated Answer:
[[753 651 605 422 402 631 519 423]
 [686 860 842 481 396 565 839 286]
 [693 975 905 535 564 751 972 446]
 [730 892 836 703 548 812 858 538]
 [340 497 283 312 354 474 506 431]
 [825 835 858 525 390 643 753 351]
 [912 815 800 674 715 954 753 699]
 [397 550 438 324 532 598 561 467]]
Correct Answer:
[[753 651 605 422 402 631 519 423]
 [686 860 842 481 396 565 839 286]
 [693 975 905 535 564 751 972 446]
 [730 892 836 703 548 812 858 538]
 [340 497 283 312 354 474 506 431]
 [825 835 858 525 390 643 753 351]
 [912 815 800 674 715 954 753 699]
 [397 550 438 324 532 598 561 467]]
Is answer correct : True

```

Figure 32: Matrix results for 8X8 matrix with 10 cores

```

#      8
#      10
# Reading data - completed!
# Time for multiplication (us):          4
# File Opened - results_from_mem.txt
# Writing Results - completed!
# ** Note: $stop    : D:/FPGA/Project/FPGA based Multicore Matrix Multiplier/top_tb.v(127)
#   Time: 186835 ns  Iteration: 0  Instance: /top_tb
# Break in Module top_tb at D:/FPGA/Project/FPGA based Multicore Matrix Multiplier/top_tb.v line 127

```

Figure 33: Calculation time for 8X8 matrix with 10 cores

Table 13: Results observed for 8x8 matrices

| <b>N</b> | <b>No. of Cores</b> | <b>Time(us)</b> |
|----------|---------------------|-----------------|
| 8        | 1                   | 29              |
|          | 2                   | 15              |
|          | 3                   | 11              |
|          | 4                   | 7               |
|          | 5                   | 7               |
|          | 6                   | 7               |
|          | 7                   | 7               |
|          | 8                   | 4               |
|          | 9                   | 4               |
|          | 10                  | 4               |

3. For  $N = 16$ ,

Table 14: Results observed for 16x16 matrices

| <b>N</b> | <b>No. of Cores</b> | <b>Time(us)</b> |
|----------|---------------------|-----------------|
| 16       | 1                   | 219             |
|          | 2                   | 109             |
|          | 3                   | 82              |
|          | 4                   | 55              |
|          | 5                   | 55              |
|          | 6                   | 41              |
|          | 7                   | 41              |
|          | 8                   | 27              |
|          | 9                   | 27              |
|          | 10                  | 27              |

## 5.2 Summary of results

Fig. 34 shows the quantitative comparison of the performance results for different number of cores. 3 diagrams shows the performance results for different matrix sizes.



Figure 34: Performance comparison for different number of cores and different matrix sizes

## 5.3 Conclusion

To this end, we have implemented and tested 10-core processor for paralleled matrix multiplication with 16-bit valued data. Because of the independent processing of each core, the capability of adding more cores is straight forward.

With the results shown in Fig. 34, it can be seen that the computation time for matrix multiplication is decreasing when increasing the number of cores for all matrix sizes. The implemented processor is capable of doing;

- (4x4) 16-bit valued matrix multiplication within around **1 us** utilizing 4 cores.
- (8x8) 16-bit valued matrix multiplication within around **4 us** utilizing 8 cores.
- (16x16) 16-bit valued matrix multiplication within around **27 us** utilizing 10 cores.

Even though initial reduction of computation time is significant, number of cores will no longer be affected to the computation time when the number of cores passes the size of matrices.

## 6 Acknowledgement

A special thanks goes to our lecturer, Dr. Jayathu Samarawickrama for not only providing us with all the necessary theoretical knowledge regarding processor design but also for advising us on the potential issues, trade-offs and solutions when designing an Instruction Set Architecture.

## References

- [1] “Cpu: Central processing unit | ap csp (article).”
- [2] “Matrix multiplication,” May 2021.
- [3] “What is an fpga? programming and fpga basics - intel® fpgas.”
- [4] “What is an fpga? field programmable gate array.”
- [5] J. D. Carpinelli, *Computer systems organization & architecture*. Addison-Wesley, 2001.
- [6] “Modelsim software.” <https://www.intel.com/content/www/us/en/software/programmable/quartus-prime/model-sim.html>.
- [7] “Questasim software.” <https://eda.sw.siemens.com/en-US/ic/questa/simulation/advanced-simulator/>.
- [8] “Intel quartus prime software.” <https://www.intel.com/content/www/us/en/software/programmable/quartus-prime/overview.html>.

# Appendix

The full implementation can be found in [GitHub repository](#).

| IR | Map Instructions | Micro-instructions | Instruction Memory |
|----|------------------|--------------------|--------------------|
|    | FETCH            |                    |                    |
|    | ARPC             |                    |                    |
|    | DRMI PCINC       |                    |                    |
|    | INDR ARPC        |                    |                    |
| 0  | 3 SETN           | DRMI PCINC         |                    |
|    |                  | SETN1              |                    |
|    |                  | SETN2              |                    |
|    |                  | SETN3              |                    |
|    |                  | SETN4              |                    |
|    |                  | SETN5              |                    |
|    |                  | SETN6              |                    |
|    |                  | SETN7              |                    |
| 1  | 7 SETC           | DRMI PCINC         |                    |
|    |                  | SETC1              |                    |
|    |                  | SETC2              |                    |
|    |                  | SETC3              |                    |
|    |                  | SETC4              |                    |
|    |                  | SETC5              |                    |
|    |                  | SETC6              |                    |
|    |                  | SETC7              |                    |
| 2  | 11 SETTP1        | ACN                |                    |
|    |                  | SETTP11            |                    |
|    |                  | SETTP12            |                    |
|    |                  | SETTP13            |                    |
|    |                  | SETTP14            |                    |
|    |                  | SETTP15            |                    |
| 3  | 15 SETTP2        | ACN                |                    |
|    |                  | SETTP21            |                    |
|    |                  | SETTP22            |                    |
|    |                  | SETTP23            |                    |
|    |                  | SETTP24            |                    |
|    |                  | SETTP25            |                    |
|    |                  | SETTP26            |                    |
| 4  | 20 SETTP3        | ACN                |                    |
|    |                  | SETTP27            |                    |
|    |                  | SETTP28            |                    |
|    |                  | SETTP29            |                    |
| 5  | 28 RNG1          | ACN                |                    |
|    |                  | RNG1               |                    |
|    |                  | RNG2               |                    |
|    |                  | RNG3               |                    |
|    |                  | RNG4               |                    |
|    |                  | RNG5               |                    |
|    |                  | RNG6               |                    |
|    |                  | RNG7               |                    |
|    |                  | RNG8               |                    |
| 6  | 37 SRT11         | ACN                |                    |
|    |                  | SRT11              |                    |
|    |                  | SRT12              |                    |
|    |                  | SRT13              |                    |
|    |                  | SRT14              |                    |
| 7  | 41 END1          | ACID               |                    |
|    |                  | END1               |                    |
|    |                  | END2               |                    |
|    |                  | END3               |                    |
|    |                  | END4               |                    |

Figure 35: uOPs for microinstructions Part 1

Figure 36: uOPs for microinstructions Pert 2

## Modules

### Top Module

```
1 module top(
2     input  clk,
3     input  START,
4     input  RESET,
5
6     //For writing and reading data memory from external files
7     input [1:0] addr_mux_select,
8     input [15:0] ar_in,
9     input [15:0] current_addr,
10    input write_from_tb,
11    input [15:0] mem_data,
12
13    output END,
14    output [15:0] dmem_out_disp
15 );
16
17 //Wires for core connections
18
19 // Intermediate wires for core 1
20 wire [15:0] dmem_in;
21 wire [15:0] imem_in;
22 wire [15:0] dmem_out;
23 wire [15:0] ar_out;
24
25 // Intermediate wires for core 2
26 wire [15:0] dmem_in2;
27 wire [15:0] dmem_out2;
28 wire [15:0] ar_out2;
29
30 // Intermediate wires for core 3
31 wire [15:0] dmem_in3;
32 wire [15:0] dmem_out3;
33 wire [15:0] ar_out3;
34
35 // Intermediate wires for core 4
36 wire [15:0] dmem_in4;
37 wire [15:0] dmem_out4;
38 wire [15:0] ar_out4;
39
40 // Intermediate wires for core 5
41 wire [15:0] dmem_in5;
42 wire [15:0] dmem_out5;
43 wire [15:0] ar_out5;
44
45 // Intermediate wires for core 6
46 wire [15:0] dmem_in6;
47 wire [15:0] dmem_out6;
48 wire [15:0] ar_out6;
49
50
```

```

51 // Intermediate wires for core 7
52 wire [15:0] dmem_in7;
53 wire [15:0] dmem_out7;
54 wire [15:0] ar_out7;
55
56 // Intermediate wires for core 8
57 wire [15:0] dmem_in8;
58 wire [15:0] dmem_out8;
59 wire [15:0] ar_out8;
60
61 // Intermediate wires for core 9
62 wire [15:0] dmem_in9;
63 wire [15:0] dmem_out9;
64 wire [15:0] ar_out9;
65
66 // Intermediate wires for core 10
67 wire [15:0] dmem_in10;
68 wire [15:0] dmem_out10;
69 wire [15:0] ar_out10;
70
71
72 //Control signals
73 wire read_MI_F;
74 wire read_MD, read_MI, write_MD;      //core 1
75 wire read_MD2, read_MI2, write_MD2;  //core 2
76 wire read_MD3, read_MI3, write_MD3;  //core 3
77 wire read_MD4, read_MI4, write_MD4;  //core 4
78 wire read_MD5, read_MI5, write_MD5;  //core 5
79 wire read_MD6, read_MI6, write_MD6;  //core 6
80 wire read_MD7, read_MI7, write_MD7;  //core 7
81 wire read_MD8, read_MI8, write_MD8;  //core 8
82 wire read_MD9, read_MI9, write_MD9;  //core 9
83 wire read_MD10, read_MI10, write_MD10; //core 10
84
85 //Used to write data in data memory to an external file
86 assign dmem_out_disp = dmem_in;
87
88 wire [15:0] d_mem_addr;
89 assign memory_in_addr = d_mem_addr;
90
91 // [Instruction Memory]
92 instruction_memory IM(
93     .read(read_MI_F),
94     .address(ar_out),
95     .instruction_out(imem_in)
96 );
97
98 // To select addresses coming from cores and test bench for writing and
99 // reading from external file
100 dmem_mux data_mem_mux(
101     .in_core(ar_out),
102     .in_d_write(current_addr),
103     .in_d_read(ar_in),
104     .addr_mux_select(addr_mux_select),
105     .mux_out(d_mem_addr)

```

```

105 );
106
107 // To select write enable signals coming from cores and test bench for
108 // writing and reading from external files
109 wire write_to_data_mem;
110
111 tb_mux data_mem_write_mux(
112     .in_1(write_from_tb),
113     .in_2(write_MD),
114     .mux_out(write_to_data_mem)
115 );
116
117 // To select data coming from cores and test bench for writing and reading
118 // from external files
119 wire [15:0] data_for_mem;
120
121 tb_mux_mem_data mem_data_select_mux(
122     .in_1(mem_data),
123     .in_2(dmem_out),
124     .mem_data_select(write_from_tb),
125     .mux_out(data_for_mem)
126 );
127
128 //Allowing all cores to access the instructions
129 assign read_MI_F = read_MI || read_MI2 || read_MI3 || read_MI4 || read_MI5
130     || read_MI6 || read_MI7 || read_MI8 || read_MI9 || read_MI10 ;
131
132 //Initiating Cores
133 // [CORE 1]
134 core #(.core_id(0)) processor1 (
135     .clk(clk),
136     .START(START),
137     .RESET(RESET),
138     .read_MI(read_MI),
139     .imem_in(imem_in),
140     .dmem_in(dmem_in),
141     .dmem_out(dmem_out),
142     .ar_out(ar_out),
143     .read_MD(read_MD),
144     .write_MD(write_MD),
145     .end_i(END)
146 );
147
148 // [CORE 2]
149 core #(.core_id(1)) processor2 (
150     .clk(clk),
151     .START(START),
152     .RESET(RESET),
153     .read_MI(read_MI2),
154     .imem_in(imem_in),
155     .dmem_in(dmem_in2),
156     .dmem_out(dmem_out2),
157     .write_MD(write_MD2),
158     .ar_out(ar_out2)
159 );

```

```
157
158 // [CORE 3]
159 core #(.core_id(2)) processor3 (
160     .clk(clk),
161     .START(START),
162     .RESET(RESET),
163     .read_MI(read_MI3),
164     .imem_in(imem_in),
165     .dmem_in(dmem_in3),
166     .dmem_out(dmem_out3),
167     .write_MD(write_MD3),
168     .ar_out(ar_out3)
169 );
170
171 // [CORE 4]
172 core #(.core_id(3)) processor4 (
173     .clk(clk),
174     .START(START),
175     .RESET(RESET),
176     .read_MI(read_MI4),
177     .imem_in(imem_in),
178     .dmem_in(dmem_in4),
179     .dmem_out(dmem_out4),
180     .write_MD(write_MD4),
181     .ar_out(ar_out4)
182 );
183
184 // [CORE 5]
185 core #(.core_id(4)) processor5 (
186     .clk(clk),
187     .START(START),
188     .RESET(RESET),
189     .read_MI(read_MI5),
190     .imem_in(imem_in),
191     .dmem_in(dmem_in5),
192     .dmem_out(dmem_out5),
193     .write_MD(write_MD5),
194     .ar_out(ar_out5)
195 );
196
197 // [CORE 6]
198 core #(.core_id(5)) processor6 (
199     .clk(clk),
200     .START(START),
201     .RESET(RESET),
202     .read_MI(read_MI6),
203     .imem_in(imem_in),
204     .dmem_in(dmem_in6),
205     .dmem_out(dmem_out6),
206     .write_MD(write_MD6),
207     .ar_out(ar_out6)
208 );
209
210 // [CORE 7]
211 core #(.core_id(6)) processor7 (
```

```

212     .clk(clk),
213     .START(START),
214     .RESET(RESET),
215     .read_MI(read_MI7),
216     .imem_in(imem_in),
217     .dmem_in(dmem_in7),
218     .dmem_out(dmem_out7),
219     .write_MD(write_MD7),
220     .ar_out(ar_out7)
221 );
222
223 // [CORE 8]
224 core #(.core_id(7)) processor8 (
225     .clk(clk),
226     .START(START),
227     .RESET(RESET),
228     .read_MI(read_MI8),
229     .imem_in(imem_in),
230     .dmem_in(dmem_in8),
231     .dmem_out(dmem_out8),
232     .write_MD(write_MD8),
233     .ar_out(ar_out8)
234 );
235
236
237 // [CORE 9]
238 core #(.core_id(8)) processor9 (
239     .clk(clk),
240     .START(START),
241     .RESET(RESET),
242     .read_MI(read_MI9),
243     .imem_in(imem_in),
244     .dmem_in(dmem_in9),
245     .dmem_out(dmem_out9),
246     .write_MD(write_MD9),
247     .ar_out(ar_out9)
248 );
249
250 // [CORE 10]
251 core #(.core_id(9)) processor10 (
252     .clk(clk),
253     .START(START),
254     .RESET(RESET),
255     .read_MI(read_MI10),
256     .imem_in(imem_in),
257     .dmem_in(dmem_in10),
258     .dmem_out(dmem_out10),
259     .write_MD(write_MD10),
260     .ar_out(ar_out10)
261 );
262
263
264 // Initiating Multi Port Memory
265 data_memory_multi_port DM_4(
266     .clk(clk),

```

```

267     .we_1(write_to_data_mem),
268     .addr_1(d_mem_addr),
269     .data_in_1(data_for_mem),
270     .data_out_1(dmem_in),
271     .we_2(write_MD2),
272     .addr_2(ar_out2),
273     .data_in_2(dmem_out2),
274     .data_out_2(dmem_in2),
275     .we_3(write_MD3),
276     .addr_3(ar_out3),
277     .data_in_3(dmem_out3),
278     .data_out_3(dmem_in3),
279     .we_4(write_MD4),
280     .addr_4(ar_out4),
281     .data_in_4(dmem_out4),
282     .data_out_4(dmem_in4),
283     .we_5(write_MD5),
284     .addr_5(ar_out5),
285     .data_in_5(dmem_out5),
286     .data_out_5(dmem_in5),
287     .we_6(write_MD6),
288     .addr_6(ar_out6),
289     .data_in_6(dmem_out6),
290     .data_out_6(dmem_in6),
291     .we_7(write_MD7),
292     .addr_7(ar_out7),
293     .data_in_7(dmem_out7),
294     .data_out_7(dmem_in7),
295     .we_8(write_MD8),
296     .addr_8(ar_out8),
297     .data_in_8(dmem_out8),
298     .data_out_8(dmem_in8),
299     .we_9(write_MD9),
300     .addr_9(ar_out9),
301     .data_in_9(dmem_out9),
302     .data_out_9(dmem_in9),
303     .we_10(write_MD10),
304     .addr_10(ar_out10),
305     .data_in_10(dmem_out10),
306     .data_out_10(dmem_in10)
307 );
308
309 endmodule

```

Listing 1: Top Module (Corresponding testbench is top\_tb.v)

## Core Module

```

1 module core #(parameter core_id = 0) (
2     input clk, //Corrected clock
3     input START,
4     input RESET,
5
6     //Outputs related with external memories
7     input [15:0] dmem_in, imem_in,

```

```

8      output [15:0] dmem_out ,
9      output [15:0] ar_out ,
10
11 //output control_signals
12      output read_MD , read_MI , write_MD ,
13      output end_i
14
15 );
16
17 wire corrected_clk;
18 clock_corrector_module clock_corrector(
19     .start(START),
20     .clk(clk),
21     .end_signal(end_i),
22     .corrected_clk(corrected_clk)
23 );
24
25
26 //Bus
27 wire [15:0] bus;
28
29 wire [15:0] ac_out;
30 wire [15:0] alu_out;
31
32 //Intermediate Control signals
33 wire DREAD , IREAD , DWRITE , BUSMEM , MEMBUSD , MEMBUSI , TRBUS , DRBUS , PCBUS ,
   NBUS , CBUS , JBUS ;
34 wire IDBUS , COUNTBUS , TP1BUS , TP2BUS , TP3BUS , ICBUS , IBUS , IEBUS , TR2BUS ,
   ACBUS , LDAR ;
35 wire LDTR , LDDR , LDPC , LDN , LDC , LDIR , LDTP1 , LDTP2 , LDTP3 , LDIC , LDI , LDIE ,
   LDTR2 ;
36 wire LDAC , RSTAR , RSTTR , RSTDRA , RSTPC , RSTN , RSTC , RSTIR , RSTJ , RSTID ,
   RSTCOUNT , RSTTP1 ;
37 wire RSTTP2 , RSTTP3 , RSTIC , RSTI , RSTIE , RSTTR2 , RSTAC , RSTEND , INCPC , INCJ ,
   INCCOUNT ;
38 wire INCTP1 , INCTP2 , INCI , INCAC , INCEND , ALU0 , ALU1 , ALU2 , ADDKAC ;
39
40 //Memory read write outputs from the Core
41 assign read_MD = DREAD ;
42 assign read_MI = IREAD ;
43 assign write_MD = DWRITE ;
44
45 //Buffers to access data and instruction memory
46 buffer MEMBUSD_buf(.data_in(dmem_in) , .select(MEMBUSD) , .data_out(bus));
47 buffer BUSMEMD_buf(.data_in(bus) , .select(BUSMEM) , .data_out(dmem_out));
48 buffer MEMBUSI_buf(.data_in(imem_in) , .select(MEMBUSI) , .data_out(bus));
49
50 //Initialise registers and buffers
51
52 reg_rst_load AR(.clk(clk) , .reset(RSTAR) , .load_enable(LDAR) , .data_in(bus) ,
   .data_out(ar_out));
53
54 wire [15:0] tr_out;
55 reg_rst_load TR(.clk(clk) , .reset(RSTTR) , .load_enable(LDTR) , .data_in(bus) ,
   .data_out(tr_out));

```

```

56 buffer TR_buf(.data_in(tr_out), .select(TRBUS), .data_out(bus));
57
58 wire [15:0] dr_out;
59 reg_rst_load DR(.clk(clk), .reset(RSTDR), .load_enable(LDDR), .data_in(bus),
60 .data_out(dr_out));
61
62 buffer DR_buf(.data_in(dr_out), .select(DRBUS), .data_out(bus));
63
64 wire [15:0] pc_out;
65 reg_pc PC(.clk(clk), .reset(RSTPC), .load_enable(LDPC), .inc(INCPC), .
66 data_in(dr_out), .data_out(pc_out));
67 buffer PC_buf(.data_in(pc_out), .select(PCBUS), .data_out(bus));
68
69 wire [15:0] n_out;
70 reg_rst_load N(.clk(clk), .reset(RSTN), .load_enable(LDN), .data_in(dr_out),
71 .data_out(n_out));
72 buffer N_buf(.data_in(n_out), .select(NBUS), .data_out(bus));
73
74 wire [15:0] c_out;
75 reg_rst_load C(.clk(clk), .reset(RSTC), .load_enable(LDC), .data_in(dr_out),
76 .data_out(c_out));
77 buffer C_buf(.data_in(c_out), .select(CBUS), .data_out(bus));
78
79 wire [15:0] ir_out;
80 reg_rst_load IR(.clk(clk), .reset(RSTIR), .load_enable(LDIR), .data_in(
81 dr_out), .data_out(ir_out));
82
83 wire [15:0] j_out;
84 reg_rst_inc J(.clk(clk), .reset(RSTJ), .inc(INCJ), .data_out(j_out));
85 buffer J_buf(.data_in(j_out), .select(JBUS), .data_out(bus));
86
87 //Core ID register and buffer
88 wire [15:0] id_out;
89 reg_rst_load ID(.clk(clk), .reset(RSTID), .load_enable(1), .data_in(core_id),
90 .data_out(id_out));
91 buffer ID_buf(.data_in(id_out), .select(IDBUS), .data_out(bus));
92
93 wire [15:0] count_out;
94 reg_rst_inc COUNT(.clk(clk), .reset(RSTCOUNT), .inc(INCCOUNT), .data_out(
95 count_out));
96 buffer COUNT_buf(.data_in(count_out), .select(COUNTBUS), .data_out(bus));
97
98 wire [15:0] tp1_out;
99 reg_rst_load_inc TP1(.clk(clk), .reset(RSTTP1), .load_enable(LDTP1), .inc(
100 INCTP1), .data_in(ac_out), .data_out(tp1_out));
101 buffer TP1_buf(.data_in(tp1_out), .select(TP1BUS), .data_out(bus));
102
103 wire [15:0] tp2_out;
104 reg_rst_load_inc TP2(.clk(clk), .reset(RSTTP2), .load_enable(LDTP2), .inc(
105 INCTP2), .data_in(ac_out), .data_out(tp2_out));
106 buffer TP2_buf(.data_in(tp2_out), .select(TP2BUS), .data_out(bus));
107
108 wire [15:0] tp3_out;
109 reg_rst_load TP3(.clk(clk), .reset(RSTTP3), .load_enable(LDTP3), .data_in(
110 ac_out), .data_out(tp3_out));
111 buffer TP3_buf(.data_in(tp3_out), .select(TP3BUS), .data_out(bus));

```

```

101
102 wire [15:0] ic_out;
103 reg_rst_load IC(.clk(clk), .reset(RSTIC), .load_enable(LDIC), .data_in(
104   ac_out), .data_out(ic_out));
105 buffer IC_buf(.data_in(ic_out), .select(ICBUS), .data_out(bus));
106
107 wire [15:0] i_out;
108 reg_rst_load_inc I(.clk(clk), .reset(RSTI), .load_enable(LDI), .inc(INCI), .
109   data_in(ac_out), .data_out(i_out));
110 buffer I_buf(.data_in(i_out), .select(IBUS), .data_out(bus));
111
112 wire [15:0] ie_out;
113 reg_rst_load IE(.clk(clk), .reset(RSTIE), .load_enable(LDIE), .data_in(
114   ac_out), .data_out(ie_out));
115 buffer IE_buf(.data_in(ie_out), .select(IEBUS), .data_out(bus));
116
117 wire [15:0] tr2_out;
118 reg_rst_load TR2(.clk(clk), .reset(RSTTR2), .load_enable(LDTR2), .data_in(
119   ac_out), .data_out(tr2_out));
120 buffer TR2_buf(.data_in(tr2_out), .select(TR2BUS), .data_out(bus));
121
122 reg_rst_inc END(.clk(clk), .reset(RSTEND), .inc(INCEND), .data_out(end_i));
123
124 //ALU
125 wire [2:0] operation;
126 assign operation[0] = ALU0;
127 assign operation[1] = ALU1;
128 assign operation[2] = ALU2;
129
130 alu ALU(.in_bus(bus), .in_AC(ac_out), .operation(operation), .data_out(
131   alu_out));
132
133 //Z
134 wire z;
135 assign z= ~(ac_out[0]||ac_out[1]||ac_out[2]||ac_out[3]||ac_out[4]||ac_out
136   [5]||ac_out[6]||ac_out[7]||ac_out[8]||ac_out[9]||ac_out[10]||ac_out[11]||
137   ac_out[12]||ac_out[13]||ac_out[14]||ac_out[15]);
138
139 //Control Unit
140 wire [48:0] OPs;
141 controller controller(.clk(corrected_clk), .IR(ir_out), .z(z), .OPs(OPs));
142
143 //OPs decoder used to obtain control signals
144 ops_decoder decoder(.uOPs(OPs), .START(START), .RESET(RESET), .DREAD(DREAD),
145   .IREAD(IREAD), .DWRITE(DWRITE), .BUSMEM(BUSMEM), .MEMBUSD(MEMBUSD), .
146   MEMBUSI(MEMBUSI),
147   .TRBUS(TRBUS), .DRBUS(DRBUS), .PCBUS(PCBUS), .NBUS(NBUS), .CBUS(CBUS), .
148   JBUS(JBUS), .IDBUS(IDBUS), .COUNTBUS(COUNTBUS), .TP1BUS(TP1BUS), .TP2BUS
149   (TP2BUS),
150   .TP3BUS(TP3BUS), .ICBUS(ICBUS), .IBUS(IBUS), .IEBUS(IEBUS), .TR2BUS(TR2BUS),
151   .ACBUS(ACBUS), .LDAR(LDAR), .LDTR(LDTR), .LDDR(LDDR), .LDPC(LDPC), .LDN(

```

```

143 LDN), .LDC(LDC),
143 .LDIR(LDIR), .LDTP1(LDTP1), .LDTP2(LDTP2), .LDTP3(LDTP3), .LDIC(LDIC), .LDI(
143 LDI), .LDIE(LDIE), .LDTR2(LDTR2), .LDAC(LDAC), .RSTAR(RSTAR), .RSTTR(
143 RSTTR), .RSTDRA(RSTDRA),
144 .RSTPC(RSTPC), .RSTN(RSTN), .RSTC(RSTC), .RSTIR(RSTIR), .RSTJ(RSTJ), .RSTID(
144 RSTID), .RSTCOUNT(RSTCOUNT), .RSTTP1(RSTTP1), .RSTTP2(RSTTP2), .RSTTP3(
144 RSTTP3), .RSTIC(RSTIC),
145 .RSTI(RSTI), .RSTIE(RSTIE), .RSTTR2(RSTTR2), .RSTAC(RSTAC), .RSTEND(RSTEND),
145 .INCPC(INCPC), .INCJ(INCJ), .INCCOUNT(INCCOUNT), .INCTP1(INCTP1), .
145 INCTP2(INCTP2), .INCI(INCI),
146 .INCAC(INCAC), .INCEND(INCEND), .ALU0(ALU0), .ALU1(ALU1), .ALU2(ALU2), .
146 ADDKAC(ADDKAC)
147 );
148
149
150 endmodule

```

Listing 2: Core Module

## Multi Port Memory Module

```

1 module data_memory_multi_port
2 (
3     input [15:0] data_in_1, data_in_2, data_in_3, data_in_4, data_in_5,
3     data_in_6, data_in_7, data_in_8, data_in_9, data_in_10,
4     input [15:0] addr_1, addr_2, addr_3, addr_4, addr_5, addr_6, addr_7,
4     addr_8, addr_9, addr_10,
5     input we_1, we_2, we_3, we_4, we_5, we_6, we_7, we_8, we_9, we_10, clk,
6     output [15:0] data_out_1, data_out_2, data_out_3, data_out_4, data_out_5,
6     data_out_6, data_out_7, data_out_8, data_out_9, data_out_10
7 );
8
9 // Declare the RAM variable
10 reg [15:0] ram[0:999];
11
12 assign data_out_1 = ram[addr_1];
13 assign data_out_2 = ram[addr_2];
14 assign data_out_3 = ram[addr_3];
15 assign data_out_4 = ram[addr_4];
16 assign data_out_5 = ram[addr_5];
17 assign data_out_6 = ram[addr_6];
18 assign data_out_7 = ram[addr_7];
19 assign data_out_8 = ram[addr_8];
20 assign data_out_9 = ram[addr_9];
21 assign data_out_10 = ram[addr_10];
22
23
24 always @ (posedge clk)
25 begin
26     if (we_1) ram[addr_1] <= data_in_1;
27     if (we_2) ram[addr_2] <= data_in_2;
28     if (we_3) ram[addr_3] <= data_in_3;
29     if (we_4) ram[addr_4] <= data_in_4;
30     if (we_5) ram[addr_5] <= data_in_5;
31     if (we_6) ram[addr_6] <= data_in_6;

```

```

32     if (we_7) ram[addr_7] <= data_in_7;
33     if (we_8) ram[addr_8] <= data_in_8;
34     if (we_9) ram[addr_9] <= data_in_9;
35     if (we_10) ram[addr_10] <= data_in_10;
36
37   end
38
39 endmodule

```

Listing 3: Multi Port Data Memory (Corresponding testbench is data\_memory\_multi\_port\_tb.v)

## Instruction Memory Module

```

1 module instruction_memory(
2   input  read,
3   input [15:0] address,
4   output [15:0] instruction_out
5 );
6
7 //Initiating the instruction memory
8 reg [15:0] inst_memory [0:99];
9
10//Assembly code
11initial
12begin
13   inst_memory [0] = 0;           //SETN
14   inst_memory [1] = 998 ;        //Address N
15   inst_memory [2] = 1 ;          //SETC
16   inst_memory [3] = 999;         //Address C
17   inst_memory [4] = 5 ;          //RNGI
18   inst_memory [5] = 7 ;          //ENDI
19   inst_memory [6] = 6 ;          //STRTI
20   inst_memory [7] = 24 ;         //RSTJ
21   inst_memory [8] = 2 ;          //SETTP1
22   inst_memory [9] = 3 ;          //SETTP2
23   inst_memory [10] = 4 ;         //SETTP3
24   inst_memory [11] = 8 ;         //LDACTP1
25   inst_memory [12] = 12;         //MVACTR
26   inst_memory [13] = 9 ;         //LDACTP2
27   inst_memory [14] = 17 ;        //MUL
28   inst_memory [15] = 26 ;        //MVTR2
29   inst_memory [16] = 18 ;        //ADD
30   inst_memory [17] = 27 ;        //MVACTR2
31   inst_memory [18] = 13 ;        //MVCOUNT
32   inst_memory [19] = 11 ;        //LDN
33   inst_memory [20] = 19 ;        //SUB
34   inst_memory [21] = 21;         //JNPZ
35   inst_memory [22] = 11 ;        // 11
36   inst_memory [23] = 28 ;        //LDTR2
37   inst_memory [24] = 20 ;        //STAC
38   inst_memory [25] = 14 ;        //MVJ
39   inst_memory [26] = 11 ;        //LDN
40   inst_memory [27] = 19 ;        //SUB
41   inst_memory [28] = 21 ;        //JNPZ

```

```

42     inst_memory[29] = 8 ;      //8
43     inst_memory[30] = 16 ;     //MVIE
44     inst_memory[31] = 15 ;     //MVI
45     inst_memory[32] = 19 ;     //SUB
46     inst_memory[33] = 21 ;     //JNPZ
47     inst_memory[34] = 7 ;      //7
48     inst_memory[35] = 25 ;     //END
49   end
50
51 assign instruction_out = inst_memory[address];
52
53
54 endmodule

```

Listing 4: Instruction Memory Module (Corresponding testbench is instruction\_memory\_tb.v)

**Multiplexer module used for selecting data for data memory from cores and an external file**

```

1 module tb_mux_mem_data (
2   input [15:0] in_1,
3   input [15:0] in_2,
4   input mem_data_select ,
5
6   output [15:0] mux_out
7 );
8
9 assign mux_out = mem_data_select ? in_1 : in_2;
10
11 endmodule

```

Listing 5: Multiplexer Module used for selecting data for data memory

**Multiplexer module used for selecting the write enable signals coming for data memory from cores and an external file**

```

1 module tb_mux (
2   input in_1 ,
3   input in_2 ,
4
5   output mux_out
6 );
7
8 assign mux_out = in_1 ? in_1 : in_2;
9
10 endmodule

```

Listing 6: Multiplexer Module used for selecting write enables signal for data memory

**Multiplexer module used for selecting reading /writing addresses for data memory from cores and an external file**

```
1 module dmem_mux (
```

```

2   input [15:0] in_core,
3   input [15:0] in_d_write,
4   input [15:0] in_d_read,
5   input [1:0] addr_mux_select,
6   output reg [15:0] mux_out
7 );
8
9 always @(in_core or in_d_write or in_d_read or addr_mux_select)
10 begin
11   case(addr_mux_select)
12     0: mux_out <= in_core;
13     1: mux_out <= in_d_write;
14     2: mux_out <= in_d_read;
15     default: mux_out <= in_core;
16
17   endcase
18 end
19
20 endmodule

```

Listing 7: Multiplexer Module used for selecting reading/writing addresses for data memory

## Control Unit Module

```

1 module controller(
2   input clk,
3   input [15:0] IR,
4   input z,
5   output [48:0] OPs
6 );
7
8
9 //Map function
10 wire [15:0] map_addr;
11
12 mapping_block mapping_func(
13   .IR(IR),
14   .map_addr(map_addr)
15 );
16
17 wire [6:0] jump_addr;
18 wire [15:0] inc_addr;
19 wire [15:0] mux_out;
20 wire [15:0] reg_out;
21 wire [1:0] addr_select;
22
23 assign inc_addr = reg_out+1;
24
25 mux1_control_unit mux1(
26   .map_addr(map_addr),
27   .jump_addr(jump_addr),
28   .inc_addr(inc_addr),
29   .reg_in(mux_out),
30   .select(addr_select)
31 );

```

```

32
33 cu_reg_rst_load register(
34   .clk(clk),
35   .data_in(mux_out),
36   .data_out(reg_out),
37   .load_enable(1),
38   .reset(0)
39 );
40
41 wire [1:0] condition;
42 wire BT;
43
44 microcode micromemory(
45   .reg_out(reg_out),
46   .condition(condition),
47   .BT(BT), .OPs(OPs),
48   .jump_addr(jump_addr)
49 );
50
51 wire [15:0] logic_in;
52
53 mux2_control_unit mux2(
54   .z(z),
55   .select(condition),
56   .logic_in(logic_in)
57 );
58
59 logic_block logic_box(
60   .BT_in(BT),
61   .mux2_out(logic_in),
62   .signal_out(addr_select)
63 );
64
65
66 endmodule
67
68
69
70
71
72
73

```

Listing 8: Control Unit Module (Corresponding testbench is controller\_tb.v)

## Clock Corrector Module

```

1 module clock_corrector_module(
2   input start,
3   input clk,
4   input end_signal,
5   output corrected_clk
6 );
7
8 reg p = 0;

```

```

9
10 always @(posedge clk)
11   begin
12     if (start)
13       p <= 1;
14     end
15
16 assign corrected_clk = (p && !end_signal) ? clk : 0;
17
18 endmodule

```

Listing 9: Clock Corrector Module (Corresponding testbench is clock\_corrector\_tb.v)

## Arithmetic and Logic Unit Module

```

1 module alu(
2   input [15:0] in_bus,
3   input [15:0] in_AC,
4   input [2:0] operation,
5   output reg [15:0] data_out
6 );
7
8 //Defining symbols to identify logical opeartors
9 localparam pass = 0;
10 localparam mul = 1;
11 localparam add = 2;
12 localparam sub = 3;
13 localparam int_div = 4;
14 localparam is_mod = 5;
15
16
17 always @(in_bus or in_AC or operation)
18   begin
19     case(operation)
20       pass: data_out = in_bus;
21       mul: data_out = in_AC * in_bus;
22       add: data_out = in_AC + in_bus;
23       sub: data_out = in_AC - in_bus;
24       int_div: data_out= (in_AC - in_AC % in_bus)/in_bus;
25       is_mod: data_out= ((in_AC % in_bus) != 0);
26
27       default : data_out = in_bus;
28     endcase
29   end
30 endmodule

```

Listing 10: Arithmetic and Logic Unit Module (Corresponding testbench is alu\_tb.v)

## Ops Decoder Module

```

1 module ops_decoder(
2   input wire [48:0] uOPs,
3   input START, RESET,
4   //Control signals

```

```

5   output wire DREAD, IREAD, DWRITE, BUSMEM, MEMBUSD, MEMBUSI, TRBUS, DRBUS,
6     PCBUS, NBUS, CBUS, JBUS,
7   output wire IDBUS, COUNTBUS, TP1BUS, TP2BUS, TP3BUS, ICBUS, IBUS, IEBUS,
8     TR2BUS, ACBUS, LDAR,
9   output wire LDTR, LDDR, LDPC, LDN, LDC, LDIR, LDTP1, LDTP2, LDTP3, LDIC,
10    LDI, LDIE, LDTR2,
11  output wire LDAC, RSTAR, RSTTR, RSTDRA, RSTPC, RSTN, RSTC, RSTIR, RSTJ,
12    RSTID, RSTCOUNT, RSTTP1,
13  output wire RSTTP2, RSTTP3, RSTIC, RSTI, RSTIE, RSTTR2, RSTAC, RSTEND,
14    INCPC, INCJ, INCCOUNT,
15  output wire INCTP1, INCTP2, INCI, INCAC, INCEND, ALU0, ALU1, ALU2, ADDKAC
16 );
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
assign DREAD = uOPs[0];
assign IREAD = uOPs[1];
assign DWRITE = uOPs[10];
assign BUSMEM = uOPs[10];
assign MEMBUSD = uOPs[0];
assign MEMBUSI = uOPs[1];
assign TRBUS = uOPs[45] || uOPs[44] || uOPs[34] || uOPs[9] || uOPs[3];
assign DRBUS = uOPs[39] || uOPs[27] || uOPs[10];
assign PCBUS = uOPs[38];
assign NBUS = uOPs[37];
assign CBUS = uOPs[18];
assign JBUS = uOPs[21];
assign IDBUS = uOPs[33];
assign COUNTBUS = uOPs[12];
assign TP1BUS = uOPs[30];
assign TP2BUS = uOPs[29];
assign TP3BUS = uOPs[28];
assign ICBUS = uOPs[19];
assign IBUS = uOPs[17];
assign IEBUS = uOPs[36];
assign TR2BUS = uOPs[35] || uOPs[20];
assign ACBUS = uOPs[22] || uOPs[2];
assign LDAR = uOPs[39] || uOPs[38] || uOPs[30] || uOPs[29] || uOPs[28];
assign LDTR = uOPs[22] || uOPs[21] || uOPs[20] || uOPs[19] || uOPs[18] || uOPs[17] ||
  uOPs[12];
assign LDDR = uOPs[2] || uOPs[1] || uOPs[0];
assign LDPC = uOPs[25];
assign LDN = uOPs[7];
assign LDC = uOPs[4];
assign LDIR = uOPs[47];
assign LDTP1 = uOPs[11];
assign LDTP2 = uOPs[24];
assign LDTP3 = uOPs[14];
assign LDIC = uOPs[48];
assign LDI = uOPs[41];
assign LDIE = uOPs[43];
assign LDTR2 = uOPs[13];
assign LDAC = uOPs[45] || uOPs[44] || uOPs[37] || uOPs[36] || uOPs[35] || uOPs[34] ||
  uOPs[33] || uOPs[27] || uOPs[9] || uOPs[3];
assign RSTAR = RESET;
assign RSTTR = RESET;

```

```

53 assign RSTDR = RESET;
54 assign RSTPC = RESET;
55 assign RSTN = RESET;
56 assign RSTC = RESET;
57 assign RSTIR = RESET;
58 assign RSTJ = uOPs[46] || RESET;
59 assign RSTID = 0; // Check thi
60 assign RSTCOUNT = uOPs[6] || RESET;
61 assign RSTTP1 = RESET;
62 assign RSTTP2 = RESET;
63 assign RSTTP3 = RESET;
64 assign RSTIC = RESET;
65 assign RSTI = RESET;
66 assign RSTIE = RESET;
67 assign RSTTR2 = uOPs[16] || RESET;
68 assign RSTAC = RESET;
69 assign RSTEND = START || RESET;
70 assign INCPC = uOPs[26];
71 assign INCJ = uOPs[8];
72 assign INCCOUNT = uOPs[5];
73 assign INCTP1 = uOPs[15];
74 assign INCTP2 = uOPs[23];
75 assign INCI = uOPs[42];
76 assign INCAC = uOPs[31];
77 assign INCEND = uOPs[40];
78 assign ALU0 = uOPs[9] || uOPs[3] || uOPs[45];
79 assign ALU1 = uOPs[34] || uOPs[3];
80 assign ALU2 = uOPs[44] || uOPs[45];
81 assign ADDKAC = uOPs[32];
82
83
84 endmodule

```

Listing 11: OPs Decoder Module (Corresponding testbench is ops\_decoder\_tb.v)

## Buffer Module

```

1 module buffer(
2   input [15:0] data_in,
3   input select,
4   output [15:0] data_out
5 );
6
7 assign data_out = select ? data_in : 16'bz;
8
9 endmodule

```

Listing 12: Buffer Module (Corresponding testbench is buffer\_tb.v)

## Register Module with Reset and Load

```

1 module reg_rst_load(
2   input clk,
3   input reset,

```

```

4     input  load_enable ,
5     input  [15:0]  data_in ,
6
7     output reg [15:0]  data_out
8 );
9
10    always @ (posedge clk)
11        begin
12            if (load_enable)
13                data_out <= data_in;
14            if (reset)
15                data_out <= 16'b0;
16        end
17
18 endmodule

```

Listing 13: Register Module with Reset and Load (Corresponding testbench is reg\_rst\_load\_tb.v)

### Register Module with Reset and Increment

```

1 module reg_rst_inc(
2     input clk,
3     input inc,
4     input reset,
5
6     output reg [15:0] data_out
7 );
8
9 always @ (posedge clk)
10    begin
11        if (inc)
12            data_out <= data_out + 16'd1;
13        if (reset)
14            data_out <= 16'b0;
15    end
16
17 endmodule

```

Listing 14: Register Module with Reset and Increment (Corresponding testbench is reg\_rst\_inc\_tb.v)

### Register module with reset, load and increment

```

1 module reg_rst_load_inc(
2     input clk,
3     input inc,
4     input load_enable ,
5     input reset ,
6     input [15:0] data_in ,
7
8     output reg [15:0] data_out
9 );
10

```

```

11 always @(posedge clk)
12   begin
13     if (load_enable)
14       data_out <= data_in;
15     else if (inc)
16       data_out <= data_out + 16'd1;
17     if (reset)
18       data_out <= 16'b0;
19   end
20
21 endmodule

```

Listing 15: Register module with reset

## AC register module

```

1 module reg_ac(
2   input clk,
3   input inc,
4   input inck,
5   input load_enable,
6   input reset,
7
8   input [15:0] data_in,
9   output reg [15:0] data_out
10 );
11
12 localparam k=332;
13
14 always @(posedge clk)
15   begin
16     if (load_enable)
17       data_out <= data_in;
18     if (inc)
19       data_out <= data_out + 16'd1;
20     if (inck)
21       data_out <= data_out + k;
22     if (reset)
23       data_out <= 16'b0;
24   end
25
26 endmodule

```

Listing 16: AC register (Corresponding testbench is reg\_ac\_tb.v)

## PC register module

```

1 module reg_pc(
2   input clk,
3   input inc,
4   input load_enable,
5   input reset,
6
7   input [15:0] data_in,

```

```

8     output reg [15:0] data_out
9 );
10
11 always @(posedge clk)
12 begin
13     if (load_enable)
14         data_out <= data_in;
15     if (inc)
16         data_out <= data_out + 16'd1;
17     if (reset)
18         data_out <= 16'b0;
19 end
20
21 initial begin
22     data_out = 0;
23 end
24
25 endmodule

```

Listing 17: PC register (Corresponding testbench is reg\_pc\_tb.v)

### Module that maps instructions with addresses

```

1 module mapping_block(
2     input [15:0] IR,
3     output reg [15:0] map_addr
4 );
5
6 //Instructions
7 localparam SETN = 0;
8 localparam SETC = 1;
9 localparam SETTP1 = 2;
10 localparam SETTP2 = 3;
11 localparam SETTP3 = 4;
12 localparam RNGI = 5;
13 localparam STRTI = 6;
14 localparam ENDI = 7;
15 localparam LDACTP1 = 8;
16 localparam LDACTP2 = 9;
17 localparam LDACTP3 = 10;
18 localparam LDN = 11;
19 localparam LDTR2 = 28;
20 localparam MVACTR = 12;
21 localparam MVACTR2 = 27;
22 localparam MVCOUNT = 13;
23 localparam MVJ = 14;
24 localparam MVI = 15;
25 localparam MVIE = 16;
26 localparam MVTR2 = 26;
27 localparam STAC = 20;
28 localparam JNPZ = 21;
29 localparam JNPZY = 22;
30 localparam JNPZN = 23;
31 localparam RSTJ = 24;
32 localparam END = 25;

```

```

33 localparam MUL_ = 17;
34 localparam ADD_ = 18;
35 localparam SUB_ = 19;
36
37 //Mapping algorithm
38 always @(IR)
39 begin
40   case(IR)
41     SETN: map_addr <= 3;
42     SETC: map_addr <= 7;
43     SETTP1: map_addr <= 11;
44     SETTP2: map_addr <= 15;
45     SETTP3: map_addr <= 20;
46     RNGI: map_addr <= 28;
47     STRTI: map_addr <= 37;
48     ENDI: map_addr <= 41;
49     LDACTP1: map_addr <= 45;
50     LDACTP2: map_addr <= 48;
51     LDACTP3: map_addr <= 51;
52     LDN: map_addr <= 54;
53     LDTR2: map_addr <= 55;
54     MVACTR: map_addr <= 56;
55     MVACTR2: map_addr <= 57;
56     MVCOUNT: map_addr <= 58;
57     MVJ: map_addr <= 59;
58     MVI: map_addr <= 60;
59     MVIE: map_addr <= 61;
60     MVTR2: map_addr <= 62;
61     STAC: map_addr <= 63;
62     JNPZ: map_addr <= 66;
63     JNPZY: map_addr <= 67;
64     JNPZN: map_addr <= 69;
65     RSTJ: map_addr <= 70;
66     END: map_addr <= 71;
67     MUL_: map_addr <= 72;
68     ADD_: map_addr <= 73;
69     SUB_: map_addr <= 74;
70   endcase
71 end
72
73 endmodule

```

Listing 18: Module that maps instructions with addresses (Corresponding testbench is mapping\_block\_tb.v)

### Module that contains micro instructions

```

1 module microcode(
2   input [15:0] reg_out,
3   output reg [1:0] condition,
4   output reg BT,
5   output reg [48:0] OPs,
6   output reg [6:0] jump_addr
7 );
8

```

```

9 reg [58:0] ucode [0:74];
10
11 initial
12 begin
13   ucode[0] = {3'b000,7'd1,49'd274877906944};
14   ucode[1] = {3'b000,7'd2,49'd67108866};
15   ucode[2] = {3'b100,7'd0,49'd141012366262272};
16   ucode[3] = {3'b000,7'd4,49'd67108866};
17   ucode[4] = {3'b000,7'd5,49'd549755813888};
18   ucode[5] = {3'b000,7'd6,49'd1};
19   ucode[6] = {3'b000,7'd0,49'd128};
20   ucode[7] = {3'b000,7'd8,49'd67108866};
21   ucode[8] = {3'b000,7'd9,49'd549755813888};
22   ucode[9] = {3'b000,7'd10,49'd1};
23   ucode[10] = {3'b000,7'd0,49'd16};
24   ucode[11] = {3'b000,7'd12,49'd137438953472};
25   ucode[12] = {3'b000,7'd13,49'd131072};
26   ucode[13] = {3'b000,7'd14,49'd512};
27   ucode[14] = {3'b000,7'd0,49'd2048};
28   ucode[15] = {3'b000,7'd16,49'd137438953472};
29   ucode[16] = {3'b000,7'd17,49'd2097152};
30   ucode[17] = {3'b000,7'd18,49'd512};
31   ucode[18] = {3'b000,7'd19,49'd4294967296};
32   ucode[19] = {3'b000,7'd0,49'd16777216};
33   ucode[20] = {3'b000,7'd21,49'd137438953472};
34   ucode[21] = {3'b000,7'd22,49'd131072};
35   ucode[22] = {3'b000,7'd23,49'd512};
36   ucode[23] = {3'b000,7'd24,49'd2097152};
37   ucode[24] = {3'b000,7'd25,49'd17179869184};
38   ucode[25] = {3'b000,7'd26,49'd4294967296};
39   ucode[26] = {3'b000,7'd27,49'd4294967296};
40   ucode[27] = {3'b000,7'd0,49'd16448};
41   ucode[28] = {3'b000,7'd29,49'd137438953472};
42   ucode[29] = {3'b000,7'd30,49'd262144};
43   ucode[30] = {3'b000,7'd31,49'd17592186044416};
44   ucode[31] = {3'b000,7'd32,49'd8192};
45   ucode[32] = {3'b000,7'd33,49'd137438953472};
46   ucode[33] = {3'b000,7'd34,49'd35184372088832};
47   ucode[34] = {3'b000,7'd35,49'd1048576};
48   ucode[35] = {3'b000,7'd36,49'd17179869184};
49   ucode[36] = {3'b000,7'd0,49'd281474976776192};
50   ucode[37] = {3'b000,7'd38,49'd524288};
51   ucode[38] = {3'b000,7'd39,49'd8589934592};
52   ucode[39] = {3'b000,7'd40,49'd512};
53   ucode[40] = {3'b000,7'd0,49'd2199023255552};
54   ucode[41] = {3'b000,7'd42,49'd8589934592};
55   ucode[42] = {3'b000,7'd43,49'd2148007936};
56   ucode[43] = {3'b000,7'd44,49'd512};
57   ucode[44] = {3'b000,7'd0,49'd8796093022208};
58   ucode[45] = {3'b000,7'd46,49'd1073741856};
59   ucode[46] = {3'b000,7'd47,49'd1};
60   ucode[47] = {3'b000,7'd0,49'd134217728};
61   ucode[48] = {3'b000,7'd49,49'd536870912};
62   ucode[49] = {3'b000,7'd50,49'd1};
63   ucode[50] = {3'b000,7'd0,49'd134217728};

```

```

64     ucode[51] = {3'b000,7'd52,49'd268435456};
65     ucode[52] = {3'b000,7'd53,49'd1};
66     ucode[53] = {3'b000,7'd0,49'd134217728};
67     ucode[54] = {3'b000,7'd0,49'd137438953472};
68     ucode[55] = {3'b000,7'd0,49'd34359738368};
69     ucode[56] = {3'b000,7'd0,49'd4194304};
70     ucode[57] = {3'b000,7'd0,49'd8192};
71     ucode[58] = {3'b000,7'd0,49'd8425472};
72     ucode[59] = {3'b000,7'd0,49'd2097152};
73     ucode[60] = {3'b000,7'd0,49'd131072};
74     ucode[61] = {3'b000,7'd0,49'd4466765987840};
75     ucode[62] = {3'b000,7'd0,49'd1048576};
76     ucode[63] = {3'b000,7'd64,49'd268500992};
77     ucode[64] = {3'b000,7'd65,49'd4};
78     ucode[65] = {3'b000,7'd0,49'd1280};
79     ucode[66] = {3'b001,7'd69,49'd0};
80     ucode[67] = {3'b000,7'd68,49'd2};
81     ucode[68] = {3'b000,7'd0,49'd33554432};
82     ucode[69] = {3'b000,7'd0,49'd67108864};
83     ucode[70] = {3'b000,7'd0,49'd70368744177664};
84     ucode[71] = {3'b000,7'd0,49'd1099511627776};
85     ucode[72] = {3'b000,7'd0,49'd512};
86     ucode[73] = {3'b000,7'd0,49'd17179869184};
87     ucode[74] = {3'b000,7'd0,49'd8};

88
89 end
90
91 //Issuing first micro code at the start
92 initial {BT, condition, jump_addr, OPs} = ucode[0];
93
94 always @(reg_out)
95   {BT, condition, jump_addr, OPs}= ucode[reg_out];
96
97 endmodule

```

Listing 19: Module that contains micro instructions (Corresponding testbench is microcode\_tb.v)

## Register Module in Control Unit

```

1 module cu_reg_RST_load(
2   input clk,
3   input reset,
4   input load_enable,
5   input [15:0] data_in,
6   output reg [15:0] data_out
7 );
8
9 always @ (negedge clk)
10 begin
11   if (load_enable)
12     data_out <= data_in;
13   if (reset)
14     data_out <= 16'b0;
15 end

```

```

16
17 endmodule

```

Listing 20: Register Module in Control Unit

**Multiplexer module in control unit that select addresses accordingly**

```

1 module mux1_control_unit(
2   input [15:0] map_addr,
3   input [6:0] jump_addr,
4   input [15:0] inc_addr,
5   input [1:0] select,
6   output reg [15:0] reg_in
7 );
8
9 always @(map_addr or jump_addr or inc_addr or select)
10 begin
11   case(select)
12     0:reg_in    <=    inc_addr;
13     1:reg_in    <=    jump_addr;
14     2:reg_in    <=    map_addr;
15     default: reg_in <=    jump_addr;
16   endcase
17 end
18
19 endmodule

```

Listing 21: Multiplexer module in control unit that select addresses accordingly (Corresponding testbench is mux1\_control\_unit\_tb.v)

**Multiplexer module in control unit that select logic values accordingly**

```

1 module mux2_control_unit(
2   input z,
3   input [1:0] select,
4   output reg logic_in
5 );
6
7 always @(z or select)
8 begin
9   case(select)
10     0:logic_in=1;
11     1:logic_in=z;
12     2:logic_in=~z;
13     default:logic_in= 1;
14   endcase
15 end
16
17 endmodule

```

Listing 22: Multiplexer module in control unit that select logic values accordingly

**Logic module in control unit**

```

1 module logic_block(
2     input BT_in,
3     input mux2_out,
4     output [1:0] signal_out
5 );
6
7 assign signal_out[1] = BT_in;
8 assign signal_out[0]= ~BT_in && mux2_out;
9
10 endmodule

```

Listing 23: Logic module in control unit (Corresponding testbench is logic\_block\_tb.v)

## Test Benches

### Top module testbench

```

1 'timescale 1 ns/10 ps
2 module top_tb;
3
4     //For file reading and writing
5     integer data_out;
6     integer data_file;
7     integer scan_file;
8     integer i;
9
10
11    //Registers used for reading an external file
12    reg [15:0] captured_data;
13    'define NULL 0
14    reg [15:0] mem_data;
15    reg [15:0] current_addr = 16'd0;
16    reg [15:0] dmem_size = 16'd1000;
17    reg write_from_tb;
18
19    //Ports for Top module
20    reg clk;
21    reg START;
22    reg RESET;
23    wire END;
24    wire [15:0] DMEMBUS;
25
26    //Used to take data out from data memory
27    reg [15:0] ar_out_start_addr;
28    reg [15:0] end_addr = 16'd997;
29    reg [1:0] addr_mux_select;
30
31    //Initiating the clock
32    localparam CLK_PERIOD = 10;
33
34    initial begin
35        clk = 0;
36        forever #(CLK_PERIOD/2) clk <= ~clk;
37    end

```

```

38
39 //Initiating the TOP module
40 top top_module(
41     .clk(clk),
42     .START(START),
43     .RESET(RESET),
44     .END(END),
45     .addr_mux_select(addr_mux_select),
46     .ar_in(ar_out_start_addr),
47     .current_addr(current_addr),
48     .write_from_tb(write_from_tb),
49     .mem_data(mem_data),
50     .dmem_out_disp(DMEMBUS)
51 );
52
53 //Used to calculate time for matrix multiplication
54 realtime capture = 0.0;
55
56 initial begin
57
58     // Opening the file that contains matrix data
59     data_file = $fopen("../io_txt_files/data_to_mem.txt", "r");
60
61     if (data_file == 'NULL) begin
62         $display("data_file handle was NULL");
63         $finish;
64     end
65
66     //Commanding the data memory to use the address provided by
67     testbench
68     #10 addr_mux_select <= 1;
69
70     for (current_addr = 0; current_addr < dmem_size+1; current_addr =
71         current_addr+1) begin
72         @(posedge clk) begin
73
74             scan_file = $fscanf(data_file, "%d\n", captured_data);
75
76             if (!$feof(data_file)) begin
77                 #10 mem_data <= captured_data;
78                 #5 write_from_tb <= 1;
79                 #20 write_from_tb <= 0;
80                 $display(mem_data);
81             end
82         end
83
84     $fclose(data_file);
85
86     $display("Reading data - completed!");
87
88     //Commanding the data memory to use the addresses and data provided
89     by the internal testbenches
90     #10 write_from_tb = 0; addr_mux_select <= 0;

```

```

90
91
92     //Starting Core operations
93     #200
94     capture = $realtime;
95
96     //Issuing start signal to start the cores
97     START = 1; RESET = 0; addr_mux_select = 0;
98     #10 START = 0; RESET = 0; addr_mux_select = 0;
99
100    //Waiting untill all core operations are completed
101    while (END!=1) #10;
102
103    $display("Time for multiplication (us): %t", ($realtime - capture)
104    /1000000);
105
106    // Opening a file to write the results
107    data_out = $fopen("../io_txt_files/results_from_mem.txt","w");
108    $display("File Opened - results_from_mem.txt");
109
110    //Point the AR to the starting location of data memory
111    #10 ar_out_start_addr=16'd0;
112    #10 addr_mux_select <= 2;
113
114    //Writing data from data memory to file line by line
115    while( ar_out_start_addr < end_addr) begin
116        @(posedge clk);
117        #50 $fwrite(data_out,"%d\n",DMEMBUS);
118        #50 ar_out_start_addr <= ar_out_start_addr + 16'd1; // check AR
119        #10;
120    end
121
122    $display("Writing Results - completed!");
123
124    #20;
125    $fclose(data_out);
126
127    #20;
128    $stop;
129
130
131
132 endmodule

```

Listing 24: Top module testbench

## Data memory testbench

```

1 `timescale 1 ns/10 ps
2
3 module data_memory_multi_port_tb;
4
5     reg [15:0] data_in_1;

```

```

6   reg [15:0] data_in_2;
7   reg [15:0] data_in_3;
8   reg [15:0] data_in_4;
9   reg [15:0] addr_1;
10  reg [15:0] addr_2;
11  reg [15:0] addr_3;
12  reg [15:0] addr_4;
13  reg we_1;
14  reg we_2;
15  reg we_3;
16  reg we_4;
17  reg clk;
18  wire [15:0] data_out_1;
19  wire [15:0] data_out_2;
20  wire [15:0] data_out_3;
21  wire [15:0] data_out_4;
22
23  localparam CLK_PERIOD = 10;
24
25  initial begin
26    clk = 0;
27    forever #(CLK_PERIOD/2) clk <= ~clk;
28  end
29
30  data_memory_multi_port multiport_mem(
31    .data_in_1(data_in_1),
32    .data_in_2(data_in_2),
33    .data_in_3(data_in_3),
34    .data_in_4(data_in_4),
35    .addr_1(addr_1),
36    .addr_2(addr_2),
37    .addr_3(addr_3),
38    .addr_4(addr_4),
39    .we_1(we_1),
40    .we_2(we_2),
41    .we_3(we_3),
42    .we_4(we_4),
43    .clk(clk),
44    .data_out_1(data_out_1),
45    .data_out_2(data_out_2),
46    .data_out_3(data_out_3),
47    .data_out_4(data_out_4)
48  );
49
50  initial begin
51
52    #10 we_1 = 1; we_2 = 0; we_3 = 0; we_4 = 0; data_in_1= 1; addr_1 = 10;
53    data_in_2= 0; addr_2 = 0; data_in_3= 0; addr_3 = 0; data_in_4= 0; addr_4
54    = 0;
55
56    #10 we_1 = 1; we_2 = 1; we_3 = 1; we_4 = 1; data_in_1= 1; addr_1 = 10;
      data_in_2= 2; addr_2 = 20; data_in_3= 3; addr_3 = 30; data_in_4= 4;
      addr_4 = 40;
57
58    #10 we_1 = 0; we_2 = 0; we_3 = 0; we_4 = 0; data_in_1= 0; addr_1 = 20;

```

```

1  data_in_2= 2; addr_2 = 20; data_in_3= 3; addr_3 = 30; data_in_4= 4;
2  addr_4 = 40;
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43

```

Listing 25: Data memory testbench

## Instruction memory testbench

```

1  `timescale 1 ns/10 ps
2
3 module instruction_memory_tb;
4
5   reg read;
6   reg [15:0] address;
7   wire [15:0] instruction_out;
8
9   instruction_memory IM(.read(read), .address(address), .instruction_out(
10    instruction_out));
11
12   initial
13   begin
14     #10 read <= 1 ; address <= 0;
15     #10 read <= 1 ; address <= 1;
16     #10 read <= 1 ; address <= 2;
17     #10 read <= 1 ; address <= 3;
18     #10 read <= 1 ; address <= 4;
19     #10 read <= 1 ; address <= 5;
20     #10 read <= 1 ; address <= 6;
21     #10 read <= 1 ; address <= 7;
22     #10 read <= 1 ; address <= 8;
23     #10 read <= 1 ; address <= 9;
24     #10 read <= 1 ; address <= 10;
25     #10 read <= 1 ; address <= 11;
26     #10 read <= 1 ; address <= 12;
27     #10 read <= 1 ; address <= 13;
28     #10 read <= 1 ; address <= 14;
29     #10 read <= 1 ; address <= 15;
30     #10 read <= 1 ; address <= 16;
31     #10 read <= 1 ; address <= 17;
32     #10 read <= 1 ; address <= 18;
33     #10 read <= 1 ; address <= 19;
34     #10 read <= 1 ; address <= 20;
35     #10 read <= 1 ; address <= 21;
36     #10 read <= 1 ; address <= 22;
37     #10 read <= 1 ; address <= 23;
38     #10 read <= 1 ; address <= 24;
39     #10 read <= 1 ; address <= 25;
40     #10 read <= 1 ; address <= 26;
41     #10 read <= 1 ; address <= 27;
42     #10 read <= 1 ; address <= 28;
43     #10 read <= 1 ; address <= 29;
44     #10 read <= 1 ; address <= 30;

```

```

44      #10 read <= 1 ; address <= 31;
45      #10 read <= 1 ; address <= 32;
46      #10 read <= 1 ; address <= 33;
47      #10 read <= 1 ; address <= 34;
48      #10 read <= 1 ; address <= 35;
49
50    end
51
52 endmodule

```

Listing 26: Instruction memory testbench

## Buffer testbench

```

1 `timescale 1 ns/10 ps
2
3 module buffer_tb;
4
5   reg [15:0] data_in;
6   reg select;
7   wire [15:0] data_out;
8
9   buffer tri_state_buffer(
10     .data_in(data_in),
11     .select(select),
12     .data_out(data_out)
13 );
14
15 initial begin
16   #10 data_in = 0; select = 0;
17   #10 data_in = 1; select = 0;
18   #10 data_in = 0; select = 1;
19   #10 data_in = 1; select = 1;
20 end
21
22 endmodule

```

Listing 27: Buffer testbench

## OPs decoder testbench

```

1 `timescale 1 ns/10 ps
2
3 module ops_decoder_tb;
4
5   reg [48:0] OPs;
6   reg START, RESET;
7   wire DREAD, IREAD, DWRITE, BUSMEM, MEMBUSI;
8
9   ops_decoder decoder(
10     .uOPs(OPs),
11     .START(START),
12     .RESET(RESET),
13     .DREAD(DREAD),

```

```

14      .IREAD(IREAD),
15      .DWRITE(DWRITE),
16      .BUSMEM(BUSMEM),
17      .MEMBUSI(MEMBUSI)
18  );
19
20  initial
21    begin
22      #10 OPs <= 0;
23      #10 OPs <= 1;
24      #10 OPs <= 2;
25      #10 OPs <= 2678;
26    end
27
28 endmodule

```

Listing 28: OPs decoder testbench

### Clock corrector module testbench

```

1 `timescale 1 ns/10 ps
2
3 module clock_corrector_tb;
4
5   reg start;
6   reg clk;
7   wire corrected_clk;
8
9   clock_corrector_module clock_corrector(
10     .start(start),
11     .clk(clk),
12     .corrected_clk(corrected_clk)
13   );
14
15   localparam CLK_PERIOD = 10;
16
17   initial begin
18     clk = 0;
19     forever #(CLK_PERIOD/2) clk <= ~clk;
20   end
21
22   initial
23   begin
24     #20 start <= 1 ;
25   end
26
27 endmodule

```

Listing 29: Clock corrector module testbench

### Register module with reset and load testbench

```

1 `timescale 1 ns/10 ps
2

```

```

3 module reg_rst_load_tb;
4
5     reg clk;
6     reg reset;
7     reg load_enable;
8     reg [15:0] data_in;
9     wire [15:0] data_out;
10
11    reg_rst_load register(
12        .clk(clk),
13        .reset(reset),
14        .load_enable(load_enable),
15        .data_in(data_in),
16        .data_out(data_out)
17    );
18
19    localparam CLK_PERIOD = 10;
20
21    initial begin
22        clk = 0;
23        forever #(CLK_PERIOD/2) clk <= ~clk;
24    end
25
26    initial begin
27        #10 reset <= 0 ; load_enable <= 0; data_in <= 16'd2;
28
29        #10 reset <= 0 ; load_enable <= 1; data_in <= 16'd2;
30
31        #10 reset <= 0 ; load_enable <= 1; data_in <= 16'd5;
32
33        #10 reset <= 1 ; load_enable <= 1; data_in <= 16'd2;
34
35    end
36
37 endmodule

```

Listing 30: Register module with reset and load testbench

### Register module with reset and increment testbench

```

1 `timescale 1 ns/10 ps
2
3 module reg_rst_inc_tb;
4
5     reg clk;
6     reg reset;
7     reg inc;
8
9     wire [15:0] data_out;
10
11    reg_rst_inc register(
12        .clk(clk),
13        .reset(reset),
14        .inc(inc),
15        .data_out(data_out)

```

```

16 );
17
18 localparam CLK_PERIOD = 10;
19
20 initial begin
21     clk = 0;
22     forever #(CLK_PERIOD/2) clk <= ~clk;
23 end
24
25
26 initial begin
27     #10 reset <= 1 ; inc <= 0;
28
29     #10 reset <= 0 ; inc <= 1;
30
31     #10 reset <= 0 ; inc <= 1;
32
33 end
34
35 endmodule

```

Listing 31: Register module with reset and increment testbench

## AC Register testbench

```

1 `timescale 1 ns/10 ps
2
3 module reg_ac_tb;
4
5     reg clk;
6     reg inc;
7     reg inck;
8     reg load_enable;
9     reg reset;
10
11    reg [15:0] data_in;
12    wire [15:0] data_out;
13
14    reg_ac register(
15        .clk(clk),
16        .inc(inc),
17        .inck(inck),
18        .load_enable(load_enable),
19        .reset(reset),
20        .data_in(data_in),
21        .data_out(data_out)
22    );
23
24    localparam CLK_PERIOD = 10;
25
26    initial begin
27        clk = 0;
28        forever #(CLK_PERIOD/2) clk <= ~clk;
29    end
30

```

```

31
32     initial begin
33         #10 reset <= 0 ; load_enable <= 0; data_in <= 16'd2; inc <= 0; inck
34         <= 0;
35
36         #10 reset <= 0 ; load_enable <= 1; data_in <= 16'd2; inc <= 0; inck
37         <= 0;
38
39         #10 reset <= 0 ; load_enable <= 0; data_in <= 16'd2; inc <= 1; inck
40         <= 0;
41
42         #10 reset <= 1 ; load_enable <= 0; data_in <= 16'd2; inc <= 1; inck
43         <= 0;
44
45     end
46
47 endmodule

```

Listing 32: AC Register testbench

## PC Register testbench

```

1  `timescale 1 ns/10 ps
2
3 module reg_pc_tb;
4
5     reg clk;
6     reg inc;
7     reg load_enable;
8     reg reset;
9
10    reg [15:0] data_in;
11    wire [15:0] data_out;
12
13    reg_pc register(
14        .clk(clk),
15        .reset(reset),
16        .load_enable(load_enable),
17        .inc(inc),
18        .data_in(data_in),
19        .data_out(data_out)
20    );
21
22    localparam CLK_PERIOD = 10;
23
24    initial begin
25        clk = 0;
26        forever #(CLK_PERIOD/2) clk <= ~clk;
27    end
28
29    initial begin
30        #50 reset <= 0 ; inc <= 1; data_in <= 0; load_enable = 0;

```

```

31      #10 reset <= 0 ; inc <= 0; data_in <= 0; load_enable = 0;
32
33      #10 reset <= 1 ; inc <= 0; data_in <= 0; load_enable = 0;
34
35 end
36
37 endmodule

```

Listing 33: PC Register testbench

## Arithmetic and logic unit testbench

```

1 `timescale 1 ns/10 ps
2
3 module alu_tb;
4
5   reg [15:0] in_bus;
6   reg [15:0] in_AC;
7   reg [2:0] operation;
8   wire [15:0] data_out;
9
10  alu alu_test(
11    .in_bus(in_bus),
12    .in_AC(in_AC),
13    .operation(operation),
14    .data_out(data_out)
15  );
16
17 initial begin
18   #100 in_bus=15'd2 ; in_AC= 15'd5; operation=0 ;
19   #100 in_bus=15'd2 ; in_AC= 15'd5; operation=1 ;
20   #100 in_bus=15'd2 ; in_AC= 15'd5; operation=2 ;
21   #100 in_bus=15'd2 ; in_AC= 15'd5; operation=3 ;
22   #100 in_bus=15'd2 ; in_AC= 15'd5; operation=4 ;
23   #100 in_bus=15'd2 ; in_AC= 15'd5; operation=5 ;
24
25 end
26
27 endmodule

```

Listing 34: Arithmetic and logic unit testbench

## Control unit testbench

```

1 `timescale 1 ns/10 ps
2
3 module controller_tb;
4
5   reg clk;
6   reg [15:0] IR;
7   reg z;
8   wire [48:0] OPs;
9
10

```

```

11 localparam CLK_PERIOD = 10;
12
13 controller controller(.clk(clk), .IR(IR), .z(z), .OPs(OPs));
14
15 initial begin
16     clk = 0;
17     forever #(CLK_PERIOD/2) clk <= ~clk;
18 end
19
20
21 initial begin
22     #10 IR <= 0 ; z <= 0;
23     #10 IR <= 21 ; z <= 0;
24     #10 IR <= 21 ; z <= 0;
25     #10 IR <= 21 ; z <= 0;
26
27 end
28
29
30 endmodule

```

Listing 35: Control testbench

## Microcode testbench

```

1 `timescale 1 ns/10 ps
2
3 module microcode_tb;
4
5     reg [15:0] reg_out;
6     wire [1:0] condition;
7     wire BT;
8     wire [58:0] OPs;
9     wire [6:0] jump_addr;
10
11     microcode micromemory(
12         .reg_out(reg_out),
13         .condition(condition),
14         .BT(BT),
15         .OPs(OPs),
16         .jump_addr(jump_addr)
17     );
18
19     initial begin
20
21         #10 reg_out= 15'd0;
22         #10 reg_out= 15'd1;
23         #10 reg_out= 15'd2;
24         #10 reg_out= 15'd3;
25         #10 reg_out= 15'd4;
26         #10 reg_out= 15'd5;
27         #10 reg_out= 15'd6;
28
29     end
30

```

```
31 endmodule
```

Listing 36: Microcode testbench

## Multiplexer testbench

```
1 'timescale 1 ns/10 ps
2
3 module mux1_control_unit_tb;
4
5     reg [15:0] map_addr;
6     reg [6:0] jump_addr;
7     reg [15:0] inc_addr;
8     reg [1:0] select;
9     wire [15:0] reg_in;
10
11 mux1_control_unit mux1_control_unit(
12     .map_addr(map_addr),
13     .jump_addr(jump_addr),
14     .inc_addr(inc_addr),
15     .select(select),
16     .reg_in(reg_in)
17 );
18
19
20 initial begin
21     #10 map_addr=1 ; jump_addr= 45; inc_addr= 3 ; select= 0 ;
22     #10 map_addr=1 ; jump_addr= 45; inc_addr= 3 ; select= 1 ;
23     #10 map_addr=1 ; jump_addr= 45; inc_addr= 3 ; select= 2 ;
24
25 end
26
27 endmodule
```

Listing 37: Multiplexer testbench

## Logic block testbench

```
1 'timescale 1 ns/10 ps
2
3 module logic_block_tb;
4
5     reg BT_in;
6     reg mux2_out;
7     wire [1:0] signal_out;
8
9     logic_block logic(.BT_in(BT_in), .mux2_out(mux2_out), .signal_out(
10      signal_out));
11
12 initial
13 begin
14     #10 BT_in <= 0 ; mux2_out <= 0;
15     #10 BT_in <= 0 ; mux2_out <= 1;
16     #10 BT_in <= 1 ; mux2_out <= 0;
```

```

16      end
17
18 endmodule

```

Listing 38: Logic block testbench

## Mapping block testbench

```

1 `timescale 1 ns/10 ps
2
3 module mapping_block_tb;
4
5   reg [15:0] IR;
6   wire [15:0] map_addr;
7
8   mapping_block map_function(
9     .IR(IR),
10    .map_addr(map_addr)
11  );
12
13 initial begin
14   #10 IR = 0;
15   #10 IR = 1;
16   #10 IR = 2;
17   #10 IR = 3;
18   #10 IR = 4;
19   #10 IR = 5;
20   #10 IR = 6;
21   #10 IR = 7;
22   #10 IR = 8;
23   #10 IR = 9;
24   #10 IR = 10;
25   #10 IR = 11;
26   #10 IR = 12;
27   #10 IR = 13;
28   #10 IR = 14;
29   #10 IR = 15;
30   #10 IR = 16;
31   #10 IR = 17;
32   #10 IR = 18;
33   #10 IR = 19;
34   #10 IR = 20;
35   #10 IR = 21;
36   #10 IR = 22;
37   #10 IR = 23;
38   #10 IR = 24;
39   #10 IR = 25;
40   #10 IR = 26;
41   #10 IR = 27;
42   #10 IR = 28;
43
44 end
45 endmodule

```

Listing 39: Mapping block testbench

## Python codes

### Code for rearranging user data into algorithm expected format

```

1 import numpy as np
2 import argparse
3
4 from numpy.core.fromnumeric import size
5
6
7 def pad_matrix(matrix1, matrix2, n):
8     n1, n2 = matrix1.shape
9     n3, n4 = matrix2.shape
10
11 #n_max= max(n1, n2, n3, n4)
12
13 matrix1_pad= np.zeros((n,n))
14 matrix2_pad= np.zeros((n,n))
15
16 matrix1_pad[:n1, :n2]= matrix1
17 matrix2_pad[:n3, :n4]= matrix2
18
19 return matrix1_pad, matrix2_pad
20
21
22 def save_matrix(matrix1, matrix2, mem_size=1000, K=332, n=3, c=1,
23                 write_filename= 'out.txt'):
24     matrix1 = np.array(matrix1)
25     matrix2 = np.array(matrix2)
26
27     matrix1, matrix2 = pad_matrix(matrix1, matrix2, n)
28
29     matrix1_flat = matrix1.flatten().astype('int')
30     matrix2_flat = matrix2.T.flatten().astype('int')
31
32     to_mem=np.zeros((mem_size+1,)).astype('int')
33     for idx in range(0, n*n):
34         to_mem[idx] = matrix1_flat[idx]
35
36     for idx in range(0, n*n):
37         to_mem[K+idx] = matrix2_flat[idx]
38
39     to_mem[-2]=c
40     to_mem[-3]=n
41
42     to_mem= '\n'.join(to_mem.astype('str'))
43
44     with open(write_filename, 'w') as f:
45         f.write(to_mem)
46         print(f'written to {write_filename}')
47
48 parser = argparse.ArgumentParser(description='Write matrices')
49 # parser.add_argument('--matrix1', type=str)

```

```

50 # parser.add_argument('--matrix2', type=str)
51 parser.add_argument('--mem_size', type=int, default= 1000)
52 parser.add_argument('--K', type= int, default= 332)
53 parser.add_argument('--n', type= int)
54 parser.add_argument('--c', type= int)
55 parser.add_argument('--filename', type=str, default= 'data_to_mem.txt')
56
57 args = parser.parse_args()
58
59
60 matrix1 = np.random.randint(20, size= (args.n,args.n))
61 matrix2 = np.random.randint(20, size= (args.n,args.n))
62
63
64 save_matrix(matrix1,matrix2,args.mem_size, args.K, args.n, args.c, args.
   filename)

```

Listing 40: save\_matrix.py

## Code for displaying matrix output

```

1
2 import numpy as np
3 import argparse
4
5 def check_matrices(file_name= 'results_from_mem.txt', n=3, K=100):
6     with open(file_name) as f:
7         data = f.read()
8
9     all_data = data.split('\n')[:-1]
10
11    matrix1 = np.array([int(x) for x in all_data[0:n*n]]).reshape(n,n)
12    matrix2 = np.array([int(x) for x in all_data[K:K+n*n]]).reshape(n,n).T
13    out_matrix = np.array([int(x) for x in all_data[2*K:2*K+n*n]]).reshape(n,n)
14
15    correct_matrix= np.matmul(matrix1, matrix2)
16
17    print(f'matrix 1: \n{matrix1}\nmatrix 2: \n{matrix2}\n\nCalculated Answer:
18      \n{out_matrix}\nCorrect Answer: \n{correct_matrix}')
19    print(f'\nIs answer correct : {(correct_matrix== out_matrix).all()}')
20
21 parser = argparse.ArgumentParser(description='Check matrices')
22 parser.add_argument('--filename', type=str, default= 'results_from_mem.txt')
23 parser.add_argument('--n', type=int)
24 parser.add_argument('--K', type= int, default= 332)
25
26 args = parser.parse_args()
27 check_matrices(args.filename, args.n, args.K)

```

Listing 41: check\_matrix.py