



**Universidade do Minho**  
Escola de Engenharia

Hugo José Duarte Ribeiro

**Adaptive Quantum for Parallel Full System  
Simulation**

Hugo José Duarte Ribeiro    **Adaptive Quantum for Parallel Full System  
Simulation**





**Universidade do Minho**  
Escola de Engenharia

Hugo José Duarte Ribeiro

**Adaptive Quantum for Parallel Full System  
Simulation**

Dissertação de Mestrado  
Mestrado em Engenharia Eletrónica Industrial e  
Computadores

Trabalho efetuado sob a orientação de  
**Professor Doutor Jorge Miguel Nunes Santos Cabral**  
**Professor Doutor Rainer Leupers**

## **DIREITOS DE AUTOR E CONDIÇÕES DE UTILIZAÇÃO DO TRABALHO POR TERCEIROS**

Este é um trabalho académico que pode ser utilizado por terceiros desde que respeitadas as regras e boas práticas internacionalmente aceites, no que concerne aos direitos de autor e direitos conexos.

Assim, o presente trabalho pode ser utilizado nos termos previstos na licença abaixo indicada.

Caso o utilizador necessite de permissão para poder fazer um uso do trabalho em condições não previstas no licenciamento indicado, deverá contactar o autor, através do RepositóriUM da Universidade do Minho.



**Atribuição-NãoComercial-Compartilhagual  
CC BY-NC-SA**

<https://creativecommons.org/licenses/by-nc-sa/4.0/>

# Acknowledgements

First of all, I would like to thank my supervisors, Prof. Dr. Jorge Cabral and Prof. Dr. Rainer Leupers, for the support and motivation that they gave during this journey.

Then, I would like to express my gratitude to Niko Zurstraßen. Niko was the first contact that I had when I arrived in Germany. He helped me not only in the development of this work but also in the adaptation to a new city and new people. Also, a special word for Rui Almeida and Rui Machado, who supported me all the time with their knowledge and expertise. Furthermore, and most importantly, thank you all for your friendship.

I want to acknowledge also my family and friends. They were crucial in this part of my life, not only backing me up, but keeping me motivated and relaxed as well.

Lastly, I want to thank all the professors that I had since I started studying, for teaching me their knowledge and guiding me to achieve this goal. Just a special note to Prof. Mário Roque, Prof. Rosário Freitas, and Prof. Rui Barreiro, who with their teaching and empathy guided me to choose the electrical engineering field, for which I am grateful.

## **STATEMENT OF INTEGRITY**

I hereby declare having conducted this academic work with integrity. I confirm that I have not used plagiarism or any form of undue use of information or falsification of results along the process leading to its elaboration. I further declare that I have fully acknowledged the Code of Ethical Conduct of the University of Minho.

# Abstract

In recent years, the MultiProcessor System on a Chips (MPSoCs) complexity has been growing exponentially nevertheless, the performance of simulation tools is not following this growth, mainly because of their sequential simulation type. Therefore, simulation time increases each time a new MPSoC is developed. Concerning this problem, the Institute for Communication Technologies and Embedded Systems (ICE) Rheinisch-Westfälische Technische Hochschule (RWTH) Aachen team developed a parallel version of the atomic mode of Gem5, par-gem5. It is based on a synchronous Parallel Discrete Event Simulation (PDES) which allows each simulation thread to run independently from the rest of the system for a time  $t_{\Delta q}$  - the so-called quantum. Although this is a huge improvement, it carries a challenge in the quantum definition.

This dissertation aims to solve the problem of finding the optimal quantum, which can lead to the best trade-off between accuracy and performance. Nowadays, par-gem5 is bound to a static quantum which cannot be changed during run-time however, an adaptive version can overcome this limitation. It was conceptualized and developed a flexible algorithm that can operate independently of the used benchmark and system. Further, it was also tested and verified its functionality in a co-simulation environment.

In the end, the dynamic version of the quantum choice brings greater benefits when compared with the present. It was possible to achieve, on average, a performance gain of almost 10%, only sacrificing 0.5% of accuracy. Nevertheless, when a-priori information about the test is given, the tradeoff can be improved with the static approach. It becomes possible to calculate the optimal quantum before initiating the simulation, obtaining higher performances. If perfect accuracy is a requirement, the sequential version must be used, since the usage of both methods implies a loss in accuracy. Also, simulations with a single simulated core should always be executed with the later mode.

**Keywords:** **Parallel Discrete Event Simulation, Gem5, Full-System Simulation, Quantum**

# Resumo

Nos ultimos anos, a complexidade dos MPSoCs tem crescido exponencialmente, contudo as ferramentas de simulacao destes nao tem seguido este crescimento, principalmente devido ao seu tipo de simulacao sequencial. Nesse sentido, os tempos de simulacao aumentam sempre que um novo MPSoCs é desenvolvido. A equipa do ICE RWTH Aachen desenvolveu uma versao simulacao paralela to modo atomico do Gem5. Esse modo é baseado numa simulacao paralela sincrona de eventos discretos (PDES) que permite cada thread de simulacao executar independentemente do resto do sistema por um tempo  $t_{\Delta q}$ , cujo nome é quantum.

Esta dissertacao tem como objetivo solucionar o problema de encontrar o quantum otimo, ou seja, o quantum que permita obter o melhor trade-off entre a performance e a precisao da simulacao. De momento, a versao oficial do Gem5 apenas permite a defenicao de um quantum estatico, pelo que este nao pode ser alterado durante o decorrer da simulacao, porém um quantum dinamico consegue ultrapassar as limitacoes do anterior. Portanto, o principal objetivoo é desenvolver um algoritmo flexivel que consiga operar independentemente do benchmark utilizado.

**Palavras-chave:**Simulacao Paralela de Eventos Discretos, Gem5, Simulacao de Sistemas Completos, Quantum

METER NO WORD PARA CORRIGIR ERROS E ACENTOS

# Contents

|                                             |            |
|---------------------------------------------|------------|
| <b>List of Figures</b>                      | <b>x</b>   |
| <b>List of Tables</b>                       | <b>xi</b>  |
| <b>Glossary</b>                             | <b>xii</b> |
| <b>1 Introduction</b>                       | <b>1</b>   |
| 1.1 Motivation . . . . .                    | 1          |
| 1.2 Goals and Contributions . . . . .       | 2          |
| 1.3 Dissertation Outline . . . . .          | 2          |
| <b>2 State of the Art</b>                   | <b>4</b>   |
| 2.1 Simulation . . . . .                    | 4          |
| 2.1.1 Definition . . . . .                  | 5          |
| 2.1.2 Importance of Simulation . . . . .    | 6          |
| 2.2 Embedded Systems . . . . .              | 7          |
| 2.2.1 Definition . . . . .                  | 8          |
| 2.2.2 Operating Systems . . . . .           | 9          |
| 2.2.3 Development Models . . . . .          | 13         |
| 2.2.4 Embedded Simulation . . . . .         | 16         |
| 2.3 Discrete Event Simulation . . . . .     | 18         |
| 2.3.1 Continuous Event Simulation . . . . . | 20         |
| 2.4 Simulation Modes . . . . .              | 21         |
| 2.4.1 Sequential Simulation . . . . .       | 21         |
| 2.4.2 Temporal Decoupling . . . . .         | 22         |
| 2.4.3 Parallel Simulation . . . . .         | 24         |
| 2.5 Gem5 . . . . .                          | 26         |
| 2.5.1 Overview . . . . .                    | 27         |
| 2.5.2 Simulation Capabilities . . . . .     | 28         |
| 2.5.3 Usage . . . . .                       | 30         |

|          |                                                 |           |
|----------|-------------------------------------------------|-----------|
| 2.5.4    | Par-gem5 . . . . .                              | 31        |
| 2.5.5    | Other Simulators . . . . .                      | 33        |
| 2.6      | Machine Learning . . . . .                      | 36        |
| 2.6.1    | Artificial Neural Networks . . . . .            | 36        |
| 2.6.2    | Learning Rules . . . . .                        | 38        |
| 2.6.3    | Machine Learning in Simulation . . . . .        | 39        |
| 2.7      | Co-Simulation . . . . .                         | 40        |
| 2.7.1    | Co-Simulation on Gem5 . . . . .                 | 41        |
| <b>3</b> | <b>Dynamic Quantum Extension</b>                | <b>42</b> |
| 3.1      | Par-gem5 Changes and Benchmarks . . . . .       | 42        |
| 3.1.1    | Par-gem5 Changes . . . . .                      | 43        |
| 3.1.2    | Benchmarks . . . . .                            | 44        |
| 3.2      | ADALINE-Based Algorithm . . . . .               | 46        |
| 3.2.1    | Adaptive Filter . . . . .                       | 47        |
| 3.2.2    | TDL . . . . .                                   | 49        |
| 3.2.3    | Quantum Increment . . . . .                     | 50        |
| 3.2.4    | Results . . . . .                               | 51        |
| 3.3      | Step Ladder Algorithm . . . . .                 | 53        |
| 3.3.1    | Results . . . . .                               | 56        |
| 3.4      | Instruction Flow Prediction Algorithm . . . . . | 56        |
| 3.4.1    | Forecast . . . . .                              | 58        |
| 3.4.2    | Step Ladder Update . . . . .                    | 60        |
| 3.4.3    | Results . . . . .                               | 60        |
| 3.5      | Loop Detection Algorithm . . . . .              | 62        |
| 3.5.1    | Hare-Tortoise Algorithm . . . . .               | 63        |
| 3.5.2    | Quantum Calculation . . . . .                   | 64        |
| 3.5.3    | Results . . . . .                               | 65        |
| 3.6      | Improved Baseline Algorithm . . . . .           | 67        |
| 3.6.1    | Results . . . . .                               | 67        |
| <b>4</b> | <b>Co-Simulation on Gem5</b>                    | <b>70</b> |
| 4.1      | Framework Proposal . . . . .                    | 70        |
| 4.1.1    | SystemC . . . . .                               | 71        |
| 4.1.2    | Gem5 . . . . .                                  | 72        |
| 4.1.3    | TLM Wrapper . . . . .                           | 74        |
| 4.2      | CRC as a Case Study . . . . .                   | 78        |
| 4.2.1    | Peripheral Development . . . . .                | 78        |

|                   |                                       |           |
|-------------------|---------------------------------------|-----------|
| 4.2.2             | Peripheral Validation . . . . .       | 81        |
| 4.2.3             | Memory Integrity . . . . .            | 84        |
| 4.3               | Dynamic Quantum Integration . . . . . | 88        |
| <b>5</b>          | <b>Conclusions</b>                    | <b>91</b> |
| 5.1               | Developed Work . . . . .              | 91        |
| 5.2               | Future Work . . . . .                 | 92        |
| <b>References</b> |                                       | <b>94</b> |

# List of Figures

|      |                                                                                            |    |
|------|--------------------------------------------------------------------------------------------|----|
| 2.1  | Evolution of the simulation topic in the literature by Google Books ngram viewer . . . . . | 4  |
| 2.2  | Market research report by ZION about simulation [4] . . . . .                              | 5  |
| 2.3  | Taxonomy of virtual simulations (Adapt from [6]) . . . . .                                 | 6  |
| 2.4  | Typical embedded system (adapted from [10]) . . . . .                                      | 8  |
| 2.5  | Layers of an Operating System (OS) [17] . . . . .                                          | 10 |
| 2.6  | Concurrency between processes . . . . .                                                    | 10 |
| 2.7  | Process and threads block diagram . . . . .                                                | 11 |
| 2.8  | Context switch . . . . .                                                                   | 12 |
| 2.9  | Two different tasks using the same resource . . . . .                                      | 12 |
| 2.10 | Two different tasks using the same resource with mutex . . . . .                           | 13 |
| 2.11 | The waterfall model . . . . .                                                              | 14 |
| 2.12 | The V-Model . . . . .                                                                      | 15 |
| 2.13 | The agile model . . . . .                                                                  | 15 |
| 2.14 | The spiral model [28] . . . . .                                                            | 16 |
| 2.15 | Full system hardware simulator [30] . . . . .                                              | 17 |
| 2.16 | Full system software simulator [30] . . . . .                                              | 18 |
| 2.17 | Supermarket flow schematic . . . . .                                                       | 19 |
| 2.18 | Updating state over simulated time in continuous and discrete simulation [41] . . . . .    | 21 |
| 2.19 | Example of a sequential simulation . . . . .                                               | 22 |
| 2.20 | The principal of temporal decoupling . . . . .                                             | 23 |
| 2.21 | Sequential VS parallel simulation with and without Temporal Decoupling (TD) . . . . .      | 24 |
| 2.22 | The causality problem . . . . .                                                            | 25 |
| 2.23 | Speed vs. accuracy spectrum [34] . . . . .                                                 | 27 |
| 2.24 | Atomic mode . . . . .                                                                      | 29 |
| 2.25 | Timing mode . . . . .                                                                      | 29 |
| 2.26 | Simple system configuration diagram . . . . .                                              | 30 |
| 2.27 | Example of scheduling events in Gem5 [3] . . . . .                                         | 32 |
| 2.28 | Final assessment charts [67] . . . . .                                                     | 34 |
| 2.29 | Modelling solutions spectrum [68] . . . . .                                                | 34 |

---

|      |                                                                                   |    |
|------|-----------------------------------------------------------------------------------|----|
| 2.30 | Feature comparison between simulators [70] . . . . .                              | 35 |
| 2.31 | Examples of Machine Learning (ML) algorithms [78] . . . . .                       | 36 |
| 2.32 | Comparison diagram between a biological and an artificial neuron . . . . .        | 37 |
| 2.33 | Schematic of ADALINE [80] . . . . .                                               | 39 |
| 2.34 | Subfields of combining ML and simulation [84] . . . . .                           | 39 |
| 2.35 | Different domains of a complex system . . . . .                                   | 40 |
| 2.36 | Research publications of co-simulation applications over five years[89] . . . . . | 41 |
| 3.1  | High-level diagram of par-gem5 operation mode . . . . .                           | 43 |
| 3.2  | Bubble sort example . . . . .                                                     | 45 |
| 3.3  | Adaptive Noise Cancellation (ANC) scheme . . . . .                                | 46 |
| 3.4  | Control action with different learning rate values . . . . .                      | 48 |
| 3.5  | Reset system in ADALINE-based algorithm . . . . .                                 | 48 |
| 3.6  | Tapped Delay Line (TDL) working method . . . . .                                  | 49 |
| 3.7  | Reset vs. Host . . . . .                                                          | 50 |
| 3.8  | ADALINE flowchart . . . . .                                                       | 51 |
| 3.9  | ADALINE-based algorithm results . . . . .                                         | 52 |
| 3.10 | ADALINE-based and step ladder algorithms combined . . . . .                       | 54 |
| 3.11 | Example of the increment value evolution with a threshold . . . . .               | 54 |
| 3.12 | Step ladder algorithm flowchart . . . . .                                         | 55 |
| 3.13 | Dynamic increment results . . . . .                                               | 57 |
| 3.14 | Instruction Flow Prediction (IFP) Algorithm . . . . .                             | 58 |
| 3.15 | Possible scenarios after the forecast . . . . .                                   | 59 |
| 3.16 | IFP algorithm results . . . . .                                                   | 61 |
| 3.17 | Quantum definition with loop detection algorithm . . . . .                        | 62 |
| 3.18 | Hare-Tortoise Algorithm . . . . .                                                 | 63 |
| 3.19 | Loop detection flowchart . . . . .                                                | 64 |
| 3.20 | Quantum calculation flowchart . . . . .                                           | 65 |
| 3.21 | Repetitions algorithm results . . . . .                                           | 66 |
| 3.22 | New quantum definition in the synchronization process . . . . .                   | 67 |
| 3.23 | Improved baseline and static mode comparison . . . . .                            | 68 |
| 3.24 | Simulation issue in par-gem5 . . . . .                                            | 69 |
| 4.1  | High-level proposal design . . . . .                                              | 70 |
| 4.2  | General SystemC design . . . . .                                                  | 71 |
| 4.3  | Flowcharts of the available commands . . . . .                                    | 72 |
| 4.4  | Redefinition of <i>BasicPioDevice</i> functions . . . . .                         | 74 |
| 4.5  | TLM wrapper payload definition . . . . .                                          | 74 |

|      |                                                                                                       |    |
|------|-------------------------------------------------------------------------------------------------------|----|
| 4.6  | TLM wrapper in Gem5 . . . . .                                                                         | 76 |
| 4.7  | TLM wrapper in SystemC . . . . .                                                                      | 77 |
| 4.8  | SystemC design with CRC . . . . .                                                                     | 78 |
| 4.9  | CRC block diagram . . . . .                                                                           | 79 |
| 4.10 | CRC write operation . . . . .                                                                         | 80 |
| 4.11 | Class diagram for the Cyclic Redundancy Check (CRC) Application Programming Interface (API) . . . . . | 81 |
| 4.12 | Class diagram for the CRC API . . . . .                                                               | 82 |
| 4.13 | CRC validation test results . . . . .                                                                 | 83 |
| 4.14 | Co-simulation design . . . . .                                                                        | 84 |
| 4.15 | State machine diagram for the memory integrity test . . . . .                                         | 84 |
| 4.16 | Sequence diagram diagram for the memory comparison . . . . .                                          | 85 |
| 4.17 | Application execution timely diagram . . . . .                                                        | 86 |
| 4.18 | Success memory integrity test . . . . .                                                               | 86 |
| 4.19 | Application execution timely diagram with a failure . . . . .                                         | 87 |
| 4.20 | Failure memory integrity test . . . . .                                                               | 88 |
| 4.21 | Co-simulation results . . . . .                                                                       | 89 |

# List of Tables

|     |                                                                            |    |
|-----|----------------------------------------------------------------------------|----|
| 2.1 | Example of a sequence of events in a supermarket . . . . .                 | 20 |
| 2.2 | Overview of the supported architectures in Gem5 [61] . . . . .             | 28 |
| 2.3 | Overview of the different works regarding the quantum definition . . . . . | 33 |
| 2.4 | Comparison table between a biological and an artificial neuron . . . . .   | 37 |
| 3.1 | System Configurations . . . . .                                            | 44 |
| 4.1 | TLM wrapper commands . . . . .                                             | 75 |
| 4.2 | Reverse operation . . . . .                                                | 79 |

# Glossary

|             |                                                               |
|-------------|---------------------------------------------------------------|
| <b>ADC</b>  | Analog-to-Digital Converter                                   |
| <b>ANC</b>  | Adaptive Noise Cancellation                                   |
| <b>ANN</b>  | Artificial Neural Network                                     |
| <b>API</b>  | Application Programming Interface                             |
| <b>ASIC</b> | Application Specific Integrated Circuit                       |
| <b>BNN</b>  | Biological Neural Network                                     |
| <b>CES</b>  | Continuous Event Simulation                                   |
| <b>CPU</b>  | Central Processing Unit                                       |
| <b>CRC</b>  | Cyclic Redundancy Check                                       |
| <b>CS</b>   | Context Switch                                                |
| <b>DAC</b>  | Digital-to-Analog Converter                                   |
| <b>DES</b>  | Discrete Event Simulation                                     |
| <b>DMA</b>  | Direct Memory Access                                          |
| <b>DMI</b>  | Direct Memory Interface                                       |
| <b>FFT</b>  | Fast Fourier Transform                                        |
| <b>FIFO</b> | First-In First-Out                                            |
| <b>FPC</b>  | Forecasted Program Counter                                    |
| <b>FS</b>   | Full-System                                                   |
| <b>FSS</b>  | Full System Simulator                                         |
| <b>GDB</b>  | GNU Debugger                                                  |
| <b>GEMS</b> | General Execution-driven Multiprocessor Simulator             |
| <b>GPIO</b> | General Purpose Input Output                                  |
| <b>GUI</b>  | Graphical User Interface                                      |
| <b>HDL</b>  | Hardware Description Language                                 |
| <b>I/O</b>  | Input/Output                                                  |
| <b>ICE</b>  | Institute for Communication Technologies and Embedded Systems |
| <b>IFP</b>  | Instruction Flow Prediction                                   |
| <b>IP</b>   | Intellectual Property                                         |

|               |                                              |
|---------------|----------------------------------------------|
| <b>IPC</b>    | Inter-Process Communication                  |
| <b>ISA</b>    | Instruction Set Architecture                 |
| <b>KPI</b>    | Key Performance Indicator                    |
| <b>KVM</b>    | Kernel Virtual Mode                          |
| <b>LED</b>    | Light-Emitting Diode                         |
| <b>LMS</b>    | Least Mean Square                            |
| <b>LoC</b>    | Lines of Code                                |
| <b>MIPS</b>   | Millions-of-Instructions-Per-Second          |
| <b>ML</b>     | Machine Learning                             |
| <b>MPSoC</b>  | MultiProcessor System on a Chip              |
| <b>NIC</b>    | Network Interface Controller                 |
| <b>NN</b>     | Neural Network                               |
| <b>NPB</b>    | NAS Parallel Benchmarks                      |
| <b>O3</b>     | Out-Of-Order                                 |
| <b>OS</b>     | Operating System                             |
| <b>PC</b>     | Program Counter                              |
| <b>PDES</b>   | Parallel Discrete Event Simulation           |
| <b>POSIX</b>  | Portable Operating System Interface          |
| <b>PTE</b>    | Page Table Entry                             |
| <b>RAM</b>    | Random Access Memory                         |
| <b>RTL</b>    | Register-Transfer Level                      |
| <b>RWTH</b>   | Rheinisch-Westfälische Technische Hochschule |
| <b>SDES</b>   | Sequential Discrete Event Simulation         |
| <b>SE</b>     | System-call Emulation                        |
| <b>SoC</b>    | System-on-a-Chip                             |
| <b>SP</b>     | Stack Pointer                                |
| <b>SWaP-C</b> | Size, Weight, Power, and Cost                |
| <b>TD</b>     | Temporal Decoupling                          |
| <b>TDL</b>    | Tapped Delay Line                            |
| <b>TLM</b>    | Transaction-Level Modeling                   |
| <b>UART</b>   | Universal Asynchronous Receiver-Transmitter  |
| <b>VP</b>     | Virtual Platform                             |

# 1 | Introduction

A normal industrial design process is divided into five stages: define, ideate, prototype, production and deliver. One characteristic of this process is its non-linearity. There is an interaction between the ideate, prototype, and production stages, the so-called interaction design. This happens because problems or upgrades can be found, leading to a step back in the design process [1].

One example where this design process happens is in the development of an Application Specific Integrated Circuit (ASIC). As the name suggests, an ASIC is used for specific applications where dedicated hardware is required, for example, a critical system of a car. After the design, the prototype is developed, on which will be running tests, or benchmarks, in order to understand if the developed ASIC satisfies all the requirements. Only in this stage, it is possible to obtain some indicators such as power consumption or compute performance to evaluate if Size, Weight, Power, and Cost (SWaP-C) requirements match. However, with Virtual Platforms (VPs) or Full System Simulators (FSSs) it is possible to have those without the physical prototype. Therefore, the time to market can be accelerated because problems or upgrades can be spotted much sooner.

Thus, these simulation tools are useful for the design of modern massively parallel and complex multicore systems. Nevertheless, the major problem is that, typically, many of these simulators cannot execute a parallel simulation, in other words, they only execute the workload in one thread. Also, the complexity of the new systems increases due to the integration of more and more applications on a single chip [2], leading to unacceptable simulation times. For example, the case of the SPEC2017 integer benchmark, where it may take up to two years to complete the simulation [3].

## 1.1 Motivation

Gem5 is one FSS that cannot only execute a simulation with multiple threads. To solve this, the ICE RWTH Aachen team developed par-gem5 [3], a parallel version of the atomic mode of Gem5, that exploits the multithreading capabilities of modern host systems. It is based on a synchronous PDES which allows the parallelization of the system. Synchronizations are done periodically, according to a defined time, so-called quantum or quanta.

High quantum allows for high simulation speeds but negatively impacts the simulation's accuracy, or, in the worst case, can even break the system's functionality. If the quanta is too small, the accuracy is

perfect, although the simulation performance will be unsatisfactory. Thus, there is a tradeoff between accuracy and performance, and finding an optimal quantum is one of the main challenges when running synchronous PDES.

In the current state of par-gem5 (and as in other frameworks), the quantum is set once and then kept for the rest of the simulation. This brings several different problems. First of all, to know which is the best quantum, it is necessary to do the simulation in order to obtain the simulation results, and further evaluate if it was the best choice or not. Moreover, the quantum varies from simulation to simulation, therefore one can be the best for one case, but for another not so much. All this try-and-error consumes a lot of time, thus this is not the optimal option.

## 1.2 Goals and Contributions

A dynamic quantum could address these issues by adjusting the quantum value for each simulation in real-time, leading to improved results. This approach is particularly beneficial for typical benchmarks that consist of multiple phases with distinct computing and synchronization characteristics, that is, for the computational part, the quantum can be increased, for the synchronization part, the quantum should be reduced.

In the context of par-gem5, with dissertation development, it will be possible to automatically tune to the best quantum, without any user inputs or feedback. Furthermore, the quantum adaptation must be "on-the-fly" and be independent of the simulated system or benchmark, hence the algorithm must be flexible.

On top of that, the simulator, at the end of the benchmark execution, should give feedback to the user, by the creation of a statistics document. It also must include information related to the adaptive quantum, for example, the mean of the used quantum, by the reason of understanding how the algorithm performed.

In the end, the dynamic quantum should bring more advantages than the static version. It is expected that this algorithm solves the problem of finding the best compromise between performance and accuracy, allowing speedups in different simulations, and making it possible to simulate massively parallel and complex systems faster, without a break in the accuracy.

From an industry point of view, this work will grant a faster development of new products, in such a way companies can be the market leaders of the technology. Time-to-market can be optimized since the product is finished earlier, giving room for maneuvering to commercialize it at the right moment, and increasing the revenues.

## 1.3 Dissertation Outline

This document provides the development of a simulation extension that allows par-gem5 to automatically address the best quantum for the desired benchmark, together with a case study. Its content was

divided into six chapters, which are going to be briefly described.

The second chapter introduces the concepts and methodologies used. The reader will receive notions about simulation in a general view, where later these will be specified in the embedded systems context. Themes such as simulation modes, simulation methodologies, and simulation tools will be covered, with special attention to Gem5, the framework on which this dissertation will focus. ML will be exploited, having in consideration Neural Networks (NNs) and the simulation context. Finally, co-simulation practices are discussed, emphasizing their significance in the simulation world.

The third chapter presents the developed work done regarding the dynamic quantum. It starts by giving a short explanation of the used benchmarks to test the algorithms. Then, the four developed algorithms will be covered, providing an insight into the need to develop such algorithms and how they actually work. It ends with the final version, where a direct comparison with the static mode is made. The simulation results will be provided over the chapter in order to help the reader have a better idea of the algorithm's evolution.

The fourth chapter contains a proposal for a co-simulation environment, providing an interface where the two frameworks can cooperate. A case study was chosen to validate the developed work, giving a practical example of how this technique might be useful. It was based on a CRC peripheral, which is used for communication error detection and memory integrity checks. Since in a real-world scenario, there are other devices to manage, the co-simulation was tested alongside other tasks. In the end, the integration of the par-gem5 with the dynamic quantum was done, in order to evaluate its advantages and disadvantages in this situation.

The final chapter discusses the conclusions regarding the developed work. It also includes future work, where can be found a set of aspects on which this work can have continuity and further improve par-gem5.

## 2 | State of the Art

This chapter establishes the needed concepts to accomplish and understand this work. First of all, it will start with the simulation topic, what it is, and why it is so important. Embedded systems are briefly contextualized, highlighting the connection they share with the previous topic. Then, simulation types and modes are introduced, with a special focus on the ones that will be used in this dissertation. The Gem5 simulator will be presented in detail since it is the framework where all the development will be done. To conclude, machine learning and co-simulation will end this chapter, providing a short view of these topics in the project's context.

### 2.1 Simulation

Simulation is a very recent topic in the literature. The graph presented below attests to the fact that this subject began to be studied solely in the early 1950s. Prior to the advent of computers, it was unfeasible to replicate anything without the actual object or prototype, thereby rendering this subject of relatively low significance in the industry. However, with the introduction of computers, a new realm of opportunities emerged as a "virtual world" was conceived. As computers became smaller, more robust, and more cost-effective, a panoply of novel technologies and topics began to flourish, including simulation.



**Figure 2.1:** Evolution of the simulation topic in the literature by Google Books ngram viewer

Analyzing the market research report by ZION, is possible to conclude that the global virtual training and simulation market size was valued at \$332.6 Billion in 2022 and is projected to reach \$973.4 Billion by 2030.



**Figure 2.2:** Market research report by ZION about simulation [4]

This growth and demand in the industry turn this topic into an important subject. For this reason, it is important to understand what is simulation and why is it necessary. In the following points, this will be covered in detail.

### 2.1.1 Definition

Accordingly the Cambridge dictionary, simulation can be defined as "*a situation or event that seems real but is not real, used especially in order to help people deal with such situations or events*". Moreover, simulation can be also defined, in a more general way, as "*an imitation of a system*" [5]. Therefore, simulation is not exclusively confined to computers but also encompasses a diverse range of applications, whether physical or virtual. For instance, in the automotive industry, each new model that is designed must undergo a security test. One of the numerous examinations performed involves simulating a car crash into a wall and assessing the driver's resulting damages. As a result, conclusive determinations can be made, specifically in this case, regarding whether the car has successfully met the test's requirements or not.

Hence, simulation is a controlled verification technique that helps to understand how a system would behave in a real situation. As previously mentioned, simulation can be physical or virtual. Nowadays, the last is the preferable one, since it brings huge advantages when compared to the other. First of all, the cost. It is clear that physical simulations, like the one previously described, have a lot of associated costs. The car that is going to be destroyed, the workers to make sure everything goes as planned and to clear all the wreckage, and even the infrastructure, that require a designated area which involves costs and space. The second reason is the time. In a competitive industry, the first to have the product in the market can be the one that will have more success. Physical simulation can take a lot of time. It may be necessary, beyond the simulation itself, preparation, post-cleaning, license acquisition, weather conditions, etc. delaying the

process. The last factor is the computer's evolution. As noted earlier, its evolution was crucial in the way that more complex simulation environments can be tested and obtain more accurate results.

To elaborate on the concept of virtual simulation, it can be classified into three categories. This classification is illustrated in the Figure 2.3, which outlines the taxonomy of virtual simulations.



**Figure 2.3:** Taxonomy of virtual simulations (Adapt from [6])

As shown, it can be divided into physical simulation facilitated virtually, virtual simulation modalities, and in-person simulation in a virtual setting.

Physical simulation facilitated virtually is the process of utilizing virtual simulation to replicate a physical phenomenon or system. This is done through the creation of a digital model that imitates the physical object, and through various algorithms and computational methods, simulates its behavior in a virtual environment. In this way, it is possible to predict how the physical system will behave in various scenarios, without the need for physical testing, thereby saving time and resources. It is used, for example, in the development of an embedded system. Without it, it would be mandatory to have the physical prototype to evaluate if the system matches all the requirements.

This dissertation will focus on computer architecture simulation. Therefore, the concept of simulation can be redefined as "*the imitation of the operation of a real-world process or system over time*" [7]. The computer architecture itself becomes the system under study, and simulation enables the user to explore its capabilities and limitations in a simulated environment.

## 2.1.2 Importance of Simulation

To have a full scope of why simulation is important, three topics should be considered: how the nature of the system affects the simulation, and the advantages and disadvantages of this technique compared to other validation methods [5].

The first topic exhibits three characteristics: variability, interconnections, and complexity. Consider the production of a Central Processing Unit (CPU) chip as an example. The production process may have

interconnections with other factors such as the chip manufacturers or the chip designers. These interconnections can make it difficult to predict the overall performance of the system, which in this example is the CPU chip production, especially when variability is present. Furthermore, systems can be complex, and these complexities make it challenging to predict the performance of a system when changes are made or actions are taken. With simulation models, these problems can be solved in a way that they can explicitly represent the variability, interconnectedness, and complexity of a wanted system.

Keeping the example of CPU chip development, the process goes through various stages and, in the end, it is crucial to test the product before mass production. Thus, a prototype is built so tests can be made. However, building a prototype takes time and money. In the majority of cases, the prototype can be improved, so the developers need to go through the process again. A new cycle begins, which means spending more time and money. Simulation can solve this problem in the way that development and testing can be done in parallel. In other words, while the developers are creating the chip, it can be tested in simulation, even without the physical prototype. Moreover, simulation is flexible, which means small changes can be done easily, is controlled, that is, it can control the experimental conditions in the way direct comparisons can be done, and is transparent, so there are no subjective results.

In terms of drawbacks, it is possible that the development team consists solely of individuals specialized in CPU chip design. Therefore, the team would need to hire a new employee with expertise in simulation to fulfill the requirements. Furthermore, it requires powerful computers and a huge sufficient storage capacity, as simulation results can be in the order of dozens of gigabytes. The software to run the simulation and the creation of the model to test require investments. All these aspects represent an enormous investment for the company, where it will not get a direct profit. On top of that, simulations are not 100% accurate, meaning that they can produce results that are not the correct ones.

Considering all the aspects, simulation brings a lot of benefits when compared to other techniques, such as experimentation in real life. Nonetheless, the downsides must be taken into account and concluded whether this solution fits the needs or not. Banks in [8] mentions examples of applications where it can be used, like computer system performance, food processing, transportation systems, and embedded systems.

## 2.2 Embedded Systems

Embedded systems have a ubiquitous presence in modern life, from cars, computers, and televisions to smartphones and even toothbrushes. It is so common and present everywhere that the world without it wouldn't be the same. While this topic is not new to academia, with books like [7] mentioning it as early as the 19th century, it has gained more attention and deeper analysis since the 1980s until nowadays.

The following topics explain more about how can embedded systems be defined, and which are the development processes used in the industry. In the end, a specific topic about simulation in embedded systems will be covered.

### 2.2.1 Definition

The definition of embedded systems is not straightforward, that is, there is no agreement in the literature mostly because this term is being under development every day. One possible definition can be a combination of computer hardware and software, and perhaps additional mechanical or other parts, designed to perform a dedicated function [9]. Other authors prefer to highlight their relevant properties, like Raul Camposano in [10], where he focuses on correctness, real-time, and dedicated functions, among others, as the important aspects to retain when an embedded system is mentioned.



**Figure 2.4:** Typical embedded system (adapted from [10])

In the same work, he defined how a typical embedded system is, which is shown in the Figure 2.4. The colors depicted in the figure indicate the significance of each component in the embedded system. Red denotes an indispensable part that must be present. Yellow means that it is not mandatory, although it is recommended. Green stands for optional, as it can be used or not depending on the application. The arrows are used to demonstrate if the communications between the different parts are one-directional or bidirectional.

The processor core and memory are the two crucial components in an embedded system, as they are interdependent and cannot function without each other. Specifically, the processor requires the memory to access the data that needs to be processed, while the memory itself cannot perform any processing on its own. Therefore, the existence of both components is essential for the development of a functional embedded system. The Direct Memory Access (DMA) is a very important part of an embedded system. It is a hardware mechanism that allows peripheral components to transfer their Input/Output (I/O) data

directly to and from main memory without the need to involve the system processor. Thus, the significance of this lies in its ability to avoid a substantial processor overhead [11]. ASIC, as the name suggests, is used for specific applications where dedicated hardware is required. It can reduce significantly the execution time of certain applications [12] [13] [14]. For this reason, many systems require dedicated hardware however, it will become a smaller fraction as time passes [10]. General Purpose Input Output (GPIO), Anallog-to-Digital Converter (ADC), and Digital-to-Analog Converter (DAC) are examples of parts that establish an interface between the embedded system and the external world, and so they can be needed or not depending on the requirements.

The range of embedded systems is very wide. It starts from low power, like receiving data from ADC and sending it to a database, to multi-core applications, for example, laptops, smartphones, or even tables. Hence, when designing a system, it is important to comply with the SWaP-C constraints, which means that the system can only use the essential resources, commonly referred to as "resource-constrained devices" [15].

Furthermore, embedded systems should be dependable, that is, they should perform as and when required [16]. The following characteristics must be fulfilled for dependability, nevertheless, different applications can attend better to different topics. In the case of cars, trains, or planes, the focus should be more on safety, whereas databases or banks should prioritize security.

1. **Reliability:** Continuity of service delivery while in use, e.g., the probability of the system working properly since it worked after startup
2. **Maintainability:** Capability to be retained in, or restored to a state to perform as required, under given conditions of use and maintenance
3. **Safety:** Ability to not cause catastrophic effects on the environment as a consequence of a failure
4. **Availability:** Capacity to be in a state to perform as required
5. **Security:** Capability to provide communication confidentiality and authentication

## 2.2.2 Operating Systems

An embedded system can be complex. Take a laptop as an example. There are lots of different resources to manage, each one with its characteristics. Screen, mouse, keyboard, I/O devices, and network interfaces are some examples. Creating code to control all of them efficiently is almost impossible since the programmer needs to understand in detail every component. Operating Systems (OSs) came to solve this problem by providing a software layer that not only abstracts all the details for the user but can also optimally manage all the resources. The next picture presents how an embedded system with a OS can be divided. It is important to point out that every embedded system does not need to have an OS, for example, a device that only reads a ADC port and sends the information does not need that.

**Figure 2.5:** Layers of an OS [17]

Going into detail in the OSs, one important concept to keep in mind is the **process**, which is an instance of an executing program. Processes are crucial in a way that they allow the system to have multiple jobs at the same time. Following the previous example, a user with a laptop can play music and surf the internet at the same time. Even with one CPU, OSs have the capability of keeping track of multiple processes, by switching from process to process speedily, creating the illusion of parallelism. This concept is called **concurrency** and it is illustrated in the Figure 2.6

**Figure 2.6:** Concurrency between processes

Note that, in this example, it is considered a system with one CPU core. It is clear that if the system has more cores, which is common nowadays, the OS can split the work among them and have real parallelism. If the system has four CPU cores, each core can execute a process at a time, increasing the performance even more. To do this concurrency, the OS, more precisely the **scheduler**, manages the processes by attributing states to them. There are three states: running (the process is in execution at that time), ready (the process is not running at the moment but it is ready), and blocked (the process is unable to run until some external event happens).

When thinking about programs, it is clear that one can have multiple works to do at the same time, for instance, a web browser. It needs to display the content, receive the keyboard and mouse information, receive web packages, and so on. If it was only receiving the keyboard information, it wouldn't do the remaining jobs making the process inefficient. Therefore, processes can have multiple **tasks** that run within the process, which are called **threads**. Threads are essential for the following reasons [17].

1. They have the ability for the parallel entities to share address space and all of its data among themselves
2. They are lightweight and easier/faster to create and destroy than processes (Around 10-100 times faster)
3. They can increase performance when there is substantial computing and I/O because it allows these activities to overlap
4. They are useful on systems with multiple CPUs, where real parallelism is possible

Processes are assigned independent memory regions, and the threads within the process share the process memory between them. A process can have multiple threads, but a thread can have only one process associated. Without threads, a process contains one single stack and a set of registers. With threads, each of them has a stack and a set of registers. **Execution context** is represented with a combination of registers and stack, representing the CPU state for each task. Figure 2.7 illustrates the previous notion.



**Figure 2.7:** Process and threads block diagram

With threads can also be applied concurrency, in order to have parallelism and better computational performance. As processes, the scheduler defines which task will be executed. There are four states: run (the task is in execution at that time), ready (the task is not running at the moment but it is ready), wait (the task is waiting for data), and null (the task is created or killed). Figure 2.6 can also be applied to this context.

Nevertheless, in reality, when a task is running and another replaces it, there are some steps to pursue. First of all, it is necessary to save the CPU state of the running task. After that, the scheduler needs to select another to get running. Then, it restores the context of the selected one, so it can transfer the execution control. All this process is called **Context Switch (CS)**. The time consumption of the CSs are a problem, as present in the figure below. However, the benefits of having concurrency, as explained earlier in this section, are massive, and thus the trade-off is positive.

**Figure 2.8:** Context switch

Another issue that can take place is when two different processes are using the same memory. By definition, they are assigned independent memory regions, but there are situations where this is not true. The most common is when two or more different processes interact with each other, where one, for example, writes in the memory and the others read from it. To control this **Inter-Process Communication (IPC)**, two mechanisms were created: data transfer and shared data.

Data transfer involves the concepts of writing and reading. The most typical IPCs are pipes and message queues. The main difference between both is pipes are uni-directional communication channels, while message queues are bidirectional.

Shared data is a memory region that is shared by multiple processes. Thus, if a process wants to share some data, it only needs to make it available in the shared memory region. For this reason, it is the fastest IPC available because the data transfer occurs at the speed of memory access, but it is dangerous because it has the same meaning as global variables, so there is no control to access those. The next figure shows a possible problem that can happen with shared memory.

**Figure 2.9:** Two different tasks using the same resource

In this example, task A reads the value from the shared memory region, executes an equation, and writes the result in the same variable. For instance, this equation can be as simple as summing 0xFF ( $Y = X + 0xFF$ ). Task B reads that value and prints it on the terminal. As demonstrated in the figure, task B will read 0x0000 even though task A was using the variable. In the case of multiple tasks and the result

of one matters to another, this issue can be a serious problem. This problem is called **race conditions**, and it is more critical with the increasing parallelism due to increasing numbers of cores.

To avoid that were created **task synchronization objects**. The most important methods are semaphores and mutexes. The first one can be divided into binary semaphores and counter semaphores, and they are useful methods to synchronize interrupts with a given task. The second one is a shared variable that can be in one of two states: owned or free. Binary semaphores and mutexes share many characteristics. Both can be utilized for mutual exclusion, ensuring that only one thread accesses a shared resource at a time, but only the first can be used for synchronization [18]. Regarding the previous example, Figure 2.10 illustrates the same situation but using a mutex.



**Figure 2.10:** Two different tasks using the same resource with mutex

Now, both tasks, before using the shared variable, need to take the mutex. If the mutex was already taken (first trial of task B), the task will receive a blocked resource response, meaning it is unavailable until the owner releases the mutex.

### 2.2.3 Development Models

When building an embedded system, several aspects should be taken into account, like its requirements, constraints, and levels of complexity, therefore one development model that works well for one

project may not work well for another with different requirements or constraints. This section will present some examples of models typically used in the industry and research teams. There are a lot more models with different characteristics, thus before the beginning of the project, all possibilities must be considered.

### **Waterfall model**

The traditional waterfall model is linear, that is, there are well-defined stages and once the project moves to the next stage, it can not go back [19]. Figure 2.11 represents the different stages of the model.



**Figure 2.11:** The waterfall model

Because of this linearity, no previous stages can be reviewed or improved, which makes the traditional waterfall model a good option for projects that have well-defined requirements and a short development cycle. Adetokunbo in [19] goes further and states that it is also viable to use it when altering the software after coding is very much prohibited.

Some variations allow for an iterative relationship between phases and even add more phases. Royce in [20] presents and explains some of them in detail.

### **V-Model**

The verification and validation model, more known as V-model, is a modified version of the Waterfall method [21]. It is non-linear, which means that allows step-backs in the development process. An important aspect of this model is that testing activities like planning, and test designing happen well before coding, preventing bugs or bad-functional systems. [22]. The following picture shows a typical representation of this model.

**Figure 2.12:** The V-Model

The main advantages are the proactive tracking of defects in the various phases, cost reduction in the correction of defects since these are spotted much sooner, and it is simple to understand and apply. On the other hand, it lacks flexibility as any changes require updating the majority of documents, demanding significant time and resources, which can make it challenging for companies to adopt. [21] [22]. Nevertheless, it is arguably the most traditional model utilized for software test management. [23].

Other variations try to improve these downsides. Some examples are the shark tooth and W-model [22].

## Agile Model

Although there are lots of agile techniques, they share common characteristics, including iterative development and a focus on interaction, communication, and the reduction of resource-intensive intermediate artifacts [24]. Therefore, this model states that the project should be divided into mini-projects to remove unnecessary activities that waste time and effort. These mini-projects have requirements analysis, design, implementation, and test [24], and should not exceed 30 days [25]. In the end, all are combined to obtain the final project.

**Figure 2.13:** The agile model

Figure 2.13 shows the agile model. In the review stage, an important aspect happens, which is customer interaction. The customer adaptively specifies his requirements for the next release based on the observation of the current release, rather than speculating at the start of the project [26].

### Spiral Model

The spiral model is similar to the agile model, nevertheless, this one focuses on risk assessment and minimizing project risk. This can be achieved by breaking a project into smaller segments, which then provide more ease of change during the development process [27]. In Figure 2.14 is presented the spiral model. It is possible to notice that it requires a lot of time spent to evaluate the risks and planning or reevaluating the project every time a cycle is complete.



**Figure 2.14:** The spiral model [28]

Hence, this model is recommended for medium to high-risk projects, when risk evaluation and costs are important, and when significant changes are expected [27]. Moreover, it is also applicable to very large, complex, and ambitious projects [28].

#### 2.2.4 Embedded Simulation

All the models presented in the subsection 2.2.3 have the verification stage, which is crucial to validate if the developed system meets the desired specifications. This validation normally requires the construction of a prototype in a way that engineers can evaluate its performance, test its functionality, and identify any potential issues or shortcomings. For example, in the automotive industry, where the most commonly

adopted model is the V-model [29], the construction of a mock-up is necessary to identify flaws, delaying the overall development process.

Therefore, simulation in the context of embedded systems is essential since the system can be tested without having the physical prototype, and some requirements, such as cache hits/misses, and computer performance, can be evaluated [3]. A simulator of this type is commonly referred to as a Full System Simulator (FSS) or a Virtual Platform (VP), and they can be described as a computer architecture simulator that simulates software in an electronic system, being this independent of the nature of the host computer. Usually, this term is mixed with emulation because although it can also run software tests inside flexible, software-defined environments, an emulator goes beyond by simulating both software and hardware configurations, being useful to test how software interacts with underlying hardware or a combination of hardware and software.

There are two types of FSSs, full system hardware simulator and full system software simulator. The hardware version offers a cycle-accurate simulation. As the name suggests, this technique is employed to perform a comprehensive analysis of the simulated embedded system at the clock level. Therefore, it is possible to accurately simulate hardware state transitions, obtaining specific information as the state of all logic gates or registers. Nevertheless, it can have very poor simulation performance because hardware-level simulators are not suited for the simulation of such complex systems as the embedded ones [30]. Figure 2.15 shows the simulation of a system using a hardware simulator. Some examples of this kind of simulator are GHDL [31] and Icarus Verilog [32].



**Figure 2.15:** Full system hardware simulator [30]

The software version typically employs instruction-accurate simulation, where computations are performed according to the instruction set. However, this type of simulation does not provide information about the execution time of each instruction, resulting in a less detailed simulation that runs faster. Because of that, the hardware is not described in Hardware Description Language (HDL), unlike the previous version, as presented in the Figure 2.16.



**Figure 2.16:** Full system software simulator [30]

The hardware model must be defined for the software simulation, thus these may or may not be available for the target machine. It requires the selection of a simulator that aligns with the application scenario or offers the flexibility to expand the range of supported hardware modules, such as QEMU [33] and Gem5 [34].

As modern embedded and integrated systems become increasingly complex, an escalating number of design companies are embracing virtual prototyping methods [35]. VPs should follow this trend, that is, as embedded systems are becoming more complex, VPs should be faster yet accurate, so they can be a reliable option.

The problem is VPs are not following this evolution, which results in low performance, and thus high simulation times [3] [35] [36]. Although several works try to solve the problem, none of them have been able to provide an effective solution.

## 2.3 Discrete Event Simulation

Simulations, while valuable, cannot guarantee 100% reliability due to certain scenarios that may be challenging to evaluate. For instance, certain physical phenomenon modeling may not be fully accurate when compared with real-world behavior. However, to enhance reliability, a simulator should strive to accurately represent real-world conditions. Discrete Event Simulation (DES) aligns with these requirements by simulating system dynamics event by event and providing comprehensive performance reports.

A DES can be defined as a simulation technique where state changes (events) happen at discrete instances in time. Events take zero time to happen, and it is assumed that nothing happens between two consecutive events that is, no state change takes place in the system between the events. A group of events organized by execution order is called an event queue or process [37] [5]. From this point forward, whenever the term "process" is mentioned, it specifically refers to the event queue.

Consider a supermarket as an illustrative example. A supermarket system consists of one employee and two cashiers, PAY1 and PAY2. In this context, three events can be specified, and they are related to each other as shown in the Figure 2.17.

- Customer arrives at the supermarket (CA)
- Customer goes to the cashier (PAY1 / PAY2)
- Customer leaves the supermarket (CL)



**Figure 2.17:** Supermarket flow schematic

Clients arrive randomly at the facility, and an attendant directs them to a cashier, according to the following rules: Whether both cashiers are available, the customer goes to PAY1; If only one of them is free, the client goes there; If both are occupied, the buyer waits until one is ready; If another buyer wants to pay, a queue is created following a First-In First-Out (FIFO) style. The next table presents a situation assuming five clients and a payment time of five units of time.

**Table 2.1:** Example of a sequence of events in a supermarket

| Event number | Time | Event description |
|--------------|------|-------------------|
| 1            | 1    | CA1               |
| 2            | 1    | CA1 goes to PAY1  |
| 3            | 3    | CA2               |
| 4            | 3    | CA2 goes to PAY2  |
| 5            | 4    | CA3               |
| 6            | 5    | CA4               |
| 7            | 6    | CL1               |
| 8            | 6    | CA3 goes to PAY1  |
| 9            | 7    | CA5               |
| 10           | 8    | CL2               |
| 11           | 8    | CA4 goes to PAY2  |
| 12           | 11   | CL3               |
| 13           | 11   | CA5 goes to PAY1  |
| 14           | 13   | CL4               |
| 15           | 16   | CL5               |

When the time between arrival and reaching the attendant is considered to be zero, it indicates that there is no significant activity occurring. Otherwise, an additional event would have been included. In addition, when an event occurs, it is referred to as an event timestamp, typically measured in the same unit as the time within the model, known as the simulation time. Wall clock time or CPU time refers to how long the simulation program has been running and how much CPU time it has consumed on the target system. On the other hand, the amount of time spent, known as host or real-time, represents the actual duration taken to perform the simulation.

In this example, some events depend on others, for instance, event number eight is executed only when event seven has been executed as well. However, others are independent, like events one and two. So, even in this simple case, there are relationships between events that can complicate the concurrent execution of events, not allowing for a reduction in wall clock time.

This method has applications not only in the context of embedded systems but also in a wide area of topics. It is traditionally used for industrial applications, for example, in the design and evaluation of new manufacturing processes, and in the establishment of optimum operational policies [38].

### 2.3.1 Continuous Event Simulation

Besides DES, there is another simulation technique, which is the Continuous Event Simulation (CES). The perfect one does not exist, since they should be chosen to take into account the application, therefore

a brief explanation will be made to understand when one is better than another.

Continuous simulation is used for modeling systems with continuous or analog behaviors. The state changes are continuous since they are modeled by differential equations. [39] and [40] are examples of applications where the authors decided to use this technique.



**Figure 2.18:** Updating state over simulated time in continuous and discrete simulation [41]

## 2.4 Simulation Modes

Different FSSs can use different modes to perform the simulation. These modes have an impact on the performance and accuracy of the simulation. For example, while Gem5 is limited to sequential simulation [34], QEMU offers the flexibility to be used in both sequential and parallel simulation [42]. Depending on the application and the requirements of the project one option can be better than another.

According to [43], there are two dominant approaches to advance time within a simulation, asynchronous and synchronous methods. In the first one, each thread communicates with the others in a way that each one can determine when it is secure to execute an event. The number of synchronizations can be reduced by far, although knowledge about the communication behavior of models and deadlock mechanisms, such as roll-back, is required. The second one uses global synchronization times to synchronize all threads. Between these times, all events that will be executed between them can be parallelized. This approach is simpler, nevertheless, it may incur high simulation overhead.

This section will present the two available modes, the sequential and the parallel. Nonetheless, it will also explain the TD technique, since both can use it in order to obtain more performance. Moreover, all further context will be related to the DES technique and the synchronous approach.

### 2.4.1 Sequential Simulation

Sequential simulation, or Sequential Discrete Event Simulation (SDES), is the most simple and accurate simulation mode. It executes the workload sequentially, that is, each event executes at its simulation

time, resulting in a perfect simulation without any error.

Going into detail, the simulator runs the process corresponding or sensitive to an event that should be executed at a specific timestamp. This process will run until a certain time when another event must be executed, and it is not related to the process in execution. In this exchange, there is a CS, where happens a synchronization with the rest of the system. All variables are updated, thus, when a process reads or writes a variable, it accesses the state of the variable as it would be at the current simulation time. The coming image shows a sequential simulation with two different processes.



**Figure 2.19:** Example of a sequential simulation

Picture a scenario of a dual-core system with two processes where each event corresponds to an execution of an instruction with constant time (five nanoseconds), and each CS takes one nanosecond. If the simulation lasts for one minute, ten seconds would be just for CSs. It is visible that it spends a lot of simulation time, and it tends to get worse if more processes are needed. The overhead of the event scheduling and process context switching become the dominant factor in simulation speed, leading to a huge host time.

### 2.4.2 Temporal Decoupling

A technique used by SystemC, which is a standard C++ class library for system and hardware design, to improve performance is TD [44]. TD is a technique where individual processes are permitted to run ahead in a local time, without actually advancing simulation time, until they reach the point when they need to synchronize with the rest of the system. This time slice, from the beginning of the execution until the synchronization, is called **quantum**. The usage of TD may result in a faster simulation in some cases since it increases the data and code locality and reduces the scheduling overhead of the simulator. Figure 2.20 shows the application of TD to the previous example.



**Figure 2.20:** The principal of temporal decoupling

Now, for the same process execution time, the host time was smaller ensuring the existence of less CSs. In this example, the reduction was four nanoseconds (four CSs), circa 11% of host time, compared to the approach without TD. It can be even higher if more processes and CSs were being simulated.

As mentioned earlier, the process is permitted to advance beyond the current simulation time until it requires interaction with another process. This interaction could involve reading or updating a variable that belongs to another. At the moment, two things can happen: either access the current value and proceed, sacrificing accuracy or request synchronization and continue its process when the simulation time aligns with its local time. Proceeding with the current value entails making assumptions about communication and timing within the modeled system. It assumes there will be no adverse consequences when sampling or updating the value either too early or too late. Typically, this assumption holds in the context of a virtual platform simulation, where the software stack is designed to be independent of the underlying hardware timing intricacies.

Each process is responsible for evaluating whether it can progress beyond the current simulation time without compromising the functionality of the model. This is a SystemC characteristic because it guarantees functional congruency with the standard semantics of the simulator thus, it is not a mandatory feature to implement in other situations.

Nevertheless, a problem can occur if the process does not respect the previous rule and runs with no restrictions. Other processes will not be able to run, compromising the system's functionality. A solution is to define a global quantum that forces the synchronizations. This global quantum, in the original version of SystemC, is static, which means it is defined by the user before the simulation and never changes. Forcing the synchronization results in another problem which is the trade-off between speed and accuracy. A small global quantum guarantees that the processes are using always the updated values and the simulator does not crash, therefore the accuracy will be high. However, the simulation speed will be sacrificed as more CSs are occurring. On the opposite side, if the time slice is big, means that the system might introduce timing inconsistencies, which can lead, in the worst case, to a crash in the simulator. Hence, this value must be chosen carefully in order to have a fast yet accurate simulation.

The SystemC reference manual [44] enforces that some processes cannot be temporally decoupled due to their characteristics, thus they might become a simulation speed bottleneck.

From here onwards, whenever the word "quantum" is mentioned, it refers to the global quantum.

### 2.4.3 Parallel Simulation

In the parallel mode, or in a PDES, as the name suggests, the simulation uses more than one simulation thread in order to have real parallelism. In consequence, the host time will be smaller compared to the sequential mode, since multiple processes can be running at the same time. The next figure shows the speed difference between the two modes.



**Figure 2.21:** Sequential VS parallel simulation with and without TD

Continuing the previous scenario, with the parallel mode, the simulation can be fully optimized in a way that each process can have its execution thread. For this reason, the performance of the simulation increases a lot. In this case, for the same number of executed processes, the parallelized version was able to reduce eighteen nanoseconds of host time, which means a reduction of 50%. With TD, the gain was pushed further when compared to the approach without it. However, it does not only offer advantages due to the several challenges that make achieving PDES difficult. Typical problems are load imbalance, limited parallel work, and causality errors [45] [46].

When working with tasks there is always the problem of balancing the workload among them to make an efficient use of today's multicore computers. An inefficient distribution can create performance bottlenecks. For instance, a system is programmed to read two different GPIO ports, do a logical XOR between both, and write in another GPIO port the result (to turn on a Light-Emitting Diode (LED)) and send it to the terminal. This workload results in four events, which will be assigned to two processes. If process A is attached with only one event, e.g. sending the information to the terminal, process B will be overloaded,

and vice-versa. This issue is easy to solve in this simple example, but in programs with thousands Lines of Code (LoC), it is a challenging job. [47] and [48] are examples of works that help to detect and measure the work imbalance.

It is rational to think that, after observing Figure 2.21, more parallelism will provide an even faster simulation, but this is not true. Normally programs can be parallelized until some point, like the previous example. It is possible to simulate this system with four cores however, it is clear extra cores will not execute anything because there are no more processes to run. For this reason, the performance will not be improved, and the opposite can happen, that is, it becomes even worse [49]. Some works identify this scalability problem to show the programmer how it can be improved [50] [51].

The last problem arises when there are causality errors. Causality errors happen when one event in the future affects an event in the past. Sequential simulation ensures that event B will execute after event A if the timestamp of event A is smaller than the timestamp of event B. In this mode that cannot happen, and if event B tries to change the state variables used by event A, a causality error happens. The following image illustrates the previous scenario, with events colored in green denoting "in execution," and events in gray indicating "scheduled".



**Figure 2.22:** The causality problem

When the simulation starts, event A is executed. Meanwhile, it triggers event B, which has an event timestamp of five nanoseconds. In the sequential mode, since it executes one event at a time, this is not a problem. In the parallel version, it can be, because each execution thread has its own time. Can happen

that one thread is ahead of another, e.g. thread 1, which is executing process A, is at three nanoseconds, and thread 2, which is executing process B, is at ten nanoseconds, and when event A schedules event B to five nanoseconds, this event will not be executed for the reason that a DES cannot execute events in the past. On top of that, as mentioned before, if event B modifies the state variables of event C, the problem arises.

The usage of TD in both modes, may increase the occurrence of causality errors. Not only that, it can create this type of error in another way. Take the coming scenario as an example. A simulator uses the parallel mode to test a system that is writing values into a DAC to play a song in an analog speaker, and, at the same time, solve complex mathematical equations to perform that audio. Supposing the wanted output frequency was the gold standard audio (44.1 kHz) [52], the associated event needed to update the GPIO port every 2.27 microseconds. If the defined quantum is higher than that value, the event will not be executed at this event timestamp. Hence, the causality error was caused by the chosen quantum time. In this particular example, the consequence would be the production of audio that differs from the expected output. Although, in other cases, these errors can break the system functionality and crash the simulator.

As peer with Fujimoto in [53], there are two mechanisms to solve the problem: optimistic and conservative methods. In the first one, causality errors are not prevented at all, although they use detection and recovery strategies to correct that. The correction can be done with a rolling back method, which returns the simulation to a point before the causality error and sets the tighter synchronization. A problem with this method is the performance cost, because of this rewind in the simulation. The second option, on the other hand, avoids the possibility of any occurrence of these errors. An evaluation is done on the event, thus can be identified if the event is safe to process, that is, it is ensured that all events that should influence the given event have been processed before its execution. Some works with optimist approaches are [54] and [55]. For [56] and [57] are used conservative methods.

## 2.5 Gem5

One available FSS used in academia and industry is the Gem5 simulator [34][58]. It had its roots in 2011, from the merge between M5 [59], a CPU simulation framework, and the multifacet General Execution-driven Multiprocessor Simulator (GEMS) toolset [60]. With this combination, it is possible to use the best of both frameworks, the effective support of multiple Instruction Set Architectures (ISAs) and the cache memory and cache coherence models. It is also the result of the work between AMD, ARM, HP, MIPS, Princeton, MIT, the Universities of Michigan, Texas, Wisconsin, and many other institutions.

In addition, Gem5 came to overcome a demand in the market at the time. The main highlights are the flexible framework for researchers to evaluate their design in several different ways; Licensing terms that allow researchers and academia to work together without the pressure of revealing the proprietary information, in the industry case, or not getting credits for their contributions; Defined code style that ensures the code quality remains good so new collaborators understand faster and better the code [34].

This section will explain in detail what are its capabilities and how can it be used. Then will be presented a work done by the ICE team of the RWTH Aachen University, which will be the reference work for this dissertation. To conclude, other simulators will be given to have a wide perspective of what exists in the market and when and where one can be better than another.

### 2.5.1 Overview

Gem5 is a DES platform that has as its main goal to be a community tool focused on architectural modeling. It has a set of CPU models and memory systems, and two different system modes, the System-call Emulation (SE) and Full-System (FS). Figure 2.23 represents the different simulation configurations regarding the trade-off between speed and accuracy.

| Processor     |             | Memory System |        |          |
|---------------|-------------|---------------|--------|----------|
| CPU Model     | System Mode | Classic       | Ruby   |          |
|               |             | Simple        | Garnet |          |
| Atomic Simple | SE          | Speed         |        |          |
|               | FS          |               |        |          |
| Timing Simple | SE          |               |        |          |
|               | FS          |               |        |          |
| In-Order      | SE          |               |        |          |
|               | FS          |               |        |          |
| O3            | SE          |               |        |          |
|               | FS          |               |        | Accuracy |

**Figure 2.23:** Speed vs. accuracy spectrum [34]

As mentioned in the introduction of this section, one characteristic of Gem5 is flexibility. Several simulation configurations allow an adaptation to fit the requirements of the project or the level of wanted detail. A project where scalability is a must-have feature, for example, will not require a detailed CPU model. The CPU models will be explained with more detail in subsection 2.5.2.

In this simulator, components can be rearranged, parameterized, extended, or replaced easily to suit the project's needs. This is possible thanks to its pervasive object-oriented design, with the mix of Python and C++. All major components, like the memory, are SimObjects and share common behaviors for configuration, initialization, statistics, and serialization. Moreover, a user can create a customized SimObject, increasing even more the flexibility. To do that, it is required to define in a Python file the SimObject's parameters, such as instantiation, naming, etc., and in the C++ files define its behavior.

Another remarkable characteristic is the support of different ISAs, counting with Alpha, ARM, SPARC, MIPS, POWER, x86, and recently RISC-V. Currently, not all possible combinations of ISAs and other components are known to be functional, as demonstrated in the Table 2.2

**Table 2.2:** Overview of the supported architectures in Gem5 [61]

| <b>ISA</b> | <b>Level of ISA support</b> | <b>Full-system OS support</b> |
|------------|-----------------------------|-------------------------------|
| Alpha      | High                        | Linux                         |
| ARM        | High                        | Linux, BSD, Android           |
| MIPS       | Low                         | None                          |
| Power      | Low                         | None                          |
| RISC-V     | Medium                      | None                          |
| SPARC      | Low                         | None                          |
| x86        | Medium                      | Linux, BSD                    |

Furthermore, it is also possible to simulate more than one computer system in various ways. It is done by instantiating another system and connecting it via a network interface, creating a client/server pair that communicates with the TPC/IP protocol. The results of the simulation remain deterministic, in other words, a particular input will produce always the same output.

Gem5 uses other tools to perform specific jobs. The most important ones are the pybind11 [62], a lightweight header-only library that exposes C++ types in Python and vice versa, and SCons [63], an open-source software construction tool that orchestrates the construction of software in a way that solves several problems compared to the other build tools. With the SCons scripts present in Gem5 is possible to build different binaries for different purposes.

- **Debug:** This mode has no optimizations, and thus it is the slowest one. It is mostly used when the opt version does not provide enough detail in the debug session.
- **Opt:** It has most of the available optimizations but still with same debug information. Compared to the debug version, it is much faster.
- **Fast:** The fastest binary available. All optimizations are on, and there are no debug symbols. In addition, asserts are removed, although panics and fatsals are still included. This binary should only be used when it is desirable for maximum performance, and the code is very unlikely to have bugs.

### 2.5.2 Simulation Capabilities

As shown in the Figure 2.23, this simulator provides different features that should be used regarding the project requirements. Those are the support for different ISAs, CPU models, and execution modes. Also, it has the capability of modeling multiple systems at the same time, supports multiple devices, two different network models, and different cache coherence protocols [34].

While the event queue is responsible for executing its associated events, the CPU is answerable to schedule the events thereby, it is possible to keep running the simulation even if no CPUs are working.

There are four types of CPU models. Atomic simple, timing simple, in-order, and Out-Of-Order (O3). The atomic and timing models are the simplest ones. The main difference between them can be seen in the figures below. Meanwhile the first completes all memory accesses immediately, which makes it a proper option for tasks like fast-forwarding (a technique used to warm up micro-architectural state), the timing mode models the timing of memory accesses, providing a more accurate simulation.



**Figure 2.24:** Atomic mode



**Figure 2.25:** Timing mode

The in-order and O3 models are "execute-in-execute" models, which means that instructions are executed only in the execute stage once all dependencies have been resolved. For this reason, those are highly accurate.

Gem5 can operate either in FS and SE mode. The FS is more complex and complete, in a way that it supports interrupts, exceptions, I/O devices, and so on. As the name suggests, the SE simulates system calls, like `write()`. This characteristic can be a bottleneck to multi-thread applications nevertheless, results in a faster simulation. If the workload has lots of I/O iterations or OS services, the FS is the proper choice, otherwise the SE may be considered, even because some ISAs do not support the FS mode yet.

Another important capability is the support to the classic and ruby memory system, due to the integration of both works (M5 and GEMS). On one hand, the classic system is fast and easy to configure. On the other hand, the ruby system is more flexible and accurate. Gem5 also offers an extension for the simple version named Garnet. At the time, Gem5 supports HeteroGarnet or Garnet 3.0 which brings improvements compared to the previous version, for example, it can enable accurate simulation of emerging interconnect systems.

Even though Gem5 is a very flexible simulator, according to [34], there are other capabilities the developers want to include. A first-class power model, full cross-product ISA/ CPU/memory system support, and parallelization are some examples. It is evident that there is still much work to be done, but compared to the first review, there are lots of advances and improvements, like the support for the RISC-V architecture and the Kernel Virtual Mode (KVM) [58].

### 2.5.3 Usage

To start using Gem5 it is necessary two things. First of all, the simulator's binary for the wanted ISA is necessary. As mentioned before, there are three different types of binaries, therefore it should be chosen mindfully. Then, it is mandatory the definition of the system. It describes the components comprising the System-on-a-Chip (SoC) and details their interconnections. Gem5 provides in its tutorials the simplest system a developer can use, and it is presented in the following picture.



**Figure 2.26:** Simple system configuration diagram

All the configuration is performed in a Python script, while the functionality of the simulator is implemented in C++. Making changes to the latter requires creating a new binary, whereas in the former case, it does not. The next code demonstrates the instantiation required in the Python script to execute the simulation

```

1 #create the system to simulate
2 system = MySystem ( opts )
3
4 #set up the root SimObject and start the simulation
5 root = Root(full_system = False, system = system)
6 m5.instantiate()
7
8 #instantiate all objects above
9 print("Beginning simulation!")
10 exit_event = m5.simulate()
11
12 #After execution
13 print('Exiting @ tick {} because {}'
14 .format(m5.curTick(), exit_event.getCause()))

```

**Code 2.1:** Script to instantiate and execute the simulation

Finally, to run Gem5, the command should first have the directory of the binary, and then the directory of the script. Furthermore, it can have more arguments to specify the system or the simulation modes, e.g. the CPU model.

## 2.5.4 Par-gem5

A huge problem of Gem5 is its speed. At the moment, its official version only offers a single-threaded simulation. Even with the best CPU or the fastest memory on the host machine, the simulation can not get close to one Millions-of-Instructions-Per-Second (MIPS), which results in long execution times. For instance, while the SPEC2017 integer benchmark can take ten minutes in a host computer, with Gem5 the same workload can take more than two years [3]. As referred in the subsection 2.5.2, parallelize Gem5 is one desired feature to have however, more than ten years passed and there are no developments on this topic, only the support for KVM.

Par-gem5 [3] arises to solve this problem. It was developed by the ICE team of the RWTH Aachen University, with the collaboration of Huawei. It utilizes the multi-threading capabilities of the host system through a modified conservative, synchronous PDES approach, which allows the execution to be dispatched to multiple simulation threads that run independently from the rest of the system for a time  $t_{\Delta q}$  - the so-called quantum or quanta. At the time, only dist-gem5 [56], a work that focuses on simulating distributed systems connected via a Network Interface Controller (NIC), has implemented a parallel extension. Nonetheless, its application is very strict, since can only be used in this type of system thus, par-gem5 comes to overlap this problem and be a generalized parallel version.

In the initialization phase, each CPU is assigned to a dedicated event queue, and all other objects are assigned to the default event queue (q0). Thereby, the total number of threads will be the number of cores plus one. This version cannot be classified either as conservative or optimistic for the reason that it allows causality errors to occur.



**Figure 2.27:** Example of scheduling events in Gem5 [3]

Scenario A represents the typical sequential simulation, and scenario B the par-gem5 scheduling method. When an event is scheduled to another CPU, it can either be executed or postponed to the following synchronization. The event is postponed when its timestamp has passed. Consequently, it can lead to a chronologically incorrect execution order of events, affecting the simulation's accuracy, or in the worst case, flaws in the functionality of the simulated system. Nevertheless, in [3] is proven that the inaccuracy can be kept within a single-digit percentage.

The results of this work were very positive. For example, a speedup of 24.7x was achievable without losing accuracy significantly. Moreover, it has been demonstrated that there exists a saturation point in the definition of the quantum. Based on their tests, this point falls between ten and a thousand microseconds, indicating that increasing the quantum beyond this range does not result in performance gains. Another important conclusion was the relationship between cores and inaccuracy. If the quantum does not change, when the number of cores increases, the inaccuracy also increases.

One way to solve the problem is to set a smaller quantum, but getting the perfect quantum is hard. One can be perfect for one benchmark and bad for another simultaneously. Zurstraßen et al. [64] go further and perform a study where this trade-off is evaluated. In the end, it was concluded the speedup of the simulation as a function of the quantum behaves similarly to a sigmoidal or "S"-shaped curve, while the simulation inaccuracy, on the other hand, grows linearly. Moreover, for the cases studied, 95% of the maximum attainable speedup was obtained at a quantum between 1000 and 10000 instructions, as different quantums suit better for different workloads.

Finding a quantum that allows high accuracy with high speedup is one of the main challenges when running a PDES. There are some works that try to explore this problem and come up with solutions, but none of them at the moment, as shown in the table below, is generalized, real-time, and for PDES simultaneously.

**Table 2.3:** Overview of the different works regarding the quantum definition

| Work                      | Real-Time | Supports PDES | Generalized |
|---------------------------|-----------|---------------|-------------|
| Jung et al. [55] (2019)   | X         |               | X           |
| Glaser et al. [65] (2015) | X         |               | X           |
| Jünger et al. [36] (2021) |           | X             | X           |
| dist-gem5 [56]            | X         | X             |             |

The first work uses a technique that tries to achieve 100% accuracy by rectifying causal errors when they happen, forcing the system to simulate again the faulty part with a smaller quantum. This approach is called a rollback mechanism. One of the best-known methods of this type is the “Time Warp algorithm” [66]. Despite this concept is not new, it is still very present nowadays [54]. Beyond the non-PDES support, a criticism of this method is the performance cost. Because it “rolls back” the simulation every time a causality error happens, the execution time will be much higher.

The second work uses a wiener filter to update the quantum within the simulation. This work was able to obtain great performance since it does not require lots of computational resources. Here, there are two simulators, SystemC, and Verilog, that communicate between them by a ADC. Hence, the quantum that leads to high accuracy should be equal to the smaller latency. To get that, it assumes a stationary process and model order is known, which sometimes is not possible. Moreover, this work is done in the context of a single-threaded simulation, so there is no IPCs being evaluated.

In the context of PDES, Jünger developed a method whose principle is to classify events as relevant or irrelevant. An event is considered irrelevant to others if its synchronization is not related to them, for example, a timer interrupt. With this information, each local quantum is adapted accordingly to the following event. In consequence of that, it is possible to obtain perfect accuracy, without performance costs. To do this classification, firstly the simulation must be run, and then with the results, the events can be classified. This makes the technique not so desirable since the user needs to execute twice the same workload.

## 2.5.5 Other Simulators

Like Gem5 other simulators are used for the same purpose. This subsection will discuss some of them, their advantages and disadvantages compared to the simulator under study, to have a better perspective of what is offered about this subject at the moment.

Starting with QEMU (Quick EMULATOR) [33], as the name suggests, it is a versatile full-system emulator that can emulate various architectures, including x86, ARM, PowerPC, and more. As explained earlier in the subsection 2.2.4, there is a difference between an emulator and a simulator, yet QEMU was considered due to other similarities, like the open source availability, the support to different architectures, and the

active development community. José Morales in [67] compared these two simulators in detail and, in the end, the conclusion of his study can be summed up in the next figure.



**Figure 2.28:** Final assessment charts [67]

One characteristic of QEMU is its loosely timed coding style, which provides high simulation speed environments. Cycle-accurate coding style, as Register-Transfer Level (RTL) simulators, on the other hand, cover fine-level Intellectual Property (IP) tuning. This is useful to validate hardware design. Gem5 can be positioned within the spectrum as an intermediate solution, offering a balance between performance and power exploration for early system-level solutions.



**Figure 2.29:** Modelling solutions spectrum [68]

Another platform already mentioned is SystemC [44]. It is a set of C++ classes and macros which provide an event-driven simulation interface for system and hardware design. Its main purpose is to provide a C++-based standard for designers and architects who need to address complex systems that

are a hybrid between hardware and software. Furthermore, it can be used also as an HDL, still, it may require an evaluation due to its syntactical overhead compared to other options like Verilog or VHDL. It also supports Transaction-Level Modeling (TLM)-2.0, which allows us to model the communication as function calls. Events are applied to an entire transaction payload, rather than to individual bus signals, and at the protocol phase boundaries, rather than at clock edges [69]. Thus, the simulations have much better performance, 100-10000 times faster, when compared to the cycle-accurate version. The main advantages are its standardization in the industry, lots of documentation and work, and the capability to quickly simulate hardware and software systems on different levels of abstraction. Contrarily, it is not so flexible hence, it requires external modules to complement itself, e.g., it can not perform cycle-accurate simulations, and most of the modules/ IPs are not free of charge, meaning extra costs.

Ayaz Akram and Lina Sawalha developed a work where a comparison of x86 computer architecture simulators is done [70]. In this study are evaluated four simulators: Gem5 [34], Multi2Sim [71], Sniper [72], and PTLsim [73]. The authors selected these simulators based on their diverse design approaches regarding the level of detail and abstraction. All of them are modern simulators that are actively being developed, except for PTLsim, which is currently not undergoing active development but is still widely utilized. The following image presents a features comparison between the previous simulators and ZSim [74].

| Feature                                 | Gem5  | Sniper | PTLsim | Multi2Sim | ZSim |
|-----------------------------------------|-------|--------|--------|-----------|------|
| Platform support                        | P++   | P      | P      | P+        | P    |
| Target support                          | T++   | T      | T      | T+        | T    |
| Full system                             | ✓     | X      | ✓      | X         | X    |
| Fast forwarding & cache warmup          | ✓     | ✓      | X      | ✓         | ✓    |
| Checkpointing                           | ✓     | X      | X      | ✓         | X    |
| Trace generation                        | ✓     | ✓      | ✓      | ✓         | ✓    |
| Details of generated performance stats. | D++   | D      | D+     | D+        | D+   |
| Pipeline depth configuration            | ✓     | X      | ✓      | X         | ✓    |
| Energy and power modeling               | E++   | E      | E      | E-        | E    |
| In-order pipeline support               | ✓     | ✓      | X      | X         | ✓    |
| HMP support                             | M,G,S | S      | X      | M,G       | S    |
| GPU-Modelling                           | ✓     | X      | X      | ✓         | X    |
| Multi-threaded app. support             | ✓     | ✓      | ✓      | ✓         | ✓    |
| Community support                       | C++   | C++    | C-     | C         | C+   |

Note: [feature's 1st letter]++ is better than [feature's 1st letter]+ which is better than [feature's 1st letter]  
which is better than [feature's 1st letter]- , S=Single-ISA, M=Multi-ISA, G=GPU

**Figure 2.30:** Feature comparison between simulators [70]

In their work, these platforms were tested with three different benchmarks. In the end, was possible to conclude that Gem5 is a good option when very detailed results and multiple ISAs support are wanted. Sniper is specifically designed for multi-core simulations and is considered the most accurate among the simulators examined. However, it lacks the capability to generate detailed performance statistics for the simulated system and is comparatively less flexible than the other simulators. Then, if the focus is a CPU-GPU architecture simulation, Multi2Sim proves to be a great preference. Finally, PTLsim did not get

a best-case scenario to be used nevertheless, it is the base of other x86 diverse simulators, such as the MARSSx86 simulator [75].

## 2.6 Machine Learning

Machine Learning (ML) is at present-day an extremely important subject, encompassing applications in both smaller devices like watches and larger ones such as cars. Autonomous driving is one example of what can be done with machine learning [76]. Although this theme emerged recently, with the evolution of computers, it was first defined by Arthur Samuel in 1959, who states that *it is a field of computer science that gives computers the ability to learn without being explicitly programmed* [77].

There are various types of ML algorithms, such as classification, error cancellation, and prediction. Depending on the project requirements, these algorithms can be utilized individually or combined in a manner that complements each other to address complex problems, like [76]. The following picture shows some of the commonly used algorithms in ML.



**Figure 2.31:** Examples of ML algorithms [78]

Within this dissertation, only neural networks will be considered. They can be divided into Artificial Neural Networks (ANNs) and Biological Neural Networks (BNNs). A Biological Neural Network (BNN) is a network of neurons that are connected by axons and dendrites, and the connections between neurons are made by synapses. An Artificial Neural Network (ANN) will be explained in detail in the next subsection. Also, it will be presented how ML can be applied to simulation and the improvements it brought.

### 2.6.1 Artificial Neural Networks

ANNs are types of ML that are inspired by the operation of the human brain. They try to model it to implement a particular task or function of interest. To do that, the biological neuron was studied in detail to understand how it works. How the information flows from one to another, how it adjusts the

**Table 2.4:** Comparison table between a biological and an artificial neuron

| BNN              | ANN                             |
|------------------|---------------------------------|
| Dendrites        | Inputs                          |
| Soma / Body cell | Summing and activation junction |
| Synapses         | Weights or interconnections     |
| Axon             | Output                          |

importance of several inputs, and so on. Figure 2.32 and Table 2.4 represent how a biological neuron can be transformed into an artificial one.

**Figure 2.32:** Comparison diagram between a biological and an artificial neuron

Like the human brain, ANNs also need to learn. There are three ways to train an ANN, which are also inspired by human actions. Supervised learning is a method that teaches the NN by providing the outputs for certain inputs. The predicted outputs are compared to the real ones, and then weights are adjusted according to the error. In the real world, it can be compared to heating food. The human knows how warm wants his food, but he cannot calculate how much time should it be warming in the microwave. Thus, he makes a prediction, and based on what result, time may be redefined.

In the opposite way, unsupervised learning is a technique where the NN only knows the inputs. The training is done only with the input information, which is a good solution for clustering or density problems, where the data is categorized based on similarities. A great example is when it is required for a person to sort apples. Since there is no information, for instance, the person uses the color to distinguish them and perform the sort operation.

The last method is the reinforced NN. There is an agent that controls the outputs of the system. The NN may receive a reward or a penalty for each action it performs. With this information, it adapts its behavior to receive the fewest penalties possible. A comparison can be made to a relationship between a mother and her son, where the mother supervises the actions of her son, and reward or alert him when he does something.

Also, ANNs can be classified into two different classes. The Feed-Forward NNs and the Recurrent NNs. In the first one, the information pursuers a linear path, where the output of one neuron will be the input of another. The other way around, the output of a neuron is used as input of the same neuron, which is appropriate for non-linear applications.

## 2.6.2 Learning Rules

As illustrated in the Figure 2.32, in an ANN the inputs, before summing, suffer an adjustment by the synaptic weights. To define these weights, learning rules are used, that is, self-adaptive equations that update the weights and bias levels of neurons, enabling a neural network to learn from its input data and enhance its performance. There are diverse types of learning [79]. Some examples are:

- Hebbian learning rule
- Perceptron learning rule
- Delta learning rule (Widrow-Hoff rule)
- Competitive/correlation learning rule (Winner-takes-all)
- Outstar learning rule (Grossberg learning)

For this thesis, only the delta learning rule or Least Mean Square (LMS) algorithm will be considered. It was introduced by Widrow and Hoff in 1960 in [80], and its objective is to minimize the sum of squares of the linear errors over the training set. Additionally, a typical neuron used in NNs is the "ADaptive LInear NEuron" or ADALINE. It consists of an adaptive linear combiner cascaded with a hard-limiting quantizer, which generates a binary output (-1 or 1) used, e.g., for classification problems. A schematic can be seen in the figure below. If it is removed, the output will be analog, which may be useful, for instance, to noise-canceling applications [81]. Further, it is possible to have multiple layers of neurons to address more complex systems (Multiple ADaptive LInear NEuron, or MADALINE).



**Figure 2.33:** Schematic of ADALINE [80]

The linear error is defined to be the difference between the desired response and the linear output. After that, this error signal is used to train the NN, regarding the Equation 2.1 [82].

$$W_{i+1} = W_i + 2\mu\varepsilon_k X_{i_k} \quad (2.1)$$

Where  $W_i$  is the weight associated with the input  $i$ ,  $k$  defines the time,  $\mu$  is a parameter that controls stability, called learning rate,  $\varepsilon_k$  is the linear error, and  $X_{i_k}$  is the input  $i$ .

### 2.6.3 Machine Learning in Simulation

ML and simulation are two different topics that can complement each other. For example, when evaluating risk management, where the behavior of the system is based on causal relationships, hidden dependencies, etc. [83], a combination of both worlds could be beneficial. L. von Rueden in [84] presents a work that defines three subfields when combining both, simulation-assisted ML, machine-learning-assisted simulation, and hybrid.



**Figure 2.34:** Subfields of combining ML and simulation [84]



**Figure 2.35:** Different domains of a complex system

Machine learning in simulation is often used to detect patterns in data or support the solution process. It can integrate all four components of the simulation. These components are: model reduction [85], input parameters [86], numerical method [87] and simulation results [88]. Because of this versatility, ML is an important part of the simulation world and should be used to obtain better conclusions and performance. Yet, it is not being fully exploited, that is, simulations, especially in the industry, are run with very specific analysis goals, ignoring further analysis that could be interesting for other contexts [84].

## 2.7 Co-Simulation

The development of truly complex engineered systems that integrate physical, software and network aspects is on the rise [89]. The simulation of these all together is not reliable hence, the need to divide the system into different teams arises. Each one develops a part of the final solution, that, in the end, is integrated into the others. Different abstraction layers may demand distinct simulation tools, as some are finely tuned for specific domains, yielding results that closely mock real-world behavior. The Figure 2.35 shows an example of different domains in a complex system.

Normally, each part is tested and verified alone, that is, without iterations from other modules [89]. However, information on another part of the system may be needed, for example, a wave input to produce a sound. A possible solution can be the generation of local files containing the necessary information nevertheless, interactions are not thoroughly tested. This may create validation failures, which can result in delays in the development and extra costs in the development.

Co-simulation is proposed as a solution to overcome the latter issue [89]. It consists of a combination of simulators that execute concurrently, with one providing information to the others and vice versa. This approach is used in different domains, as shown in the Figure 2.36. To have the co-simulation environment



**Figure 2.36:** Research publications of co-simulation applications over five years[89]

it is required an API that creates the interface between the simulators like LibSystemCTL-M-SoC, which contains various SystemC/ TLM-2.0 modules that enable co-simulation of Xilinx QEMU, SystemC/ TLM-2.0 and RTL [90]. If no API is provided, the simulator can also be extended to allow such interactions, as long as it is open-source.

One of the challenges when running a DES co-simulation is synchronization. As previously mentioned, there are different iterations of different abstraction levels, which can result in temporal misalignment. In other words, causality errors can happen due to a distinct time granularity. This problem is similar to the one faced when executing a parallel simulation, and the solutions applied to address them are remarkably consistent.

### 2.7.1 Co-Simulation on Gem5

Until the present date, few works have used gem5 in a co-simulation environment. The primary reason for this could be that official support is available only for SystemC [58]. However, the open-source nature of Gem5 allows for extensions and integrations of additional tools. An example of that is the work developed by Manu Komalan in [91], where gem5 and NVMain work together to obtain a good simulation accuracy.

SystemC is being widely used in the industry and research hardware design space exploration [92]. Nonetheless, there is a notable scarcity of precise, freely available, adaptable, and lifelike SystemC models for contemporary CPUs. Christian Menard in [92] developed an API that allows full interoperability between the two simulators, which is now part of the official version of Gem5 [58].

In the paper, there are presented three possible scenarios for the co-simulation environment. The issue here is that achieving this co-simulation requires compiling Gem5 as a library and then using it in SystemC. This does not fully adhere to the previously mentioned co-simulation definition because the two simulators are not running simultaneously in separate processes. This limitation can lead to verification failures as direct inputs from other simulators to Gem5 are not possible.

# **3 | Dynamic Quantum Extension**

The development of par-gem5 [3] solved one problem of Gem5 [34], which is low performance even with a power full host machine. As exhibited in the subsection 2.5.4, the accuracy is no longer perfect, and a trade-off between these occurs. Concerning the current works on this subject, none of them fulfilled the three characteristics presented in the Table 2.3, therefore they can not be employed in the par-gem5.

To overcome these limitations and find the optimal quantum automatically, some algorithms in the scope of this dissertation were developed. During the chapter, these algorithms, which are called ADALINE-based, step ladder, IFP, and loop detection, will be described and explained in detail. Regarding the overall simulator, these will operate on the synchronization process, more precisely in the quantum definition, as presented in the Figure 3.1.

All CPUs are running independently from each other until they reach the barrier, which is the defined time to execute synchronization. Due to this independence, each one has its local time thus, the CPUs will not hit the barrier at the same moment. To avoid parallelism issues, the synchronization process is only executed when all have reached the defined time. When starting this process, the simulator verifies if there are more events to carry out and establishes whether the simulation should continue or not. If affirmative, the postponed events are scheduled to their respective event queues and the next synchronization is programmed for the current time plus the result from the developed algorithm. If not, it finishes the workload.

The first section will present an introduction to the developed extension, where the changes in par-gem5 and the used benchmarks will be exploited. After that, the remaining sections will outline the advantages and disadvantages of each algorithm, consistently comparing their performance with the previous scenario. In the end, an evaluation will be done, resulting in the definition of the definitive algorithm and comparing its results with the static version already present in the par-gem5.

## **3.1 Par-gem5 Changes and Benchmarks**

Before going into the algorithms' design, a brief introduction about the changes in par-gem5 and the used benchmarks must be done. These changes were done to integrate the usage of the dynamic quantum, and to better evaluate, at the end of the simulation, its performance.



**Figure 3.1:** High-level diagram of par-gem5 operation mode

### 3.1.1 Par-gem5 Changes

The first and most important one was the creation of the *QuantumMode*, and the other was the inclusion of more simulation statistics regarding the dynamic algorithm.

Par-gem5 has two possible simulation modes, sequential and parallel with a static quantum. When the user chooses a parallel mode, the simulator already knows that it will use a static quantum. If the user previously provides a value, it will be applied as the new synchronization time otherwise, a default value is selected (1 microsecond). By adding the dynamic version, the user needs to specify which one will be employed, hence the prior approach does not work.

Concerning this, the *QuantumMode* type was created to adhere to the new situation. This variable type can have two states, static or dynamic, representing the two available versions. To integrate it into the simulator and be able to recognize it, a couple of steps were done. First of all, in the *params.py* file, the *QuantumMode* was defined and added to the list of the simulator's available types. After that, the *quantum.py* file was created to implement the name translation, due to the fact that the names used on the user side are different from the simulator side. Then, in the *Root.py* file, it was added the *QuantumMode* parameter in order to allow the user to select the desired option. By last, the *simulate.cc* was modified to be able to select between the two versions. Furthermore, an alteration was required on the event responsible for the synchronization process, *QuantumGlobalSyncEvent*, to be capable of identifying which version is being used.

At the end of each simulation, par-gem5 provides a set of statistics that reflect the behavior of the performed simulation. Examples are the number of quanta with at least one postponed event (*errQuanta*), the absolute error of par-gem5 in seconds (*errSimSeconds*), and the number of events postponed (*eventsPostponed*). However, to have a better view of how the dynamic algorithm was executed, the *quantumMean*

and *DynReset* statistics were included in the final simulation results.

The first, as the name suggests, refers to the mean of applied quantum. During the simulation, the dynamic algorithm may vary the quantum value over the simulation process hence, it may be useful to have an idea of how the simulation was conducted. For instance, if the workload was focused on performance and the *quantumMean* gives a small value, it might be a sign that something went badly. The development of a dynamic algorithm can be another example since this data can reveal if, in some tests, the quantum was or was not the most adequate.

The *DynReset* gives information about how many times the dynamic algorithm had to reset. This might happen when the dynamic system becomes unstable, providing illogical quantum values. Later in this chapter will be explained better this concept.

### 3.1.2 Benchmarks

When an engineer designs an ASIC, for example, in the end, he needs to verify if the project meets the requirements. Therefore, the engineer needs to execute tests, or benchmarks, to perform the evaluation. They are designed to measure performance, capabilities, efficiency, and other system components. There are different types of benchmarks, each one emphasizing an area of interest. Some examples are CPU, network, and storage.

In the context of this dissertation, it will be used two CPU benchmarks, the bare-metal bubble sort and the NAS Parallel Benchmarks (NPB). These will be executed on a host and target system with the configurations exhibited in the Table 3.1a and Table 3.1b, respectively.

**Table 3.1:** System Configurations

| (a) Host |                                         | (b) Target         |                                        |
|----------|-----------------------------------------|--------------------|----------------------------------------|
| CPU      | AMD Ryzen 3990x (64 cores, 128 threads) | CPU                | ARM64, AtomicSimpleCPU @ 2GHz          |
| RAM      | 128GB of 3200MHz DDR4-DRAM              | Caches             | 64kiB L1-D, 32kiB L1-I, 2MiB L2 shared |
| OS       | Ubuntu Linux 20.04                      | Main Memory        | 2GB of DDR3 RAM @ 1600MHz              |
|          |                                         | Periph. Sub-system | Real View Virtual Express V1           |

#### Bare-metal Bubble Sort

The main objective of this benchmark, as implied by its name, is to rearrange an array in such a way that elements with higher values are positioned at the top. The algorithm employed for this task is quite straightforward. It involves a loop that iterates through the input list, examining each element in turn. In each iteration, the algorithm compares the current element with the subsequent one, and if the latter is smaller, the values are swapped. This process continues until the entire array is sorted according to the desired criterion. It was designed to attain a near-best-case simulation throughput, meaning that thread synchronizations and accesses to shared memory are reduced to a minimum.



**Figure 3.2:** Bubble sort example

The test will run a pre-defined number of iterations, and once all the cores have completed their tasks, the simulation will conclude successfully. At the first moment, the array is initialized with random numbers, where the lower and bigger numbers are saved in separate values. Then, the bubble-sort the algorithm is executed. To ensure that the simulation ran smoothly, the program compares the first and last values of the sorted array with previously saved values. If there is a match, the execution is considered successful. In case of a mismatch, a failure is detected, and a notification is sent to the user.

This benchmark operates in a bare-metal environment, which means it is implemented without the use of an OS. In a bare-metal setup, the program runs directly on the hardware without the mediation or assistance of an operating system. This approach is often chosen for performance-critical or resource-constrained applications where direct control over the hardware is necessary, and there is no need for the additional services and abstractions provided by an operating system.

## NPB

NPB [93] is a group of a small set of programs designed to help the performance evaluation of parallel supercomputers. In total, there are eight benchmark specifications, five kernels, and three pseudo-applications. The kernel benchmarks are:

- **Integer Sort (IS):** Performs a sort operation where both integer computation speed and communication performance are tested.
- **Fast Fourier Transform (FT):** Solves a three-dimensional partial differential equation using Fast Fourier Transforms (FFTs). Long-distance communication performance is the main evaluation point.
- **MultiGrid (MG):** It is a simplified multigrid kernel that requires highly structured long-distance communication. Thus, short and long-distance data communications are evaluated.

- **Conjugate Gradient (CG):** Computes an approximation to the smallest eigenvalue of a large, sparse, symmetric positive definite matrix. The main goal is to test irregular long-distance communication.
- **Embarrassingly Parallel (EP):** It provides an estimate of the upper achievable limits for floating point performance, with no significant inter-process communications.

The pseudo-applications are the **Block Tridiagonal (BT)**, **Scalar Pentadiagonal (SP)**, and **Lower-Upper symmetric Gauss-Seidel(LU)** benchmarks. Each of these benchmarks is designed to solve a synthetic system of nonlinear partial differential equations using a distinct algorithm. The names of the benchmarks correspond to the specific algorithms employed in solving these mathematical problems.

Moreover, NPB implemented benchmark classes, that is, the problem size of each workload can be modified. There were implemented 8 problem sizes (S, W, A, B, C, D, E, and F), where S is the smaller one and F is the bigger one. In the context of this dissertation, only the W size will be considered.

## 3.2 ADALINE-Based Algorithm

Adaptive filters are commonly used for the reason that they possess the ability to automatically adjust their own parameters and their design necessities with minimal or no prior knowledge of signal [94]. One application of these filters can be in the development of a ANC algorithm. A generic scheme can be found in the image below.



**Figure 3.3:** ANC scheme

$S(n)$  is the input that has the signal mixed with the noise, while  $X(n)$  receives only the noise hence, it is a noise reference. If the noise in  $S(n)$  was the same as the one presented in  $X(n)$ , it would only be needed to subtract the  $X(n)$  to  $S(n)$  nevertheless, this noise is different because it is attenuated, delayed, and filtered by noise path. For these reasons, an adaptive filter should be utilized, allowing  $Y(n)$  to closely resemble  $X(n)$ .  $E(n)$  will be the output without the error. It is also used by the filter in order to adapt their weights.

A comparison can be made to the par-gem5 world. The signal is the quantum, the noise is the simulation error, and the output is the quantum without the error. Also, the  $X_1(n)$  and  $X(n)$  are different due to the impossibility of calculate exactly  $X_1(n)$  in runtime. When a benchmark is running, the simulation error obtained is referred to as the worst-case scenario. In other words, the simulator analyses when there is a cross-schedule event, and the time difference between when that happens and the next synchronization is considered the error since it considers that the event must be postponed. However, most of the time, it is not true, and the event is not postponed, causing a smaller error. To accurately analyse what was the impact of inaccuracy ( $X_1(n)$ ), it would be needed to run the sequential simulation first, killing all the benefits of par-gem5 [3]. The next equation describes how the worst-case error estimation is calculated.

$$e_{rel,t} = \frac{t_{sim,meas}}{t_{sim,meas} - \sum_{i=0}^Q t_{i,max\_pp}} - 1 = \frac{t_{sim,meas}}{t_{sim,est}} - 1 \quad (3.1)$$

In each quantum the postpone-mechanism records which event experienced the most significant time shift caused by the postponement,  $t_{i,max\_pp}$ .  $Q$  is the number of simulated quanta and  $t_{sim,meas}$  is the measured simulation time.

### 3.2.1 Adaptive Filter

Following the Figure 2.33, the adaptive filter will be responsible for adapting the weights of the ADALINE NN. It uses the Equation 2.1 for the training, where it is essential to carefully choose an appropriate learning rate.

The learning rate is a constant that is chosen at the beginning of the simulation and it is not modified. This parameter compromises a trade-off between control speed and stability. With lower values, the learning process is fast nevertheless, it is likely to become an unstable control system. With higher values, stability is granted, but the learning process is very slow, taking a lot of time to be operational. Finding the best value is not linear, therefore an iterative approach was used.

A small simulation was done, and in each quantum synchronization, the values of the quantum and the error were recorded. Furthermore, in this test, the quantum was incremented 1 microsecond every time a record was done. Then, the ADALINE-based algorithm was implemented in Matlab, obtaining the results illustrated on Figure 3.4.

The aforementioned mention trade-off can be observed, where for  $\mu > 1 * 10^{-16}$ , the results were similar to Figure 3.4a, and for  $\mu < 1 * 10^{-18}$ , the system become even more unstable. To choose the learning rate, the logarithmic scale is used, as small variations in the  $\mu$  do not result in a significant change. From the results was concluded that the best value for  $\mu$  is  $1 * 10^{-17}$ .

Although on this test when  $\mu$  was equal to  $1 * 10^{-17}$  did not result in an unstable control system, there is a possibility that it could happen, for example, when there are huge error variations. Typically, these variations occur when the synchronization time is much longer than the ideal case, which requires



**Figure 3.4:** Control action with different learning rate values

resetting the algorithm. Regarding the results on [3], where it was observed the inaccuracy grows with increasing quantum and number of cores, it was also implemented a *safeQuanta*, a value that is smaller than the actual quantum. It depends on the previously mentioned characteristics hence, it can adapt to the present situation. The next flowchart illustrates how the reset system works.



**Figure 3.5:** Reset system in ADALINE-based algorithm

If the ADALINE output is negative, means it becomes unstable and needs a reset. In the reset, all the variables used in the calculation are set into their initial conditions. Although it will sacrifice simulation

performance, this action is lightweight, meaning a negligible variation. Yet, it is desirable to have the least resets possible.

### 3.2.2 TDL

As the simulation precedes, there is more data to analyse, taking away performance from the NN. Stella et al. [95] presented one example of an application where this problem is also present. The solution to make full use of the ADALINE network as an adaptive filter was the implementation of a TDL. Its operating method is described in the figure below.



**Figure 3.6:** TDL working method

One input is considered  $N$  times, being  $N$  the size of the TDL. When a new input arrives, the oldest one is discarded. In the end, the TDL have always the signal at the current time, the previous input signal, and so on. Combining the TDL with the ADALINE, efficiency, and performance are no longer sacrificed. This approach was implemented in the noise source since it receives new data in each new quantum evaluation.

TDL size varies from application to application. In this case, when the size is small, the learning process may be subpar, but it also becomes less sensitive to variations in the noise. Conversely, a larger size leads to more effective weight adjustments through improved learning, but it makes the NN more susceptible to noise variations. To choose the best size for this scenario, a small test was done. The results are presented in the following graph.

**Figure 3.7:** Reset vs. Host

To have maximum performance on the simulator, only base  $2^n$  sizes were chosen, where  $n \in N$ . Observing the outcome, it can be seen that there is a point where the host time does not improve significantly however, the number of resets continues increasing. Each reset has a performance impact, as it requires resetting every variable to its initial conditions. Therefore, it is desirable to minimize the number of resets whenever possible. In the end, it can be concluded that the best size to be used is 32, and for this reason, it was used for the subsequent tests.

### 3.2.3 Quantum Increment

The ADALINE-based algorithm output, that is, the filtered quantum, will be always lower than the source. Using this new quantum implies there will be a point in time when it becomes too small, primarily because it fails to exhibit a value increment. This situation inevitably leads to a trade-off, wherein performance is compromised. To strike the optimal balance, it becomes necessary to increment the quantum value. This adjustment should occur when no errors are detected in the system. Figure 3.8 presents the flowchart for the ADALINE-based algorithm.

**Figure 3.8:** ADALINE flowchart

An error exists when an event must be postponed. Par-gem5 [3] introduces a flag called *postponedInThisQuanta*, which triggers if the previous scenario occurs. When the simulator enters in synchronization state, this flag is verified to decide if the quantum should be incremented or not. The increment value should have a balance, in the way that with a smaller increment, the performance may not be improved, and with a high value the accuracy may be ruined. According to [3], the quantum of 1 microsecond gives a good trade-off between those two thus, it was chosen to use an increment 10 times lower. This value enables a significant balance without causing any harm.

### 3.2.4 Results

The aforementioned benchmarks executed the developed algorithm with different numbers of cores. The results obtained are depicted in the Figure 3.9.

The horizontal subtitle illustrates the executed benchmark along with the number of cores and the corresponding algorithm used, hence 4 ADA means four cores with the ADALINE-based algorithm were simulated. To assess the ADALINE performance, the static version was also executed with a quantum set to one microsecond. In all cases, the comparison reference is the result obtained from sequential simulation. Additionally, the results for a single simulated core are not presented because, in all conducted tests, the sequential simulation exhibited similar performance. Hence, this scenario is not relevant for evaluating the algorithms.

When it comes to performance, it is noticeable that the algorithm yielded better results in certain benchmarks such as bare-metal, NPB EP, and SP. However, in other cases like NPB IS and CG, it did not show significant improvements. Shifting the focus to accuracy, the static version generally outperformed the dynamic approach. Nevertheless, it systematically maintained an accuracy of under 2%, with the



Figure 3.9: ADALINE-based algorithm results

exception of some cases, like NPB CG, where both approaches performed poorly. Lastly, when comparing the host time to performance, similar outcomes were observed, though certain benchmarks exhibited superior results to others.

Concerning the Figure 2.27, it can be stated that the target time obtained with the parallel mode can only be higher than the one obtained with the sequential mode. Two reasons justify the previous sentence. The first one is related to TD, where its use may cause a higher target time due to the obligation of only ending the simulation in the synchronization process. The second one is connected to postponed events since these may delay the simulation. For these reasons, inaccuracy can only be positive. Nevertheless, as shown on the graph, this is not the case. It is worth mentioning that these occurrences are not related to the algorithm itself but rather to par-gem5, as simulation with the static version also evidences this dilemma.

The performance issue may be related to the previous quantum increment consideration. It is clear this value for some workloads is good, but for others may not be the most adequate. Therefore, having a dynamic value would solve this problem, resulting in a more flexible algorithm. Moreover, special attention should be given to its design, as excessively large values can lead to an unfavorable trade-off.

### 3.3 Step Ladder Algorithm

As shown, it was verified in some cases where the performance was equal or lower when compared with the static approach. Concerning the quantum increment, a fixed value was defined following the aforementioned philosophy however, a dynamic approach may bring better results, as different benchmarks require different needs.

Therefore, the ADALINE-based algorithm was updated to integrate this new functionality. Observing the Figure 3.8, the calculation of the increment value (*incValue*) will be done before the if statement, so whether there is a postponed event or not, the *incValue* is always controlled. The new flowchart is represented in the next figure.



**Figure 3.10:** ADALINE-based and step ladder algorithms combined

The concept is to commence with a gradual increase in value initially. If there are no postponed events during this period, then the increment can be escalated. It can be seen as an exponential however, an approach like this would scale too fast, affecting the accuracy. A solution can be the definition of a threshold, where the increment value only increases if no accuracy issues have happened N times in a row. An example is illustrated in the Figure 3.11. It is possible to observe that there is a point, marked in red, where there was no variation in the *incValue*, indicating the occurrence of a postponed event.



**Figure 3.11:** Example of the increment value evolution with a threshold

The *incValue* starts in the smaller value possible (*minValue*), which is equal to the clock period, and can grow until a maximum (*maxValue*) of a hundred microseconds. This value was defined regarding the results on [3] and [64], where the benefits of a quantum of this magnitude are few. The addition method respects the subsequent equation.

$$incValue = 2 * incValue \quad (3.2)$$

Oppositely, the decrement is done by dividing by two the actual *incValue*, until it reaches the minimum. The threshold value depends on how many cores are used by the simulation since the inaccuracy grows with the increasing number of cores. The more cores are being used, the greater the threshold, making it harder to increase the *incValue*. Moreover, the increment value only changes (increases or decreases) if it fulfills the condition for N consecutive times, as shown in the later image. The algorithm starts checking if any event was postponed. If yes, it is considered that an "inaccuracy problem" occurred. If not, it is considered that the quantum value was the right one. Then, the *maxCondition* and *minCondition* ensure that the *incValue* does not cross the previously set boundaries. To control the increment or decrement sequence, it was created the *incIndex* and *decIndex*, respectively. These are incremented or reset, as shown in the flowchart, and they are used to check if the increment or decrement threshold has been reached.



**Figure 3.12:** Step ladder algorithm flowchart

### **3.3.1 Results**

The previous benchmarks executed the new approach with different numbers of cores. The results obtained are depicted in the Figure 3.13.



Figure 3.13: Dynamic increment results

The ADA refers to the first results, while A-SL is related to the new configuration. In general, the dynamic approach with the increment value resulted in better performance. Due to the dynamic threshold, when simulating with 32 and 64 cores, there are cases when the previous approach yielded a better performance. Examples of that are NPB BT, NPB LU, and NPB SP. However, focusing on the remaining cases, on average, the performance increment was about 11%. Additionally, as a consequence of that, the overall host time was lower, reducing it by 1%

Unfortunately, the same did not occur in terms of accuracy. Negative inaccuracy is still present, and in all cases, just with a few exceptions, there was an accuracy reduction, where, in some cases, such as the NPB FT and NPB EP with sixteen simulated cores, it fell below 95%. Regarding the first topic, the reasons are the same as presented in the previous section. About the other, a possible explanation for this could be the nature of the algorithm itself. The adaptive filter may not be designed to handle drastic interventions. As shown in Figure 3.4b, the control action appears to be smooth, even though there are moments, for example, in inter-process communications when it should be more aggressive to prevent postponed events in cross-scheduling. The problem was not evident at the beginning however, with the dynamic increment, it becomes more notable.

One of the requirements for this dissertation is to maintain a maximum inaccuracy of 5% thus, the current approach does not meet this criterion. It is crucial to explore alternative methods and optimizations to ensure that the desired level of accuracy is achieved.

### 3.4 Instruction Flow Prediction Algorithm

As mentioned before, one problem of the previous algorithm is the incapability to reduce the quantum significantly in a short period. A potential solution for this issue could involve predicting when this situation is likely to occur. In other words, an attempt can be made to assess when a postponed event will take place. If it becomes possible to forecast when such an event will be postponed, adjustments can be made to the quantum before it happens. This preemptive action can help prevent significant inaccuracies from occurring.

Before the simulation initiates, the commands destined for execution are loaded into memory and remain unaltered throughout the simulation. These, when executed, can lead to a postponed event. An example of that is the Wait For Event (WFE) instruction, where the CPU enters in low-power standby state. Furthermore, in general, loops are the backbone of benchmarks, whether due to multiple iterations or the benchmark's inherent characteristics, such as testing the transfer of memory blocks between different locations. As a result, the same memory address can be encountered more than one time.

By evaluating the Program Counter (PC), it is possible to understand if the processing of one memory address can result in a postponed event (*PPaddr*), and so avoid its execution with a large quantum. With this idea in mind, the IFP algorithm, as the name suggests, will analyse the PC during the simulation,

in order to perform the prediction of when these will be executed. Figure 3.14 exemplifies an example of how the algorithm works.



**Figure 3.14:** IFP Algorithm

In the beginning, every memory address is considered risk-free, but as the simulation proceeds, it can be changed. When it happens, the processed address becomes a spotted address, marked in red in the figure, until the end of the simulation. All these addresses will be taken into account in the new quantum calculation, thus to avoid possible noise on the attribution, these will be acknowledged as genuine "postponed event generators" only when they are identified regularly.

### 3.4.1 Forecast

Workloads are not straightforward in a way that the PC do not follow a linear path. Instructions like "jump" (JMP) or interruptions have the potential to redirect the execution flow to different memory locations. Understanding exactly what type of instructions the simulator will execute and tracking them is computationally heavy, meaning performance costs. Since the algorithm should be lightweight, it is considered that the PC is linear, that is, the next PC will be the actual plus the instruction width.

First of all, is used an analytic method to find where the program counter will be in the future. As a result of the previous consideration, the Equation 3.3 was developed.

$$FPC = PC + \left( \frac{Quantum * InstWidth}{CyclePeriod} \right) \quad (3.3)$$

$$Quantum = \frac{(FPC - PC) * CyclePeriod}{InstWidth} \quad (3.4)$$

Thereafter, one of two scenarios can unfold, as depicted in the next pictures. First, nothing may transpire if no addresses are identified between the PC and Forecasted Program Counter (FPC). Alternatively, if the FPC corrects its value to the nearest identified address, a quantum recalibration is necessary, achieved using Equation 3.4. Hence, the synchronization will occur right after the execution of the spotted address, avoiding inaccuracy as the cross-schedule event is inserted in the target event queue at the expected time.



**Figure 3.15:** Possible scenarios after the forecast

### **3.4.2 Step Ladder Update**

With the development of this new algorithm, there is a new way to verify whether the quantum is reduced, meaning the old one was greater than the desirable. For this reason, the step ladder algorithm will also verify this situation, counting as an "inaccuracy problem", contributing to the reduction of the increment value.

To determine whether the previous scenario occurred, the quantum value before the IFP algorithm's action is stored and compared to the current. If the stored value is equal to or higher than the current quantum, no action is taken. However, if the stored value is lower, it indicates that the quantum was reduced, triggering the flag.

### **3.4.3 Results**

The results of the bare-metal bubble-sort and NPBs benchmarks execution are presented in the Figure 3.16. As previously, A-SL refers to the results of the combination of the ADALINE-based and step ladder algorithms, and A-SL-IFP points to the IFP algorithm intervention results.

As expected, there was a drop in the performance gain when compared to the previous. This was about 5%, which in practice, reflected in an increment of the host time of around 0.3%. Nevertheless, there are a few cases in the algorithm's action conducted into better performance, for instance, the NPB BT over two and eight simulated cores.

The most distinguished side is the accuracy, where the PC analysis inclusion allowed the removal of the inaccuracy peaks that existed in the A-SL version hence, the previously mentioned cases that passed the 5% are now within the limit. Only the NPB FT with 32 and NPB CG with 64 simulated cores are exceptions. When considering up to 16 simulated cores, there was approximately an increment of about 0.5% in accuracy.

Negative inaccuracy persists in the simulations. While the occurrence of this issue has been reduced in comparison to the initial version, it is important to retain that this problem remains associated with par-gem5, as aforementioned.

It can be concluded the IFP algorithm presence allowed the quantum to immediately reduce its value. It also can be identified when analysing the worst-case scenario of inaccuracy, where drops can reach up to 10%. The tradeoff was even improved since the enhancements in accuracy outweighed the increase in host time.



Figure 3.16: IFP algorithm results

## 3.5 Loop Detection Algorithm

As mentioned earlier in this chapter, loops are strongly present in benchmarks. Thinking in this perspective, if a record of every executed memory address is made, a loop can be identified in real-time. Combining this with the previous algorithms, the quantum could be adapted in a way to have the highest value possible. Furthermore, the accuracy would be nearly perfect since all cross-schedule events would be known, allowing for precise adaptation.



**Figure 3.17:** Quantum definition with loop detection algorithm

When the simulation starts, there is no information about loops or *PPaddrs*. To achieve this, it would be necessary to execute the simulation once and then obtain the results, which do not meet the requirements. One solution is to conduct the simulation in a regular manner while simultaneously recording the executed memory addresses. When a loop is detected, the algorithm begins utilizing the loop to calculate the quantum until the PC no longer follows the loop's path. At that point, the loop is reset, and the process starts anew.

### 3.5.1 Hare-Tortoise Algorithm

The detection method must be light otherwise, the simulation performance is set in danger. One algorithm with this characteristic is Floyd's Tortoise and Hare. It consists of two pointers, one twice faster than the other. If both match at some point, means that there is a loop, as shown in the figure below.



**Figure 3.18:** Hare-Tortoise Algorithm

The flowchart present in the Figure 3.19 describes the loop detection flow. In this case, the faster pointer is the PC, and the slower one is the FTHptr. A match occurs when both are pointing to the same memory at the same time. Going into the process in detail, loop delineation can be achieved through two methods. Firstly, an analysis can be conducted on the preceding memory address to precisely define the loop boundaries. Alternatively, after the initial match, the FTHptr remains in the same position while the PC continues execution. When they match once more, the loop becomes well-defined. In terms of performance, it is trivial the second approach is more underweight, reason why it was chosen. After the detection, it is necessary to keep track of the PC, as mentioned earlier. It is done by comparing the actual PC with the expected one. Whether matches, nothing happens. In the opposite way, the loop is discarded and the entire process of detection starts again.

Although this technique is simple and very effective, it has some inherent problems. The first issue pertains to the inability to detect nested loops. Another challenge is the incapability to identify subsequent loops following the one that has been detected. Lastly, the most crucial problem is the difficulty in detecting conditions within loops. All of these can be solved with the PC track nevertheless, this solution may cut the benefits of the algorithm, since these problems can be recurrent in the benchmark.

Another problem is the multi-thread environment. When more than one CPU is active, the PC executes more than one event queue, meaning an unpredictable pointer. In this situation is almost impossible to identify any loop, thus a solution can be detection application only when one CPU is on. If there is a loop defined and the other CPUs wakes up, this is automatically discarded, and the algorithm stops, restarting again when the later condition reverts.



**Figure 3.19:** Loop detection flowchart

### **3.5.2 Quantum Calculation**

The quantum calculation is done at the synchronization point, as shown in detail in the Figure 3.20. It can be divided into 2 distinct tasks. The first one is to classify the loop addresses. The second is the calculation of the new quantum.

Regarding the primary task, it is only executed when the loop is new, so it is only done once per loop. The loop's size is utilized to determine whether it is the same loop or not. This approach is chosen for its simplicity and the low probability of encountering two different loops with identical sizes. After this verification, an iteration within the loop is made, comparing these with the spotted addresses. If there is a match, that value is associated with a *PPaddr*.



**Figure 3.20:** Quantum calculation flowchart

To calculate the quantum, the last task uses an iteration technique, as demonstrated in the flowchart. The iterator starts where the PC is, and pursues the loop until it either finds a *PPaddr* or reaches the starting point. The quantum starts with the minimum value, which is the clock period, and increments that value in each iteration.

### 3.5.3 Results

Figure 3.21 exhibits the results of the executed benchmarks with the previous (A-SL-IFP) and new (LD) configurations.



Figure 3.21: Repetitions algorithm results

## 3.6 Improved Baseline Algorithm

After the development and testing of all aforementioned algorithms, it was determined the better approach is the combination of the ADALINE-based, the dynamic increment, and the PC analyses. The results show they complement each other, providing a great trade-off between performance and accuracy. The repetition algorithm was not integrated into the final solution due to its weak performance and accuracy gain.

The synchronization process represented in the ?? can now be redefined to integrate the dynamic approach, as illustrated in the following flowchart. The static version was not removed because, with a-priory information about the benchmark, the best quantum can be calculated before simulating. One example is a network application, where the communications delays are well-defined. If the quantum is set with the smaller delay, a perfect accuracy can be achieved [56].



**Figure 3.22:** New quantum definition in the synchronization process

### 3.6.1 Results

The next graphs present a comparison between the improved baseline algorithm and the static mode with one microsecond synchronization time.



Figure 3.23: Improved baseline and static mode comparison

Starting with performance, the dynamic version outperformed the other one in most of the cases. Nevertheless, in the cases where it did not happen, in particular, NPB BT, LU, and SP with 32 and 64 simulated cores, the performance difference was huge. If these are considered, when calculating the mean of performance gain and comparing it with the static version, the result would be -9.1%. Without the previous tests, the gain average would be almost 10%.

Moving to accuracy, in all workloads, the dynamic approach only exceeded the 5% inaccuracy in two cases. On the other hand, the static version performed worst, having six occurrences. In the remaining cases, on average, there was an increment of 0.5% of inaccuracy when using the developed algorithm.

It is important to mention that, in the dynamic case, all of these occurrences are related to simulation issues inherent in par-gem5 nature. In other words, the sum/subtraction of the sequential target time with the worst-case scenario may give a smaller/higher value than the obtained target time. Figure 3.24 demonstrates the previously described scenario. The target time obtained must be within the maximum inaccuracy, which sometimes does not verify. In these cases, the simulation result should not be considered, and for this reason, the dynamic version occurrences were discarded. Meanwhile, the static version also had cases regarding this subject but does not represent its totality.



**Figure 3.24:** Simulation issue in par-gem5

Finally, in terms of host time, the results are again very positive for the dynamic version. Until 4 simulated cores, only one case did not show improvements, and in the left cases, the major part show also gains. There are still situations where it does not verify, especially when 32 and 64 simulated cores. However, it is crucial to emphasize that accuracy should exceed 95%, a criterion that is not met in certain cases, such as NPB IS.

# 4 | Co-Simulation on Gem5

This chapter presents a co-simulation environment using gem5 and SystemC. Since none of the previously mentioned works fully adhere to this topic, it was designed and developed an interface where both frameworks can work in parallel. Further, it was selected a case study, which serves as a stimulus for system design and demonstrates the practical application of the developed work. The chosen case study focuses on the CRC peripheral, which is present in several microcontroller families, such as STM32 [96] or Xilinx Zynq [97] nevertheless, VExpress\_gem5, which is a board introduced by Gem5, does not include it.

In conclusion, par-gem5 with both static and dynamic versions will be tested with the developed interface. The previous tests, which were all done in the sequential mode, will be the reference for these. The main objective is to determine which approach is better in the different situations.

## 4.1 Framework Proposal

As mentioned earlier in this dissertation, co-simulation involves the integration of multiple simulation tools from different domains. Because of that, it is required to develop a system that could execute and communicate among them. The following picture demonstrates how the different tools will communicate with each other.



**Figure 4.1:** High-level proposal design

The communication will be established using a remote port, utilizing the TCP/IP IPC protocol. It was chosen due to its application independence and its versatility for various applications and purposes. The

remote connection must be created and connected before the beginning of the simulation, in a way that transactions occur. Each hardware device will have a unique connection to avoid concurrency problems.

### 4.1.1 SystemC

The peripheral development was carried out using the SystemC tool. It has the capability to simulate multiple hardware devices at the same time, and these can be different from each other. In order to integrate this flexibility, the subsequent design was developed.



**Figure 4.2:** General SystemC design

SystemC design is composed of three main components, which will communicate under the blocking transport method. The initiator, the bus, and the targets. Beginning with the first, it serves as the initial point of interaction for the tool, as it handles incoming frames received through the remote port. The remote port operates independently of the initiator, enabling asynchronous reception and transmission of bytes through the receiver and transmit buffers, respectively. Additionally, it also performs an interpretation of the received trama and creates the TLM transaction. It analyses various parameters such as the operation type, the desired target, and the memory region, among others, with the help of TLM wrapper, which will be presented later on.

The bus is responsible for forwarding the TLM transaction to the right target. Each target has a unique ID, that is attributed to it at the beginning of the simulation. Gem5's board is a 32-bit processor, meaning that each transaction is 32-bit in length. However, SystemC TLM transactions are 64-bit width, leaving 32 bits unused. Before the transmission, the initiator uses these bytes to set the target\_ID, which will be utilized by the bus to identify the targets.

Lastly, the targets are the peripherals, which can be different from each other. These receive the TLM commands and act accordingly. The next images demonstrate how peripherals respond to write and read commands

**Figure 4.3:** Flowcharts of the available commands

The first step is to verify the offset/size out-of-bounds, and permissions in a way that the device's integrity is maintained. If no errors occurred in this process, the desired operation is done otherwise, an error response is created, reporting where the problem was. At the end of each operation, the correspondent latency is returned, which may vary regarding the desired work.

### 4.1.2 Gem5

As aforementioned, Gem5 has available as a target platform the VExpress\_gem5 board. It is based on the ARM Versatile Express RS1, which consists of a motherboard and two daughterboards. The system provides a range of both on-chip and off-chip devices. On-chip devices include the Generic Interrupt Controller (GIC), an LCD controller, and system timers. Off-chip devices contain the Keyboard and Mouse Interface (KMI), Real-Time Clock (RTC), and Universal Asynchronous Receiver-Transmitter (UART). The platform's memory map is divided into the next sections.

1. Boot memory ..... 0x00000000 to 0x03FFFFFF
2. Reserved ..... 0x04000000 to 0x0FFFFFFF
3. Off-chip peripherals ..... 0x10000000 to 0x1FFFFFFF
  - (a) Gem5-specific peripherals ..... 0x10000000 to 0x13FFFFFF
    - i. Energy controller ..... 0x10000000 to 0x1000FFFF
    - ii. Pseudo-ops ..... 0x10010000 to 0x1001FFFF
    - iii. MHU ..... 0x10020000 to 0x1002FFFF

|                              |                          |
|------------------------------|--------------------------|
| (b) Reserved .....           | 0x14000000 to 0x17FFFFFF |
| (c) VRAM .....               | 0x18000000 to 0x19FFFFFF |
| (d) Reserved .....           | 0x1A000000 to 0x1BFFFFFF |
| (e) Peripheral block 1 ..... | 0x1C000000 to 0x1FFFFFFF |
| 4. On-chip peripherals ..... | 0x20000000 to 0x3FFFFFFF |
| 5. PCI memory .....          | 0x40000000 to 0x7FFFFFFF |
| 6. DRAM .....                | 0x80000000 to 0xFFFFFFFF |

It is possible to notice that the Gem5-specific peripherals memory region is not fully occupied, therefore hardware should be implemented in this memory region. In accordance with the remaining devices, each peripheral should occupy 0xFFFF of memory space, unless it requires more space. In this case, another block of 0xFFFF should be used until its needs are fulfilled. After planning this, Gem5 must recognize this hardware to enable communication with the other framework. To achieve that, every peripheral should be defined and added to the list of off-chip devices in the board's configuration. In addition, it is also required to create a Page Table Entry (PTE) for each device, which can be done by following the code on 4.1. However, it's important to note that even after this process, the devices are only recognized as part of the VExpress\_gem5 board, as their implementation still needs to be completed.

```

1 LDR  r1,= DEVICE_ADDR    // Device address
2 LSR  r1, r1, #20        // Find which 1MB block it is in
3 LSL  r2, r1, #2          // Gives offset into the page tables
4 LSL  r1, r1, #20        // Put back in address format
5 LDR  r3, =L1_DEVICE     // Descriptor template
6 ORR  r1, r1, r3          // Combine address and template
7 STR  r1, [r0, r2]        // Store table entry

```

**Code 4.1:** Template for a PTE

To implement a device in Gem5, a few aspects are mandatory. First of all, the peripheral's Python interface. In this document, it is described the type, where implementation is, and, optionally, some parameters to be customized. In this context, these can be the port number, for the remote connection, or the action time delay. In second place, is the implementation itself. Gem5 has implemented a device template, that is present in the class *BasicPioDevice*. By inheriting this, the default configurations are placed nevertheless, read, and write functions still need to be defined, as these change from device to device. The figures 4.4a and 4.4b present its implementation. Further, other aspects must be present, like the device registers, the socket\_ID, and the remote port functions(listen, accept, and detach). The remote connection is done before the simulation starts, meaning that it does not continue until a connection

**Figure 4.4:** Redefinition of *BasicPioDevice* functions

is made. As soon as it happens, an init function is called to state the initialization, with the device's information. It is required due to the possibility of various devices in SystemC. The last point is the *Sconsript* file, which is mandatory by the compiler to understand the available SimObjects. Here should be discriminated the SimObject, that is, the Python file, the SimObject's source file, and the debug flags, if presented.

### 4.1.3 TLM Wrapper

Although Gem5 and SystemC are two frameworks that use C++, they cannot communicate directly with each other. Gem5 is unable to interpret the TLM commands, just as SystemC doesn't understand gem5 packages. For these reasons, it is necessary a wrapper to enable communication between the two platforms. This will be available in both, and every transaction must pass through it otherwise, the channel could be corrupted.

**Figure 4.5:** TLM wrapper payload definition

The main part of the wrapper is the payload definition. It provides coherent communication between platforms, keeps the transactions easy to understand, and improves efficiency. For this case, its definition

can be found in the prior image. Each operation must have a command and a slot, the reason why both have the color red. TCP/IP reads and writes eight bits at a time consequently, even if the command list does not use all the bits, a byte is allocated for their size. The list of commands can be found on Table 4.1. The slot is used as an ID. Regarding the Figure 4.1, each peripheral has a remote connection associated hence, with this, is possible to decode the port where the transaction should be done. Offset and size are marked as blue since they are mandatory for both read and write operations. Data is with green color because only is necessary for the write action. Four bytes are demanded for these parameters, due to the 32-bit processor present in the VExpress\_gem5 board.

**Table 4.1:** TLM wrapper commands

| Bits | Command   |
|------|-----------|
| 00   | TLM_INIT  |
| 01   | TLM_CLOSE |
| 10   | TLM_READ  |
| 11   | TLM_WRITE |

Every transaction is kept up with a response, TLM\_ACK (0) for success, or TLM\_NACK (1) for error, and the operation latency. In the case of reading, is added the bytes of the selected memory region. The next figures demonstrate how the wrapper is implemented in both frameworks.



Figure 4.6: TLM wrapper in Gem5



Figure 4.7: TLM wrapper in SystemC

## 4.2 CRC as a Case Study

To validate the developed interface was chosen, as a case study, the CRC peripheral. The CRC was created in 1961 by William Wesley Peterson [98]. As the name suggests, it utilizes systematic cyclic codes to encode messages by incorporating a fixed-length check value. In the end, his work contributed significantly to simplifying and enhancing the detection of accidental errors/changes in communication networks. CRC uses a generator polynomial, which is known by the sender and receiver, and it is used to perform the calculation. There are different standards however, the most common ones are the CRC-8, CRC-12, CRC-16, CRC-32, and CRC-CCIT [99]

Another application for the CRC is the storage integrity. Due to defective components or electromagnetic fields, bits can change their value without notice. In the presented case study, this scenario will be explored, where the CRC peripheral is used to maintain a specific memory state and verify if there have been any changes. Furthermore, the board will not be only engaged in this operation, as in a real-world scenario, it has additional peripherals to manage. Hence, as a proof-of-concept, parallel tasks will be included in the execution.

It is worth noting that the design process took as reference the STM32 microcontroller family's reference manual [96].

### 4.2.1 Peripheral Development

The CRC development was required in both simulators. Starting with SystemC, first of all, it was defined which hardware will take place. In this example, since the focus will be the CRC, only it will be simulated. Hence, Figure 4.2 can be redefined to the Figure 4.8.



**Figure 4.8:** SystemC design with CRC

The peripheral's characteristics can be observed in the bullet points below, and Figure 4.9 demonstrates how it will function.

- CRC-32 (Ethernet) polynomial: 0x4C11DB7


**Figure 4.9:** CRC block diagram

- Programmable CRC initial value
- Single input/output 32-bit data register
- CRC computation done in 4 clock cycles
- General-purpose 8-bit register (can be used for temporary storage)
- Reversibility option on I/O data

First of all, the user needs to write the input value in the data register (CRC\_DR). After four CRC clock cycles, the correspondent CRC value is fully calculated, and its value is stored in the CRC\_DR. In this case, the CRC frequency will be equal to the CPU. The data width must be 32-bit, hence whether the user needs a CRC for five bytes, for example, two different computations will be needed.

Moreover, the input and output data can be reversed, to manage the various endianness schemes. For the input data, the reverse operation is controlled by the REV\_IN[1:0] bits, and for the output data, the REV\_OUT bit is used. These, along with the reset bit, which is used to reset the CRC, are located in the control register (CRC\_CR). This bit must be set by software, and it is automatically cleared by the hardware. An example of a reverse operation can be found on Table 4.2.

**Table 4.2:** Reverse operation

| (a) Input   |            |                                |               | (b) Output |            |                           |                |
|-------------|------------|--------------------------------|---------------|------------|------------|---------------------------|----------------|
| REV_IN[1:0] | Input      | Reverse Action                 | Reverse Input | REV_OUT    | Output     | Reverse Action            | Reverse Output |
| 0 0         | 0x1A2B3C4D | Not affected                   | 0x1A2B3C4D    | 0          | 0x11223344 | Not affected              | 0x11223344     |
| 0 1         | 0x1A2B3C4D | Bit-reversal done by byte      | 0x58D43CB2    |            |            |                           |                |
| 1 0         | 0x1A2B3C4D | Bit-reversal done by half-word | 0xD458B23C    |            |            |                           |                |
| 1 1         | 0x1A2B3C4D | Bit-reversal done by word      | 0xB23CD458    | 1          | 0x11223344 | Bit-reversal done by word | 0x22CC4488     |



**Figure 4.10:** CRC write operation

By default, the polynomial coefficients are defined by 0x4C11DB7 nevertheless, it can be fully programmable through the CRC\_POL register. It is important to mention that modifications in this register when a CRC computation is ongoing are not permitted, as it would compromise the output value. To conclude the available registers, are missing the CRC\_INIT and CRC\_IDR, which are used to initialize the CRC calculator in the reset, and to hold a temporary 8-bit value related to CRC calculation, respectively.

Finally, the write command must be redefined to adhere to the peripheral characteristics, as demonstrated in Figure 4.10. Since the read operation does not require a specific behavior depending on the desired registers, the aforementioned template does not demand any action.

Moving to Gem5, the first aspect to take care of is the definition of the memory region. Following the previously defined rule, for the under study device memory will be reserved from 0x10030000 to 0x1003FFFF. Then, the PTE entry must be created, with the *DEVICE\_ADDR* parameter equal to the beginning of the device's memory, which is 0x10030000. The last part is the implementation of the write and read functions. Concerning the template present in the Figure 4.4, no modifications were needed to integrate this peripheral, therefore these were employed.

At this point, the user on the application side can write and read directly from memory without any

restriction. However, it can be dangerous, in the way that the user can, by mistake, do an unpermitted operation. For example, write in reversed memory, causing a segmentation fault. To avoid this, it was developed an API for the application side. Similar to STM32 microcontrollers that utilize the HAL library, the CRC device will also have its own hardware abstraction layer. This layer is designed to speed up development and enhance code clarity.



**Figure 4.11:** Class diagram for the CRC API

Primarily, initializing the peripheral is a mandatory step. Without initialization, any attempt to execute an operation will result in an error, and no action will take place. To calculate the CRC value of a given number, the *CRC\_SC\_VALUE\_INPUT* function must be called, with the corresponding number as a parameter. After that, the result can be obtained by calling the *CRC\_SC\_VALUE\_OUTPUT* function. It is important to mention that the calculation takes four clock cycles, thus the function can either return the CRC result or an error, warning that the value is not calculated yet. All the remaining functions are used to control the device.

## 4.2.2 Peripheral Validation

After development, the next step was to simulate and validate the peripheral. For this purpose, it was developed a validation test where all features of the device will be tested. Two cores will be employed for this purpose: one will be dedicated to the CRC, while the other will execute the bubble-sort benchmark. The simulation will be conducted in sequential mode since the main objective is not performance, but accuracy.

Figure 4.12 presents the design of the co-simulation environment. It will be used one CRC and the UART, to communicate with the user by the terminal. As referred, UART is already implemented, hence it only requires its initialization and configuration. For the test, it will be used the UART0, which is present in the memory map from 0x1C090000 to 0x1C09FFFF. To connect to the UART, the m5term can be utilized. It is a dedicated program that allows the user to connect to the simulated console interface. When executing the simulation, this program does not launch automatically, therefore it must be manually called, with `./m5term <host> <port>`.



**Figure 4.12:** Class diagram for the CRC API

The application will run the code present in 4.2. It can be divided into three parts. In the first one, the reverse output will be tested. Then, the settings will change, and the calculation will use a reverse input and a different polynomial. To conclude, the reset function will be checked. By executing this benchmark, every functionality of the peripheral is tested. The expected outputs are:

1. Return from CRC\_DR: 0x22CC4488
2. Return from CRC\_DR: 0xC66CE444
3. Reset was done with success!

```

1 CRC_SC_INIT();
2
3 CRC_SC_SET_REVERSE_OUTPUT(REV_WORD_OUT);
4 CRC_SC_SET_IDR(0x4C);
5 CRC_SC_VALUE_INPUT(0x15e32ef3);
6
7 do //Wait 4 tick
8 {
9     CRC_value = CRC_SC_VALUE_OUTPUT();
10 } while (CRC_value == -EBUSY);
11
12 printf("Return from CRC_DR: %x \n", (uint32_t) CRC_value);
13

```

```

14 CRC_SC_SET_REVERSE_OUTPUT(NOT_REV_OUT);
15 CRC_SC_SET_REVERSE_INPUT(REV_HALF_WORD_IN);
16 CRC_SC_SET_POLYNOMIAL(0x12345678);
17 CRC_SC_VALUE_INPUT(0x1A2B3C4D);

18
19 do //Wait 4 tick
20 {
21     CRC_value = CRC_SC_VALUE_OUTPUT();
22 } while (CRC_value == -EBUSY);

23
24 printf("Return from CRC_DR: %x \n", (uint32_t) CRC_value);

25
26 CRC_SC_SET_INIT(0x4C11DB7);

27
28 if(CRC_SC_GET_IDR() == 0x4C)
29     CRC_SC_RESET();

30
31 if(CRC_SC_VALUE_OUTPUT() == CRC_SC_GET_POLYNOMIAL())
32     printf("Reset was done with success! \n");
33 else
34     printf("Failure in Reset \n");

35
36 break;

```

**Code 4.2:** CRC validation code

After executing the validation benchmark, the real results were the ones present in the Figure 4.13. It is possible to conclude that the peripheral passed the validation since every expected output matches the real ones.

```

root@icelab-16:~/workspace
/usr/bin/arm-none-eabi-cpp boot.s -g -O3 -lunistd -I . -DNUM_CORES=2 -DARR_SIZE=1024 -DREPEAT_WL=100 -DMEM_SIZE=16 | /usr/bin/arm-none-eabi-as -EL -o boot.o
/usr/bin/arm-none-eabi-gcc -g -O3 -lunistd -I . -DNUM_CORES=2 -DARR_SIZE=1024 -DREPEAT_WL=100 -DMEM_SIZE=16 -c -o main.o main.c
/usr/bin/arm-none-eabi-gcc -g -O3 -lunistd -I . -DNUM_CORES=2 -DARR_SIZE=1024 -DTLM_READ -DREPEAT_WL=100 -DMEM_SIZE=16 -c -o workloads.o workloads.c
/usr/bin/arm-none-eabi-cpp armv7.s -g -O3 -lunistd -I . -DNUM_CORES=2 -DARR_SIZE=1024 -DREPEAT_WL=100 -DMEM_SIZE=16 | /usr/bin/arm-none-eabi-as -EL -o armv7.o
/usr/bin/arm-none-eabi-gcc -g -O3 -lunistd -I . -DNUM_CORES=2 -DARR_SIZE=1024 -DTLM_WRITE -DREPEAT_WL=100 -DMEM_SIZE=16 -c -o CRC_SC.o CRC_SC.c
/usr/bin/arm-none-eabi-gcc -xlinker --defsym=NUM_CORES=2 -nostartfiles -xlinker -Map=linkmap.txt -Xlinker -Tmulticore.ld -o main.elf boot.o ../common/syscalls.o main.o workloads.o armv7.o CRC_SC.o --specs=nosys.specs
/workspace/gem5.1ce/build/ARM/gem5.0pt -r -d /workspace/tests/results/atomic/bar e-metal/2c_reg --debug-flags=Crc /workspace/gem5.1ce/configs/example/fs.py --b Received Slot: 0
are-metal --cpu-type=AtomicSimpleCPU --caches -l2cache --machine-type=VExpress_Received offset: 4
GEM5_V1 --kernel=/workspace/tests/build/main_2c.elf --num-cpus=2 Received Size: 4
build/ARM/base/statistics.hh:277: warn: One of the stats is a legacy stat. Legacy stat is deprecated.
y stat is a stat that does not belong to any statistics::Group. Legacy stat is deprecated.
Redirecting stdout and stderr to /workspace/tests/results/atomic/bare-metal/2c_legsimout
root@icelab-16:/workspace# 

ribeiro@icelab-16.ice.rwth-aachen.de: /scratch/ribeiro/workspace/pad-gem5/gem5.1ce/util/term$ ./m5term 127.0.0.1 3456
Alma8 :ribeiro@icelab-16: /scratch/ribeiro/workspace/pad-gem5/gem5.1ce/util/term$ Return from CRC_DR: 22cc4488
Return from CRC_DR: c66ce444
Reset was done with success!
CPU0: done
CPU1: Bubble sort done
CPU2: done
Alma8 :ribeiro@icelab-16: /scratch/ribeiro/workspace/pad-gem5/gem5.1ce/util/term$ 

```

**Figure 4.13:** CRC validation test results

### 4.2.3 Memory Integrity

With the CRC's validation, it is possible to simulate the scenario described at the beginning of this section. Similar to the validation test, from the application point of view, the system will execute 2 distinct jobs. CPU0 will be responsible for performing the memory integrity checks, while the remaining ones will execute a bubble sort algorithm, as presented in the Figure 4.14.



**Figure 4.14:** Co-simulation design

When the benchmark starts, there is the creation of the intended CPUs. Each one has an ID, that allows one to identify themselves. To conduct the memory integrity test, CPU0 will start initializing the CRC and the memory that will be utilized to compare. Afterward, periodically, it will perform a memory comparison, having two possible outcomes. Or everything is all right, and the operation is a success. Or there is a flaw and the simulation ends immediately. A detailed view can be observed in 4.16.



**Figure 4.15:** State machine diagram for the memory integrity test

The bubble-sort test run in the remaining instantiated cores. Detail information about this benchmark can be found on subsection 3.1.2. If its workload is concluded successfully and still there are memory verifications to carry out, the simulation continues since the primary objective is not to evaluate the bubble-sort algorithm.



**Figure 4.16:** Sequence diagram diagram for the memory comparison

## Success Modeling

In the first memory integrity test, the perfect scenario will be simulated, in other words, the memory will operate without any failures, ensuring a well-functioning system. As represented in the Figure 4.15, there will be an interval between both tasks, and this can be different from each other (4.17).



**Figure 4.17:** Application execution timely diagram

To perform this benchmark, the following parameters were defined. After completing the bubble-sort workload, the system executes a final memory check to ensure that no problems occurred between the last one, allowing the simulation to conclude safely. The simulation results are available in the subsequent images. As expected, the system executed normally the benchmark without any problems. Both CPUs conclude the designated tasks and the last memory comparison gives positive feedback.

- Repeat = 100
  - $t_{CMP}$  = 100 ms
  - $t_{BS}$  = 10 us
  - Number of simulated cores = 2
  - Memory size = 16
  - Simulation mode = sequential
  - Reverse input CRC = REV\_HALF\_WORD\_IN
  - Reverse output CRC = REV\_WORD\_OUT
  - Polynomial = 0x04C11DB7

```
root@icelab-16: /workspace
REPEAT_WL=100 -DMEM_SIZE=16 -c -o workloads.o workloads.c
/usr/bin/arm-none-eabi-cpp armv7.s -g -O3 -lunistd -I . -DNUM_CORES=2 -DARR_SIZE
=1024 -DREPEAT_WL=100 -DMEM_SIZE=16 | /usr/bin/arm-none-eabi-as -EL -o armv7.o
/usr/bin/arm-none-eabi-gcc -g -O3 -lunistd -I . -DNUM_CORES=2 -DARR_SIZE=1024 -T
REPEAT_WL=100 -DMEM_SIZE=16 -c -o CRC_SC.o CRC_SC.c
/usr/bin/arm-none-eabi-gcc -fdefsym=NUM_CORES=2 -nostartfiles -L
-Map=linkmap.txt -Lxlinker -Tmulticore.ld -o main.elf boot.o ./common/syscalls.o
main.o workloads.o armv7.o CRC_SC.o --specs=nosys.specs
/workspace
[workspace]gem5-ice/build/ARM/gem5opt -r -d /workspace/tests/results/atomic/bar
e-metal/2c_reg /workspace/gem5-ice/configs/example/fs.py --bare-metal --cpu-type=AtomicSimpleCPU --caches --l2cache --machine-type=VExpress_GEMS_V1 --kernel=/
workspace/tests/build/main_2c.elf --num-cpus=2
build/ARM/base/statistics.hh:27: warn: One of the stats is a legacy stat. Legacy
stat is a stat that does not belong to any statistics::Group. Legacy stat is deprecated.
Redirecting stdout and stderr to /workspace/tests/results/atomic/bare-metal/2c_
reg/simout
root@icelab-16:/workspace# [ ] ribeiro@icelab-16.ice.rwth-aachen.de: /scratch/ribeiro/workspace/pad-gem5/gem5-ice/util/term
CPU1: Buble sort done, more 11 to go
CPU1: Buble sort done, more 10 to go
CPU1: Buble sort done, more 9 to go
CPU1: Buble sort done, more 8 to go
CPU1: Buble sort done, more 7 to go
CPU1: Buble sort done, more 6 to go
CPU1: Buble sort done, more 5 to go
CPU1: Buble sort done, more 4 to go
CPU1: Buble sort done, more 3 to go
CPU1: Buble sort done, more 2 to go
CPU1: Buble sort done, more 1 to go
CPU1: Buble sort done, more 0 to go
CPU1: done
Memory verified -> Everything is OK
CPU0: done
Alma8 :ribeiro@icelab-16: /scratch/ribeiro/workspace/pad-gem5/gem5-ice/util/term$ [ ]
```

**Figure 4.18:** Success memory integrity test

## Fault Modeling

In opposition to the first, this test will simulate a corrupted memory scenario, as shown in the Figure 4.19. In order to achieve that, a failure will be injected into the memory with the GNU Debugger (GDB) tool. GDB is a debugger that is supported by Gem5 and SystemC. It offers a variety of features but, for this purpose, the `set var` command will be utilized. This command allows a value modification of a variable in real-time thus, in this way, the failure can be simulated.



**Figure 4.19:** Application execution timely diagram with a failure

The test will be done under the same conditions as the previous one nevertheless, in this case, for comprehensive debugging capabilities, the debug version of the gem5 binary will be employed. In the end, the workload is expected to finish with an error from Gem5, and it can occur in two different moments. Either when the CPU1-N are still running, or when these already completed their tasks. Both cases will be tested, and in either case, the program should terminate promptly. The Figure 4.20 demonstrates the obtained results, and it can be concluded that the system behaves in accordance with expectations.

```

rbeiro@icelab-16.ice.rwth-aachen.de: /net/home/ribeiro
build/ARM/base/statistics.hh:277: warn: One of the stats is a legacy stat. Legacy stat is a static stat that does not belong to any statistics::Group. Legacy stat is deprecated.
Redirecting stdout and stderr to /workspace/tests/results/atomicc/bare-metal/2c_reg/simout
[Detaching after fork from child process 4016]
[Detaching after fork from child process 4017]
^C
Program received signal SIGINT, Interrupt.
gdb:::doSimLoop (event=0x562efac55560) at build/ARM/sim/simulate.cc:289
289      assert(!event->empty());
(gdb) b crc.cc:191
Breakpoint 1 at 0x562ef876350b: file build/ARM/dev/systemC/crc.cc, line 191.
(gdb) c
Continuing.

Breakpoint 1, gem5::Crc::write (this=0x562efbf16e00, pkt=0x7ffc48a33a50)
at build/ARM/dev/systemC/crc.cc:191
warning: Source file is more recent than executable.
191          if( write_wrapper_tlm(CrcSlot, CRC_DR, sizeof(int), data, CRC_op_delay) != Read_remote_port_failed
() )
(gdb) set var data = 0x12345678
(gdb) c
Continuing.
[Inferior 1 (process 4012) exited normally]
(gdb)

rbeiro@icelab-16.ice.rwth-aachen.de: /scratch/ribeiro/workspace/pad-gem5/gem5-ice/util/term
CPU1: Bubble sort done, more 67 to go
CPU1: Bubble sort done, more 66 to go
CPU1: Bubble sort done, more 65 to go
CPU1: Bubble sort done, more 64 to go
CPU1: Bubble sort done, more 63 to go
CPU1: Bubble sort done, more 62 to go
CPU1: Bubble sort done, more 61 to go
CPU1: Bubble sort done, more 60 to go
CPU1: Bubble sort done, more 59 to go
CPU1: Bubble sort done, more 58 to go
Broken memory in position: 0 | Expected CRC MEM -> edb88320 | CRC MEM -> bbc09114
Alma8 : rbeiro@icelab-16: /scratch/ribeiro/workspace/pad-gem5/gem5-ice/util/term$ 

```

(a) Case 1

```

root@icelab-16: /workspace
at that does not belong to any statistics::Group. Legacy stat is deprecated.
Redirecting stdout and stderr to /workspace/tests/results/atomic/bare-metal/2c_reg/simout
[Detaching after fork from child process 4072]
[Detaching after fork from child process 4073]
^C
Program received signal SIGINT, Interrupt.
0x00007f65913be392 in __libc_read ({fd=6, buf=buf@entry=0x7ffd5e17909f, nbytes=nbytes@entry=1})
  at ../sysdeps/unix/sysv/linux/read.c:26
(gdb) b crc.cc:191
Breakpoint 1 at 0x55970908550b: file build/ARM/dev/systemC/crc.cc, line 191.
(gdb) c
Continuing.

Breakpoint 1, gem5::Crc::write (this=0x55970d136e00, pkt=0x7ffd5e17afb0)
  at build/ARM/dev/systemC/crc.cc:191
warning: Source file is more recent than executable.
191      if( write_wrapper_tlm(CrcSlot, CRC_DR, sizeof(int), data, CRC_op_delay) != Read_remote_port_failed
() )
(gdb) set var data = 0xabcd
(gdb) c
Continuing.
[Inferior 1 (process 4068) exited normally]
(gdb) []

```

```

CPU1: Bubble sort done, more 8 to go
CPU1: Bubble sort done, more 7 to go
CPU1: Bubble sort done, more 6 to go
CPU1: Bubble sort done, more 5 to go
CPU1: Bubble sort done, more 4 to go
CPU1: Bubble sort done, more 3 to go
CPU1: Bubble sort done, more 2 to go
CPU1: Bubble sort done, more 1 to go
CPU1: Bubble sort done, more 0 to go
CPU1: done
Broken memory in position: 13 | Expected CRC_MEM -> edb58320 | CRC_MEM -> 2057838b
Alma8 :ribeiro@icelab-16: /scratch/ribeiro/workspace/systemC/co_sim_gen5_systemc$ 

```

(b) Case 2

**Figure 4.20:** Failure memory integrity test

## 4.3 Dynamic Quantum Integration

Up to this point in the co-simulation, accuracy has been the primary concern. With this criteria, performance is sacrificed, giving a higher host time. In the co-simulation itself, the use of par-gem5 would not provide greater benefits, as communication between the tools is the most time-consuming aspect. On the other hand, the remaining workload would take advantage of the parallel mode, as verified in the previous chapter (3.6).

To assess the advantages of par-gem5 and the dynamic quantum, a series of tests were conducted with various configurations. Compared to the prior ones, the differences will be resumed to the simulation modes, with the addition of static and dynamic parallel modes, and the number of simulated cores, which will range from 2, 4, 8, 16, 32, to 64 cores. The next figures exhibit the results obtained.

The graphs illustrate the gains in comparison to the sequential simulation. Observing the images can be settled that the parallel version, either with the static or dynamic version had a better performance. With 64 simulated cores, the gains of the dynamic version overpassed 250%. In fact, it can be affirmed that the performance gains grow with the increasing number of cores, due to the rise of workload amount. In terms of accuracy, the static version had almost a perfect result, while the dynamic one had a maximum inaccuracy of 0,98% nevertheless, it is considerably far from the 5%. Finally, as a consequence of the previously mentioned gains, the host time was reduced, getting, in the majority of the cases, at least a reduction of 40%.

From these benchmarks other conclusions can be taken. Depending on the benchmark, the better simulation mode can differ from the sequential and the parallel ones. It was verified that when the parallel


**Figure 4.21:** Co-simulation results

tasks have considerable time consumption, the parallel mode gives more advantages since it can accelerate these, keeping the inaccuracy below 5%. On the opposite way, if the communication between tools is where the simulation spends most of the time, the sequential mode will fit better, as it has a greater trade-off between accuracy and performance. Further, the delay times between the tasks also play a significant role. In cases where  $t_{CoSim} \gg t_{ParTask}$ , the parallel version is the most appropriate otherwise, the sequential one should be chosen.

In the end, it is trivial that to decide which mode is better, apriori information about the workload is needed. If it does not exist, either because there is no documentation or the access to the source code is restricted, the better simulation mode to use, based on this case study, is the parallel mode with a dynamic quantum. The reason is that it had always an inaccuracy below 5% and had a higher performance gain when compared to the sequential version. Nevertheless, if perfect accuracy is the main requirement, the sequential simulation mode should always be chosen.

# 5 | Conclusions

This chapter concludes this dissertation, summarizing the former developed work, and giving up some topics that could be interesting for future work.

From a general perspective, this project development contributed to solve the principal problem verified in par-gem5, which is the quantum definition to get the best tradeoff between performance and accuracy. With this advancement, hardware developers do not need to study and understand in detail the target system and the benchmarks since the introduced dynamic version is able to automatically tune to the best synchronization time. For this reason, the dissertation goal was accomplished.

## 5.1 Developed Work

Concerning the development work, all algorithms were used in the final approach, except for the repetition one. The ADALINE with the main algorithm, and the increment and PC as support for a better quantum adjustment. Due to its overhead in the simulation performance and weak accuracy gain, the repetition algorithm was excluded from the final solution.

Several tests were performed to evaluate it. In the end, it can be concluded the use of the dynamic version for the quantum choice brings greater benefits. When the NPB BT, LU, and SP with 32 and 64 simulated cores are not considered, there are gains in performance almost reaching 10%, only sacrificing 0.5% of accuracy. As mentioned on subsection 3.1.2, the previous tests have a particular characteristic, which is they executed nonlinear partial differential equations. This workload does not require inter-process communications thus, quantum can be increased without losing significant accuracy. The dynamic algorithm was designed to make it harder to increase the quantum value as the number of cores increases, due to the conclusions obtained in the par-gem5 [3] work. For this reason, this type of benchmark obtains a meaningful drop in performance when more than 16 cores are being simulated. However, the algorithm's design must be flexible, in a way that all benchmarks can have the best performance within accuracy limits.

In conclusion, the better algorithm depends on the a-priori available information. If there is nothing about it, the dynamic version should be used because, even if the performance might be lower than it could be, it is guaranteed that accuracy will be above 95%. On the other side, knowing the details of the benchmark enables us to calculate the optimal quantum before initiating the simulation. It is important to

remember that when increasing the quantum, while performance as a function of the quantum behaves similarly to a sigmoidal, inaccuracy grows linearly [64].

The developed co-simulation interface between SystemC and Gem5 provided a new work environment between the two platforms. Both can communicate easily and to multiple targets within the same simulation. As a case study to validate this cooperation, the CRC device was chosen, passing all the requirements successfully. In addition, as part of the case study, memory integrity tests were designed and performed, where either for the success or the failure modeling, the expected results matched with the real ones. On the other part, it was possible to verify that the dynamic quantum yielded improved results even with a significant time consumption in the communication between tools. Further, accuracy was always above 99%, enchanting its performance.

In the final analysis, it is important to mention that the usage of par-gem5, and consequently the dynamic quantum, has an accuracy cost. If perfect accuracy is mandatory, sequential mode should be selected. Also, simulations with a single simulated core should always be executed with the later mode. It was concluded that parallelization with only one simulated core results in a loss of accuracy without a significant gain in performance.

## 5.2 Future Work

Concerning future work, there are some aspects where improvements and new additions can be developed in order to have a more robust and versatile algorithm.

The first aspect is the execution of more benchmarks. Although the algorithm was tested with a set of different and distinct benchmarks, more tests are required to have more performance results. These may reveal points where it can be improved, either with the development of a new support algorithm or with a better tuning of the available parameters. PARSEC, SPEC2017, and STREAM are some examples of benchmarks where experiments can be conducted.

The second aspect is the adaptation for the timing CPU model. A future work of par-gem5 [3] is to extend this solution to the timing mode thus, the developed approach should also function in this condition. Only the IFP algorithm requires an adaptation, due to its PC analysis method. The timing mode uses multiple events to model a transaction, which makes it difficult to perform a correct examination of the executed instructions. In-order, and O3 demand an adaptation as well however, until the present date, no publications have been made on this matter regarding par-gem5.

Another point is the usage of other targets. The developed algorithm was designed to tackle any target board nevertheless, as aforementioned, only VExpress\_gem5 was used for all the tests. Different boards with different characteristics may highlight flaws that could not be spotted with the present one. Such usage may require its design and development in the platform, with the creation of all the necessary devices and connections.

Finally, improvements in the co-simulation environment can be implemented. The main problem observed was the time consumption in the communication between tools. From a host time perspective, the workload on the memory comparison task corresponds to 87 % of the whole simulation, where 82.2% of this value, on average, corresponds to the waiting time for the payload response. As part of the solution to speed up co-simulation, other IPCs techniques can be used, like shared memory. Although it is the fastest IPC available, it is the unsafest one due to the lack of permission verifications hence, for this reason, a careful design must be employed. Nevertheless, it might not be enough to solve the problem, and other techniques must be considered, for example, the replacement of the blocking transport for Direct Memory Interface (DMI) in SystemC, which opens the possibility of new work.

# References

- [1] R. P. A. A. F. A. N. C. O. O. Sugiono Sugiono Andi S. Putra, "New concept of product design by involving emotional factors using eeg: A case study of xomputer mouse design," *ACTA NEUROPSYCHOLOGICA*, vol. 19, no. 1, pp. 63–80, 2021.
- [2] D. N. J. A. K. P. K. S. P. I. S. A. S. V. Mani Azimi Naveen Cherukuri, "Tera-scale computing," *Intel Technology Journal*, vol. 11, no. 3, pp. 173–184, 2007.
- [3] N. Zurstraßen, C.-C. Jose, J. M. Joseph, R. Leupers, X. Xinghua, and L. Yichao, "Par-gem5: Parallelizing gem5's atomic mode," English, in *2023 Design, Automation and Test in Europe Conference (DATE)*, 2023.
- [4] Z. M. Research, *Virtual training and simulation market size, growth, trends, share*.
- [5] S. Robinson, *Simulation: The Practice of Model Development and Use*. Hoboken, NJ, USA: John Wiley & Sons, Inc., 2004, ISBN: 0470847727.
- [6] M. Verkuyl, N. Dubois, S. Goldsworthy, T. Merwin, T. Willet, and T. Job, *Virtual Simulation: An Educator's Toolkit*. 2022.
- [7] J. Banks, "Introduction to simulation," in *Proceedings of the 31st conference on Winter simulation: Simulation—a bridge to the future-Volume 1*, 1999, pp. 7–13.
- [8] J. Banks, J. Carson, B. Nelson, and D. Nicol, *Discrete-Event System Simulation*, English, 5th ed. Prentice Hall, 2010, ISBN: 0136062121.
- [9] M. Barr, *Programming Embedded Systems in C and C++* (O'Reilly Series). O'Reilly, 1999, ISBN: 9781565923546.
- [10] R. Camposano and J. Wilberg, "Embedded system design," *Design Automation for Embedded Systems*, vol. 1, pp. 5–50, 1996.
- [11] J. Corbet, A. Rubini, and G. Kroah-Hartman, *Linux Device Drivers, 3rd Edition*. O'Reilly Media, Inc., 2005, ISBN: 0596005903.
- [12] X. Zhai, F. Bensaali, and K. McDonald-Maier, "Automatic number plate recognition on fpga," in *2013 IEEE 20th International Conference on Electronics, Circuits, and Systems (ICECS)*, 2013, pp. 325–328. DOI: 10.1109/ICECS.2013.6815420.

- [13] J. Cong and J. Peck, "On acceleration of the check tautology logic synthesis algorithm using an fpga-based reconfigurable coprocessor," in *Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186*, 1997, pp. 246–247. DOI: 10.1109/FPGA.1997.624629.
- [14] J. Cong and J. Peck, "On acceleration of the check tautology logic synthesis algorithm using an fpga-based reconfigurable coprocessor," in *Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186*, 1997, pp. 246–247. DOI: 10.1109/FPGA.1997.624629.
- [15] M. Silva, D. Cerdeira, S. Pinto, and T. Gomes, "Operating systems for internet of things low-end devices: Analysis and benchmarking," *IEEE Internet of Things Journal*, vol. 6, no. 6, pp. 10 375–10 383, 2019.
- [16] IEC, *iec, 192-01-22 dependability*.
- [17] A. Tanenbaum and H. Bos, *Modern Operating Systems, Global Edition*. Pearson Education, 2015, ISBN: 9781292061955.
- [18] R. Barry, *Mastering the freertos real time kernel. real time engineers ltd*, 2016.
- [19] A. A. Adenowo and B. A. Adenowo, "Software engineering methodologies: A review of the waterfall model and object-oriented approach," *International Journal of Scientific & Engineering Research*, vol. 4, no. 7, pp. 427–434, 2013.
- [20] W Royce, "Winston," *Proceedings, Managing the Development of Large Software Systems, IEEE WESCON*, 1970.
- [21] S. Balaji and M. S. Murugaiyan, "Waterfall vs. v-model vs. agile: A comparative study on sdlc," *International Journal of Information Technology and Business Management*, vol. 2, no. 1, pp. 26–30, 2012.
- [22] G. B. Regulwar, P. Deshmukh, R. Tugnayat, P. Jawandhiya, and V. Gulhane, "Variations in v model for software development," *International Journal of Advanced Research in Computer Science*, vol. 1, no. 2, pp. 134–135, 2010.
- [23] S. Mathur and S. Malik, "Advancements in the v-model," *International Journal of Computer Applications*, vol. 1, no. 12, pp. 29–34, 2010.
- [24] W. Van Casteren, "The waterfall model and the agile methodologies: A comparison by project characteristics," *Research Gate*, vol. 2, pp. 1–6, 2017.
- [25] B. Boehm, "A survey of agile development methodologies," *Laurie Williams*, vol. 45, p. 119, 2007.
- [26] B. W. Boehm, "A spiral model of software development and enhancement," *Computer*, vol. 21, no. 5, pp. 61–72, 1988.

- [27] A. Alshamrani and A. Bahattab, "A comparison between three sdlc models waterfall model, spiral model, and incremental/iterative model," *International Journal of Computer Science Issues (IJCSI)*, vol. 12, no. 1, p. 106, 2015.
- [28] B. Boehm, "A spiral model of software development and enhancement," *ACM SIGSOFT Software engineering notes*, vol. 11, no. 4, pp. 14–24, 1986.
- [29] B. Liu, H. Zhang, and S. Zhu, "An incremental v-model process for automotive development," in *2016 23rd Asia-Pacific Software Engineering Conference (APSEC)*, IEEE, 2016, pp. 225–232.
- [30] W. M. Zabołotny, "Development of embedded pc and fpga based systems with virtual hardware," in *Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments 2012*, SPIE, vol. 8454, 2012, pp. 259–265.
- [31] *Ghdl main/home page*.
- [32] S. Williams and M. Baxter, "Icarus verilog: Open-source verilog more than a year later," *Linux Journal*, vol. 2002, no. 99, p. 3, 2002.
- [33] F. Bellard, "Qemu, a fast and portable dynamic translator.," in *USENIX annual technical conference, FREENIX Track*, California, USA, vol. 41, 2005, p. 46.
- [34] N. Binkert *et al.*, "The gem5 simulator," *SIGARCH Comput. Archit. News*, vol. 39, no. 2, pp. 1–7, 2011, ISSN: 0163-5964. DOI: 10.1145/2024716.2024718.
- [35] O. Bringmann *et al.*, "The next generation of virtual prototyping: Ultra-fast yet accurate simulation of hw/sw systems," in *2015 Design, Automation & Test in Europe Conference & Exhibition (DATE)*, IEEE, 2015, pp. 1698–1707.
- [36] L. Jünger, C. Bianco, K. Niederholtmeyer, D. Petras, and R. Leupers, "Optimizing temporal decoupling using event relevance," in *Proceedings of the 26th Asia and South Pacific Design Automation Conference*, 2021, pp. 331–337.
- [37] A. Varga, "Discrete event simulation system," in *Proc. of the European Simulation Multiconference (ESM'2001)*, 2001, pp. 1–7.
- [38] E. Babulak and M. Wang, "Discrete event simulation," *Aitor Goti (Hg.): Discrete Event Simulations. Rijeka, Kroatien: Sciyo*, p. 1, 2010.
- [39] W Boughton and O Droop, "Continuous simulation for design flood estimation—a review," *Environmental Modelling & Software*, vol. 18, no. 4, pp. 309–318, 2003.
- [40] F. Henning, L. Kärger, D. Dörr, F. J. Schirmaier, J. Seuffert, and A. Bernath, "Fast processing and continuous simulation of automotive structural composite components," *Composites Science and Technology*, vol. 171, pp. 261–279, 2019.
- [41] M. Helal, *A hybrid system dynamics-discrete event simulation approach to simulating the manufacturing enterprise*. University of Central Florida, 2008.

- [42] T. Q. P. Developers, *Qemu's documentation*.
- [43] C. Schumacher, R. Leupers, D. Petras, and A. Hoffmann, "Parsc: Synchronous parallel systemc simulation on multi-core host architectures," in *Proceedings of the eighth IEEE/ACM/IFIP international conference on hardware/software codesign and system synthesis*, 2010, pp. 241–246.
- [44] "Ieee standard for standard systemc language reference manual," *IEEE Std 1666-2011 (Revision of IEEE Std 1666-2005)*, pp. 1–638, 2012. DOI: 10.1109/IEEESTD.2012.6134619.
- [45] A. Yoga and S. Nagarakatte, "Parallelism-centric what-if and differential analyses," in *Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation*, 2019, pp. 485–501.
- [46] T. Zhou, *Sequential and parallel discrete event simulation on computer communication networks*. Western Michigan University, 1992.
- [47] N. R. Tallent, L. Adhianto, and J. M. Mellor-Crummey, "Scalable identification of load imbalance in parallel executions using call path profiles," in *SC '10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis*, 2010, pp. 1–11. DOI: 10.1109/SC.2010.47.
- [48] L. Rose, B. Homer, and D. Johnson, "Detecting application load imbalance on high end massively parallel systems," Aug. 2007, pp. 150–159, ISBN: 978-3-540-74465-8. DOI: 10.1007/978-3-540-74466-5\_17.
- [49] K.-Y. Chen, J. M. Chang, and T.-W. Hou, "Multithreading in java: Performance and scalability on multicore systems," *IEEE Transactions on Computers*, vol. 60, no. 11, pp. 1521–1534, 2010.
- [50] K. Du Bois, J. B. Sartor, S. Eyerman, and L. Eeckhout, "Bottle graphs: Visualizing scalability bottlenecks in multi-threaded applications," *ACM SIGPLAN Notices*, vol. 48, no. 10, pp. 355–372, 2013.
- [51] S. Eyerman, K. Du Bois, and L. Eeckhout, "Speedup stacks: Identifying scaling bottlenecks in multi-threaded applications," Apr. 2012. DOI: 10.1109/ISPASS.2012.6189221.
- [52] R. Ciesla, "Bits, sample rates, and other fundamentals of digital audio," in *Sound and Music for Games: The Basics of Digital Audio for Video Games*, Springer, 2022, pp. 1–24.
- [53] R. M. Fujimoto, "Parallel discrete event simulation," *Communications of the ACM*, vol. 33, no. 10, pp. 30–53, 1990.
- [54] G. Busnot, T. Sassolas, N. Ventroux, and M. Moy, "Standard-compliant parallel systemc simulation of loosely-timed transaction level models," in *2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC)*, IEEE, 2020, pp. 363–368.

- [55] M. Jung, F. Schnicke, M. Damm, T. Kuhn, and N. Wehn, "Speculative temporal decoupling using fork ()," in *2019 Design, Automation & Test in Europe Conference & Exhibition (DATE)*, IEEE, 2019, pp. 1721–1726.
- [56] A. Mohammad, U. Darbaz, G. Dozsa, S. Diestelhorst, D. Kim, and N. S. Kim, "Dist-gem5: Distributed simulation of computer clusters," in *2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)*, IEEE, 2017, pp. 153–162.
- [57] J. H. Weinstock, R. Leupers, G. Ascheid, D. Petras, and A. Hoffmann, "Systemc-link: Parallel systemc simulation using time-decoupled segments," in *2016 Design, Automation & Test in Europe Conference & Exhibition (DATE)*, IEEE, 2016, pp. 493–498.
- [58] J. Lowe-Power *et al.*, "The gem5 simulator: Version 20.0+," *arXiv preprint arXiv:2007.03152*, 2020.
- [59] N. Binkert, R. Dreslinski, L. Hsu, K. Lim, A. Saidi, and S. Reinhardt, "The m5 simulator: Modeling networked systems," *IEEE Micro*, vol. 26, no. 4, pp. 52–60, 2006. DOI: 10.1109/MM.2006.82.
- [60] M. M. K. Martin *et al.*, "Multifacet's general execution-driven multiprocessor simulator (gems) toolset," vol. 33, no. 4, pp. 92–99, 2005, ISSN: 0163-5964. DOI: 10.1145/1105734.1105747.
- [61] D.-I. G. Hempel and I. J. Castrillon, "Simulation of risc-v based systems in gem5,"
- [62] W. Jakob, J. Rhinelander, and D. Moldovan, "Pybind11—seamless operability between c++ 11 and python," URL <https://github.com/pybind/pybind11>, 2022.
- [63] S. Knight, "Scons design and implementation," in *Tenth int'l python conf*, 2002.
- [64] N. Zurstraßen, R. Brandhofer, J. C. Cascante, J. M. Joseph, N. Bosbach, and R. Leupers, *Beyond the Quantum of Temporally-Decoupled Simulations*,
- [65] G. Glaser, G. Nitsche, and E. Hennig, "Temporal decoupling with error-bounded predictive quantum control," in *2015 Forum on Specification and Design Languages (FDL)*, 2015, pp. 1–6.
- [66] D. R. Jefferson, "Virtual time," *ACM Transactions on Programming Languages and Systems (TOPLAS)*, vol. 7, no. 3, pp. 404–425, 1985.
- [67] F. Morales and J. L. Bismarck, *Evaluating gem5 and qemu virtual platforms for arm multicore architectures*, 2016.
- [68] A. Herrera, "running trusted firmware-a on gem5", 2020.
- [69] T. Wieman, B. Bhattacharya, T. Jeremiassen, C. Schroder, and B. Vanthournout, "An overview of open systemc initiative standards development," *IEEE Design & Test of Computers*, vol. 29, no. 2, pp. 14–22, 2012.
- [70] A. Akram and L. Sawalha, "A comparison of x86 computer architecture simulators," 2016.
- [71] R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli, "Multi2sim: A simulation framework for cpu-gpu computing," in *Proceedings of the 21st international conference on Parallel architectures and compilation techniques*, 2012, pp. 335–344.

- [72] T. E. Carlson, W. Heirman, and L. Eeckhout, "Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation," in *Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis*, 2011, pp. 1–12.
- [73] M. T. Yourst, "Ptlsim: A cycle accurate full system x86-64 microarchitectural simulator," in *2007 IEEE International Symposium on Performance Analysis of Systems & Software*, IEEE, 2007, pp. 23–34.
- [74] D. Sanchez and C. Kozyrakis, "Zsim: Fast and accurate microarchitectural simulation of thousand-core systems," *ACM SIGARCH Computer architecture news*, vol. 41, no. 3, pp. 475–486, 2013.
- [75] A. Patel, F. Afram, and K. Ghose, "Marss-x86: A qemu-based micro-architectural and systems simulator for x86 multicore processors," in *1st International Qemu Users' Forum*, Citeseer, 2011, pp. 29–30.
- [76] M. R. Bachute and J. M. Subhedar, "Autonomous driving architectures: Insights of machine learning and deep learning algorithms," *Machine Learning with Applications*, vol. 6, p. 100 164, 2021.
- [77] A. L. Samuel, "Some studies in machine learning using the game of checkers," *IBM Journal of research and development*, vol. 3, no. 3, pp. 210–229, 1959.
- [78] B. Mahesh, "Machine learning algorithms-a review," *International Journal of Science and Research (IJSR). [Internet]*, vol. 9, pp. 381–386, 2020.
- [79] S. Haykin, *Neural networks and learning machines*, 3/E. Pearson Education India, 2009.
- [80] B. Widrow and M. E. Hoff, "Adaptive switching circuits," Stanford Univ Ca Stanford Electronics Labs, Tech. Rep., 1960.
- [81] B. Widrow and S. D. Stearns, "Adaptive signal processing prentice-hall," *Englewood Cliffs, NJ*, p. 52, 1985.
- [82] B. Widrow and M. A. Lehr, "Perceptrons, adalines, and backpropagation," *Arbib*, vol. 4, pp. 719–724, 1995.
- [83] K. Mitchell-Wallace, M. Jones, J. Hillier, and M. Foote, *Natural catastrophe risk management and modelling: A practitioner's guide*. John Wiley & Sons, 2017.
- [84] L. von Rueden, S. Mayer, R. Sifa, C. Bauckhage, and J. Garcke, "Combining machine learning and simulation to a hybrid modelling approach: Current and future directions," in *Advances in Intelligent Data Analysis XVIII: 18th International Symposium on Intelligent Data Analysis, IDA 2020, Konstanz, Germany, April 27–29, 2020, Proceedings 18*, Springer, 2020, pp. 548–560.
- [85] P. Benner, S. Gugercin, and K. Willcox, "A survey of projection-based model reduction methods for parametric dynamical systems," *SIAM review*, vol. 57, no. 4, pp. 483–531, 2015.
- [86] E. Tsymbalov, S. Makarychev, A. Shapeev, and M. Panov, "Deeper connections between neural networks and gaussian processes speed-up active learning," *arXiv preprint arXiv:1902.10350*, 2019.

- [87] F. Noé, A. Tkatchenko, K.-R. Müller, and C. Clementi, “Machine learning for molecular simulation,” *Annual review of physical chemistry*, vol. 71, pp. 361–390, 2020.
- [88] K. Albertsson *et al.*, “Machine learning in high energy physics community white paper,” in *Journal of Physics: Conference Series*, IOP Publishing, vol. 1085, 2018, p. 022 008.
- [89] C. Gomes, C. Thule, D. Broman, P. G. Larsen, and H. Vangheluwe, “Co-simulation: State of the art,” *arXiv preprint arXiv:1702.00686*, 2017.
- [90] Xilinx, *Xilinx/libsystemctlm-soc: Systemc/tlm-2.0 co-simulation framework*.
- [91] M. Komalan *et al.*, “Main memory organization trade-offs with dram and stt-mram options based on gem5-nvmain simulation frameworks,” in *2018 Design, Automation & Test in Europe Conference & Exhibition (DATE)*, IEEE, 2018, pp. 103–108.
- [92] C. Menard, J. Castrillon, M. Jung, and N. Wehn, “System simulation with gem5 and systemc: The keystone for full interoperability,” in *2017 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)*, IEEE, 2017, pp. 62–69.
- [93] D. Bailey *et al.*, “The nas parallel benchmarks rnr-94-007,” *NASA Advanced Supercomputing Division, Tech. Rep*, 1994.
- [94] S. Haykin, “Linear prediction,” *Adaptive filter theory*, pp. 562–588, 1996.
- [95] M. Stella, D. Begusic, and M. Russo, “Adaptive noise cancellation based on neural network,” in *2006 International Conference on Software in Telecommunications and Computer Networks*, 2006, pp. 306–309. DOI: 10.1109/SOFTCOM.2006.329765.
- [96] *Rm0385 reference manual -based 32-bit mcus.*
- [97] G Xilinx and S Guide, “Zynq-7000 all programmable soc technical reference manual (ug585),” Tech. rep., Xilinx, 2014, <https://www.xilinx.com/support/documentation...>, Tech. Rep., 2014.
- [98] W. W. Peterson and D. T. Brown, “Cyclic codes for error detection,” *Proceedings of the IRE*, vol. 49, no. 1, pp. 228–235, 1961.
- [99] C. Borrelli, “Ieee 802.3 cyclic redundancy check,” *application note: Virtex Series and Virtex-II Family, XAPP209 (v1. 0)*, 2001.