

# SystemC and Virtual Prototyping

Dr. Matthias Jung, Fraunhofer Institute IESE  
matthias.jung@iese.fraunhofer.de



Fraunhofer  
IESE

Your notes:

TLM has several Topics, we will talk more about the trades off between simulation speed and accuracy, in TLM we are raising the abstraction to higher level which gives us higher speed But accuracy goes down , But we can also find some cases that accuracy doesn't go down and also I will show you some examples that they did where we can get up to 10x speed by having same accuracy by just applying some clever modeling techniques



## Transaction Level Modeling (TLM)

Your notes:

---

---

---

---

---

---

---

---

---

---

---

---

---



Your notes:

---



---



---



---



---



---



---



---



---



---



---



Your notes:

---



---



---



---



---



---



---



---



---



---



---



---

## How to build a virtual Smartphone?



6

Fraunhofer  
IESE

© Fraunhofer IESE

- Your notes:
  - If we model the complete SmartPhone in RTL, if you model everything and everything clocked and everything is detailed the simulation will take too long to complete Linux Boot.
  - So for very accurate models it takes so long time because we have so many events that are happening.
  - So what we are trying to do in TLM is to reduce the amount of events.

We don't want to simulate the CLK because every CLK is an event it's signal change in our system. Kernel and our kernel needs to do something for that change and this gives overhead.

- By the way there is a hint to speed up our simulation is to CLK gate the CLK for ex, that's something we did in one project, so for ex<sup>TF</sup> there is a component is not required for simulation we shut down the CLK of the component and the simulation gets faster, so the simulation gets faster, that's kind of optimization in RTL simulation.

## Typical Smartphone SoC Consists of the Following



© Fraunhofer IESE

Fraunhofer  
IESE

7

Your notes:  
When you see now such a system on chip how you classify this components

- ① Initiators because they initiates communication
- ② Interconnects initiators and Targets at same time

- ③ Targets because a memory controller will not suddenly send data to CPU, No CPU requested this data - it's still good that it's possible

# oots active component over its pass From Coniculation Point of view

## Recap: Accuracy vs. Speed Trade-Off

This shows we  
Classify our  
models in  
TLM -



© Fraunhofer IESE

8  
 **Fraunhofer**  
IESE

Your notes: You should always ask  
what you want to have what  
you want to answer by  
this simulation.

## Recap: Accuracy vs. Speed Trade-Off

- E.g. RTL Simulations:
- VHDL / Verilog / SystemC
  - Very accurate
  - Very very very ... slow
  - Inflexible
- We have thousands of signals that we are simulating we won't be able to change things*

### E.g. System Level Simulations:

- Fast ↑
- Large flexibility ↑
- Inaccurate



9  
access  
this  
with  
detailed

Your notes:

---

---

---

---

---

---

---

---

---

In SystemC we can model like RTL  
our models can have thousands of pins we simulate all the events

## Remember: Cycle Accurate Simulation



We also simulate all the events  
CLK

This makes simulation slow  
if there is nothing happening  
CLK is still simulating

CA  
SystemC

Pin Accurate

CA  
SystemC

- RTL models can have thousands of pins
- Simulate all events on all pins
  - RTL bus access has ~75 events

Source: Doulos Ltd. www.doulos.com

© Fraunhofer IESE

Fraunhofer  
IESE

Your notes:

---

---

---

---

---

---

---

---

---

---

## The idea of Transaction Level Modeling (TLM)



Simple Method Call



- we Reduce structural accuracy by replacing signals by single function calls

- e.g. AXI/AHB require more than 100 signals
- TLM is communication centric
  - Concentrate only on the important events
  - i.e. the Transactions "Data"
  - TLM bus access has 1-4 events
- TLM Simulations 100-10,000 times faster than RTL

Source: Doulos Ltd. www.doulos.com

© Fraunhofer IESE

 **Fraunhofer**  
not one  
by bit  
11  
IESE

Transaction level modeling is speeding up simulation by replacing all pin-level events with a single method call. In an RTL simulation the communication requires multiple events at the level of individual bits. In a transaction-level model, complete bus cycles can be modeled with one method call. Therefore, TLM models can be 100 to 10000 times faster than their RTL counter parts. However, there is no free lunch: the increased speedup of simulation is payed with reduced accuracy of the results.

Your notes:

What is actually based on  
Target will have a function  
and the initiator will call  
a function of this component and  
in the parameter we will  
put a packet and we will not  
do call by value here

but we will do this by  
call by reference (pointer)

- so we will create in the initiator  
this data packet and we don't  
transfer the whole packet as copy  
we just transfer a pointer, just  
64 bit number transferred and this make  
simulation faster.

## Transaction: A custom TLM Implementation

```
#include <iostream>
#include <systemc.h>
#include <queue>

using namespace std;

enum cmd {READ, WRITE};

struct transaction {
    unsigned int data;
    unsigned int addr;
    cmd command;
};

class transactionInterface : public sc_interface {
public:
    virtual void transport(transaction &trans) = 0;
};
```

→ We have a read command  
↓ write  
→ packet of bit data

Try code on github:  
[https://github.com/TUK-SCVP/SCVP.artifacts/tree/master/custom\\_tlm](https://github.com/TUK-SCVP/SCVP.artifacts/tree/master/custom_tlm)



© Fraunhofer IESE

Fraunhofer  
IESE

Function and there we pass the reference  
to such a transaction object we call trans-  
that's the main idea or the definition of the interface that  
Your notes: our component will have later on

→ As you remember Hierarchical channel:

it's channel that is not only inherit from SC\_Interface but it also inherit from SC\_Module, so it's a channel that can have processes, that's the trick in TLM that we say we have some components but the components have an interface like channel and we can connect and bind them.

## Transaction: A custom TLM Implementation

```

SC_MODULE(PRODUCER)
{
    sc_port< transactionInterface > master;
    SC_CTOR(PRODUCER)
    {
        SC_THREAD(process);
    }
    void process() write to memory
    {
        for(unsigned int i=0; i < 4; i++) {
            wait(1,SC_NS);
            transaction trans;
            trans.addr = i;
            trans.data = rand();
            trans.command = cmd::WRITE;
            master->transport(trans);
        }
        read from memory
        for(unsigned int i=0; i < 4; i++) {
            wait(1,SC_NS);
            transaction trans;
            trans.addr = i;
            trans.data = 0;
            trans.command = cmd::READ;
            master->transport(trans);
            cout << trans.data << endl;
        }
    }
};

```

Write

Read

© Fraunhofer IESE

```

class CONSUMER : public sc_module,
public transactionInterface
{
    private:
    unsigned int memory[1024];
public:
SC_CTOR(CONSUMER)
{
    for(unsigned int i=0; i < 1024; i++) {
        memory[i] = 0; // Initialize memory
    }
}

void transport(transaction &trans) {
    if(trans.command == cmd::WRITE) {
        memory[trans.addr] = trans.data;
    }
    else /* cmd::READ */ {
        trans.data = memory[trans.addr];
    }
};

int sc_main(...) {
    PRODUCER cpu("cpu");
    CONSUMER mem("memory");
    cpu.master.bind(mem);
    sc_start();
    return 0;
}

```

The consumer  
has to be derived from  
sc\_module and  
transactionInterface  
Target

We check if  
it receives  
read or write

The Producer is  
active, the  
consumer is passive

13

Fraunhofer  
IESE

Your notes:

You would notice that we don't have delta cycle, This makes simulation faster because we just communicate with function calls and the transport call is blocking call, your simulation is trapped inside this fn, that's why we call it Blocking Transport.

- Later on we will learn that calling wait statements in Targets in this Transport Function within the Target, it's not a good practice as it make simulation slower we will see that today

## TLM Basic Concept

### TLM2.0 Interoperability Layer:

- Interfaces & Sockets
- Generic Payload
- Base Protocol

Ports in TLM  
(called sockets)

Payload reference



① ensures that we can connect different components, different models from different vendors

② it defines

③ is the transaction object we saw before it contains:

- ① The Command
- ② The Address
- ③ Data
- ④ Some Byte enables
- ⑤ Some streaming things
- ⑥ Response status
- ⑦ The cool thing is you can extend it, you can put your own data fields, for example if you do memory access

④ If you want to transfer this payload to the target the initiator have to call the transport fn of the target socket the interconnect

⑧ your memory controller wants to know if this instruction fetch or data access or is this TLP miss or whatever because you want to deal with this transactions with different priorities soon, that's cool thing you can extend it easily.

Your notes:

② Forex: if you get a System-C module from ARM for a Cortex M0, you can plug it with a sensor model that you have developed, just plug and play because of Interoperability layer, it's defined in standard and everyone follows the rules of standard, so we are interoperable and we can plug models.

⑥ The interconnect has to check the address which target it will go then the initiator socket of the interconnect will call the Transport function of the Target socket here.

Note:



Target Socket Symbol

## Initiator Socket Symbols

7

Now we transfer things, through function calls we get the reference over there in the memory then we do some access on the data of Generic Payload object then it goes back to initiator

Base Protocol : it describes how this transport and these function calls should happen

Note

We create the Payload in the initiator and have a pointer to it in the Generic Payload object.

⑥ We usually call it untimed, we don't simulate time when we build Algorithm like

## Accelerator Algorithm



Fraunhofer  
IESE

⑦ Your notes:  
TLM have mainly 2 different views:

### ① Arch view.

it's called APPROXIMATELY TIMED because it's very very CLOSE to the RTL Timing but it's NOT accurate 100% but 99%

### ② Programmers view

it's called LOOSELY TIMED, it has some timing information but it's not accurate as the APPROXIMATELY timed but it's accurate to write code



Your notes:

---



---



---



---



---



---



---



---



---



---



---



---



---

## TLM Coding Styles and Mechanisms



### TLM 2.0 Coding Style (Just Guidelines)

#### Loosely-Timed (LT)

Single-phase, blocking API

Multi-phase, non-blocking API

#### Approximately-Timed (AT)

### TLM Mechanisms (Definitive API for enabling Interoperability)

Blocking transport

DMI

Quantum (Keeper)

Sockets

Generic payload

Extensions

Phases

Non-blocking transport

Source: Doulos Ltd. www.doulos.com

© Fraunhofer IESE

18

Fraunhofer  
IESE

Your notes:

- we have different TLM use cases :-

① we want to do s/w Development on H/w that is not there already so we want to do this on Model  
Check the rest of TLM use cases in up.

TLM have "Coding style", it's always nice to stick to one :

## ① Loosely Timed

"single Phase, blocking API"

- It's always Blocking, and we have just function call here.

## ② Approximately Timed

We have more than 1 Function call and more than event, and it's

not blocking API.

The best thing that you can mix between both styles

- we have different TLM mechanisms to support this different coding

styles:

Check it in diagram up.



Your notes:

---



---



---



---



---



---



---



---



---



---



---



---



---



---



---



---



---



---



---



---



---

## Coding Styles in TLM

### ■ Loosely-Timed (LT):

*We want to simulate as fast as possible*



- Sufficient timing detail to boot OS and run multicore systems and to develop SW or drivers
- Processes can run ahead of simulation time (temporal decoupling)
- Each transaction completes in one blocking function call
- Usage of Direct Memory Interface (DMI) e.g. for boot process

© Fraunhofer IESE

~~For ex, here we say  
a memory access takes  
40 nsec~~

Your notes:

### ■ Approximately-Timed (AT):

*We want to model accurately enough for performance modelling*



- Sufficient for architectural HW design space exploration

All

Processes run in lockstep with simulation time *All of them are synchronized*

- Each transaction has usually 4 timing points i.e. 4 function calls (extensible if required, also less possible); non-blocking behavior
- More detailed than LT and therefore also slower than LT

~~Fraunhofer~~

IESE

~~but here we simulate  
the memory bus, we simulate  
Queuing delays this way is better~~



Your notes:

---



---



---



---



---



---



---



---



---



---



---



---

## Initiators, Targets and Interconnect



- TLM components are divided into Initiators, Targets and Interconnect components:
  - A initiator initiates (construct and send) new transactions
  - A target is a module that acts as the end point for a transaction. It executes requests from an initiator and send responses
  - An interconnect component forwards and routes transaction objects between initiators and targets
- Transactions are sent through initiator sockets (■) and received through target sockets (□)
- References to the object are passed along the forward and backward paths:
  - LT uses Forward and Return Path
  - AT uses Forward, Backward and Return Path

23

© Fraunhofer IESE

Your notes: we have different Paths of the Information Flow.

Ex:

when italk about the ForwardPath here, The initiator calls the Fn of the Target socket and the return Path if this Fn returns -

- That means we can Transfer Information on the return Path, we Transfer forex The req is successful or not successful.

- We have Backward Path but we will discuss it in Approximate time style because I am only discuss the loosely timed

- today we have discussed about the Initiator and Target.
- ④ Initiator : Always initiate that means it always construct and send new Transaction.
  - ⑤ Target : it's a module that acts as an end point of Transaction and then it executes the request from the initiator and then it sends a response either by returning or backward Path.

# SystemC Provides Interfaces for TLM - The Interface defines which Functions For Forward Path, return path, Backward Path



© Fraunhofer IESE

fns of the Interface  
Without Implementation

Your notes:

---

---

---

---

---

---

---

---

---



Your notes:

---



---



---



---



---



---



---



---



---



---



---



---



---

## Generic Payload

- The generic payload is designed to include typical attributes of memory mapped busses (e.g. AXI, AHB, etc.)
- It can support
  - READ and WRITE transactions
  - Byte enables
  - Single word transfer
  - Burst transfer
  - Streaming transfer
- Extensions can be used to carry further metadata or to model more complex bus and NoC protocols and maintain 100% compatibility because they are ignorable
- Very efficient implementation for simulation speed

Generic payload object



© Fraunhofer IESE

Your notes:

---

---

---

---

---

---

---

---

---

---

↳ Forex for

RTL

↳ If you add another component there is not  
look at the slave extension  
it can still work

## What are the Generic Payload Attributes

| Attribute           | Type                    | Modifiable?       |
|---------------------|-------------------------|-------------------|
| Command             | tlm_command             | No                |
| Address             | uint64_t                | Interconnect Only |
| Data Pointer        | unsigned char *         | No (array yes)    |
| Data Length         | unsigned int            | No                |
| Byte Enable Pointer | unsigned char *         | No                |
| Byte Enable Length  | unsigned int            | No                |
| Streaming Width     | unsigned int            | No                |
| DMI Hint            | bool                    | Yes               |
| Response Status     | tlm_response_status     | Target Only       |
| Extensions          | (tlm_extension_base*)[] | Yes               |

The initiator should initialize attributes before sending the transaction object

Also Set-Methods like set\_address() etc. are provided

Assignment  
Cursece  
Fraunhofer IESE

Can set  
to target  
status  
if communication

The majority of the attributes must not be changed by Interconnects & Targets

But there is some exception

Address & Response Status

Fraunhofer  
IESE

Your notes: in TLM they decided to make all communication byte based.

- They consider data in array so we define the Data length

- 1



Your notes:

---



---



---



---



---



---



---



---



---



---



---



---



---

Note

\*This<sup>o</sup> is always the pointer that Points  
to the object of this class

itself so, it's own pointer.

↳ loosely typed

## Building a Blocking (LT) Initiator

SW

```
class Initiator: sc_module, tlm::tlm_bw_transport_if> {
public:
    tlm::tlm_initiator_socket<> iSocket;
    SC_CTOR(Initiator) : iSocket("iSocket") {
        iSocket.bind(*this);
        SC_THREAD(process);
    }

    void process() {
        for (int i = 0; i < 4; i++) {           Write to memory
            tlm::tlm_generic_payload trans;
            unsigned char data = rand();
            trans.set_address(i);
            trans.set_data_length(1);
            trans.set_command(tlm::TLM_WRITE_COMMAND);
            trans.set_data_ptr(&data);
            sc_time delay = sc_time(0, SC_NS);
            iSocket->b_transport(trans, delay);
            wait(delay);
        }
        for (int i = 0; i < 4; i++) {           Read from memory
            tlm::tlm_generic_payload trans;
            unsigned char data;
            trans.set_address(i);
            trans.set_data_length(1);
            trans.set_command(tlm::TLM_READ_COMMAND);
            trans.set_data_ptr(&data);
            sc_time delay = sc_time(0, SC_NS);
            iSocket->b_transport(trans, delay);
            wait(delay);
        }
    }
}
```

```
// Dummy method:
void invalidate_direct_mem_ptr(
    sc_dt::uint64 start_range,
    sc_dt::uint64 end_range)
{}
```

Must be implemented

```
// Dummy method:
tlm::tlm_sync_enum nb_transport_bw(
    tlm::tlm_generic_payload& trans,
    tlm::tlm_phase& phase,
    sc_time& delay)
{
    return tlm::TLM_ACCEPTED;
}
```



Try code on github:

[https://github.com/TUK-SCVP/SCVP.artifacts/tree/master/tlm\\_lt\\_initiator\\_target](https://github.com/TUK-SCVP/SCVP.artifacts/tree/master/tlm_lt_initiator_target)

31

Fraunhofer  
IESE

Your notes:

when we use `(this)` it gets the connection between methods that we have to implement here and later on the connection of this socket

# Building a Blocking (LT) Target

```

class Target:sc_module, tlm::tlm_fw_transport_if<> {
    private:
        unsigned char mem[1024];

    public:
        tlm::tlm_target_socket<> tSocket;

        SC_CTOR(Target) : tSocket("tSocket") {
            tSocket.bind(*this);
        }

        void b_transport(tlm::tlm_generic_payload &trans,
                         sc_time &delay)
        {
            if(trans.get_address() >= 1024){
                SC_REPORT_FATAL(this->name(), "Out of Range");
            }

            if(trans.get_command() == tlm::TLM_WRITE_COMMAND)
            {
                memcpy(mem+trans.get_address(), // destination
                       trans.get_data_ptr(), // source
                       trans.get_data_length()); // size
            } else {
                memcpy(trans.get_data_ptr(), // destination
                       mem+trans.get_address(), // source
                       trans.get_data_length()); // size
            }
            delay = delay + sc_time(40, SC_NS);
        }
        ...
    };

```

```

// Dummy method
virtual tlm::tlm_sync_enum nb_transport_fw(
    tlm::tlm_generic_payload& trans,
    tlm::tlm_phase& phase,
    sc_time& delay )
{
    return tlm::TLM_ACCEPTED;
}

// Dummy method
bool get_direct_mem_ptr(
    tlm::tlm_generic_payload& trans,
    tlm::tlm_dmi& dmi_data)
{
    return false;
}

// Dummy method
unsigned int transport_dbg(
    tlm::tlm_generic_payload& trans)
{
    return 0;
}

```

Must be implemented

32

© Fraunhofer IESE

 **Fraunhofer**  
IESE

Your notes:

---



---



---



---



---



---



---



---



---



---



---



---

# Binding Target and Initiator

SW

```
int sc_main (...)  
{  
    Initiator * cpu = new Initiator("cpu");  
    Target * memory = new Target("memory");  
  
    cpu->iSocket.bind(memory->tSocket);  
  
    sc_start();  
    return 0;  
}
```

Summary:

- Initiator and target ports are derived from `sc_port` and `sc_export`
- `b_transport` uses call by reference to transfer the transaction object (from `tlm_generic_payload` class) *and the Delay*
- The key idea of timing annotation is that the recipient is obliged to behave as if it had received the transaction at time `sc_time_stamp() + delay`
- All virtual methods must be implemented
- Transaction objects should be reused to avoid always new allocations

Try code on github:  
[https://github.com/TUK-SCVP/SCVP.artifacts/tree/master/tlm\\_It\\_initiator\\_target](https://github.com/TUK-SCVP/SCVP.artifacts/tree/master/tlm_It_initiator_target)

33

© Fraunhofer IESE

 **Fraunhofer**  
IESE

Your notes:

---

---

---

---

---

---

---

---

---

---

---

## Error Handling for b\_transport

*Initiator should set it by default*

| enum tlm_response_status       | Meaning                                           |
|--------------------------------|---------------------------------------------------|
| TLM_OK_RESPONSE                | Successful transmission                           |
| TLM_INCOMPLETE_RESPONSE        | Transaction not delivered to the target (default) |
| TLM_ADDRESS_ERROR_RESPONSE     | Unable to work with given address                 |
| TLM_COMMAND_ERROR_RESPONSE     | Unable to execute command (e.g write to ROM)      |
| TLM_BURST_ERROR_RESPONSE       | Unable to work with given datalength              |
| TLM_BYTE_ENABLE_ERROR_RESPONSE | Unabel to work with byte enable                   |
| TLM_GENERIC_ERROR_RESPONSE     | Any other error                                   |

- Initiator should set response status to TLM\_INCOMPLETE\_RESPONSE (default)
- Targets modify the response status
- Initiator checks status of transaction when it is completed (e.g. after b\_transport)

34

© Fraunhofer IESE

Your notes:

---

---

---

---

---

---

---

---

---

---

# Error Handling for b\_transport

```

class exampleInitiator: sc_module,
tlm::tlim_bw_transport_if<>
{
    ...
private:
void process()
{
    ...
    iSocket->b_transport(trans, delay);
    if(trans.is_response_error())
    {
        SC_REPORT_FATAL(name(), "Error")
    }
}
...     Detailed check can be done with
};     trans.get_response_status()

```

```

class exampleTarget : sc_module,
tlm::tlim_fw_transport_if<>
{
    ...
unsigned char mem[512];

public:
tlm::tlim_target_socket<> tSocket;

SC_CTOR(exampleTarget) : tSocket("tSocket")
{
    tSocket.bind(*this);
}

```

```

void b_transport(... &trans, ... &delay)
{
    if (trans.get_address() >= 512) {
        trans.set_response_status(
            tlm::TLM_ADDRESS_ERROR_RESPONSE );
        return;
    }
    if (trans.get_data_length() != 4) {
        trans.set_response_status(
            tlm::TLM_BURST_ERROR_RESPONSE );
        return;
    }
    if (byt) {
        trans.set_response_status(
            tlm::TLM_BYTE_ENABLE_ERROR_RESPONSE );
        return;
    }

    if(trans.get_command() == tlm::TLM_WRITE_COMMAND){
        memcpy(...);
    }
    else {
        memcpy(...);
    }

    delay = delay + sc_time(40, SC_NS);

    trans.set_response_status( tlm::TLM_OK_RESPONSE );
}

```

35

© Fraunhofer IESE

Your notes:

---



---



---



---



---



---



---



---



---



---



---



---



---

# Sequence Chart

## Blocking Transport (LT)

SW



Calling `wait()` results in a context switch ! → bad for simulation speed

36

© Fraunhofer IESE

*That's the reason we should not use wait fn's what's bad for simulation time*

Fraunhofer  
IESE

An initiator, and only an initiator, may call `b_transport`. The target can return immediately, or can suspend (by calling `wait`) and return later. However, calling `wait` statements in targets should be avoided because a wait statement will result in a context switch of the simulator, which decreases the simulation speed.

The same transaction object could be reused from one call to the next, but the two calls would count as separate transactions.

Your notes:

---



---



---



---



---



---

So we should do something different as shown below's



## Blocking Transport (LT)



Initiator should use a local time variable and should call `wait()`! → Less context switches

37

© Fraunhofer IESE

**Fraunhofer**  
IESE

Your notes:

---

---

---

---

---

---

---

---

---

---

We can chain the b\_Transport calls.

## Chaining b\_transport Calls That's the



38

© Fraunhofer IESE

**Fraunhofer**  
IESE

Your notes:

---

---

---

---

---

---

---

---

---

---

# Let's Build Interconnect Components

## Building an Interconnect Component

```

class Interconnect : sc_module,
    tlm::tlm_bw_transport_if<>,
    tlm::tlm_fw_transport_if<>
{
public:
    tlm::tlm_initiator_socket<> iSocket[2];
    tlm::tlm_target_socket<> tSocket;
    SC_CTOR(exampleInterconnect)
    {
        tSocket.bind(*this);
        iSocket[0].bind(*this);
        iSocket[1].bind(*this);
    }

    void b_transport(
        tlm::generic_payload &trans,
        sc_time &delay)
    {
        delay = delay + sc_time(40, SC_NS);

        if(trans.get_address() < 512) {
            iSocket[0]->b_transport(trans, delay);
        }
        else {
            trans.set_address(trans.get_address() - 512);
            iSocket[1]->b_transport(trans, delay);
        }
    }
    ... // Dummy Methods
};

```

**Annotations:**

- Port Array: tlm::tlm\_target\_socket<> tSocket;
- Initiator Sockets: tlm::tlm\_initiator\_socket<> iSocket[2];
- Annotate Bus Time: Annotate Bus Time
- Modify address: Modify address

```

int sc_main (int __attribute__((unused)) sc_argc,
            char __attribute__((unused)) *sc_argv[])
{
    Initiator * cpu      = new Initiator("cpu");
    Target * memory1   = new Target("memory1");
    Target * memory2   = new Target("memory2");
    Interconnect * bus = new Interconnect("bus");

    cpu->iSocket.bind(bus->tSocket);
    bus->iSocket[0].bind(memory1->tSocket);
    bus->iSocket[1].bind(memory2->tSocket);

    sc_start();
    return 0;
}

```



Try code on github:  
[https://github.com/TUK-SCVP/SCVP.artifacts/tree/master/tlm\\_It\\_initiator\\_interconnect\\_target](https://github.com/TUK-SCVP/SCVP.artifacts/tree/master/tlm_It_initiator_interconnect_target)

39

© Fraunhofer IESE

Fraunhofer  
IESE

Your notes:

Before we just had 1 CPU with one Target, now we want to have a bus model in between.

This technique is secure or slow to get everything first.

We are using raw pointers and locations. Misaligned.

Address Space  
0-511  
512-1023

Note  
Addresses  
Counts by  
Counter  
in initiator address  
we update  
the address  
from the global  
map to local address  
allow the  
target



Your notes:

---



---



---



---



---



---



---



---



---



---



---



---

... about another technique that

let's talk about how we can make simulation faster.



© Fraunhofer IESE



ideal example you have 3 CPUs under every core is calculating something But they don't communicate with each other?

Your notes:

we have now CPU, GPU, H/W A, one initiator is doing memory access then it did another memory access, then it did a wait statement. Then another initiator wants to do something so we go first to the kernel then we switch to the other initiator, but every time we switch we have an overhead.

- The idea of avoiding this context switching is the Temporally Decoupled Simulation, where we give every initiator to run all transactions during this simulation time and this

Time is called Quantum, once Quantum reach we switch to other initiator.

## Blocking Interface (LT) with Quantum

SW

lets assume

Quantum = 1us

Initiator

Simulation time = 1us

Target

Our local time simulated  
delay=950ns  
delay=970ns

Time more than  
Quantum

delay=990ns

delay=1010ns

Something  
bad happened  
we run out  
of the Q  
we are the Q

Quantum Keeper: +10ns

call b\_transport(trans,delay)

return

call b\_transport(gp,delay)

delay+=20ns

? If we reach  
(gp,1010ns)

return

If we reach Quantum time  
we wait for 1us  
Then the next Quantum  
will start  
Quantum  
with offset

wait(1us)

Simulation time = 2us

call b\_transport(gp,10ns) we start

43

© Fraunhofer IESE

Fraunhofer  
IESE

Your notes:

So System-C will perform this  
P Transport and the next Quantum  
will start with an offset, that's how  
System-C is solving this problem  
when you run out of your Quantum  
time.

For that reason there exist this Quantum  
keeper that stores the remainder of time  
of previous Quantum.

## Quantum Keeper

```

class Initiator: sc_module, tlm::tlm_bw_transport_if<>
{
    private: inherit interface Quantum
    tlm_utils::tlm_quantumkeeper quantumKeeper;

    public:
        tlm::tlm_initiator_socket<> iSocket;
        SC_CTOR(exampleInitiator) : iSocket("iSocket")
    {
        iSocket.bind(*this);
        SC_THREAD(process);
        quantumKeeper.set_global_quantum(
            sc_time(1,SC_US)
        );
        quantumKeeper.reset();
    } // initializing it

    void process()
    { // generate write
        // Write to memory:
        for (int i = 0; i < 1024; i++) {
            tlm::tlm_generic_payload trans;
            unsigned char data = rand();
            trans.set_address(i);
            trans.set_data_length(1);
            trans.set_command(tlm::TLM_WRITE_COMMAND);
            trans.set_data_ptr(&data);
            sc_time delay = quantumKeeper.get_local_time();
    }
}

```

*in initiator everything is handled for Quantum keeper*

*we set 1 byte*

*we initialize our local time.*

*Static Method*

*all classes that have Quantum keeper will have same global things here*

Then we perform P-Transport

```

iSocket->b_transport(trans, delay);
// Anotate the time of the target
quantumKeeper.set(delay);
quantumKeeper.set(trans.get_time());
quantumKeeper.set(trans.get_time() + delay);
// Consume internal computation time
quantumKeeper.inc(sc_time(10,SC_NS));
if(quantumKeeper.need_sync())
{
    quantumKeeper.sync();
}
...
// Dummy methods ...

```

*if you want to consume interfs consuming time*

*↳ Quantum keeper knows we're now 10ns further*

*Calls wait() internally*

*We check the Quantum keeper if it's not exceeding 50 ns*

*(not we go to process once we reach the Quantum keeper)*



Try code on github:

[https://github.com/TUK-SCVP/SCVP.artifacts/tree/master/tlm\\_quantum\\_keeper](https://github.com/TUK-SCVP/SCVP.artifacts/tree/master/tlm_quantum_keeper)

44

Fraunhofer  
IESE

Your notes:

---



---



---



---



---



---



---



---



---



---



---



---

# Summarization OF STEPS

## Quantum Keeper Variables and Methods



45

© Fraunhofer IESE

 **Fraunhofer**  
IESE

- There are three absolute time variables, namely the current simulation time as returned by `sc_time_stamp()`, the current time decoupled from simulation time, and the time of the next synchronization point.
- The global quantum is the time between two sync points.
- The local time is the time offset of the current time (as used by a temporally decoupled initiator) from the SystemC simulation time.
- The current time is the absolute time value as used by a temporally decoupled initiator.

Your notes:

---

---

---

---

---

---

## Quantum vs. Accuracy



- Quantum is user configurable
- Trade-off between simulation speed and accuracy
- The smaller the quantum, the more accurate the simulation
- If target uses `wait()` internally, it should set the `delay = SC_ZERO_TIME`



© Fraunhofer IESE

Our Artifacts Example (i.7, MacOS):



as we increase

Quantum size we get higher

Speed but until

we reach a certain

Point the Speed of

Simulation get

Saturation Point

no more Benefits

From increasing

Quantum.

46

Your notes:  
in general

The Smaller the Quantum

The more Accurate the Simulation

The greater the Quantum

The less accurate the Simulation

Typical system but you see  
like this

## A Closer Look on Functional Simulation Errors



### Temporal Coupled (e.g. Cycle Accurate)

- I/O Device makes an *Interrupt Request* (IRQ)
- The *Interrupt Service Routine* (ISR) will be called a few cycles later

### Temporal Decoupled Simulation

- I/O Device makes an IRQ (Interrupt request)
- The ISR will be called in the next Quantum

Temporal Coupled  
(e.g. Cycle Accurate)

When we have  
Temporal Decoupled



The CPU is running  
when of simulation  
so we get an error

© Fraunhofer IESE

 **Fraunhofer**  
IESE

47

## Checkpointing

- 15 The ability to save the state of a simulation and later pick up at the exact same point in time.



- Gläser et al. [4] presented in 2015 the idea of using checkpointing in order to rollback in simulation time and force an earlier synchronization to correct the occurred errors.

48

Source: Jacob Engblom, Intel, SystemC Evolution Day, 2017

[4] Temporal decoupling with error-bounded predictive quantum control. Georg Glaeser, Gregor Nitsche, Eckhard Hennig.  
Published in Forum on Specification and Design Languages (FDL) 2015

## The Good Old `fork()`

We make a copy of all the simulation with all the context

- `fork()` is a system call that allows a process in the OS to create a one-to-one copy of itself, called *child*.
- Supported by all OS: Linux, FreeBSD, macOS, ...
- Modern OS do not duplicate the complete memory space of a process
- Instead they use the Copy-on-Write semantic:
  - The copy operation is deferred to the first write to a memory page
  - In other words: a memory page is only copied in the moment of the change

Can `fork()` be used as an efficient way for checkpointing in order to get an error free temporal decoupled simulation?



it's Possible but it dePeas -

## Speculative Temporal Decoupling using fork()

The idea

Idea:  
we want to use

- Use `fork()` to backup the simulation state

Then we  
Execute the next quantum **speculatively**

- In case of an error rollback in simulation time
- Correct the timing error e.g. by temporary decreasing the quantum size

Two approaches investigated:

- Naïve Approach** (Forking at each quantum)
- Lockstep Approach** (Forking only in case of an error)

Synchronization of *parent* and *child* is done with pipe (acts like a barrier)

Details of the implementation are in the paper



## Approach



© Fraunhofer IESE

 **Fraunhofer**  
IESE

## Results for the Approach



- Example System with one CPU and I/O Device
- Interrupt rates between 1s - 1ms
- Synchronization with `wait()`
- Synchronization with different quanta (0s, 1us, 10us)
- Errors = Number of missed IRQ events



© Fraunhofer IESE



Your notes:

---



---



---



---



---



---



---



---



---



---



---



---

There are another Technique For making Simulation Fast

## DMI (Direct Memory Interface)



- Bypasses the Interconnects (e.g. Bus or Cache) and all socket & transport calls!
- Gives an initiator a pointer to memory region in the target
- Target can give a hint to initiator that DMI is available → *it tells initiator you can connect with me through DMI.*
- Uses also generic payload, can use also extensions
- Target can invalidate DMI regions
- Gives higher simulation speed, e.g. for booting an OS ...

© Fraunhofer IESE

*By that you reach high Simulation Speed*



The main reason for DMI is simulation speed. DMI allows the transport interface to be bypassed by giving an initiator a direct pointer to an area of memory in a target. Instead of calling transport for each read and write operation, the initiator can simply access the memory directly. For operations other than read or write, it is possible to add extensions to the DMI transaction using the standard generic payload extension mechanism. It is the responsibility of the initiator to honor any invalidation calls received from the target.

Your notes:

*The idea is as the following when you have a CPU and you have Target memory and you have some data in memory and you have the interConnect lets assume that this a very detailed interConnect model and it's very slow because you may have an RTL inside and now you want to boot Linux, it would take a week to boot Linux because*

~~This~~ interconnect is so damn slow -

What we can do for this as follows

The idea is with DMA we bypass  
this interconnect.

1

You remember we had some functions in our interface

## DMI (Direct Memory Interface)

SW

```
bool get_direct_mem_ptr(tlm_generic_payload trans, tlm_dmi dmiData);
```



```
void invalidate_direct_mem_ptr(uint64 start, uint64 end);
```

- Same routing as e.g. b\_transport
- Class `tlm_dmi`:
  - `unsigned char* dmi_ptr`
  - `uint64 start_address`
  - `uint64 end_address`
  - `dmi_access_e granted_access`
  - `sc_time read_latency`
  - `sc_time write_latency`

56

© Fraunhofer IESE

 **Fraunhofer**  
IESE

Your notes:

---

---

---

---

---

---

---

---

---

---

# DMI Initiator

```

class Initiator: sc_module, tlm::tlm_bw_transport_if<> {
    bool dmi // set to false in the constructor;
    tlm::tlm_dmi dmiData;
    ...
    void process() {
        for (int i = 0; i < 16; i++) {
            tlm::tlm_generic_payload trans;
            unsigned char data = rand();
            trans.set_address(i);
            trans.set_data_length(1);
            trans.set_command(tlm::TLM_WRITE_COMMAND);
            trans.set_data_ptr(&data);
            sc_time delay = sc_time(0, SC_NS);
            DMI start here
            if (dmi == true
                && i >= dmiData.get_start_address()
                && i <= dmiData.get_end_address())
            {
                if( trans.get_command() == tlm::TLM_READ_COMMAND
                    && dmiData.is_read_allowed())
                {
                    memcpy(&data,
                           dmiData.get_dmi_ptr() + i
                           - dmiData.get_start_address(),
                           trans.get_data_length());
                    delay += dmiData.get_read_latency();
                }
                else if( trans.get_command()== tlm::TLM_WRITE_COMMAND
                    && dmiData.is_write_allowed())
                {
                    memcpy(dmiData.get_dmi_ptr() + i
                           - dmiData.get_start_address(),
                           &data,
                           trans.get_data_length());
                    delay += dmiData.get_write_latency();
                }
            }
        }
    }
}

```

*We setup Transitions as usual*

```

} else {
    iSocket->b_transport(trans, delay);
    IF Yes
    if(trans.is_dmi_allowed() == true) {
        dmiData.init(); // Reset DMI descriptor
        dmi = iSocket->get_direct_mem_ptr(trans, dmiData);
    }
    wait(delay);
} // end for
sc_stop();
} // end process
void invalidate_direct_mem_ptr(sc_dt::uint64 start_range,
                               sc_dt::uint64 end_range)
{
    dmi = false;
}

// Dummy methods
...
};


```

*Normal b\_transport*

*Get DMI Hint!*

*We do initialization for DMI and we try to get the pointer to the location in Target*

*Then go back to check DMI*

Try code on github:  
[https://github.com/TUK-SCVP/SCVP.artifacts/tree/master/tlm\\_lt\\_dmi](https://github.com/TUK-SCVP/SCVP.artifacts/tree/master/tlm_lt_dmi)

57

 **Fraunhofer**  
IESE

Your notes:

*We Pass all the interconnect*

# DMI Interconnect

```

class exampleInterconnect : sc_module, tlm::tlm_bw_transport_if<>, tlm::tlm_fw_transport_if<>
{
public:
    tlm::tlm_initiator_socket<> iSocket;
    tlm::tlm_target_socket<> tSocket;

    SC_CTOR(exampleInterconnect) {
        tSocket.bind(*this);
        iSocket.bind(*this);
    }

    void b_transport(tlm::tlm_generic_payload &trans, sc_time &delay) {
        delay = delay + sc_time(40, SC_NS);
        iSocket->b_transport(trans, delay);
    }

    bool get_direct_mem_ptr(tlm::tlm_generic_payload& trans, tlm::tlm_dmi& dmi_data) {
        bool dmi = iSocket->get_direct_mem_ptr(trans, dmi_data);

        dmi_data.set_read_latency( dmi_data.get_read_latency() + sc_time(40, SC_NS));
        dmi_data.set_write_latency( dmi_data.get_write_latency() + sc_time(40, SC_NS));

        return dmi;
    }

    void invalidate_direct_mem_ptr(sc_dt::uint64 start_range, sc_dt::uint64 end_range)
    {
        tSocket->invalidate_direct_mem_ptr(start_range, end_range);
    }

    // Dummy methods ...
};

```

Forwarding DMI  
request on  
forward and  
backward path

58

© Fraunhofer IESE


**Fraunhofer**  
IESE

Your notes:

---



---



---



---



---



---



---



---



---



---



---



---



---

## DMI Target

```

class exampleTarget : sc_module, tlm::tlm_fw_transport_if<> {
    unsigned char mem[512];
public:
    tlm::tlm_target_socket<> tSocket;
    SC_CTOR(exampleTarget) : tSocket("tSocket") {
        tSocket.bind(*this);
        SC_THREAD(invalidateProcess);
    }
    void invalidateProcess() {
        while(true) {
            wait(500, SC_NS);
            tSocket->invalidate_direct_mem_ptr(0,511);
        }
    }
    When you get P_Transport you say hey DMA is enabled
    void b_transport(tlm::tlm_generic_payload &trans, sc_time &delay) {
        ...
        trans.set_dmi_allowed( true );
    }
    bool get_direct_mem_ptr(tlm::tlm_generic_payload& trans, tlm::tlm_dmi& dmi_data) {
        std::cout << "get_direct_mem_ptr called" << std::endl;
        dmi_data.set_dmi_ptr(mem);
        dmi_data.set_start_address(0);
        dmi_data.set_end_address(511); You Put here a pointer
        to memory
        dmi_data.set_read_latency(sc_time(40, SC_NS)); Structure
        dmi_data.set_write_latency(sc_time(40, SC_NS)); and Put
        dmi_data.allow_read_write(); addresses
        return true;
    }
    // Dummy methods ...
};


```

ranges

and you can also note

latency here for this processes when you go back to initiator

you can add up your local delay variable with this latencies

In this example the DMI access is invalidated every 500 ns, which is just an artificial example

Give Initiator a hint that DMI is possible

Configure DMI object with all relevant information

© Fraunhofer IESE

59

Your notes:

Just have an overview

it's not in P\_out yet.



Your notes:

There is another thing that helps you a lot to set up, I built a lot of virtual platforms already and the booting is really annoying, we talk about booting of Linux that we can fix it using DMI, but there is another problem which is the bootloader itself and logically you don't want to simulate bootloader but we have something really cool which called Transport Debug, which is nothing else than a blocking transport without any time.

## Debug Transport transport\_dbg

```
unsigned int transport_dbg(tlm_generic_payload trans);
```



- Gives an initiator debug access to memory in a target
- Similar to b\_transport
  - Different: delay free, no waits, no event notifications
  - Uses generic payload
  - Same routing as b\_transport
- Used for initialization, e.g. for bootloading

62

© Fraunhofer IESE

 **Fraunhofer**  
IESE

Your notes:

it has 2 Application:  
① boot loader  
you want to put bootloader in your memory and this should not simulated. You want to have everything in your memory and at once do a start simulation on everything in memory. This can be done either by giving the memory the possibility of an interface to load the stuff in local memory or if the CPU can do this job using the transport\_dbg fn.

② IF you want to check specific variable in your target for debugging purposes, maybe your target is close source thing so you can't change it anymore because it's compiled library and you want to access this register to check something, you can use transport-db

① Bootloading

② Debugging

## Debug Transport transport\_dbg

```
class Initiator : sc_module, tlm::tlm_bw_transport_if<> {
    ...
    void process() { Normal System Process
        ... // End of simulation:
        dumpMemory();
    }

    void dumpMemory()
    {
        unsigned char buffer[64];

        tlm::tlm_generic_payload trans;
        trans.set_address(0);
        trans.set_read();
        trans.set_data_length(64);
        trans.set_data_ptr(buffer);

        unsigned int n = iSocket->transport_dbg(trans);

        for(unsigned int i = 0; i < n; i++) {
            std::cout << std::hex
                << std::setfill('0')
                << std::setw(2)
                << (unsigned int)buffer[i]; we have a null for the transport
            if((i+1)%8 == 0) {
                std::cout << std::endl;
            }
        }
    };
};
```

*we set transaction to readout from GUI*

*we iterate each byte as a bit and here every 8 we have a break*

```
class Target : sc_module, tlm::tlm_fw_transport_if<>
{
    ...
    void b_transport(... &trans, ... &delay)
    {
        ...

        unsigned int transport_dbg(&trans)
        {
            if (trans.get_address() >= 1024) { you do address check
                return 0;
            }

            if(trans.get_command() == tlm::TLM_WRITE_COMMAND)
            {
                memcpy(&mem[trans.get_address()],
                    trans.get_data_ptr(),
                    trans.get_data_length());
            } else /* tlm::TLM_READ_COMMAND */
            else /* tlm::TLM_READ_COMMAND */
            {
                memcpy(trans.get_data_ptr(),
                    &mem[trans.get_address()],
                    trans.get_data_length());
            }
            return trans.get_data_length();
        };
    };
};
```

*you can write with specified length*

*↳ you return n bytes you have read.*

Try code on github:

[https://github.com/TUK-SCVP/SCVP\\_artifacts/tree/master/tlm\\_lt\\_debug\\_transport](https://github.com/TUK-SCVP/SCVP_artifacts/tree/master/tlm_lt_debug_transport)

63

© Fraunhofer IESE



Your notes:

---

---

---

---

---

---

---

---

---

---