Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latency with MappedMemory #581

Open
slosar opened this issue Jan 19, 2024 · 5 comments
Open

Latency with MappedMemory #581

slosar opened this issue Jan 19, 2024 · 5 comments

Comments

@slosar
Copy link

slosar commented Jan 19, 2024

Description

We are simulating a system that has some internal on-fabric memory and a larger DDR external memory. The external memory is "slower" i.e. there is latency associated with reads (writes might be pipelined). Can this be at least approximately simulated by inserting wait states to "MappedMemory"?

Usage example

ddr: Memory.MappedMemory @ sysbus 0x90000000
    size: 0x4000000
    read_latency: 2
    write_latency: 0

where latencies are expressed in clock cycles.

Additional information

I'm surrpised this is not implemented yet, but look at the code seems to show that indeed it is not, but perhaps I should be using a different peripheral for this.

Do you plan to address this issue and file a PR?

Perhaps, if the issue becomes burning enough.

@mateusz-holenko
Copy link
Member

Hi @slosar, thanks for asking the question.

Renode simulations operate based on fully-controlled virtual time flow which allows us to obtain reproducible results and recreate scenarios where timing of events is important from the perspective of a simulated application.

Having said that, at the CPU level we operate on a simplified model where we assume that on average execution of each instruction takes the same amount of time. The performance of the CPU itself (the number of executed instructions per virtual second) can be controlled with the PerformanceInMIPS parameter.

For simulation performance reasons accesses to MappedMemory are optimized and executed as pointer operations directly at CPU level to avoid the necessity of executing C# layer.

To assess performance of your system taking into account memory accesses, you might use the post-mortem analysis of the execution traces. Please take a look at https://antmicro.com/blog/2023/07/risc-v-co-design-using-trace-based-simulation-with-renode-and-tbm/ and https://antmicro.com/blog/2022/09/execution-tracing-in-renode/.

Of course extending current CPU model with the notion of long-running instructions is technically possible, but would require some design and implementation work.

@jzee
Copy link

jzee commented Mar 5, 2024

@mateusz-holenko - that's interesting. In a fit of trying to understand what the assumption behind the average execution time is, I simulated an STM32F4 and ran 1M loops of 100 asm("nop") at PerformanceInMIPS = 168 . Then I compared this to the runtime on the actual STM32F4 discovery board, which is clocked at 168MHz.

It turns out that the HW board performs only marginally better than the Renode simulation (613.1ms on HW vs 625ms on renode).

This seems to indicate that the average execution time that renode assumes for a Cortex-M4 is only very slightly (~2%) higher than 1. Is that an expected result? Can I tweak that number somewhere?

@slosar
Copy link
Author

slosar commented Mar 5, 2024

A somewhat related question, would it be possible to implement something like that as a python peripheral. Can python peripheral force "wait" cycles to the CPU? In practice any modern system does not use a constant time to fetch contents of memory location, but any effects of cache, etc should still be quite straightforward to deterministically simulate.

@slosar
Copy link
Author

slosar commented Mar 5, 2024

It turns out that the HW board performs only marginally better than the Renode simulation (613.1ms on HW vs 625ms on renode).

I think any NOP takes exactly one cycle to execute. What you are seeing is almost certainly time "quantization" I think. Have you tried taking 10M loops vs 1M and see if the discrepancy changes?

@mateusz-holenko
Copy link
Member

A somewhat related question, would it be possible to implement something like that as a python peripheral. Can python peripheral force "wait" cycles to the CPU? In practice any modern system does not use a constant time to fetch contents of memory location, but any effects of cache, etc should still be quite straightforward to deterministically simulate.

There are means of pushing virtual time forward without executing any instructions in Renode and we sometimes use it to improve performance of simulating busy-waiting sleep implementations.
For reference you can take a look at the following fragment of a resc script (this one is specific to RISC-V platform, but the general idea is the same for all types of CPUs):

set hook
""" 
delay_time = cpu.GetRegisterUnsafe(10)
cpu.SkipTime(Antmicro.Renode.Time.TimeInterval.FromMilliseconds(delay_time))

return_addr = cpu.GetRegisterUnsafe(1).RawValue
cpu.PC = Antmicro.Renode.Peripherals.CPU.RegisterValue.Create(return_addr, 64)
""" 

e51 AddHook `sysbus GetSymbolAddress "z_impl_k_busy_wait"` $hook

This adds a Python hook executed each time the z_impl_k_busy_wait symbol is about to be executed. The hook decodes a parameter of a function (taken from the A0 register), calls cpu.SkipTime() (that's the most interesting part) and immediately returns from the function.

You can apply the same SkipTime approach to the Python or C#-based peripheral.
To obtain a current CPU and skip some time in a Python peripheral you can use the following snippet:

cpu = self.GetMachine().GetSystemBus(self).GetCurrentCPU()
cpu.SkipTime(Antmicro.Renode.Time.TimeInterval.FromMilliseconds(100))

Another way of delaying events happening in a peripheral is to use the ScheduleAction API - see https://github.com/renode/renode-infrastructure/blob/111ceafe347c61aab9504c21980699aacbf41910/src/Emulator/Peripherals/Peripherals/UART/STM32F7_USART.cs#L128.
This does not move time forward per se, but allows you to delay actions resulting from a bus access (e.g., setting an interrupt) thus simulating a time required to "process" the requested action.

Again please note that this is currently not directly applicable to executable memory accesses. Support for this scenario is also technically possible, but would require engineering work on the CPU model side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants