-
Notifications
You must be signed in to change notification settings - Fork 639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NVIDIA Jetson Orin support #1135
base: master
Are you sure you want to change the base?
Conversation
af96ce6
to
b8d3a52
Compare
@@ -32,6 +32,11 @@ elseif(KernelArmCortexA72) | |||
# (https://developer.arm.com/documentation/100095/0001/memory-management-unit/about-the-mmu) | |||
set(KernelArmPASizeBits44 ON) | |||
math(EXPR KernelPaddrUserTop "(1 << 44)") | |||
elseif(KernelArmCortexA78AE) | |||
# Even though CortexA78AE supports 48-bits | |||
# 44 bits PA is used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems Orin specific then? So the comment should clearly say this, or we could even have a platform-specific setting then for Orin besides the A78AE's default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that should be more clear, while the A78AE supports 48 bits, I haven't figured out a way to enabke this without causing a chain reaction of assertion failures within the kernel.
My understanding is that the ARM64 kernel only has 47 useable bits of physical memory unless hypervisor mode is enabled, if anyone knows the context to this please share :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering the Jetson Orinmodules don't support more than 64GB, let alone 1TB, why not treat is as a Cortex-A78?
Or are you also adding lockstep support to seL4 later on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
30a27f0
to
0ce5fd2
Compare
So how much does Cortex-A78AE differ from Cortex-A78? That is, are the AE differences visible to seL4? If not, perhaps just call it Cortex-A78? Or leave this to be figured out by whoever needs Cortex-A78 support? |
They're pretty similar, the seL4-visible difference would be that the A78 is only 40-bit PA while the A78AE is 48-bit PA: https://developer.arm.com/documentation/102826/latest/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All in all code seems fine.
0ce5fd2
to
2a812c7
Compare
So, what's the plan for this? It does not seem like a good idea to merge a port that needs a lot of tests disabled.. |
Indeed, on the TODO list before this is ready are:
|
29a0a27
to
9ec6c1a
Compare
Note: the serial getchar is currently left blank as only one region in the uart device is mapped into the kernel (TX). The RX region is 0x3c00000. We need a way to be able to map in both regions, without breaking the UART_PPTR abstraction. Signed-off-by: Andy Bui <andy.bui@nio.io>
Signed-off-by: Andy Bui <andy.bui@nio.io>
Signed-off-by: Andy Bui <andy.bui@nio.io>
While porting the Orin, we found that ipi_send_target made assumptions about how the clusters were identified by aff1. For the Orin SoC, the cluster is actually identified by aff2. This commit allows us to ipi_send_target on any platform that uses the gic_v3. Signed-off-by: Andy Bui <andy.bui@nio.io>
9ec6c1a
to
0a5ca78
Compare
https://github.com/seL4/seL4/actions/runs/7823562148/job/21344707601?pr=1135 Is the preprocess failure suggesting changes to me? Not sure I understand what it's trying to say. Requested re-review for new commit. |
It's showing the code changes that are visible to verification and that might break the proofs. Will have a look. |
This will indeed break verification, because it changes non-platform code. I'll see if I can comment on the code. |
/* Inner-shareable if SMP enabled, otherwise unshared (ignored for devices) */ | ||
word_t shareable = cacheable ? SMP_TERNARY(SMP_SHARE, 0) : 0; | ||
|
||
if (page_size == ARMSmallPage) { | ||
return pte_pte_4k_page_new(nonexecutable, paddr, nG, 1 /* access flag */, | ||
shareable, APFromVMRights(vm_rights), attridx); | ||
INNER_SHAREABLE, APFromVMRights(vm_rights), attridx); | ||
} else { | ||
return pte_pte_page_new(nonexecutable, paddr, nG, 1 /* access flag */, | ||
shareable, APFromVMRights(vm_rights), attridx); | ||
INNER_SHAREABLE, APFromVMRights(vm_rights), attridx); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the part that breaks verification currently. It would be relatively easy to update, but I'd like to understand better what we are really doing here. Why are we usually not setting inner shareable
, and could setting it unconditionally have any adverse effect on older platforms?
Is there any potential interaction with device memory?
The explanation in the commit message was useful. I think it would be good to have more of it as a comment in the code (maybe where INNER_SHARABLE is defined).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any potential interaction with device memory?
None, ARM states in RPYFVQ (DDI 0487J.a) that the value doesn't matter for device memory, which it will just treat as Outer Shareable.
Why are we usually not setting inner shareable, and could setting it unconditionally have any adverse effect on older platforms?
I think the current reasoning is that it's not really needed for UP systems. Having a more permissive shareability such as inner shareable
might mean wasted cycles trying to be coherent with the other PEs when it's not really needed. However, I believe that even while running seL4 in UP mode, another agent in the system may still need to be coherent with the PE/node. I have not had a deeper look as to which part of the Orin SoC might be requiring this.
I'll try and copy the explanations into the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example: We've had some existing problems with the Orin's System Cache. It is not an architectural cache and is shared by the CPU/GPU, making it effectively a CPU L4 cache. It may be the case that this is not within the non-shareable domain.
I use this example because we added a cache line invalidate by VA on a range of memory, which seemed to half-fix some of the instabilities we were encountering. Setting the shareability to IS fully fixed this issue and made the fix redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, that makes sense. It would be good to see if there is any measurable performance impact then, even if we're not expecting any. I guess we could just kick off a sel4bench run and see what happens on the existing platforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I believe shareability and cacheability attribute values for user frame mapping is policy that needs to be delegated to user level.
I think the current reasoning is that it's not really needed for UP systems. Having a more permissive shareability such as inner shareable might mean wasted cycles trying to be coherent with the other PEs when it's not really needed.
It can also mean that the system has to fall back to uncached memory if it doesn't provide hardware cache-coherency. So increasing the shareability from none, to inner to outer can degrade performance if it's not needed.
Example: We've had some existing problems with the Orin's System Cache. It is not an architectural cache and is shared by the CPU/GPU, making it effectively a CPU L4 cache. It may be the case that this is not within the non-shareable domain.
Wouldn't this be an issue with inner vs outer cacheability attributes rather than shareability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I believe that even while running seL4 in UP mode, another agent in the system may still need to be coherent with the PE/node.
Being cache coherent with other cores should be the default behaviour for seL4, to avoid very hard to debug surprises. So I am strongly in favour of this change.
It can also mean that the system has to fall back to uncached memory if it doesn't provide hardware cache-coherency. So increasing the shareability from none, to inner to outer can degrade performance if it's not needed.
Those pages would be marked cacheable
and this change doesn't change anything for them on SMP.
These SoCs are made for coherency between cores, I don't think downgrading non-SMP performance to the same level as with SMP for some obscure corner case is a problem.
And the proper fix for that would be to extend the API and provide more fine grained cacheability and shareability attribute values to user space, like you propose.
Wouldn't this be an issue with inner vs outer cacheability attributes rather than shareability?
It does sound like that, so I'm also a bit surprised this change fixed that. Maybe the GPU is in the same cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this commit be split out into a separate PR and discussed separately. It's scope is more than the Jetson Orin support.
0a5ca78
to
66ab815
Compare
to Inner Shareable. We noticed a lot of instability with the Jetson Orin port when running in UP + HYP mode. Setting the shareability to Inner Shareable (IS) unconditionally is the fix, however, we're unable to find the exact line in Arm documentation that explains why this is the case. Potential theory: there are other agents on the Jetson Orin that require coherency and live in the IS domain. Being more permissive with memory resolves the prefetching and translation faults we were getting. For reference, Arm states "Arm expects operating systems to mark the majority of DRAM memory as Normal Write-back cacheable, Inner shareable" (102376_0200_01_en version 2). Signed-off-by: Andy Bui <andy.bui@nio.io>
66ab815
to
67c4803
Compare
Cursory comparison of the TX1 benchmark run with the 2024-02-09 results here https://sel4.systems/About/Performance/home.pml: Figures in cycles (std dev) Mainline (Non-shareable mappings): This PR (Inner-Shareable mappings): |
/* Assume 12 cores | ||
* NOTE: this is hardcoded to the same value for the GICR region in hardware.yml | ||
*/ | ||
#define GICR_SIZE (0x180000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps a good idea to add a compile_assert(CONFIG_MAX_NUM_NODES * GICR_PER_CORE_SIZE <= GICR_SIZE);
to reduce future frustration.
Did you manage to fix the timer problems in sel4test? |
Unfortunately no. We don't use libplatsupport so I'll need to find the bandwidth to take a deeper look. |
How do you know the problem is in libplatsupport? Do other interrupts do work in your port, or is it just the sel4test timer not getting any? Or is the problem that the interrupt triggers immediately? My concern is that there is a deeper underlying problem that causes this issue, so I really want to get to the bottom of this before merging. |
Summary: initial support for the NVIDIA Jetson Orin
I'm currently tidying up the elfloader and libplatsupport changes, but the kernel port is mostly ready to be reviewed.
sel4test results
no-MCS UP Test suite passed. 124 tests passed. 47 tests disabled.
no-MCS SMP Test suite passed. 121 tests passed. 50 tests disabled.
MCS UP Test suite passed. 150 tests passed. 40 tests disabled.
MCS SMP Test suite passed. 153 tests passed. 37 tests disabled.
HYP no-MCS UP Test suite passed. 120 tests passed. 51 tests disabled.
HYP MCS UP Test suite passed. 146 tests passed. 44 tests disabled.
HYP SMP Test suite passed. 121 tests passed. 50 tests disabled.
HYP MCS SMP Test suite passed. 149 tests passed. 41 tests disabled.
A significant amount of tests are disabled as I couldn't get the ARM generic ltimer implementation to play nice, any test that required the timer to sleep would not sleep at all, causing a lot of instability, for example in
MULTICORE0001
this check fails:https://github.com/seL4/sel4test/blob/ab9189188ff3eff46d703e36081350f0f637b647/apps/sel4test-tests/src/tests/multicore.c#L33
Where both counter and old_counter remain at 0, even though the counter should've incremented. I get similar failures on
PREEMPT_REVOKE
andSCHED0008
, as well as a few others I have forgotten.Caveats:
The Orin has a couple 8250 serial devices that could use the
tegra_omap3_dwapb.c
driver in the kernel, but as I have only tested with thetegra194-tcu
device, I'll leave that out for now.