Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVIDIA Jetson Orin support #1135

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Conversation

andybui01
Copy link
Contributor

@andybui01 andybui01 commented Nov 17, 2023

Summary: initial support for the NVIDIA Jetson Orin

I'm currently tidying up the elfloader and libplatsupport changes, but the kernel port is mostly ready to be reviewed.

sel4test results
no-MCS UP Test suite passed. 124 tests passed. 47 tests disabled.
no-MCS SMP Test suite passed. 121 tests passed. 50 tests disabled.
MCS UP Test suite passed. 150 tests passed. 40 tests disabled.
MCS SMP Test suite passed. 153 tests passed. 37 tests disabled.

HYP no-MCS UP Test suite passed. 120 tests passed. 51 tests disabled.
HYP MCS UP Test suite passed. 146 tests passed. 44 tests disabled.
HYP SMP Test suite passed. 121 tests passed. 50 tests disabled.
HYP MCS SMP Test suite passed. 149 tests passed. 41 tests disabled.

A significant amount of tests are disabled as I couldn't get the ARM generic ltimer implementation to play nice, any test that required the timer to sleep would not sleep at all, causing a lot of instability, for example in MULTICORE0001 this check fails:

https://github.com/seL4/sel4test/blob/ab9189188ff3eff46d703e36081350f0f637b647/apps/sel4test-tests/src/tests/multicore.c#L33

Where both counter and old_counter remain at 0, even though the counter should've incremented. I get similar failures on PREEMPT_REVOKE and SCHED0008, as well as a few others I have forgotten.

Caveats:
The Orin has a couple 8250 serial devices that could use the tegra_omap3_dwapb.c driver in the kernel, but as I have only tested with the tegra194-tcu device, I'll leave that out for now.

@@ -32,6 +32,11 @@ elseif(KernelArmCortexA72)
# (https://developer.arm.com/documentation/100095/0001/memory-management-unit/about-the-mmu)
set(KernelArmPASizeBits44 ON)
math(EXPR KernelPaddrUserTop "(1 << 44)")
elseif(KernelArmCortexA78AE)
# Even though CortexA78AE supports 48-bits
# 44 bits PA is used
Copy link
Member

@axel-h axel-h Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems Orin specific then? So the comment should clearly say this, or we could even have a platform-specific setting then for Orin besides the A78AE's default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that should be more clear, while the A78AE supports 48 bits, I haven't figured out a way to enabke this without causing a chain reaction of assertion failures within the kernel.

My understanding is that the ARM64 kernel only has 47 useable bits of physical memory unless hypervisor mode is enabled, if anyone knows the context to this please share :).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering the Jetson Orinmodules don't support more than 64GB, let alone 1TB, why not treat is as a Cortex-A78?

Or are you also adding lockstep support to seL4 later on?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some discussion on 48 bit physical address space in #1175 and #1157, but the result is basically that the full 48 bit are currently not supported and would require further changes.

@andybui01 andybui01 force-pushed the andyb/orin-port branch 2 times, most recently from 30a27f0 to 0ce5fd2 Compare November 18, 2023 13:41
@Indanz
Copy link
Contributor

Indanz commented Nov 18, 2023

So how much does Cortex-A78AE differ from Cortex-A78? That is, are the AE differences visible to seL4? If not, perhaps just call it Cortex-A78? Or leave this to be figured out by whoever needs Cortex-A78 support?

@andybui01
Copy link
Contributor Author

So how much does Cortex-A78AE differ from Cortex-A78? That is, are the AE differences visible to seL4? If not, perhaps just call it Cortex-A78? Or leave this to be figured out by whoever needs Cortex-A78 support?

They're pretty similar, the seL4-visible difference would be that the A78 is only 40-bit PA while the A78AE is 48-bit PA: https://developer.arm.com/documentation/102826/latest/

Copy link
Contributor

@Indanz Indanz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All in all code seems fine.

src/drivers/serial/tegra194-tcu.c Show resolved Hide resolved
tools/hardware.yml Show resolved Hide resolved
@axel-h axel-h mentioned this pull request Nov 30, 2023
@lsf37
Copy link
Member

lsf37 commented Nov 30, 2023

A significant amount of tests are disabled as I couldn't get the ARM generic ltimer implementation to play nice, any test that required the timer to sleep would not sleep at all, causing a lot of instability, [..]

So, what's the plan for this? It does not seem like a good idea to merge a port that needs a lot of tests disabled..

@andybui01
Copy link
Contributor Author

So, what's the plan for this? It does not seem like a good idea to merge a port that needs a lot of tests disabled..

Indeed, on the TODO list before this is ready are:

  1. Prepare the elfloader changes (ongoing)
  2. Find the sticking point in libplatsupport. This is a bit harder as we don't use that library so this effort is new, and I'm not sure what could be the cause yet.

@lsf37 lsf37 added the new-platform platform ports label Feb 6, 2024
@andybui01 andybui01 force-pushed the andyb/orin-port branch 2 times, most recently from 29a0a27 to 9ec6c1a Compare February 8, 2024 01:16
Andy Bui and others added 4 commits February 8, 2024 12:25
Note: the serial getchar is currently left blank as only one region in
the uart device is mapped into the kernel (TX). The RX region is
0x3c00000. We need a way to be able to map in both regions, without
breaking the UART_PPTR abstraction.

Signed-off-by: Andy Bui <andy.bui@nio.io>
Signed-off-by: Andy Bui <andy.bui@nio.io>
Signed-off-by: Andy Bui <andy.bui@nio.io>
While porting the Orin, we found that ipi_send_target made assumptions
about how the clusters were identified by aff1. For the Orin SoC, the
cluster is actually identified by aff2.

This commit allows us to ipi_send_target on any platform that uses the
gic_v3.

Signed-off-by: Andy Bui <andy.bui@nio.io>
@andybui01
Copy link
Contributor Author

andybui01 commented Feb 8, 2024

https://github.com/seL4/seL4/actions/runs/7823562148/job/21344707601?pr=1135

Is the preprocess failure suggesting changes to me? Not sure I understand what it's trying to say.

Requested re-review for new commit.

@lsf37
Copy link
Member

lsf37 commented Feb 8, 2024

Is the preprocess failure suggesting changes to me? Not sure I understand what it's trying to say.

It's showing the code changes that are visible to verification and that might break the proofs. Will have a look.

@lsf37 lsf37 added the verification Needs formal verification input/change, or is motivated by verification label Feb 8, 2024
@lsf37
Copy link
Member

lsf37 commented Feb 8, 2024

Is the preprocess failure suggesting changes to me? Not sure I understand what it's trying to say.

It's showing the code changes that are visible to verification and that might break the proofs. Will have a look.

This will indeed break verification, because it changes non-platform code. I'll see if I can comment on the code.

Comment on lines -668 to 682
/* Inner-shareable if SMP enabled, otherwise unshared (ignored for devices) */
word_t shareable = cacheable ? SMP_TERNARY(SMP_SHARE, 0) : 0;

if (page_size == ARMSmallPage) {
return pte_pte_4k_page_new(nonexecutable, paddr, nG, 1 /* access flag */,
shareable, APFromVMRights(vm_rights), attridx);
INNER_SHAREABLE, APFromVMRights(vm_rights), attridx);
} else {
return pte_pte_page_new(nonexecutable, paddr, nG, 1 /* access flag */,
shareable, APFromVMRights(vm_rights), attridx);
INNER_SHAREABLE, APFromVMRights(vm_rights), attridx);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the part that breaks verification currently. It would be relatively easy to update, but I'd like to understand better what we are really doing here. Why are we usually not setting inner shareable, and could setting it unconditionally have any adverse effect on older platforms?

Is there any potential interaction with device memory?

The explanation in the commit message was useful. I think it would be good to have more of it as a comment in the code (maybe where INNER_SHARABLE is defined).

Copy link
Contributor Author

@andybui01 andybui01 Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any potential interaction with device memory?

None, ARM states in RPYFVQ (DDI 0487J.a) that the value doesn't matter for device memory, which it will just treat as Outer Shareable.

Why are we usually not setting inner shareable, and could setting it unconditionally have any adverse effect on older platforms?

I think the current reasoning is that it's not really needed for UP systems. Having a more permissive shareability such as inner shareable might mean wasted cycles trying to be coherent with the other PEs when it's not really needed. However, I believe that even while running seL4 in UP mode, another agent in the system may still need to be coherent with the PE/node. I have not had a deeper look as to which part of the Orin SoC might be requiring this.

I'll try and copy the explanations into the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example: We've had some existing problems with the Orin's System Cache. It is not an architectural cache and is shared by the CPU/GPU, making it effectively a CPU L4 cache. It may be the case that this is not within the non-shareable domain.

I use this example because we added a cache line invalidate by VA on a range of memory, which seemed to half-fix some of the instabilities we were encountering. Setting the shareability to IS fully fixed this issue and made the fix redundant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, that makes sense. It would be good to see if there is any measurable performance impact then, even if we're not expecting any. I guess we could just kick off a sel4bench run and see what happens on the existing platforms.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I believe shareability and cacheability attribute values for user frame mapping is policy that needs to be delegated to user level.

I think the current reasoning is that it's not really needed for UP systems. Having a more permissive shareability such as inner shareable might mean wasted cycles trying to be coherent with the other PEs when it's not really needed.

It can also mean that the system has to fall back to uncached memory if it doesn't provide hardware cache-coherency. So increasing the shareability from none, to inner to outer can degrade performance if it's not needed.

Example: We've had some existing problems with the Orin's System Cache. It is not an architectural cache and is shared by the CPU/GPU, making it effectively a CPU L4 cache. It may be the case that this is not within the non-shareable domain.

Wouldn't this be an issue with inner vs outer cacheability attributes rather than shareability?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I believe that even while running seL4 in UP mode, another agent in the system may still need to be coherent with the PE/node.

Being cache coherent with other cores should be the default behaviour for seL4, to avoid very hard to debug surprises. So I am strongly in favour of this change.

It can also mean that the system has to fall back to uncached memory if it doesn't provide hardware cache-coherency. So increasing the shareability from none, to inner to outer can degrade performance if it's not needed.

Those pages would be marked cacheable and this change doesn't change anything for them on SMP.

These SoCs are made for coherency between cores, I don't think downgrading non-SMP performance to the same level as with SMP for some obscure corner case is a problem.

And the proper fix for that would be to extend the API and provide more fine grained cacheability and shareability attribute values to user space, like you propose.

Wouldn't this be an issue with inner vs outer cacheability attributes rather than shareability?

It does sound like that, so I'm also a bit surprised this change fixed that. Maybe the GPU is in the same cluster?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this commit be split out into a separate PR and discussed separately. It's scope is more than the Jetson Orin support.

to Inner Shareable.

We noticed a lot of instability with the Jetson Orin port when running
in UP + HYP mode. Setting the shareability to Inner Shareable (IS)
unconditionally is the fix, however, we're unable to find the exact line
in Arm documentation that explains why this is the case.

Potential theory: there are other agents on the Jetson Orin that require
coherency and live in the IS domain. Being more permissive with
memory resolves the prefetching and translation faults we were getting.

For reference, Arm states "Arm expects operating systems to mark the
majority of DRAM memory as Normal Write-back cacheable, Inner
shareable" (102376_0200_01_en version 2).

Signed-off-by: Andy Bui <andy.bui@nio.io>
@lsf37 lsf37 added the hw-bench run sel4bench on this PR label Feb 8, 2024
@andybui01
Copy link
Contributor Author

Cursory comparison of the TX1 benchmark run with the 2024-02-09 results here https://sel4.systems/About/Performance/home.pml:

Figures in cycles (std dev)

Mainline (Non-shareable mappings):
Call: 403 (5)
Reply: 415 (0)
MCS Call: 433 (9)
MCS Reply: 453 (12)

This PR (Inner-Shareable mappings):
Call: 406 (12)
Reply: 416 (1)
MCS Call: 432 (8)
MCS Reply: 444 (14)

/* Assume 12 cores
* NOTE: this is hardcoded to the same value for the GICR region in hardware.yml
*/
#define GICR_SIZE (0x180000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a good idea to add a compile_assert(CONFIG_MAX_NUM_NODES * GICR_PER_CORE_SIZE <= GICR_SIZE); to reduce future frustration.

@Indanz
Copy link
Contributor

Indanz commented Feb 9, 2024

Did you manage to fix the timer problems in sel4test?

@andybui01
Copy link
Contributor Author

Did you manage to fix the timer problems in sel4test?

Unfortunately no. We don't use libplatsupport so I'll need to find the bandwidth to take a deeper look.

@Indanz
Copy link
Contributor

Indanz commented Feb 9, 2024

Did you manage to fix the timer problems in sel4test?

Unfortunately no. We don't use libplatsupport so I'll need to find the bandwidth to take a deeper look.

How do you know the problem is in libplatsupport? Do other interrupts do work in your port, or is it just the sel4test timer not getting any? Or is the problem that the interrupt triggers immediately? My concern is that there is a deeper underlying problem that causes this issue, so I really want to get to the bottom of this before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hw-bench run sel4bench on this PR new-platform platform ports verification Needs formal verification input/change, or is motivated by verification
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants