Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VT-d: add remappable MSI and IOAPIC support #1098

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

abrandnewusername
Copy link
Contributor

@abrandnewusername abrandnewusername commented Aug 30, 2023

This pull request does:

  • add interrupt remapping table features for VT-d, and support remappable MSIs.
  • minor fixes

There are potential future works including:

  • convert ioapic into the remappable format
  • convert serial and pit interrupts from compatibility format to remappable format
  • implement the queued invalidation feature and replaces the legacy invalidation methods that are used by iotlb and context tables
  • implement posted interrupts

Update: now it supports remappable IOAPICs

@Indanz Indanz added the hw-test sel4test hardware builds + runs for this PR label Aug 30, 2023
@Indanz
Copy link
Contributor

Indanz commented Aug 30, 2023

I assume we have at least one IOMMU configuration in CI. I don't think it's easy to add tests to sel4test for this, but this should at least check that nothing broke for non-VM code.

@wom-bat
Copy link
Member

wom-bat commented Sep 6, 2023

Can you pleae check that CVE-2013-3495 and CVE-2011-1898 are addressed by this patch (those CVEs are for a parallel situation with Xen)

Copy link
Member

@wom-bat wom-bat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few minor changes; plus a check against the CVEs for IRQ remapping to check against.

src/arch/x86/object/interrupt.c Show resolved Hide resolved
src/arch/x86/object/interrupt.c Show resolved Hide resolved
src/plat/pc99/machine/intel-vtd.c Outdated Show resolved Hide resolved
src/plat/pc99/machine/intel-vtd.c Outdated Show resolved Hide resolved
src/plat/pc99/machine/intel-vtd.c Show resolved Hide resolved
src/plat/pc99/machine/intel-vtd.c Show resolved Hide resolved
@abrandnewusername
Copy link
Contributor Author

abrandnewusername commented Sep 20, 2023

Can you pleae check that CVE-2013-3495 and CVE-2011-1898 are addressed by this patch (those CVEs are for a parallel situation with Xen)

For CVE-2013-3495, in our implementation, a malformed MSI does cause an Interrupt Remapping Fault but does not panic the kernel (see vtd_handle_fault). I can also disable the fault if that's desirable.

For CVE-2011-1898, we try to prevent it by adding the remappable MSI, however, since serial and pit are using compatibility format ioapic, I am not able to disable compatibility format interrupts, which means that the users are still able to craft compatibility format MSIs that bypass the remapping table.

Copy link
Member

@wom-bat wom-bat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indan's comment on the need for #ifdeffing a couple of functions needs to be addressed.

src/arch/x86/object/interrupt.c Show resolved Hide resolved
src/plat/pc99/machine/intel-vtd.c Outdated Show resolved Hide resolved
Copy link
Member

@wom-bat wom-bat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. But I'm not a great expert in IOMMU/VT-D ... I'd like someone who is to review.

@Indanz
Copy link
Contributor

Indanz commented Sep 20, 2023

For CVE-2013-3495, in our implementation, a malformed MSI does cause an Interrupt Remapping Fault but does not panic the kernel (see vtd_handle_fault). I can also disable the fault if that's desirable.

The problem seems to be that if System Error Reporting (SERR) is enabled, such malformed transactions raise a PCI System Error NMI instead of a normal vtd fault. SERR may not make it to seL4, but might halt the whole system, depending on host configuration. See https://www.openwall.com/lists/oss-security/2013/08/20/8

@Indanz
Copy link
Contributor

Indanz commented Sep 20, 2023

For CVE-2011-1898, we try to prevent it by adding the remappable MSI, however, since serial and pit are using compatibility format ioapic, I am not able to disable compatibility format interrupts, which means that the users are still able to craft compatibility format MSIs that bypass the remapping table.

If it's an all or nothing setting then a work-around is needed for serial and timers, because hardware pass-through without remapping isn't secure. What is Xen's solution in this regard? Looking around it seems possible to remap IOAPIC interrupts, why can't that be done for serial and PIT?

If compatibility format can be enabled only for PIT and serial, that should be relatively safe: They are not DMA capable bus masters and are not PCI devices, so they can't be used for the attacks mentioned in the paper. That said, compatibility format is only allowed if x2APIC is disabled. So switching to x2APIC should make it possible to remap serial and timer interrupts too.

EDIT: The above points are not blockers for this patch and I see you listed them as future TODOs already.

@wom-bat
Copy link
Member

wom-bat commented Sep 27, 2023

For CVE-2013-3495, in our implementation, a malformed MSI does cause an Interrupt Remapping Fault but does not panic the kernel (see vtd_handle_fault). I can also disable the fault if that's desirable.

The problem seems to be that if System Error Reporting (SERR) is enabled, such malformed transactions raise a PCI System Error NMI instead of a normal vtd fault. SERR may not make it to seL4, but might halt the whole system, depending on host configuration. See https://www.openwall.com/lists/oss-security/2013/08/20/8

I think that fixing that is out of scope for this patch. To do the job properly would mean teaching seL4 about SERR NMIs --- and given that most of the time we're targetting embedded systems (even for X86) I reckon just making sure that this is noted somewhere obvious would be enough. Malformed MSIs are only going to be generated by buggy code (which will be picked up during debugging), and malicious code (which is an issue, but one that's not made worse by this.)

Copy link
Contributor

@Indanz Indanz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming this doesn't break x86 on CPUs not supporting VT-d, and everyone is happy with the code, I think it's ready to be merged?

src/arch/x86/object/interrupt.c Outdated Show resolved Hide resolved
src/plat/pc99/machine/intel-vtd.c Show resolved Hide resolved
@abrandnewusername abrandnewusername changed the title VT-d: add remappable MSI support VT-d: add remappable MSI and IOAPIC support Sep 29, 2023
@abrandnewusername
Copy link
Contributor Author

Updates:

  • refactor my MSI code to produce meaningful error messages
  • add remappable IOAPIC support. Ideally I should make a new pull request for remappable IOAPIC, but I think it makes more sense to put the commit here since the implementation of remappable IOAPIC is similar to remappable MSI

include/plat/pc99/plat/32/plat_mode/machine/hardware.bf Outdated Show resolved Hide resolved
include/plat/pc99/plat/64/plat_mode/machine/hardware.bf Outdated Show resolved Hide resolved
include/plat/pc99/plat/machine/ioapic.h Outdated Show resolved Hide resolved
src/plat/pc99/machine/intel-vtd.c Outdated Show resolved Hide resolved
src/plat/pc99/machine/intel-vtd.c Outdated Show resolved Hide resolved
src/plat/pc99/machine/intel-vtd.c Outdated Show resolved Hide resolved
Comment on lines 177 to 181
status = vtd_inspect_irte_msi(handle, vector);
if (status != EXCEPTION_NONE) {
return status;
}
vtd_create_irte_msi(handle, vector, pci_bus, pci_dev, pci_func);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not look like code that is legal to do in a decode function. At least vtd_create_irte_msi sounds like it changes state, and if vtd_inspect_irte_msi is just decoding and not changing state, it should also be called decode.. or check...

All state-changing code must go into Arch_invokeIRQControl.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Is this convention documented? Where can I find the documentation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.sel4.systems/processes/style-guide.html in the Kernel code/Error handling section has that requirement.

@lsf37
Copy link
Member

lsf37 commented Oct 25, 2023

So, there is a higher-level discussion we need to have on this one. @seL4/tsc -- strictly speaking this fix/feature needs an RFC, because it changes the API. It is also a breaking change, because it reserves IRQ handler 0, which user-level systems so far will assume to be available (see sel4test test failing for x86).

It is relatively obvious that we do want this feature and it is security relevant, so we may want it for this release (13.0). The details on how to handle it are less obvious to me. Any opinions on whether this should be on hold for the release and go through the RFC process instead?

@lsf37
Copy link
Member

lsf37 commented Nov 9, 2023

So, there is a higher-level discussion we need to have on this one. @seL4/tsc -- strictly speaking this fix/feature needs an RFC, because it changes the API. It is also a breaking change, because it reserves IRQ handler 0, which user-level systems so far will assume to be available (see sel4test test failing for x86).

It is relatively obvious that we do want this feature and it is security relevant, so we may want it for this release (13.0). The details on how to handle it are less obvious to me. Any opinions on whether this should be on hold for the release and go through the RFC process instead?

Not exactly a quick discussion so far, but I think we can already say that we at least need a user-level guide description: what do the changes mean for the user, which new invocations are added, what do they do. You mentioned that you made a design decision with using IRQ 0. Can you also describe with other options you considered and what the implications would be?

Edit: you could add these here or post a separate thread on the forum, maybe as a pre-RFC

Copy link
Contributor

@Indanz Indanz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The remaining issue stopping this from being merged, as far as I can tell, are:

  • That all state-changing code must go into Arch_invokeIRQControl.
  • Fix IOAPIC IRQs if that's accidentally broken now.
  • Fix seL4test if necessary if it does something it's not supposed to.

assert(irt_pptr != paddr_to_pptr(0));

word_t entry;
for (entry = 0; entry < MAX_IOAPIC && vtd_irte_ptr_get_present(&irt_pptr[entry]); entry++);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this guaranteed to find an empty slot? If so, maybe add a comment why.

return EXCEPTION_NONE;
}

if (entry < MAX_IOAPIC || entry >= N_VTD_IRTES_4K) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that there is remappable IOAPIC support, is the check entry < MAX_IOAPIC still relevant? If so, what needs to be done to allow user space to get IRQ handlers for IOAPIC IRQs? sel4test fails because it seems to try to get IRQ handlers for IOAPIC IRQs.

@Indanz
Copy link
Contributor

Indanz commented Nov 9, 2023

strictly speaking this fix/feature needs an RFC, because it changes the API. It is also a breaking change, because it reserves IRQ handler 0, which user-level systems so far will assume to be available (see sel4test test failing for x86).

I don't see the API change, the only change I see is an extra error code. That's not API breakage. Maybe I'm missing something or misunderstanding things here, I only looked at the code changes.

I think the RFC procedure is to decide whether something is wanted and to decide on the API for new features. The first is already decided, the second doesn't apply as far as I can see.

Not exactly a quick discussion so far,

I thought I commented on this already, but apparently not. :-/

but I think we can already say that we at least need a user-level guide description: what do the changes mean for the user, which new invocations are added, what do they do. You mentioned that you made a design decision with using IRQ 0. Can you also describe with other options you considered and what the implications would be?

Where is that IRQ 0 thing mentioned? I don't see it. I also don't see new invocations.

Comment on lines 164 to 168
word_t entry;
if (!vtd_create_irte_ioapic(vector, level, ioapic_target_cpu, &entry)) {
printf("IOAPIC: failed to create IRTE for vector %lu.", vector);
return;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The &entry takes the address of a local variable. This is not allowed in the kernel.

@lsf37
Copy link
Member

lsf37 commented Nov 9, 2023

strictly speaking this fix/feature needs an RFC, because it changes the API. It is also a breaking change, because it reserves IRQ handler 0, which user-level systems so far will assume to be available (see sel4test test failing for x86).

I don't see the API change, the only change I see is an extra error code. That's not API breakage. Maybe I'm missing something or misunderstanding things here, I only looked at the code changes.

It's a bit of a grey area, that's why I was calling for discussion. A new error condition that is not just something that is missing from before, but adds new behaviour, is strictly speaking API breaking, and vtd_create_irte_msi adds new behaviour, i.e. we're now supporting remapped interrupts, but there is no mention of them in the manual or the API, so it's not clear to me what it actually is that is now supported.

I think the RFC procedure is to decide whether something is wanted and to decide on the API for new features. The first is already decided, the second doesn't apply as far as I can see.

That's why it's not a clear-cut case, it's a minor API change that on the high level is clear we want (hence could skip RFC), but the user-level explanation is missing.

Not exactly a quick discussion so far,

I thought I commented on this already, but apparently not. :-/

That's alright, there are also 12 more people who I had hoped might comment :-)

but I think we can already say that we at least need a user-level guide description: what do the changes mean for the user, which new invocations are added, what do they do. You mentioned that you made a design decision with using IRQ 0. Can you also describe with other options you considered and what the implications would be?

Where is that IRQ 0 thing mentioned? I don't see it. I also don't see new invocations.

All PC99 tests are failing (e.g. here) because IRQ handler 0 is now reserved, which is also a breaking change. I did not actually see that in the code anywhere either, though. @abrandnewusername can you point us to where that happens?

Copy link
Member

@lsf37 lsf37 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The verification code requirements and the high-level API question have to be addressed before we can merge.

Signed-off-by: Jingyao Zhou <jingyao.zhou@unsw.edu.au>
Signed-off-by: Jingyao Zhou <jingyao.zhou@unsw.edu.au>
Add interrupt remapping table features for the VT-d driver

Signed-off-by: Jingyao Zhou <jingyao.zhou@unsw.edu.au>
Add support for remappable IOAPICs.

Signed-off-by: Jingyao Zhou <jingyao.zhou@unsw.edu.au>
Add support for remappable MSIs.

Signed-off-by: Jingyao Zhou <jingyao.zhou@unsw.edu.au>
@abrandnewusername
Copy link
Contributor Author

Update:

  • the code no longer breaks the kernel C coding conventions
  • more comments added

@lsf37 @Indanz Currently the kernel IOMMU driver supports two kinds of interrupts, MSI and IOAPIC. We don't have an MSI driver in the kernel, so the handle (which will be the entry of this MSI in the interrupt remapping table (IRT)) of an MSI is passed as a parameter of this MSI cap to the kernel by the user. We do have a kernel IOAPIC driver, so we don't rely on the user to provide the handle for IOAPICs.

It leads to two issues:

  • The IOMMU driver creates IRT entries for MSIs and IOAPICs during the cap invocation, and it wouldn't know which one comes first (at least that's what I think happens). To avoid entry number clashes, I decided to reserve entry 0 to #MAX_IOAPIC - 1 (the maximum number of IOAPICs this board can have) for IOAPICs, which means that MSI can only use entry #MAX_IOAPIC to #N_VTD_IRTES_4K - 1 (the number of entries the IRT has). This is the reason why seL4test dislikes me.
  • The IRT is fix-sized. The number of IRTEs in a 4K interrupt remapping table is 256. The reason for doing that is because I use it_alloc_paging() to allocate memory for the IRT, which gives me exactly one page, and I couldn't find other tools in the kernel that allocate N contiguous pages for me. This means that if #MAX_IOAPIC is somehow too big, we leave no space for MSIs. I could make the size of the IRT configurable, but I guess it requires some more changes in the kernel.

@Indanz
Copy link
Contributor

Indanz commented Nov 11, 2023

If user space is managing the entries for MSI, but the kernel does it for IOAPIC, and it's the same range, it makes much more sense to reserve the last MAX_IOAPIC entries for IOAPIC instead of the beginning. Any reason that can't be done?

Then the only user space visible change is a lower max handle for MSI IRQs, right? That should cause much less breakage. If there is a define that user space can use to check the max handle, that should be updated too.

Edit: Alternatively, if that's not possible, add and substract MAX_IOAPIC to all handle values provided by user space to hide the change.

@Indanz
Copy link
Contributor

Indanz commented Nov 11, 2023

A limit of 256 seems quite low. You could use a static array instead of dynamically allocating the page, just align it properly. Or use alloc_rootserver_obj() instead. Not sure if there is a clean solution for this, but it seems weird to use root server memory for IRT, but I guess that's the only possible way to allocated memory dynamically in kernel code.

@lsf37
Copy link
Member

lsf37 commented Nov 11, 2023

A limit of 256 seems quite low. You could use a static array instead of dynamically allocating the page, just align it properly. Or use alloc_rootserver_obj() instead. Not sure if there is a clean solution for this, but it seems weird to use root server memory for IRT, but I guess that's the only possible way to allocated memory dynamically in kernel code.

If a static array is possible, then that would be preferable. I don't think there is precedent for the kernel to use root server resources to allocate kernel-level tables (unless it is a proper object with a retype API). We should not start to introduce some kind of hybrid -- that would definitely need an RFC discussion and TSC decision.

@Indanz
Copy link
Contributor

Indanz commented Nov 12, 2023

If a static array is possible, then that would be preferable. I don't think there is precedent for the kernel to use root server resources to allocate kernel-level tables (unless it is a proper object with a retype API). We should not start to introduce some kind of hybrid -- that would definitely need an RFC discussion and TSC decision.

The current vtd code is using it_alloc_paging(), which allocates from the root server memory.

@lsf37
Copy link
Member

lsf37 commented Nov 12, 2023

The current vtd code is using it_alloc_paging(), which allocates from the root server memory.

You're right it does, I hadn't seen that. It'd be good if someone from the systems side with experience of the x86 port could comment on this. To me, this is against the design and we should ideally fix it, not do more of it.

For instance, if we had accepted one of the schemes that reclaim root server memory, this would immediately have lead to crashes. Using special ways to give memory to the kernel sounds extremely dangerous to me.

@Indanz
Copy link
Contributor

Indanz commented Nov 12, 2023

It'd be good if someone from the systems side with experience of the x86 port could comment on this. To me, this is against the design and we should ideally fix it, not do more of it.

The amount of memory needed is hardware dependent and can be queried at startup, so I think the intention was to allocate as much as necessary instead of the theoretical maximum.

@kent-mcleod
Copy link
Member

For instance, if we had accepted one of the schemes that reclaim root server memory, this would immediately have lead to crashes. Using special ways to give memory to the kernel sounds extremely dangerous to me.

Some allocation decisions need to be computed at boot time because the number of objects the kernel init code needs to create is not known at compile time.

But all allocations still need to be declared up front with calculate_rootserver_size(). So it_alloc_paging() should be called exactly as many times as what arch_get_n_paging(it_v_reg) returns. So all allocations the kernel does can be kept in one block that is allocated once during boot. But most of the objects created inside the rootserver allocation block correspond to caps in the root thread's cnode. Recovering root memory involves reusing the resources described by caps in the root's CNode, I can't see how other ways would work.

@lsf37
Copy link
Member

lsf37 commented Nov 13, 2023

For instance, if we had accepted one of the schemes that reclaim root server memory, this would immediately have lead to crashes. Using special ways to give memory to the kernel sounds extremely dangerous to me.

Some allocation decisions need to be computed at boot time because the number of objects the kernel init code needs to create is not known at compile time.

But all allocations still need to be declared up front with calculate_rootserver_size(). So it_alloc_paging() should be called exactly as many times as what arch_get_n_paging(it_v_reg) returns. So all allocations the kernel does can be kept in one block that is allocated once during boot. But most of the objects created inside the rootserver allocation block correspond to caps in the root thread's cnode. Recovering root memory involves reusing the resources described by caps in the root's CNode, I can't see how other ways would work.

Ok, thanks for explaining it. So this is an old approach and while I would still call it flawed, we don't need to fix it now for this PR. The main problem I have with the approach is that user-level functions and kernel-level functions are mixed -- it_* functions are supposed to operate on the initial thread, not set up the kernel.

For the table needed specifically here, the approach seems to be limiting, because it can use only one page. Would it make sense to have a compile-time configured fixed-size table that can be as large as needed, and then only use as much of it as the system dynamically supports at boot time? (with and enforced max at the size of the table). This probably won't work for everything, but it sounds like it might work here, and it gives the user some control over the memory trade-off.

@Indanz
Copy link
Contributor

Indanz commented Nov 13, 2023

For the table needed specifically here, the approach seems to be limiting, because it can use only one page. Would it make sense to have a compile-time configured fixed-size table that can be as large as needed, and then only use as much of it as the system dynamically supports at boot time?

Agreed that for a fixed-size table like in this patch using it_alloc_paging() makes no sense, but solving that can go either way: Either do make it dynamic in size so it's always big enough for the hardware it runs on, following what the other IOMMU code does, or make it a static array. If doing the latter, I wouldn't bother with a compile-time option though.

Long-term, as x86's memory range is figured out dynamically anyway, memory like this should be reserved before it is given to the rest of the seL4 kernel to use, instead of stealing it from the root server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hw-test sel4test hardware builds + runs for this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants