New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VT-d: add remappable MSI and IOAPIC support #1098
base: master
Are you sure you want to change the base?
Conversation
5d71420
to
8847047
Compare
I assume we have at least one IOMMU configuration in CI. I don't think it's easy to add tests to sel4test for this, but this should at least check that nothing broke for non-VM code. |
5868320
to
4757c55
Compare
Can you pleae check that CVE-2013-3495 and CVE-2011-1898 are addressed by this patch (those CVEs are for a parallel situation with Xen) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few minor changes; plus a check against the CVEs for IRQ remapping to check against.
aaf058d
to
ce082dd
Compare
For CVE-2013-3495, in our implementation, a malformed MSI does cause an Interrupt Remapping Fault but does not panic the kernel (see For CVE-2011-1898, we try to prevent it by adding the remappable MSI, however, since serial and pit are using compatibility format ioapic, I am not able to disable compatibility format interrupts, which means that the users are still able to craft compatibility format MSIs that bypass the remapping table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indan's comment on the need for #ifdeffing a couple of functions needs to be addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. But I'm not a great expert in IOMMU/VT-D ... I'd like someone who is to review.
The problem seems to be that if System Error Reporting (SERR) is enabled, such malformed transactions raise a PCI System Error NMI instead of a normal vtd fault. SERR may not make it to seL4, but might halt the whole system, depending on host configuration. See https://www.openwall.com/lists/oss-security/2013/08/20/8 |
If it's an all or nothing setting then a work-around is needed for serial and timers, because hardware pass-through without remapping isn't secure. What is Xen's solution in this regard? Looking around it seems possible to remap IOAPIC interrupts, why can't that be done for serial and PIT? If compatibility format can be enabled only for PIT and serial, that should be relatively safe: They are not DMA capable bus masters and are not PCI devices, so they can't be used for the attacks mentioned in the paper. That said, compatibility format is only allowed if x2APIC is disabled. So switching to x2APIC should make it possible to remap serial and timer interrupts too. EDIT: The above points are not blockers for this patch and I see you listed them as future TODOs already. |
I think that fixing that is out of scope for this patch. To do the job properly would mean teaching seL4 about SERR NMIs --- and given that most of the time we're targetting embedded systems (even for X86) I reckon just making sure that this is noted somewhere obvious would be enough. Malformed MSIs are only going to be generated by buggy code (which will be picked up during debugging), and malicious code (which is an issue, but one that's not made worse by this.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming this doesn't break x86 on CPUs not supporting VT-d, and everyone is happy with the code, I think it's ready to be merged?
ce082dd
to
4117c0d
Compare
4117c0d
to
2fc9024
Compare
Updates:
|
aeeeddd
to
05ebeef
Compare
src/arch/x86/object/interrupt.c
Outdated
status = vtd_inspect_irte_msi(handle, vector); | ||
if (status != EXCEPTION_NONE) { | ||
return status; | ||
} | ||
vtd_create_irte_msi(handle, vector, pci_bus, pci_dev, pci_func); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not look like code that is legal to do in a decode function. At least vtd_create_irte_msi
sounds like it changes state, and if vtd_inspect_irte_msi
is just decoding and not changing state, it should also be called decode..
or check..
.
All state-changing code must go into Arch_invokeIRQControl
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Is this convention documented? Where can I find the documentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://docs.sel4.systems/processes/style-guide.html in the Kernel code/Error handling section has that requirement.
So, there is a higher-level discussion we need to have on this one. @seL4/tsc -- strictly speaking this fix/feature needs an RFC, because it changes the API. It is also a breaking change, because it reserves IRQ handler 0, which user-level systems so far will assume to be available (see sel4test test failing for x86). It is relatively obvious that we do want this feature and it is security relevant, so we may want it for this release (13.0). The details on how to handle it are less obvious to me. Any opinions on whether this should be on hold for the release and go through the RFC process instead? |
Not exactly a quick discussion so far, but I think we can already say that we at least need a user-level guide description: what do the changes mean for the user, which new invocations are added, what do they do. You mentioned that you made a design decision with using IRQ 0. Can you also describe with other options you considered and what the implications would be? Edit: you could add these here or post a separate thread on the forum, maybe as a pre-RFC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The remaining issue stopping this from being merged, as far as I can tell, are:
- That all state-changing code must go into
Arch_invokeIRQControl
. - Fix IOAPIC IRQs if that's accidentally broken now.
- Fix seL4test if necessary if it does something it's not supposed to.
src/plat/pc99/machine/intel-vtd.c
Outdated
assert(irt_pptr != paddr_to_pptr(0)); | ||
|
||
word_t entry; | ||
for (entry = 0; entry < MAX_IOAPIC && vtd_irte_ptr_get_present(&irt_pptr[entry]); entry++); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this guaranteed to find an empty slot? If so, maybe add a comment why.
return EXCEPTION_NONE; | ||
} | ||
|
||
if (entry < MAX_IOAPIC || entry >= N_VTD_IRTES_4K) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that there is remappable IOAPIC support, is the check entry < MAX_IOAPIC
still relevant? If so, what needs to be done to allow user space to get IRQ handlers for IOAPIC IRQs? sel4test fails because it seems to try to get IRQ handlers for IOAPIC IRQs.
I don't see the API change, the only change I see is an extra error code. That's not API breakage. Maybe I'm missing something or misunderstanding things here, I only looked at the code changes. I think the RFC procedure is to decide whether something is wanted and to decide on the API for new features. The first is already decided, the second doesn't apply as far as I can see.
I thought I commented on this already, but apparently not. :-/
Where is that IRQ 0 thing mentioned? I don't see it. I also don't see new invocations. |
src/plat/pc99/machine/ioapic.c
Outdated
word_t entry; | ||
if (!vtd_create_irte_ioapic(vector, level, ioapic_target_cpu, &entry)) { | ||
printf("IOAPIC: failed to create IRTE for vector %lu.", vector); | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The &entry
takes the address of a local variable. This is not allowed in the kernel.
It's a bit of a grey area, that's why I was calling for discussion. A new error condition that is not just something that is missing from before, but adds new behaviour, is strictly speaking API breaking, and
That's why it's not a clear-cut case, it's a minor API change that on the high level is clear we want (hence could skip RFC), but the user-level explanation is missing.
That's alright, there are also 12 more people who I had hoped might comment :-)
All PC99 tests are failing (e.g. here) because IRQ handler 0 is now reserved, which is also a breaking change. I did not actually see that in the code anywhere either, though. @abrandnewusername can you point us to where that happens? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The verification code requirements and the high-level API question have to be addressed before we can merge.
Signed-off-by: Jingyao Zhou <jingyao.zhou@unsw.edu.au>
Signed-off-by: Jingyao Zhou <jingyao.zhou@unsw.edu.au>
Add interrupt remapping table features for the VT-d driver Signed-off-by: Jingyao Zhou <jingyao.zhou@unsw.edu.au>
Add support for remappable IOAPICs. Signed-off-by: Jingyao Zhou <jingyao.zhou@unsw.edu.au>
Add support for remappable MSIs. Signed-off-by: Jingyao Zhou <jingyao.zhou@unsw.edu.au>
05ebeef
to
41c7799
Compare
Update:
@lsf37 @Indanz Currently the kernel IOMMU driver supports two kinds of interrupts, MSI and IOAPIC. We don't have an MSI driver in the kernel, so the handle (which will be the entry of this MSI in the interrupt remapping table (IRT)) of an MSI is passed as a parameter of this MSI cap to the kernel by the user. We do have a kernel IOAPIC driver, so we don't rely on the user to provide the handle for IOAPICs. It leads to two issues:
|
If user space is managing the entries for MSI, but the kernel does it for IOAPIC, and it's the same range, it makes much more sense to reserve the last Then the only user space visible change is a lower max handle for MSI IRQs, right? That should cause much less breakage. If there is a define that user space can use to check the max handle, that should be updated too. Edit: Alternatively, if that's not possible, add and substract |
A limit of 256 seems quite low. You could use a static array instead of dynamically allocating the page, just align it properly. Or use |
If a static array is possible, then that would be preferable. I don't think there is precedent for the kernel to use root server resources to allocate kernel-level tables (unless it is a proper object with a retype API). We should not start to introduce some kind of hybrid -- that would definitely need an RFC discussion and TSC decision. |
The current vtd code is using |
You're right it does, I hadn't seen that. It'd be good if someone from the systems side with experience of the x86 port could comment on this. To me, this is against the design and we should ideally fix it, not do more of it. For instance, if we had accepted one of the schemes that reclaim root server memory, this would immediately have lead to crashes. Using special ways to give memory to the kernel sounds extremely dangerous to me. |
The amount of memory needed is hardware dependent and can be queried at startup, so I think the intention was to allocate as much as necessary instead of the theoretical maximum. |
Some allocation decisions need to be computed at boot time because the number of objects the kernel init code needs to create is not known at compile time. But all allocations still need to be declared up front with calculate_rootserver_size(). So it_alloc_paging() should be called exactly as many times as what |
Ok, thanks for explaining it. So this is an old approach and while I would still call it flawed, we don't need to fix it now for this PR. The main problem I have with the approach is that user-level functions and kernel-level functions are mixed -- For the table needed specifically here, the approach seems to be limiting, because it can use only one page. Would it make sense to have a compile-time configured fixed-size table that can be as large as needed, and then only use as much of it as the system dynamically supports at boot time? (with and enforced max at the size of the table). This probably won't work for everything, but it sounds like it might work here, and it gives the user some control over the memory trade-off. |
Agreed that for a fixed-size table like in this patch using Long-term, as x86's memory range is figured out dynamically anyway, memory like this should be reserved before it is given to the rest of the seL4 kernel to use, instead of stealing it from the root server. |
This pull request does:
There are potential future works including:
convert ioapic into the remappable formatconvert serial and pit interrupts from compatibility format to remappable formatUpdate: now it supports remappable IOAPICs