New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vcpu maintenance IRQ improvement #1046
base: master
Are you sure you want to change the base?
Conversation
field idx 6 | ||
field idxValid 1 | ||
padding 25 | ||
field eisr 64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fields in the bitfield specs must be lower than the word width of the machine, so this field here is giving you compile errors (for aarch32).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the help, I still have a compilation issue:
CMakeFiles/kernel.elf.dir/src/arch/arm/32/traps.S.obj CMakeFiles/kernel.elf.dir/src/arch/arm/32/hyp_traps.S.obj CMakeFiles/kernel.elf.dir/kernel_all.c.obj -o kernel.elf && :
/usr/lib/gcc-cross/arm-linux-gnueabi/10/../../../../arm-linux-gnueabi/bin/ld: CMakeFiles/kernel.elf.dir/kernel_all.c.obj: in function `VGICMaintenance.part.0':
kernel_all.c:(.text+0x9[65](https://github.com/seL4/seL4/actions/runs/5114599391/jobs/9195008723?pr=1046#step:4:67)c): undefined reference to `__ctzdi2'
I will check if I can solve this soon
ab56023
to
956a049
Compare
- Active and/or Pending LRs cannot trigger maintenance IRQ with EOI bit set. - The EOId LR is set to 0. Signed-off-by: JorgeMVP <jorgepereira89@gmail.com>
Signed-off-by: JorgeMVP <jorgepereira89@gmail.com>
Signed-off-by: JorgeMVP <jorgepereira89@gmail.com>
Signed-off-by: JorgeMVP <jorgepereira89@gmail.com>
Signed-off-by: JorgeMVP <jorgepereira89@gmail.com>
003e9ee
to
a536f6f
Compare
- Only one bit set in the EISR is propagated, the others are masked. - In case of multiple EOIed LRs the maintenance IRQ is regenerated since it is level-sensitive. Signed-off-by: JorgeMVP <jorgepereira89@gmail.com>
- Support handling multiple EOIed LRs at once. - EISR is fully propagated to user-space. Signed-off-by: JorgeMVP <jorgepereira89@gmail.com>
field address 32 | ||
field inReceivePhase 1 | ||
padding 27 | ||
field seL4_FaultType 4 | ||
} | ||
|
||
block UnknownSyscall { | ||
padding 32 | ||
field syscallNumber 32 | ||
padding 28 | ||
field seL4_FaultType 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lsf37, I thought that the bf language could support variable size blocks for tagged enums?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that doesn't matter when the storage isn't allocated based on the enum value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lsf37, I thought that the bf language could support variable size blocks for tagged enums?
No, all blocks in a union must have the same size currently. There is special code so that the type field can change in size, but the block size overall must be the same.
field address 32 | ||
field inReceivePhase 1 | ||
padding 27 | ||
field seL4_FaultType 4 | ||
} | ||
|
||
block UnknownSyscall { | ||
padding 32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a comment on the struct tcb
definition that claims seL4_FaultType is 2 words which may need updating.
block VGICMaintenance { | ||
field idx 6 | ||
field idxValid 1 | ||
padding 25 | ||
field eisr1 32 | ||
field eisr0 32 | ||
padding 28 | ||
field seL4_FaultType 4 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Increasing the fault storage size on all architectures may be a bigger change than we want, especially because it's unrealistic for > 32 list registers to change within the same maintenance interrupt isn't it?
@lsf37, any ideas about what the impact of this change would be on verification, and then any ideas for alternative approaches? (Such as storing additional fault info with the faulting vcpu instead of in the faulting TCB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The verification impact of this is pretty high compared to the size of the code change, because it contains a loop, affects multiple architectures, changes type sizes etc. I'd definitely welcome exploring alternatives and also some data on actual performance difference before we merge it.
It looks to me like this actually is an API change, so this would require a (small) RFC.
To clarify: I think the verification cost of this change is larger than what we can do for free, so this would need funding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional considerations: what is the WCET impact, does current user space need to change, and how much effort is that going to be for people
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lsf37 seL4/seL4_projects_libs#105 This would be the change in the user-space
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When simple things become complex:
- (i) Increased verification efforts of the loop
- (ii) Breaking interfaces with existing VMMs
- (iii) Increased ipc msg size.
The current implementation of maintenance IRQ has inefective code:
- Reading an LR, if we already know that the LR is EOIed by reading EISR
- e.g. cleaning up pending and active IRQs - This should never happen.
Understandably nearly all parts of this MR can be done later, with exception of the first commit, if you guys are happy with it. @kent-mcleod @lsf37 @axel-h @Indanz What about the first commit for now, and later the RFC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation of maintenance IRQ has inefective code:
- Reading an LR, if we already know that the LR is EOIed by reading EISR
- e.g. cleaning up pending and active IRQs - This should never happen.
Understandably nearly all parts of this MR can be done later, with exception of the first commit, if you guys are happy with it.
I agree that the first commit improves the code quality, but I don't think it's fixing any actual errors in the existing implementation. Sometimes the code in C ends up less optimal in order to make the verification work easier as it can be very hard to show in the proof context that certain code paths are ineffective and it's easier to add them to the C code anyway which may have been the case here (but it may also just be bad code ;) ).
For the code in the first commit, accepting the change will require additional verification work to update proofs and it would depend on the verification team's capacity for when this would get completed :
- https://github.com/seL4/l4v/blob/master/spec/haskell/src/SEL4/Object/VCPU/ARM.lhs#L528-L533 is the equivalent part in the executable spec,
- https://github.com/seL4/l4v/blob/master/proof/crefine/ARM_HYP/Syscall_C.thy#L1797C1-L2072 should be the proof that the C code corresponds to the executable spec.
Another example of ineffective code where removal requires verification effort is a condition that will always return true at the end of VGICMaintenance:if (isRunnable(NODE_STATE(ksCurThread)))
, but removing the check is blocked on verification work to establish in the proofs that it is in fact safe to remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. cleaning up pending and active IRQs - This should never happen.
I dont understand:
(ii) how can we test a code path that never runs
(ii) how can we verify a code path that never runs
(iii) What is the worst case that can happen if we run it?
The current sel4 vgic model doesnt remove active or pending vIRQs from the LRs, or even disable their EOI bit. If that happens there is an issue in the guest.
How does that help in the verification path? Runnable code is understandable, But this code never runs.
Dead code is not good for security.
Even worse is dead code, that does something wrong :)
Thanks for this change @JorgeMVP! If I understand correctly, this is processing and forwarding all updated LRs on each VGIC maintenance interrupt instead of one-at-a-time. Are you able to provide some example situations for where this change would lead to a performance improvement? |
I think the fault size increase is unnecessary. I also think the old API should be retained so that not all VMMs in the wild need to be updated to the new ABI immediately. If you break the ABI there are probably much better improvements to be made, like not going through user space at all to pass interrupts to the VM. |
16 bits are good enough for GICv3 because that matches exactly with max number of LRs support in GICv3. Typical implementations only support 4 LRs, so in this case 4 bits would also be good enough. The idea of this being 64-bits is to make it generic, so it could support both GICv2 and GICv3 and both aarch32 and aarch64. Again, GICv3 can support up to 16 LRs while GICv2 can support up to 64 LRs. Thats why 64-bit wide register to be generic.
I think for this specific feature going to user-space is a must, because trap using EOI bit and maintenance IRQ is a typical feature used on virtual devices that require emulation of a sw vIRQ with level sensitive semantics (The vmm has the opportunity to re-inject the vIRQ if the virtual device virq condition is still met). Please, refer to other hypervisor implementations like Xen and KVM. |
Yes, I was thinking about that: First use-case is then guest is using EOImode (which is a typical, like Linux Kernel). Multiple pCPUs can EOI multiple vSPIs in parallel. This is a common thing when there are threaded IRQ support. Instead of generating multiple maintenance IRQs, it would generate just one. It also avoids several transitions from Kernel to the VMM. Second use-case: Guest uses nested interrupt handling and the deactivation of multiple IRQs happen to be one after the other sequentially with short clock cycles spent in between instructions. IRQ maintenance is an async event, doesnt happen immediately after IRQ EOI, it might be that the CPU can consume some more instructions before maintenance IRQ is generated. Anyway, there might be other use-cases being the first one stated importanted and the second one considered rare. It is always good to know what use-cases will use this kind of features. But more than that, we should be compatible with ARMv8 spec. As you said, might be a bit irrealistic to have 32/64 IRQs being EOIed at the same time, but if that costs nothing for us, we should handle it, because the spec allows it. Thanks for the help |
What is the actual hardware number of LRs supported by GICv2 hardware currently supported by seL4? I'm pretty sure it's no where near 64. This is just a performance improvement, I don't think anyone who cares about VM performance is going to use GICv2 hardware. I think getting multiple IRQs at the same time is rare enough not to optimise for this case, until proven otherwise with performance numbers. |
This is implementation defined. As I said up to 64 LRs can be supported.
Having assumptions coming from nowhere is a problem. Lets not go in that way. My approach here is how to add value to the current seL4 for virtualization. Also, please check this one: https://arxiv.org/pdf/2303.11186.pdf |
I am asking about implementations, not the theoretical maximum.
They don't come from nowhere, in the past when I optimised software to handle multiple interrupts the performance gain was disappointing. Even with a fairly high interrupt load, the chance of multiple interrupts triggering at the same time is sadly low.
And that is much appreciated, but keep in mind that most changes add a huge verification burden and that breaking existing software for a performance optimisation with no actual performance improvement numbers is not realistic. seL4 is not like normal software where you can just improve the code without worrying about verification. I understand the urge to clean up and improve the code, I have it all the time, but because of verification we have to restrain ourselves, and perhaps do things less perfectly, to be more achievable for verification. And the verification people are so busy that you can't expect any moderate or major changes to happen without funding. I don't know what the current status of virtualisation verification is though. If virtualisation isn't verified yet for ARM then you can improve it fairly easily before that starts. But try to keep the changes contained to VM code only. I am trying to be practical, you seem very focussed on theoretical issues, like supporting 64 LRs while no hardware has that many. Limiting the number of LRs supported by seL4 seemed like an easy way to make it more likely that your change gets merged. An alternative would be to store the information in the vCPU structure instead of the fault data itself, but that's a bigger change. I did read that paper after you posted it last time. I hope you realise that Camkes has huge performance issues which distort the numbers a lot. You have to know where the bottlenecks are before you can optimise things effectively, in this case, the split between seL4 slowness and Camkes slowness. Of course there is a lot of room for improvement in the seL4 kernel, and more with bigger changes how virtualisation is done and making better use of modern hardware features. But we have to be practical, which means keeping verification, seL4's limited manpower and avoiding user space breakage in mind. One of the things currently missing are virtualisation tests in sel4test and sel4bench. If we had those, it would be a bit easier to make changes like these. Now you can break the user space ABI and pass all the tests. |
I'm not sure I follow this use case, as it seems to me that this wouldn't trigger more than one LR updated for each vCPU, whereas the code being changed only iterates LRs for a single vCPU at a time:
|
Agree,
Agree,
Partially agreed :), if the guest uses EOIMode set to 1, then the instruction that deactivates the vIRQ can be executed by any vCPU/pCPU, which is ICV_DIR_EL1. If 3 LRs are injected in the vCPU 0 LRs running on top of the pCPU 0, then they can be deactivated at the same time by multiple vCPUs either running in the pCPU 0 or in some other pCPU. Of course the maintenance IRQ will only be generated by pCPU0 (it is the host of the vCPU0), because maintenance IRQ is private to the pCPU. |
Please, lets focus in the technical side, we avoid wasting our time. I'm okay if the feature is not accepted, we all know there are tradeoffs, especially with verification which is okay, but please, the only thing I ask is to help me to improve the MR.
This could be also be an option, which is assumption free. We only support 4 LRs could be possible.
Question is there any other VMM that has publicly available results? |
Ok, I can follow that the spec requires that priority drop must happen on the same CPU that acknowledges the interrupt, but allows for deactivation of SPIs on any CPU as the deactivation happens at the distributor level. However, I can't see how a hypervisor is meant to efficiently handle a scenario where the guest is using EOIMode set to 1 and the interrupt is deactivated on a different core - how would the hypervisor find out the the irq number being deactivated without directly trapping the write to ICV_DIR_EL1? My interpretation of the spec is that the status of ICH_EISR_EL2 is changed by interrupt deactivation that would happen with the write to ICV_DIR_EL1 when EOIMode is set to 1 and this can only happen on the same core based on the location of the list registers and not on the core that the write to ICV_DIR_EL1 happens on so at best you'd just get a maintenance interrupt for an LR-not-present? |
@kent-mcleod very good point, even if we get this LR not present, it is not possible to know each vIRQ caused that. One way is to trap the ICV_DIR_EL1 but there are further problems with it in the kernel. The first use-case is indeed another problem. and solution for it is still unclear. Please have a look at the following experiments: Thanks |
I'm now also unclear as to how a host is supposed to handle a guest that is using SplitEOI mode and has more active IRQs than there are available LRs without also needing to trap |
Needs trapping, otherwise no way. Fortunatelly, moving out active vIRQs from LRs are far away from us :) |
This patch requires: seL4/seL4_projects_libs#105
Signed-off-by: JorgeMVP jorgepereira89@gmail.com