Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vcpu maintenance IRQ improvement #1046

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

JorgeMVP
Copy link
Contributor

@JorgeMVP JorgeMVP commented May 28, 2023

  • Support handling multiple EOIed LRs per IRQ
  • clean up for simplicity and readibility in the kernel side

This patch requires: seL4/seL4_projects_libs#105

Signed-off-by: JorgeMVP jorgepereira89@gmail.com

@lsf37 lsf37 added verification Needs formal verification input/change, or is motivated by verification virtualisation relevant to virtualisation labels May 28, 2023
field idx 6
field idxValid 1
padding 25
field eisr 64
Copy link
Member

@lsf37 lsf37 May 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fields in the bitfield specs must be lower than the word width of the machine, so this field here is giving you compile errors (for aarch32).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the help, I still have a compilation issue:

CMakeFiles/kernel.elf.dir/src/arch/arm/32/traps.S.obj CMakeFiles/kernel.elf.dir/src/arch/arm/32/hyp_traps.S.obj CMakeFiles/kernel.elf.dir/kernel_all.c.obj -o kernel.elf   && :
  /usr/lib/gcc-cross/arm-linux-gnueabi/10/../../../../arm-linux-gnueabi/bin/ld: CMakeFiles/kernel.elf.dir/kernel_all.c.obj: in function `VGICMaintenance.part.0':
  kernel_all.c:(.text+0x9[65](https://github.com/seL4/seL4/actions/runs/5114599391/jobs/9195008723?pr=1046#step:4:67)c): undefined reference to `__ctzdi2'

I will check if I can solve this soon

@JorgeMVP JorgeMVP force-pushed the maintenance-imp branch 7 times, most recently from ab56023 to 956a049 Compare May 29, 2023 18:57
src/arch/arm/object/vcpu.c Outdated Show resolved Hide resolved
- Active and/or Pending LRs cannot trigger maintenance IRQ with
EOI bit set.
- The EOId LR is set to 0.
Signed-off-by: JorgeMVP <jorgepereira89@gmail.com>
Signed-off-by: JorgeMVP <jorgepereira89@gmail.com>
Signed-off-by: JorgeMVP <jorgepereira89@gmail.com>
Signed-off-by: JorgeMVP <jorgepereira89@gmail.com>
Signed-off-by: JorgeMVP <jorgepereira89@gmail.com>
@JorgeMVP JorgeMVP force-pushed the maintenance-imp branch 2 times, most recently from 003e9ee to a536f6f Compare June 3, 2023 19:46
- Only one bit set in the EISR is propagated, the others are masked.
- In case of multiple EOIed LRs the maintenance IRQ is regenerated
since it is level-sensitive.
Signed-off-by: JorgeMVP <jorgepereira89@gmail.com>
- Support handling multiple EOIed LRs at once.
- EISR is fully propagated to user-space.
Signed-off-by: JorgeMVP <jorgepereira89@gmail.com>
field address 32
field inReceivePhase 1
padding 27
field seL4_FaultType 4
}

block UnknownSyscall {
padding 32
field syscallNumber 32
padding 28
field seL4_FaultType 4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lsf37, I thought that the bf language could support variable size blocks for tagged enums?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that doesn't matter when the storage isn't allocated based on the enum value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lsf37, I thought that the bf language could support variable size blocks for tagged enums?

No, all blocks in a union must have the same size currently. There is special code so that the type field can change in size, but the block size overall must be the same.

field address 32
field inReceivePhase 1
padding 27
field seL4_FaultType 4
}

block UnknownSyscall {
padding 32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a comment on the struct tcb definition that claims seL4_FaultType is 2 words which may need updating.

Comment on lines 202 to 207
block VGICMaintenance {
field idx 6
field idxValid 1
padding 25
field eisr1 32
field eisr0 32
padding 28
field seL4_FaultType 4
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Increasing the fault storage size on all architectures may be a bigger change than we want, especially because it's unrealistic for > 32 list registers to change within the same maintenance interrupt isn't it?

@lsf37, any ideas about what the impact of this change would be on verification, and then any ideas for alternative approaches? (Such as storing additional fault info with the faulting vcpu instead of in the faulting TCB?

Copy link
Member

@lsf37 lsf37 Jun 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The verification impact of this is pretty high compared to the size of the code change, because it contains a loop, affects multiple architectures, changes type sizes etc. I'd definitely welcome exploring alternatives and also some data on actual performance difference before we merge it.

It looks to me like this actually is an API change, so this would require a (small) RFC.

To clarify: I think the verification cost of this change is larger than what we can do for free, so this would need funding.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional considerations: what is the WCET impact, does current user space need to change, and how much effort is that going to be for people

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lsf37 seL4/seL4_projects_libs#105 This would be the change in the user-space

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When simple things become complex:

  • (i) Increased verification efforts of the loop
  • (ii) Breaking interfaces with existing VMMs
  • (iii) Increased ipc msg size.

The current implementation of maintenance IRQ has inefective code:

  • Reading an LR, if we already know that the LR is EOIed by reading EISR
  • e.g. cleaning up pending and active IRQs - This should never happen.

Understandably nearly all parts of this MR can be done later, with exception of the first commit, if you guys are happy with it. @kent-mcleod @lsf37 @axel-h @Indanz What about the first commit for now, and later the RFC?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation of maintenance IRQ has inefective code:

  • Reading an LR, if we already know that the LR is EOIed by reading EISR
  • e.g. cleaning up pending and active IRQs - This should never happen.

Understandably nearly all parts of this MR can be done later, with exception of the first commit, if you guys are happy with it.

I agree that the first commit improves the code quality, but I don't think it's fixing any actual errors in the existing implementation. Sometimes the code in C ends up less optimal in order to make the verification work easier as it can be very hard to show in the proof context that certain code paths are ineffective and it's easier to add them to the C code anyway which may have been the case here (but it may also just be bad code ;) ).

For the code in the first commit, accepting the change will require additional verification work to update proofs and it would depend on the verification team's capacity for when this would get completed :

Another example of ineffective code where removal requires verification effort is a condition that will always return true at the end of VGICMaintenance:if (isRunnable(NODE_STATE(ksCurThread))), but removing the check is blocked on verification work to establish in the proofs that it is in fact safe to remove.

Copy link
Contributor Author

@JorgeMVP JorgeMVP Jun 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. cleaning up pending and active IRQs - This should never happen.

I dont understand:
(ii) how can we test a code path that never runs
(ii) how can we verify a code path that never runs
(iii) What is the worst case that can happen if we run it?

The current sel4 vgic model doesnt remove active or pending vIRQs from the LRs, or even disable their EOI bit. If that happens there is an issue in the guest.

How does that help in the verification path? Runnable code is understandable, But this code never runs.

Dead code is not good for security.

Even worse is dead code, that does something wrong :)

@kent-mcleod
Copy link
Member

Thanks for this change @JorgeMVP! If I understand correctly, this is processing and forwarding all updated LRs on each VGIC maintenance interrupt instead of one-at-a-time. Are you able to provide some example situations for where this change would lead to a performance improvement?

@Indanz
Copy link
Contributor

Indanz commented Jun 6, 2023

I think the fault size increase is unnecessary. ICH_EISR_EL2 is only defined for 16 bits currently, while there are 60 bits available in the structure.

I also think the old API should be retained so that not all VMMs in the wild need to be updated to the new ABI immediately. If you break the ABI there are probably much better improvements to be made, like not going through user space at all to pass interrupts to the VM.

@JorgeMVP
Copy link
Contributor Author

JorgeMVP commented Jun 6, 2023

I think the fault size increase is unnecessary. ICH_EISR_EL2 is only defined for 16 bits currently, while there are 60 bits available in the structure.

16 bits are good enough for GICv3 because that matches exactly with max number of LRs support in GICv3. Typical implementations only support 4 LRs, so in this case 4 bits would also be good enough.

The idea of this being 64-bits is to make it generic, so it could support both GICv2 and GICv3 and both aarch32 and aarch64.

Again, GICv3 can support up to 16 LRs while GICv2 can support up to 64 LRs. Thats why 64-bit wide register to be generic.

I also think the old API should be retained so that not all VMMs in the wild need to be updated to the new ABI immediately. If you break the ABI there are probably much better improvements to be made, like not going through user space at all to pass interrupts to the VM.

I think for this specific feature going to user-space is a must, because trap using EOI bit and maintenance IRQ is a typical feature used on virtual devices that require emulation of a sw vIRQ with level sensitive semantics (The vmm has the opportunity to re-inject the vIRQ if the virtual device virq condition is still met). Please, refer to other hypervisor implementations like Xen and KVM.

@JorgeMVP
Copy link
Contributor Author

JorgeMVP commented Jun 6, 2023

Thanks for this change @JorgeMVP! If I understand correctly, this is processing and forwarding all updated LRs on each VGIC maintenance interrupt instead of one-at-a-time. Are you able to provide some example situations for where this change would lead to a performance improvement?

Yes, I was thinking about that:

First use-case is then guest is using EOImode (which is a typical, like Linux Kernel). Multiple pCPUs can EOI multiple vSPIs in parallel. This is a common thing when there are threaded IRQ support.

Instead of generating multiple maintenance IRQs, it would generate just one. It also avoids several transitions from Kernel to the VMM.

Second use-case: Guest uses nested interrupt handling and the deactivation of multiple IRQs happen to be one after the other sequentially with short clock cycles spent in between instructions.

IRQ maintenance is an async event, doesnt happen immediately after IRQ EOI, it might be that the CPU can consume some more instructions before maintenance IRQ is generated.

Anyway, there might be other use-cases being the first one stated importanted and the second one considered rare. It is always good to know what use-cases will use this kind of features. But more than that, we should be compatible with ARMv8 spec.

As you said, might be a bit irrealistic to have 32/64 IRQs being EOIed at the same time, but if that costs nothing for us, we should handle it, because the spec allows it.

Thanks for the help

@Indanz
Copy link
Contributor

Indanz commented Jun 6, 2023

Again, GICv3 can support up to 16 LRs while GICv2 can support up to 64 LRs. Thats why 64-bit wide register to be generic.

What is the actual hardware number of LRs supported by GICv2 hardware currently supported by seL4? I'm pretty sure it's no where near 64.

This is just a performance improvement, I don't think anyone who cares about VM performance is going to use GICv2 hardware. I think getting multiple IRQs at the same time is rare enough not to optimise for this case, until proven otherwise with performance numbers.

@JorgeMVP
Copy link
Contributor Author

JorgeMVP commented Jun 6, 2023

What is the actual hardware number of LRs supported by GICv2 hardware currently supported by seL4? I'm pretty sure it's no where near 64.

This is implementation defined. As I said up to 64 LRs can be supported.

This is just a performance improvement, I don't think anyone who cares about VM performance is going to use GICv2 hardware. I think getting multiple IRQs at the same time is rare enough not to optimise for this case, until proven otherwise with performance numbers.

Having assumptions coming from nowhere is a problem. Lets not go in that way. My approach here is how to add value to the current seL4 for virtualization. Also, please check this one: https://arxiv.org/pdf/2303.11186.pdf

@Indanz
Copy link
Contributor

Indanz commented Jun 7, 2023

What is the actual hardware number of LRs supported by GICv2 hardware currently supported by seL4? I'm pretty sure it's no where near 64.

This is implementation defined. As I said up to 64 LRs can be supported.

I am asking about implementations, not the theoretical maximum.

This is just a performance improvement, I don't think anyone who cares about VM performance is going to use GICv2 hardware. I think getting multiple IRQs at the same time is rare enough not to optimise for this case, until proven otherwise with performance numbers.

Having assumptions coming from nowhere is a problem. Lets not go in that way.

They don't come from nowhere, in the past when I optimised software to handle multiple interrupts the performance gain was disappointing. Even with a fairly high interrupt load, the chance of multiple interrupts triggering at the same time is sadly low.

My approach here is how to add value to the current seL4 for virtualization. Also, please check this one: https://arxiv.org/pdf/2303.11186.pdf

And that is much appreciated, but keep in mind that most changes add a huge verification burden and that breaking existing software for a performance optimisation with no actual performance improvement numbers is not realistic.

seL4 is not like normal software where you can just improve the code without worrying about verification. I understand the urge to clean up and improve the code, I have it all the time, but because of verification we have to restrain ourselves, and perhaps do things less perfectly, to be more achievable for verification. And the verification people are so busy that you can't expect any moderate or major changes to happen without funding.

I don't know what the current status of virtualisation verification is though. If virtualisation isn't verified yet for ARM then you can improve it fairly easily before that starts. But try to keep the changes contained to VM code only.

I am trying to be practical, you seem very focussed on theoretical issues, like supporting 64 LRs while no hardware has that many. Limiting the number of LRs supported by seL4 seemed like an easy way to make it more likely that your change gets merged. An alternative would be to store the information in the vCPU structure instead of the fault data itself, but that's a bigger change.

I did read that paper after you posted it last time. I hope you realise that Camkes has huge performance issues which distort the numbers a lot. You have to know where the bottlenecks are before you can optimise things effectively, in this case, the split between seL4 slowness and Camkes slowness.

Of course there is a lot of room for improvement in the seL4 kernel, and more with bigger changes how virtualisation is done and making better use of modern hardware features. But we have to be practical, which means keeping verification, seL4's limited manpower and avoiding user space breakage in mind.

One of the things currently missing are virtualisation tests in sel4test and sel4bench. If we had those, it would be a bit easier to make changes like these. Now you can break the user space ABI and pass all the tests.

@kent-mcleod
Copy link
Member

First use-case is then guest is using EOImode (which is a typical, like Linux Kernel). Multiple pCPUs can EOI multiple vSPIs in parallel. This is a common thing when there are threaded IRQ support.

I'm not sure I follow this use case, as it seems to me that this wouldn't trigger more than one LR updated for each vCPU, whereas the code being changed only iterates LRs for a single vCPU at a time:

  • vgic maintenance interrupts are PPIs,
  • The set of list registers are core-local for each physical and virtual CPU,
  • so if multiple pCPUs are EOI multiple vSPIs in parallel, then there would be a separate maintenance interrupt for each core where the interrupt handler for each core would process the updated LRs for that core where only a single IRQ had been EOI'd for that core?

@JorgeMVP
Copy link
Contributor Author

JorgeMVP commented Jun 7, 2023

  • vgic maintenance interrupts are PPIs,

I'm not sure I follow this use case, as it seems to me that this wouldn't trigger more than one LR updated for each vCPU, whereas the code being changed only iterates LRs for a single vCPU at a time:

  • vgic maintenance interrupts are PPIs,

Agree,

  • The set of list registers are core-local for each physical and virtual CPU,

Agree,

  • so if multiple pCPUs are EOI multiple vSPIs in parallel, then there would be a separate maintenance interrupt for each core where the interrupt handler for each core would process the updated LRs for that core where only a single IRQ had been EOI'd for that core?

Partially agreed :), if the guest uses EOIMode set to 1, then the instruction that deactivates the vIRQ can be executed by any vCPU/pCPU, which is ICV_DIR_EL1.

If 3 LRs are injected in the vCPU 0 LRs running on top of the pCPU 0, then they can be deactivated at the same time by multiple vCPUs either running in the pCPU 0 or in some other pCPU.

Of course the maintenance IRQ will only be generated by pCPU0 (it is the host of the vCPU0), because maintenance IRQ is private to the pCPU.

@JorgeMVP
Copy link
Contributor Author

JorgeMVP commented Jun 7, 2023

Please, lets focus in the technical side, we avoid wasting our time. I'm okay if the feature is not accepted, we all know there are tradeoffs, especially with verification which is okay, but please, the only thing I ask is to help me to improve the MR.

Limiting the number of LRs supported by seL4 seemed like an easy way to make it more likely that your change gets merged. An alternative would be to store the information in the vCPU structure instead of the fault data itself, but that's a bigger change.

This could be also be an option, which is assumption free. We only support 4 LRs could be possible.

I did read that paper after you posted it last time. I hope you realise that Camkes has huge performance issues which distort the numbers a lot. You have to know where the bottlenecks are before you can optimise things effectively, in this case, the split between seL4 slowness and Camkes slowness.

Question is there any other VMM that has publicly available results?

@kent-mcleod
Copy link
Member

Partially agreed :), if the guest uses EOIMode set to 1, then the instruction that deactivates the vIRQ can be executed by any vCPU/pCPU, which is ICV_DIR_EL1.

If 3 LRs are injected in the vCPU 0 LRs running on top of the pCPU 0, then they can be deactivated at the same time by multiple vCPUs either running in the pCPU 0 or in some other pCPU.

Of course the maintenance IRQ will only be generated by pCPU0 (it is the host of the vCPU0), because maintenance IRQ is private to the pCPU.

Ok, I can follow that the spec requires that priority drop must happen on the same CPU that acknowledges the interrupt, but allows for deactivation of SPIs on any CPU as the deactivation happens at the distributor level.

However, I can't see how a hypervisor is meant to efficiently handle a scenario where the guest is using EOIMode set to 1 and the interrupt is deactivated on a different core - how would the hypervisor find out the the irq number being deactivated without directly trapping the write to ICV_DIR_EL1? My interpretation of the spec is that the status of ICH_EISR_EL2 is changed by interrupt deactivation that would happen with the write to ICV_DIR_EL1 when EOIMode is set to 1 and this can only happen on the same core based on the location of the list registers and not on the core that the write to ICV_DIR_EL1 happens on so at best you'd just get a maintenance interrupt for an LR-not-present?

@JorgeMVP
Copy link
Contributor Author

Partially agreed :), if the guest uses EOIMode set to 1, then the instruction that deactivates the vIRQ can be executed by any vCPU/pCPU, which is ICV_DIR_EL1.
If 3 LRs are injected in the vCPU 0 LRs running on top of the pCPU 0, then they can be deactivated at the same time by multiple vCPUs either running in the pCPU 0 or in some other pCPU.
Of course the maintenance IRQ will only be generated by pCPU0 (it is the host of the vCPU0), because maintenance IRQ is private to the pCPU.

Ok, I can follow that the spec requires that priority drop must happen on the same CPU that acknowledges the interrupt, but allows for deactivation of SPIs on any CPU as the deactivation happens at the distributor level.

However, I can't see how a hypervisor is meant to efficiently handle a scenario where the guest is using EOIMode set to 1 and the interrupt is deactivated on a different core - how would the hypervisor find out the the irq number being deactivated without directly trapping the write to ICV_DIR_EL1? My interpretation of the spec is that the status of ICH_EISR_EL2 is changed by interrupt deactivation that would happen with the write to ICV_DIR_EL1 when EOIMode is set to 1 and this can only happen on the same core based on the location of the list registers and not on the core that the write to ICV_DIR_EL1 happens on so at best you'd just get a maintenance interrupt for an LR-not-present?

@kent-mcleod very good point, even if we get this LR not present, it is not possible to know each vIRQ caused that. One way is to trap the ICV_DIR_EL1 but there are further problems with it in the kernel.

The first use-case is indeed another problem. and solution for it is still unclear.

Please have a look at the following experiments:
https://community.arm.com/support-forums/f/architectures-and-processors-forum/54679/ich_eisr_el2-use-cases

Thanks

@kent-mcleod
Copy link
Member

@kent-mcleod very good point, even if we get this LR not present, it is not possible to know each vIRQ caused that. One way is to trap the ICV_DIR_EL1 but there are further problems with it in the kernel.

The first use-case is indeed another problem. and solution for it is still unclear.

Please have a look at the following experiments: https://community.arm.com/support-forums/f/architectures-and-processors-forum/54679/ich_eisr_el2-use-cases

Thanks

I'm now also unclear as to how a host is supposed to handle a guest that is using SplitEOI mode and has more active IRQs than there are available LRs without also needing to trap ICV_DIR_EL1.

@JorgeMVP
Copy link
Contributor Author

I'm now also unclear as to how a host is supposed to handle a guest that is using SplitEOI mode and has more active IRQs than there are available LRs without also needing to trap ICV_DIR_EL1.

Needs trapping, otherwise no way. Fortunatelly, moving out active vIRQs from LRs are far away from us :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
verification Needs formal verification input/change, or is motivated by verification virtualisation relevant to virtualisation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants