Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVX KVM Xsave issue #958

Open
abmerop opened this issue Mar 23, 2024 · 3 comments
Open

AVX KVM Xsave issue #958

abmerop opened this issue Mar 23, 2024 · 3 comments
Assignees
Labels

Comments

@abmerop
Copy link
Member

abmerop commented Mar 23, 2024

Describe the bug
The AVX / YMM register state is not saved or restored in gem5 with the X86KvmCPU leading to crashes on checkpoint restoration when AVX is enabled in CPUID.

Affects version
develop @ 0c684f2d331e47570f47e980307977284666582e

gem5 Modifications
No modification

To Reproduce
This is easiest to reproduce using a full system GPU configuration as it enables AVX by default and supports checkpoint/restore. This requires the VEGA_X86 build. The application doesn't really matter here, so for the application one can simply use a blank shell script.

  1. scons build/VEGA_X86/gem5.opt -jnproc
  2. touch hello.sh
  3. build/VEGA_X86/gem5.opt util/obtain-resource.py x86-gpu-fs-img -p x86-gpu-fs-img
  4. build/VEGA_X86/gem5.opt util/obtain-resource.py x86-linux-kernel-5.4.0-105-generic -p vmlinux-5.4.0-105-generic
  5. build/VEGA_X86/gem5.opt configs/example/gpufs/vega10_kvm.py --disk-image ./x86-gpu-fs-img --kernel ./vmlinux-5.4.0-105-generic --gpu-mmio-trace gem5-resources/src/gpu-fs/vega_mmio.log --app hello.sh --checkpoint-dir hello_cpt
  6. build/VEGA_X86/gem5.opt configs/example/gpufs/vega10_kvm.py --disk-image ./x86-gpu-fs-img --kernel ./vmlinux-5.4.0-105-generic --gpu-mmio-trace gem5-resources/src/gpu-fs/vega_mmio.log --app hello.sh --restore-dir hello_cpt

Note: There is currently a workaround command line option that disables AVX to avoid this issue. Adding the --disable-avx command line option should not see this error.

Terminal Output

[    8.736510] ------------[ cut here ]------------
[    8.736510] Bad FPU state detected at switch_fpu_return+0x7d/0x120, reinitializing FPU registers.
[    8.736510] WARNING: CPU: 0 PID: 461 at /build/linux-hwe-5.4-utjlqf/linux-hwe-5.4-5.4.0/arch/x86/mm/extable.c:114 ex_handler_fprestore+0x65/0x70
[    8.736510] Modules linked in: ib_uverbs ib_core amdgpu(OE) amd_iommu_v2 amd_sched(OE) amdttm(OE) amdkcl(OE) drm_kms_helper drm i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt pata_acpi input_leds mac_hid edac_mce_amd serio_raw sch_fq_codel ip_tables x_tables autofs4
[    8.736510] CPU: 0 PID: 461 Comm: check-new-relea Tainted: G           OE     5.4.0-105-generic #119~18.04.1-Ubuntu
[    8.736510] Hardware name:  , BIOS  06/08/2008
[    8.736510] RIP: 0010:ex_handler_fprestore+0x65/0x70
[    8.736510] Code: 00 00 00 5d c3 48 0f ae 0d 78 30 bc 01 b8 01 00 00 00 5d c3 48 89 c6 48 c7 c7 20 87 34 82 c6 05 80 97 b8 01 01 e8 1b ea 01 00 <0f> 0b eb b9 0f 1f 80 00 00 00 00 66 66 66 66 90 55 48 89 e5 e8 e2
[    8.736510] RSP: 0018:ffffc9000042fde0 EFLAGS: 00010086
[    8.736510] RAX: 0000000000000000 RBX: ffffc9000042fe48 RCX: 0000000000000000
[    8.736510] RDX: 0000000000000005 RSI: ffffffff82f965f5 RDI: 0000000000000046
[    8.736510] RBP: ffffc9000042fde0 R08: ffffffff82f965a0 R09: 0000000000000055
[    8.736510] R10: 0000000000000000 R11: 00000000000001cd R12: 000000000000000d
[    8.736510] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[    8.736510] FS:  00007f4cdc05b740(0000) GS:ffff8880bca00000(0000) knlGS:0000000000000000
[    8.736510] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    8.736510] CR2: 00000000004e5ef0 CR3: 00000000b8f9a000 CR4: 00000000000406f0
[    8.736510] Call Trace:
[    8.736510]  fixup_exception+0x4a/0x60
[    8.736510]  do_general_protection+0x4e/0x150
[    8.736510]  general_protection+0x28/0x30
[    8.736510] RIP: 0010:switch_fpu_return+0x7d/0x120
[    8.736510] Code: 74 67 49 8d bc 24 00 14 00 00 48 89 7d d0 66 66 90 66 90 db e2 0f 77 db 45 d0 66 66 90 66 90 b8 ff ff ff ff 89 c2 48 0f c7 1f <65> 4c 89 2d 0b d7 fd 7e 66 66 66 66 90 45 89 b4 24 c0 13 00 00 65
[    8.736510] RSP: 0018:ffffc9000042fef8 EFLAGS: 00010086
[    8.736510] RAX: 00000000ffffffff RBX: ffff8880b8190000 RCX: 00000000000004dd
[    8.736510] RDX: 00000000ffffffff RSI: 7133cdb5d72e6598 RDI: ffff8880b8191400
[    8.736510] RBP: ffffc9000042ff28 R08: 0000000000000068 R09: 0000000000000001
[    8.736510] R10: 0000000000000068 R11: 000000000000ba5a R12: ffff8880b8190000
[    8.736510] R13: ffff8880b81913c0 R14: 0000000000000000 R15: 0000000000000000
[    8.736510]  ? schedule+0x33/0xa0
[    8.736510]  prepare_exit_to_usermode+0x98/0xa0
[    8.736510]  retint_user+0x8/0x8
[    8.736510] RIP: 0033:0x7f4cd9d7cbf5
[    8.736510] Code: 49 8b b7 80 00 00 00 4c 89 7c 24 48 c7 44 24 50 00 00 00 00 48 8d 04 80 48 8d 04 86 48 85 c0 48 89 44 24 40 0f 84 91 00 00 00 <8b> 10 48 c1 e2 04 49 03 97 88 00 00 00 48 39 c6 48 89 54 24 58 74
[    8.736510] RSP: 002b:00007fffb0954620 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
[    8.736510] RAX: 00007f4cd3f2980c RBX: 00007fffb0954640 RCX: 00007f4cd2fe8000
[    8.736510] RDX: 0000000000001281 RSI: 00007f4cd2fe8000 RDI: 00000000001b7f83
[    8.736510] RBP: 00007f4cd1ca850e R08: 00007f4cd30762fd R09: 00007f4cd9d62e00
[    8.736510] R10: 000000000234bcf0 R11: 00007f4cdbc0eb20 R12: 00007fffb0954660
[    8.736510] R13: 000000000260c4b0 R14: 00007f4cd3f1b4a0 R15: 000000000234bcf0
[    8.736510] ---[ end trace b5790e806846cb11 ]---

Expected behavior
There should be no kernel backtrace dumps.

Host Operating System
Ubuntu 20.04

Host ISA
amd64

Compiler used
gcc 9.4.0

Additional information

Manual "backtrace" from Linux KVM call:
https://elixir.bootlin.com/linux/v5.4/source/arch/x86/kvm/x86.c#L3442
https://elixir.bootlin.com/linux/v5.4/source/arch/x86/kernel/fpu/core.c#L338
https://elixir.bootlin.com/linux/v5.4/source/arch/x86/include/asm/fpu/internal.h#L534
https://elixir.bootlin.com/linux/v5.4/source/arch/x86/include/asm/fpu/internal.h#L457
https://elixir.bootlin.com/linux/v5.4/source/arch/x86/include/asm/fpu/internal.h#L445
https://elixir.bootlin.com/linux/v5.4/source/arch/x86/include/asm/fpu/internal.h#L338
https://elixir.bootlin.com/linux/v5.4/source/arch/x86/include/asm/fpu/internal.h#L260
https://elixir.bootlin.com/linux/v5.4/source/arch/x86/include/asm/asm.h#L153
https://elixir.bootlin.com/linux/v5.4/source/arch/x86/mm/extable.c#L106

@abmerop abmerop added the bug label Mar 23, 2024
@abmerop abmerop self-assigned this Mar 23, 2024
@abmerop
Copy link
Member Author

abmerop commented Mar 23, 2024

@mattsinc @v-ramadas These are the notes I took when diagnosing this issue. I thought they might be helpful

@mattsinc
Copy link
Contributor

@mattsinc @v-ramadas These are the notes I took when diagnosing this issue. I thought they might be helpful

Thanks!

@abmerop
Copy link
Member Author

abmerop commented Apr 8, 2024

FYI: @nmosier pointed out to me via email that the XCR registers aren't being checkpointed. I attempted a hacky workaround to set XCR0 to the pre-checkpoint value but unfortunately the error persists, so while that is something that needs to be implemented, there is more work to be done still.

abmerop added a commit to abmerop/gem5 that referenced this issue Apr 19, 2024
The extended control registers were not being updated in the KVM thread
context nor updated in the KVM state. This was causing issues when
checkpointing since the XCR0 value was reverting to the default value
rather than what it was previously before the checkpoint. THis was
causing multiple applications to crash due to executing instructions
which are now illegal instructions due to XCR0 being incorrect.

This commit adds the XCR0 as a misc register similar to the exiting x86
control registers and adds all of the helper functions to access and set
the register value. It also adds support for updating the KVM CPU's
state with the register value and updating the thread context's misc reg
value so that it is checkpointed along with the other misc regs.

Note that this does *not* add support for XSAVE of the AVX state (i.e.,
the upper 128 bits of YMM registers). It does however fix the immediate
problem in issue gem5#958 .

Change-Id: I97456c8b57cbc7b381bd4be94944ce6567a43c76
abmerop added a commit to abmerop/gem5 that referenced this issue Apr 20, 2024
The extended control registers were not being updated in the KVM thread
context nor updated in the KVM state. This was causing issues when
checkpointing since the XCR0 value was reverting to the default value
rather than what it was previously before the checkpoint. THis was
causing multiple applications to crash due to executing instructions
which are now illegal instructions due to XCR0 being incorrect.

This commit adds the XCR0 as a misc register similar to the exiting x86
control registers and adds all of the helper functions to access and set
the register value. It also adds support for updating the KVM CPU's
state with the register value and updating the thread context's misc reg
value so that it is checkpointed along with the other misc regs.

Note that this does *not* add support for XSAVE of the AVX state (i.e.,
the upper 128 bits of YMM registers). It does however fix the immediate
problem in issue gem5#958 .

Change-Id: I97456c8b57cbc7b381bd4be94944ce6567a43c76
abmerop added a commit to abmerop/gem5 that referenced this issue Apr 25, 2024
The extended control registers were not being updated in the KVM thread
context nor updated in the KVM state. This was causing issues when
checkpointing since the XCR0 value was reverting to the default value
rather than what it was previously before the checkpoint. THis was
causing multiple applications to crash due to executing instructions
which are now illegal instructions due to XCR0 being incorrect.

This commit adds the XCR0 as a misc register similar to the exiting x86
control registers and adds all of the helper functions to access and set
the register value. It also adds support for updating the KVM CPU's
state with the register value and updating the thread context's misc reg
value so that it is checkpointed along with the other misc regs.

Note that this does *not* add support for XSAVE of the AVX state (i.e.,
the upper 128 bits of YMM registers). It does however fix the immediate
problem in issue gem5#958 .

A checkpoint upgrader is also provided to add the default value of XCR0
if the checkpoint tag is missing.

Change-Id: I97456c8b57cbc7b381bd4be94944ce6567a43c76
abmerop added a commit that referenced this issue May 6, 2024
The extended control registers were not being updated in the KVM thread
context nor updated in the KVM state. This was causing issues when
checkpointing since the XCR0 value was reverting to the default value
rather than what it was previously before the checkpoint. THis was
causing multiple applications to crash due to executing instructions
which are now illegal instructions due to XCR0 being incorrect.

This commit adds the XCR0 as a misc register similar to the exiting x86
control registers and adds all of the helper functions to access and set
the register value. It also adds support for updating the KVM CPU's
state with the register value and updating the thread context's misc reg
value so that it is checkpointed along with the other misc regs.

Note that this does *not* add support for XSAVE of the AVX state (i.e.,
the upper 128 bits of YMM registers). It does however fix the immediate
problem in issue #958 .

Change-Id: I97456c8b57cbc7b381bd4be94944ce6567a43c76
BobbyRBruce pushed a commit to BobbyRBruce/gem5 that referenced this issue May 25, 2024
The extended control registers were not being updated in the KVM thread
context nor updated in the KVM state. This was causing issues when
checkpointing since the XCR0 value was reverting to the default value
rather than what it was previously before the checkpoint. THis was
causing multiple applications to crash due to executing instructions
which are now illegal instructions due to XCR0 being incorrect.

This commit adds the XCR0 as a misc register similar to the exiting x86
control registers and adds all of the helper functions to access and set
the register value. It also adds support for updating the KVM CPU's
state with the register value and updating the thread context's misc reg
value so that it is checkpointed along with the other misc regs.

Note that this does *not* add support for XSAVE of the AVX state (i.e.,
the upper 128 bits of YMM registers). It does however fix the immediate
problem in issue gem5#958 .

A checkpoint upgrader is also provided to add the default value of XCR0
if the checkpoint tag is missing.

Change-Id: I97456c8b57cbc7b381bd4be94944ce6567a43c76
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants