Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After live migration vcpu of vm is up 100% on aarch64 #6001

Open
grass-lu opened this issue Dec 4, 2023 · 6 comments
Open

After live migration vcpu of vm is up 100% on aarch64 #6001

grass-lu opened this issue Dec 4, 2023 · 6 comments
Labels
AArch64 Affects AArch64 only

Comments

@grass-lu
Copy link

grass-lu commented Dec 4, 2023

Describe the bug
A clear and concise description of what the bug is.
After live migration, vcpu of vm on the destination host is up 100% on aarch64,and the vm on the source host exits normally

To Reproduce
Steps to reproduce the behaviour:
1、use NFS server to share storage between two host,the disk of vm is in the share storage
2、Launch vm1
./cloud-hypervisor --kernel ./hypervisor-fw --disk path=focal-server-cloudimg-arm64.raw --cpus boot=4,max=16 --memory size=1024M,hotplug_size=8192M,shared=on --net tap=,mac=,ip=192.168.10.2,mask=255.255.255.0 --cmdline "console=ttyAMA0 root=/dev/vda1 rw" --api-socket /tmp/ch-socket --serial tty --console off
3、lauch vm2 on the destination host
./cloud-hypervisor --api-socket=/tmp/api2
4、Get ready for receiving migration for vm1 on the the destination host
sudo ./ch-remote --api-socket /tmp/api2 receive-migration unix:/tmp/sock2
sudo socat TCP-LISTEN:6000,reuseaddr UNIX-CLIENT:/tmp/sock2
5、Start to send migration for vm1 on source host
sudo socat UNIX-LISTEN:/tmp/sock1,reuseaddr TCP:10.253.102.207:6000
sudo ./ch-remote --api-socket=/tmp/api1 send-migration unix:/tmp/sock1

Version

Output of cloud-hypervisor --version:
./cloud-hypervisor --version
cloud-hypervisor v35.0-84-g4cbfccc1

Did you build from source, if so build command line (e.g. features):

VM configuration

What command line did you run (or JSON config data):

Guest OS version details:
Ubuntu 20.04.4 LTS ubuntu ttyAMA0

Host OS version details:
5.15.67-12.el9.aarch64

Logs

perf or top for cloud-hypervisor on destination host

image
image

Output of cloud-hypervisor -v from either standard error or via --log-file:

Linux kernel output:

@grass-lu
Copy link
Author

grass-lu commented Dec 4, 2023

physical machine is ft2000+
[root]# lscpu
Architecture: aarch64
CPU op-mode(s): 64-bit
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Vendor ID: Phytium
BIOS Vendor ID: Phytium
Model name: FT-2000+
BIOS Model name: FT-2000+/64
Model: 2
Thread(s) per core: 1
Core(s) per cluster: 8
Socket(s): 1
Cluster(s): 8
Stepping: 0x1
BogoMIPS: 100.00
Flags: fp asimd evtstrm crc32 cpuid
NUMA:
NUMA node(s): 8
NUMA node0 CPU(s): 0-7
NUMA node1 CPU(s): 8-15
NUMA node2 CPU(s): 16-23
NUMA node3 CPU(s): 24-31
NUMA node4 CPU(s): 32-39
NUMA node5 CPU(s): 40-47
NUMA node6 CPU(s): 48-55
NUMA node7 CPU(s): 56-63
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Mitigation; PTI
Mmio stale data: Not affected
Retbleed: Not affected
Spec store bypass: Vulnerable
Spectre v1: Mitigation; __user pointer sanitization
Spectre v2: Vulnerable
Srbds: Not affected
Tsx async abort: Not affected

@grass-lu grass-lu changed the title After live migration, one vcpu of vm is up 100% on aarch64 After live migration, vcpu of vm is up 100% on aarch64 Dec 4, 2023
@grass-lu
Copy link
Author

grass-lu commented Dec 4, 2023

update cloud-hypervisor version:
[root]# ./cloud-hypervisor --version
cloud-hypervisor v36.0-74-g5f89461a-dirty

guest os crash on the destination host :
[root@ceasphere-node-2 lzhp]# ./cloud-hypervisor --api-socket=/tmp/api2
[ 84.099179] Unable to handle kernel paging request at virtual address dead000000000108
[ 84.102723] Mem abort info:
[ 84.103914] ESR = 0x96000044
[ 84.105217] EC = 0x25: DABT (current EL), IL = 32 bits
[ 84.107500] SET = 0, FnV = 0
[ 84.108784] EA = 0, S1PTW = 0
[ 84.110113] Data abort info:
[ 84.111331] ISV = 0, ISS = 0x00000044
[ 84.112937] CM = 0, WnR = 1
[ 84.114191] [dead000000000108] address between user and kernel address ranges
[ 84.117413] Internal error: Oops: 96000044 [#1] SMP
[ 84.119527] Modules linked in: nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua sch_fq_codel ipmi_devintf ipmi_msghandler drm virtio_rng ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 multipath linear virtio_net net_failover crct10dif_ce failover virtio_blk aes_neon_bs aes_neon_blk crypto_simd cryptd
[ 84.135629] CPU: 2 PID: 892 Comm: top Tainted: G W 5.4.0-105-generic #119-Ubuntu
[ 84.139262] pstate: 40400085 (nZcv daIf +PAN -UAO)
[ 84.141331] pc : get_partial_node.isra.0.part.0+0x18c/0x2c0
[ 84.143677] lr : ___slab_alloc+0x390/0x5c0
[ 84.145368] sp : ffff800012bb3980
[ 84.149894] x29: ffff800012bb3980 x28: ffff000035080e00
[ 84.152183] x27: dead0000000000f8 x26: 0000000000000000
[ 84.154420] x25: ffff0000361ae900 x24: 0000000000000000
[ 84.156687] x23: dead000000000100 x22: dead000000000122
[ 84.159050] x21: ffff0000361ae910 x20: 0000000000210d00
[ 84.161293] x19: fffffe0000c2af40 x18: 0000000000000000
[ 84.163538] x17: 0000000046f8a000 x16: 0000000000000000
[ 84.165929] x15: 0000000000000060 x14: ffff0000385cad00
[ 84.168212] x13: ffff0000385cad00 x12: 0000000000000000
[ 84.170498] x11: 0000000000000010 x10: 0000000046f8b000
[ 84.173102] x9 : fefefefefefefeff x8 : ffff0000385cad00
[ 84.175725] x7 : 0000000000000000 x6 : 0000000000000000
[ 84.178091] x5 : 0000000000100010 x4 : fffffe0000c2af60
[ 84.180407] x3 : 0000000080100010 x2 : 0000000000000000
[ 84.182701] x1 : dead000000000100 x0 : dead000000000122
[ 84.185031] Call trace:
[ 84.186130] get_partial_node.isra.0.part.0+0x18c/0x2c0
[ 84.188348] ___slab_alloc+0x390/0x5c0
[ 84.189934] __slab_alloc+0x58/0x80
[ 84.191411] kmem_cache_alloc+0x238/0x260
[ 84.193149] __alloc_file+0x34/0x100
[ 84.194649] alloc_empty_file+0x68/0x100
[ 84.196346] path_openat+0x50/0x260
[ 84.197924] do_filp_open+0x88/0x110
[ 84.199413] do_sys_open+0x188/0x2b8
[ 84.200913] __arm64_sys_openat+0x30/0x40
[ 84.202570] el0_svc_common.constprop.0+0xf4/0x200
[ 84.204560] el0_svc_handler+0x38/0xa8
[ 84.206196] el0_svc+0x10/0x180
[ 84.207537] Code: a9020e62 927ff800 c89ffe60 a9408261 (f9000420)
[ 84.210502] ---[ end trace 0a7ecee8f5ca003c ]---

@peng6662001
Copy link
Contributor

Hi @grass-lu ,I re-tested your case and it worked, the vcpu of the VM is almost 0.
These are the commands I used:

target/aarch64-unknown-linux-gnu/release/cloud-hypervisor \
    --console tty --serial tty --kernel /root/workloads/CLOUDHV_EFI.fd \
    --disk path=/root/workloads/osdisk.img path=/home/dom/images/cloudinit \
    --cpus boot=4,max=16 \
    --memory size=1024M,hotplug_size=8192M,shared=on \
    --net tap=,mac=,ip=192.168.10.2,mask=255.255.255.0 \
    --cmdline "console=ttyAMA0 root=/dev/vda1 rw" \
    --api-socket /tmp/ch-socket 
target/aarch64-unknown-linux-gnu/release/cloud-hypervisor --api-socket=/tmp/api2
./target/aarch64-unknown-linux-gnu/release/ch-remote --api-socket /tmp/api2 receive-migration unix:/tmp/sock2
socat UNIX-LISTEN:/tmp/sock1,reuseaddr TCP:6000
./target/aarch64-unknown-linux-gnu/release/ch-remote --api-socket=/tmp/ch-socket send-migration unix:/tmp/sock2

I can't start vm with hypervisor-fw(issue 5987)

@grass-lu
Copy link
Author

grass-lu commented Dec 8, 2023

The problem still exists, try migration between two physical machine @peng6662001

@grass-lu
Copy link
Author

grass-lu commented Dec 8, 2023

[root@ceasphere-node-3 luzhipeng]# ./cloud-hypervisor --version
cloud-hypervisor v36.0-83-g77f4e35b-dirty

guest os crash:
[root@ceasphere-node-2 luzhipeng]# ./cloud-hypervisor --api-socket=/tmp/api2
[ 187.365404] Unable to handle kernel read from unreadable memory at virtual address 0000000000000000
[ 187.369759] Mem abort info:
[ 187.371000] ESR = 0x96000004
[ 187.372390] EC = 0x25: DABT (current EL), IL = 32 bits
[ 187.374791] SET = 0, FnV = 0
[ 187.376168] EA = 0, S1PTW = 0
[ 187.377620] Data abort info:
[ 187.379008] ISV = 0, ISS = 0x00000004
[ 187.380719] CM = 0, WnR = 0
[ 187.382042] user pgtable: 4k pages, 48-bit VAs, pgdp=00000000755f7000
[ 187.384827] [0000000000000000] pgd=0000000000000000
[ 187.387098] Internal error: Oops: 96000004 [#1] SMP
[ 187.389419] Modules linked in: nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua sch_fq_codel ipmi_devintf ipmi_msghandler drm virtio_rng ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_ce virtio_blk virtio_net net_failover failover aes_neon_bs aes_neon_blk crypto_simd cryptd
[ 187.406771] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.4.0-105-generic #119-Ubuntu
[ 187.410266] pstate: 60400085 (nZCv daIf +PAN -UAO)
[ 187.412414] pc : rb_erase+0xfc/0x3b8
[ 187.413970] lr : timerqueue_del+0x38/0x70
[ 187.415730] sp : ffff800010003db0
[ 187.417159] x29: ffff800010003db0 x28: ffff3994ffd79940
[ 187.419493] x27: ffffc93a05701ff8 x26: 0000000000000001
[ 187.421758] x25: ffffc93a04fbe028 x24: 0000000000000080
[ 187.424009] x23: ffff3994ffd79900 x22: 0000000000000000
[ 187.426262] x21: 0000000000000000 x20: ffff3994ffd79960
[ 187.428565] x19: ffffc93a05701ff8 x18: 0000000000000000
[ 187.430855] x17: 0000000000000000 x16: 0000000000000000
[ 187.433093] x15: 0000000000000000 x14: 0000000000000000
[ 187.435483] x13: 003d090000000000 x12: 00003d0900000000
[ 187.437818] x11: 0000000000000000 x10: 0000000000000040
[ 187.440137] x9 : ffffc93a054ef2e0 x8 : ffffc93a054ef2d8
[ 187.442377] x7 : ffff3994f9000280 x6 : 0000000000000000
[ 187.444613] x5 : 0000000000000000 x4 : ffff80001037b950
[ 187.446866] x3 : 0000000000000000 x2 : 0000000000000000
[ 187.449149] x1 : ffff3994ffd79960 x0 : 0000000000000000
[ 187.451404] Call trace:
[ 187.452457] rb_erase+0xfc/0x3b8
[ 187.453869] __remove_hrtimer+0x60/0xa0
[ 187.455521] __hrtimer_run_queues+0xf8/0x370
[ 187.457346] hrtimer_interrupt+0x120/0x2f8
[ 187.459137] arch_timer_handler_virt+0x40/0x50
[ 187.461020] handle_percpu_devid_irq+0x94/0x240
[ 187.462933] generic_handle_irq+0x38/0x50
[ 187.464640] __handle_domain_irq+0x70/0xc8
[ 187.466389] gic_handle_irq+0x10c/0x2cc
[ 187.468100] el1_irq+0x104/0x1c0
[ 187.469499] arch_cpu_idle+0x3c/0x1c8
[ 187.471106] default_idle_call+0x24/0x60
[ 187.472796] do_idle+0x214/0x298
[ 187.474193] cpu_startup_entry+0x2c/0xb8
[ 187.475881] rest_init+0xc0/0xcc
[ 187.477350] arch_call_rest_init+0x18/0x20
[ 187.479136] start_kernel+0x4cc/0x500
[ 187.480828] Code: f9400480 eb02001f 54fffd81 f9400880 (a9400803)
[ 187.483495] ---[ end trace f067bb2dffb6093c ]---
[ 187.485608] Kernel panic - not syncing: Fatal exception in interrupt
[ 187.488403] SMP: stopping secondary CPUs
[ 187.490264] Kernel Offset: 0x4939f3460000 from 0xffff800010000000
[ 187.492884] PHYS_OFFSET: 0xffffc66b80000000
[ 187.494695] CPU features: 0x00002,20800008
[ 187.496497] Memory Limit: none
[ 187.497851] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---

@rbradford rbradford added the AArch64 Affects AArch64 only label Dec 8, 2023
@rbradford rbradford changed the title After live migration, vcpu of vm is up 100% on aarch64 After live migration vcpu of vm is up 100% on aarch64 Dec 8, 2023
@rbradford
Copy link
Member

Perhaps your host kernel is tool old - can you try with a newer kernel?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AArch64 Affects AArch64 only
Projects
None yet
Development

No branches or pull requests

3 participants