Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZFS Kernel Panic #16187

Open
chriexpe opened this issue May 10, 2024 · 2 comments
Open

ZFS Kernel Panic #16187

chriexpe opened this issue May 10, 2024 · 2 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@chriexpe
Copy link

System information

Type Version/Name
Distribution Name Unraid
Distribution Version 6.12.8
Kernel Version 6.1.74-Unraid
Architecture x86_64
OpenZFS Version 2.1.14-1

These last few weeks I've been getting this kernel panic from a ZFS pool that I've created a year ago (almost daily), honestly idk what to do aside from removing it and starting fresh (and loose a few TBs of data).

The pool in question is the main one from my server, formed by 3x8TB HDD Seagate Exos 7e8 that is connected to a RAID Card (passthrough mode), this pool is constantly being written by a NVR, and these crashes are random apparently (or it coincidentally crashes after I write/read a file when it's been running for quite a while), this is the error:

May 10 17:26:43 Tower kernel: kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
May 10 17:26:43 Tower kernel: BUG: unable to handle page fault for address: ffff88941de53c80
May 10 17:26:43 Tower kernel: #PF: supervisor instruction fetch in kernel mode
May 10 17:26:43 Tower kernel: #PF: error_code(0x0011) - permissions violation
May 10 17:26:43 Tower kernel: PGD 4c01067 P4D 4c01067 PUD 7b2111063 PMD 800000141de001e3 
May 10 17:26:43 Tower kernel: Oops: 0011 [#1] PREEMPT SMP NOPTI
May 10 17:26:43 Tower kernel: CPU: 5 PID: 9080 Comm: dp_sync_taskq Tainted: P     U     O       6.1.74-Unraid #1
May 10 17:26:43 Tower kernel: Hardware name: ASRock Z690 Phantom Gaming 4/D5/Z690 Phantom Gaming 4/D5, BIOS 15.01 01/04/2024
May 10 17:26:43 Tower kernel: RIP: 0010:0xffff88941de53c80
May 10 17:26:43 Tower kernel: Code: ff ff 98 3d e5 1d 94 88 ff ff 01 00 00 f0 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 37 81 3e 66 00 00 00 00 <c0> 3c e5 1d 94 88 ff ff 60 9a ee a0 ff ff ff ff d1 4b bc 0f 3e 99
May 10 17:26:43 Tower kernel: RSP: 0018:ffffc900221afd80 EFLAGS: 00010282
May 10 17:26:43 Tower kernel: RAX: ffff88941de53c80 RBX: ffff8891050d2000 RCX: 0000000000000003
May 10 17:26:43 Tower kernel: RDX: 0000000000000001 RSI: ffffffff8214ded8 RDI: ffff8891050d2000
May 10 17:26:43 Tower kernel: RBP: ffffffffa0987a45 R08: ffff8885614352c0 R09: 0000000080190018
May 10 17:26:43 Tower kernel: R10: ffff8885614352c0 R11: 0000000000000010 R12: ffff88814f502000
May 10 17:26:43 Tower kernel: R13: ffff88810662eb90 R14: ffff88815d74f000 R15: ffff889593ce7c00
May 10 17:26:43 Tower kernel: FS:  0000000000000000(0000) GS:ffff88a00f540000(0000) knlGS:0000000000000000
May 10 17:26:43 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 10 17:26:43 Tower kernel: CR2: ffff88941de53c80 CR3: 000000086122e000 CR4: 0000000000752ee0
May 10 17:26:43 Tower kernel: PKRU: 55555554
May 10 17:26:43 Tower kernel: Call Trace:
May 10 17:26:43 Tower kernel: <TASK>
May 10 17:26:43 Tower kernel: ? __die_body+0x1a/0x5c
May 10 17:26:43 Tower kernel: ? page_fault_oops+0x329/0x376
May 10 17:26:43 Tower kernel: ? exc_page_fault+0xf4/0x11d
May 10 17:26:43 Tower kernel: ? asm_exc_page_fault+0x22/0x30
May 10 17:26:43 Tower kernel: ? dnode_destroy+0x1e6/0x1e6 [zfs]
May 10 17:26:43 Tower kernel: ? dbuf_evict_user+0x34/0x60 [zfs]
May 10 17:26:43 Tower kernel: ? dbuf_clear_data+0xf/0x3e [zfs]
May 10 17:26:43 Tower kernel: ? dbuf_destroy+0x9b/0x3b8 [zfs]
May 10 17:26:43 Tower kernel: ? dnode_rele_task+0x4c/0x69 [zfs]
May 10 17:26:43 Tower kernel: ? taskq_thread+0x266/0x38a [spl]
May 10 17:26:43 Tower kernel: ? wake_up_q+0x44/0x44
May 10 17:26:43 Tower kernel: ? taskq_dispatch_delay+0x106/0x106 [spl]
May 10 17:26:43 Tower kernel: ? kthread+0xe4/0xef
May 10 17:26:43 Tower kernel: ? kthread_complete_and_exit+0x1b/0x1b
May 10 17:26:43 Tower kernel: ? ret_from_fork+0x1f/0x30
May 10 17:26:43 Tower kernel: </TASK>
May 10 17:26:43 Tower kernel: Modules linked in: dm_mod ipvlan xt_CHECKSUM ipt_REJECT nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle vhost_net tun vhost vhost_iotlb tap veth xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype br_netfilter nfsd auth_rpcgss oid_registry lockd grace sunrpc md_mod i915(O) drm_buddy i2c_algo_bit ttm drm_display_helper drm_kms_helper drm intel_gtt agpgart syscopyarea sysfillrect sysimgblt fb_sys_fops nct6775 nct6775_core hwmon_vid ip6table_filter ip6_tables iptable_filter ip_tables x_tables efivarfs bridge stp llc bonding tls zfs(PO) intel_rapl_msr zunicode(PO) intel_rapl_common iosf_mbi x86_pkg_temp_thermal intel_powerclamp zzstd(O) coretemp kvm_intel zlua(O) kvm zavl(PO) icp(PO) crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel zcommon(PO) crypto_simd znvpair(PO) cryptd rapl intel_cstate
May 10 17:26:43 Tower kernel: spl(O) mei_hdcp mei_pxp wmi_bmof intel_uncore tpm_crb mpt3sas i2c_i801 mei_me nvme tpm_tis cp210x i2c_smbus video ahci raid_class sr_mod tpm_tis_core e1000e mei i2c_core nvme_core scsi_transport_sas libahci input_leds cdrom joydev usbserial led_class wmi tpm intel_pmc_core backlight acpi_pad acpi_tad button unix
May 10 17:26:43 Tower kernel: CR2: ffff88941de53c80
May 10 17:26:43 Tower kernel: ---[ end trace 0000000000000000 ]---
May 10 17:26:43 Tower kernel: RIP: 0010:0xffff88941de53c80
May 10 17:26:43 Tower kernel: Code: ff ff 98 3d e5 1d 94 88 ff ff 01 00 00 f0 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 37 81 3e 66 00 00 00 00 <c0> 3c e5 1d 94 88 ff ff 60 9a ee a0 ff ff ff ff d1 4b bc 0f 3e 99
May 10 17:26:43 Tower kernel: RSP: 0018:ffffc900221afd80 EFLAGS: 00010282
May 10 17:26:43 Tower kernel: RAX: ffff88941de53c80 RBX: ffff8891050d2000 RCX: 0000000000000003
May 10 17:26:43 Tower kernel: RDX: 0000000000000001 RSI: ffffffff8214ded8 RDI: ffff8891050d2000
May 10 17:26:43 Tower kernel: RBP: ffffffffa0987a45 R08: ffff8885614352c0 R09: 0000000080190018
May 10 17:26:43 Tower kernel: R10: ffff8885614352c0 R11: 0000000000000010 R12: ffff88814f502000
May 10 17:26:43 Tower kernel: R13: ffff88810662eb90 R14: ffff88815d74f000 R15: ffff889593ce7c00
May 10 17:26:43 Tower kernel: FS:  0000000000000000(0000) GS:ffff88a00f540000(0000) knlGS:0000000000000000
May 10 17:26:43 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 10 17:26:43 Tower kernel: CR2: ffff88941de53c80 CR3: 000000086122e000 CR4: 0000000000752ee0
May 10 17:26:43 Tower kernel: PKRU: 55555554
May 10 17:26:43 Tower kernel: note: dp_sync_taskq[9080] exited with irqs disabled
May 10 17:27:59 Tower kernel: kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
May 10 17:27:59 Tower kernel: BUG: unable to handle page fault for address: ffff8898d597c2c0
May 10 17:27:59 Tower kernel: #PF: supervisor instruction fetch in kernel mode
May 10 17:27:59 Tower kernel: #PF: error_code(0x0011) - permissions violation
May 10 17:27:59 Tower kernel: PGD 4c01067 P4D 4c01067 PUD 80000018c00001e3 
May 10 17:27:59 Tower kernel: Oops: 0011 [#2] PREEMPT SMP NOPTI

And it keeps repeating this same error.

If I check it with zpool status (after kernel panic) everything looks normal:

root@Tower:~# zpool status
  pool: disk1
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        disk1       ONLINE       0     0     0
          md1p1     ONLINE       0     0     0

errors: No known data errors

  pool: nasa
 state: ONLINE
  scan: scrub repaired 0B in 11:05:53 with 0 errors on Tue Apr 30 11:33:04 2024
config:

        NAME        STATE     READ WRITE CKSUM
        nasa        ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            sde1    ONLINE       0     0     0
            sdd1    ONLINE       0     0     0
            sdc1    ONLINE       0     0     0

errors: No known data errors

Note that I scrubbed it and there as no error.

I don't remember exactly when it started, but it's probably after upgrading to 6.12.9, tho I rolled back to 6.12.8 and it kept crashing.

I ran memtest too, but it didn't report any error.
memtest  unraid server

@chriexpe chriexpe added the Type: Defect Incorrect behavior (e.g. crash, hang) label May 10, 2024
@bignay2000
Copy link

bignay2000 commented Jun 6, 2024

I have the similar issue; I have an all ZFS Proxmox server that crashes upon backing up a VM at least once a week. Backups are scheduled daily. I also did a memtest that did not return any issues.

Journalctl logs right before hard crash.

root@gtr7pro:~# journalctl --since "2024-06-04 15:55" --until "2024-06-04 16:11"
Jun 04 15:58:54 gtr7pro pmxcfs[1295]: [status] notice: received log
Jun 04 15:58:54 gtr7pro pvedaemon[1438]: <hiveadmin@pam> starting task UPID:gtr7pro:00004D14:000AE00F:665F71FE:vzdump::hiveadmin@pam:
Jun 04 15:58:54 gtr7pro pvedaemon[19732]: INFO: starting new backup job: vzdump --mailnotification failure --mailto systems@example.com --compress zstd --prune-backups 'keep-last=3' --notes-template '{{guestname}}' --storage local --mode snapshot --all 1 --fleecing 0 --node gtr7pro
Jun 04 15:58:54 gtr7pro pvedaemon[19732]: INFO: Starting Backup of VM 102 (qemu)
Jun 04 15:58:57 gtr7pro pvedaemon[19732]: VM 102 qmp command failed - VM 102 qmp command 'guest-ping' failed - got timeout
Jun 04 15:59:00 gtr7pro kernel: hrtimer: interrupt took 5490 ns
Jun 04 15:59:08 gtr7pro kernel: BUG: unable to handle page fault for address: 0000040000000430
Jun 04 15:59:08 gtr7pro kernel: #PF: supervisor read access in kernel mode
Jun 04 15:59:08 gtr7pro kernel: #PF: error_code(0x0000) - not-present page
root@gtr7pro:~# 

ZFS status reported errors for file that is one of the vm's disks. I was able to reboot, keep the vm powered off and have ZFS fix the pool.

root@gtr7pro:~# zfs version
zfs-2.2.3-pve2
zfs-kmod-2.2.3-pve2
root@gtr7pro:~# uname -a
Linux gtr7pro 6.8.4-3-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.4-3 (2024-05-02T11:55Z) x86_64 GNU/Linux
root@gtr7pro:~# pveversion -v
proxmox-ve: 8.2.0 (running kernel: 6.8.4-3-pve)
pve-manager: 8.2.2 (running version: 8.2.2/9355359cd7afbae4)
proxmox-kernel-helper: 8.1.0
pve-kernel-6.2: 8.0.5
proxmox-kernel-6.8: 6.8.4-3
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
proxmox-kernel-6.5.13-5-pve-signed: 6.5.13-5
proxmox-kernel-6.5: 6.5.13-5
proxmox-kernel-6.2.16-20-pve: 6.2.16-20
proxmox-kernel-6.2: 6.2.16-20
pve-kernel-6.2.16-3-pve: 6.2.16-3
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.6
libpve-cluster-perl: 8.0.6
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.2
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.2.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.3-1
proxmox-backup-file-restore: 3.2.3-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.6
pve-container: 5.1.10
pve-docs: 8.2.2
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.0
pve-firewall: 5.0.7
pve-firmware: 3.11-1
pve-ha-manager: 4.0.4
pve-i18n: 3.2.2
pve-qemu-kvm: 8.1.5-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.3-pve2

@bignay2000
Copy link

@ryao Any ideas on this ticket? Feels like this maybe related to #10255 & #14636

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

2 participants