Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] mfsmaster core dumps on ARM system, Alignment Trap #488

Open
Slyphic opened this issue Aug 20, 2022 · 2 comments
Open

[BUG] mfsmaster core dumps on ARM system, Alignment Trap #488

Slyphic opened this issue Aug 20, 2022 · 2 comments

Comments

@Slyphic
Copy link

Slyphic commented Aug 20, 2022

System information

v3.0.116 installed from ArchLinux|ARM repo, running kernel 4.14.180 on ODROID-HC2 (Samsung Exynos5422 ARM Cortex-A15/Cortex-A7)

1 master and 1 metadatalogger as above, and 3-4 chunk nodes running the same mfs and OS on Espressobins (Marvell Armada 3700LP (88F37200) ARM/Cortex A53) each with 3-4 2TB disks

  • All fs objects: 96233
  • Total space: 18 TiB
  • Free space: 7.0 TiB
  • RAM used: 281 MiB

I know, it's a weird installation. It's my home test bed, an attempt at an ultra low power (and low cost) moosefs installation. It's been running largely without problems for almost 5 years now.

Describe the problem you observed:

After upgrading from 3.0.112 to 3.0.116 mfs-master core dumps and crashes during normal operation. I can't find a trigger, but it takes longer to crash if left idle, but only minutes if you try to write or read some files. The longest I've kept it running was leaving it idle with 0 client mounts last night and it made it about 6 hours before core dumping.

The cluster was previously running 3.0.112 without error, and downgrading the system back to that version has, so far, fixed my problem. There appears to be something introduced in the code between these versions that doesn't play well with ARM's strict unaligned data access restrictions in the architecture.

Aug 20 01:18:29 imsal kernel: Alignment trap: not handling instruction edc40b0a at [<004f495c>]
Aug 20 01:18:29 imsal kernel: Unhandled fault: alignment exception (0x811) at 0xae16a81b
Aug 20 01:18:29 imsal kernel: pgd = e06d4000
Aug 20 01:18:29 imsal kernel: [ae16a81b] *pgd=b5570835
Aug 20 01:18:29 imsal kernel: audit: type=1701 audit(1660976309.486:67): auid=4294967295 uid=979 gid=979 ses=4294967295 pid=1165 comm="mfsmaster" exe="/usr/bin/mfsmaster" sig=7 res=1
Aug 20 01:18:29 imsal kernel: audit: type=1130 audit(1660976309.526:68): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-coredump@4-6109-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=succe>
Aug 20 01:18:29 imsal audit[1165]: ANOM_ABEND auid=4294967295 uid=979 gid=979 ses=4294967295 pid=1165 comm="mfsmaster" exe="/usr/bin/mfsmaster" sig=7 res=1
Aug 20 01:18:29 imsal audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-coredump@4-6109-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug 20 01:18:29 imsal systemd[1]: Started Process Core Dump (PID 6109/UID 0).
-- Subject: A start job for unit systemd-coredump@4-6109-0.service has finished successfully
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- A start job for unit systemd-coredump@4-6109-0.service has finished successfully.
--
-- The job identifier is 875.
Aug 20 01:19:04 imsal systemd-coredump[6113]: [LNK] Process 1165 (mfsmaster) of user 979 dumped core.

                                              Module linux-vdso.so.1 with build-id 88dc578fb02d73083639658edaf67047678a0f1d
                                              Module libpthread.so.0 with build-id 0b0422739722054f65f9f78c4ac441ebc21cd01e
                                              Module libdl.so.2 with build-id f585c7c9f6babaf9ccb90a5feeb3f6902fd810c8
                                              Module libffi.so.8 with build-id 218e198d7c786e474efd2d9b745615880c5120df
                                              Module libp11-kit.so.0 with build-id 8166838a069281c28c7f9434827c3f73d45d5451
                                              Module libcrypto.so.1.1 with build-id 5261d99d530924d3ff0ffdc8b67d1caece137c63
                                              Module libcrypt.so.2 with build-id 421adaa3adb6e116dabffb7b515b0b38285d1ec8
                                              Module libm.so.6 with build-id 03e814c990762eeb9da12de241a4f42322248e45
                                              Module libnss_systemd.so.2 with build-id 551575f58085900ad16f5f6fc91c3e6e32358f02
                                              Module ld-linux-armhf.so.3 with build-id 072bb4cd73afd5d62040c7f3f482dbe17719bfea
                                              Module libc.so.6 with build-id ad84e29cae6a8880108cc3a95754d84ca22799e8
                                              Module libgcc_s.so.1 with build-id 5dfba9be74e9275dc2b88197d5e4a7eb31caa30b
                                              Module libz.so.1 with build-id f5e8b23636191e87948dc2c6f3c5fc2f243d9b08
                                              Module mfsmaster with build-id e065a335744ea27b193ef18932c039c11c17aef9
                                              Stack trace of thread 1165:
                                              #0  0x00000000004f4960 n/a (mfsmaster + 0x26960)
                                              ELF object binary architecture: ARM
-- Subject: Process 1165 (mfsmaster) dumped core
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- Documentation: man:core(5)
--
-- Process 1165 (mfsmaster) crashed and dumped core.
--
-- This usually indicates a programming error in the crashing program and
-- should be reported to its vendor as a bug.
Aug 20 01:19:05 imsal systemd[1]: systemd-coredump@4-6109-0.service: Deactivated successfully.
-- Subject: Unit succeeded
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- The unit systemd-coredump@4-6109-0.service has successfully entered the 'dead' state.
Aug 20 01:19:05 imsal kernel: audit: type=1131 audit(1660976345.738:69): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-coredump@4-6109-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=succe>
Aug 20 01:19:05 imsal kernel: dwmmc_exynos 12220000.mmc: Unexpected interrupt latency
Aug 20 01:19:05 imsal kernel: audit: type=1131 audit(1660976345.770:70): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=moosefs-master comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
Aug 20 01:19:05 imsal audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-coredump@4-6109-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Aug 20 01:19:05 imsal audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=moosefs-master comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
Aug 20 01:19:05 imsal mfsmaster[1166]: background data writer - HUP/ERR detected on data pipe: EACCES (Permission denied)
Aug 20 01:19:05 imsal systemd[1]: moosefs-master.service: Main process exited, code=dumped, status=7/BUS
@chogata
Copy link
Member

chogata commented Aug 24, 2022

We create those packages using a Raspberry Pi and never had a problem. The first step would be to try and compile MooseFS from source (with symbols) on the exact machine it is later run on and see, if the problem persists. If yes, then we can try to investigate what exactly seems to be the issue. If no, then it would mean those packages simply are not compatible with ODROIDs.
Are you able to compile MooseFS from source?
BTW the difference between the .112 and .116 might be simply because .112 was compiled on older OS and kernel versions.

@unixorn
Copy link

unixorn commented Dec 30, 2022

Tagging in so I can see updates on this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants