Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ocaml5-issue] Segfault in STM Domain.DLS test sequential on 32-bit trunk #446

Open
jmid opened this issue Mar 26, 2024 · 8 comments
Open
Labels
ocaml5-issue A potential issue in the OCaml5 compiler/runtime

Comments

@jmid
Copy link
Collaborator

jmid commented Mar 26, 2024

In the CI-run for #445 on 32-bit trunk the STM Domain.DLS test sequential triggered a segfault
https://github.com/ocaml-multicore/multicoretests/actions/runs/8436771284/job/23104952265?pr=445

random seed: 107236932
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential
File "src/domain/dune", line 31, characters 7-20:
31 |  (name stm_tests_dls)
            ^^^^^^^^^^^^^
(cd _build/default/src/domain && ./stm_tests_dls.exe --verbose)
Command got signal SEGV.
[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential (generating)

This may be another case of a 32-bit/bytecode issue showing up in a couple of different tests:

Surprisingly this case however triggered in a sequential (single-domain) test! 😮

@gasche
Copy link

gasche commented Mar 30, 2024

I wonder if it is related to #12889, the only recent change to Domain.DLS that I can think of. (I hope not!)

@jmid
Copy link
Collaborator Author

jmid commented May 21, 2024

This just triggered again on 32-bit 5.3.0+trunk by the merge to main of #460:
https://github.com/ocaml-multicore/multicoretests/actions/runs/9169655398/job/25210472949

random seed: 103830913
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential
File "src/domain/dune", line 31, characters 7-20:
31 |  (name stm_tests_dls)
            ^^^^^^^^^^^^^
(cd _build/default/src/domain && ./stm_tests_dls.exe --verbose)
Command got signal SEGV.
[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential (generating)

@gasche
Copy link

gasche commented May 21, 2024

Is there more information that we can use to try to investigate this? "There is a segfault somewhere in Domain.DLS on 32bit" is not that much.

@jmid
Copy link
Collaborator Author

jmid commented May 21, 2024

First off, this is a collection of failures we observe.
Once we have fleshed out reproducible steps, these are reported upstream.
Help is very welcome, snarky remarks less so.

"There is a segfault somewhere in Domain.DLS on 32bit" is not that much.

Come on, there are QCheck seeds that caused the failures, GA workflows listing the steps taken, and links to 2 CI run logs, with full information about versions.

Run opam exec -- ocamlc -config
  opam exec -- ocamlc -config
  opam config list
  opam exec -- dune printenv
  opam list --columns=name,installed-version,repository,synopsis-or-target
  opam clean --all-switches --unused-repositories --logs --download-cache --repo-cache
  shell: /usr/bin/bash -e {0}
  env:
    QCHECK_MSG_INTERVAL: 60
    DUNE_PROFILE: dev
    OCAMLRUNPARAM: 
    DUNE_CI_ALIAS: runtest
    COMPILER: ocaml-variants.5.3.0+trunk,ocaml-option-32bit
    OCAML_COMPILER_GIT_REF: refs/heads/trunk
    CUSTOM_COMPILER_VERSION: 
    CUSTOM_COMPILER_SRC: 
    CUSTOM_OCAML_PKG_VERSION: 
    OPAMCLI: 2.0
    OPAMCOLOR: always
    OPAMERRLOGLEN: 0
    OPAMJOBS: 4
    OPAMPRECISETRACKING: 1
    OPAMSOLVERTIMEOUT: 1000
    OPAMYES: 1
    DUNE_CACHE: enabled
    DUNE_CACHE_TRANSPORT: direct
    DUNE_CACHE_STORAGE_MODE: copy
    CLICOLOR_FORCE: 1
version: 5.3.0+dev0-2023-12-22
standard_library_default: /home/runner/work/multicoretests/multicoretests/_opam/lib/ocaml
standard_library: /home/runner/work/multicoretests/multicoretests/_opam/lib/ocaml
ccomp_type: cc
c_compiler: gcc -m32
ocamlc_cflags:  -O2 -fno-strict-aliasing -fwrapv -pthread -fPIC  -pthread
ocamlc_cppflags:  -D_FILE_OFFSET_BITS=64 
ocamlopt_cflags:  -O2 -fno-strict-aliasing -fwrapv -pthread -fPIC  -pthread
ocamlopt_cppflags:  -D_FILE_OFFSET_BITS=64 
bytecomp_c_compiler: gcc -m32  -O2 -fno-strict-aliasing -fwrapv -pthread -fPIC  -pthread  -D_FILE_OFFSET_BITS=64 
native_c_compiler: gcc -m32  -O2 -fno-strict-aliasing -fwrapv -pthread -fPIC  -pthread  -D_FILE_OFFSET_BITS=64 
bytecomp_c_libraries: -lzstd  -latomic -lm  -lpthread
native_c_libraries:  -latomic -lm  -lpthread
native_ldflags: 
native_pack_linker: ld -r -o 
native_compiler: false
architecture: i386
model: default
int_size: 31
word_size: 32
system: linux
asm: i386-linux-as
asm_cfi_supported: false
with_frame_pointers: false
ext_exe: 
ext_obj: .o
ext_asm: .s
ext_lib: .a
ext_dll: .so
os_type: Unix
default_executable_name: a.out
systhread_supported: true
host: i386-pc-linux-gnu
target: i386-pc-linux-gnu
flambda: false
safe_string: true
default_safe_string: true
flat_float_array: true
function_sections: false
afl_instrument: false
tsan: false
windows_unicode: false
supports_shared_libraries: true
native_dynlink: false
naked_pointers: false
exec_magic_number: Caml1999X035
cmi_magic_number: Caml1999I035
cmo_magic_number: Caml1999O035
cma_magic_number: Caml1999A035
cmx_magic_number: Caml1999Y035
cmxa_magic_number: Caml1999Z035
ast_impl_magic_number: Caml1999M035
ast_intf_magic_number: Caml1999N035
cmxs_magic_number: Caml1999D035
cmt_magic_number: Caml1999T035
linear_magic_number: Caml1999L035

<><> Global opam variables ><><><><><><><><><><><><><><><><><><><><><><><><><><>
arch              x86_64                                          # Inferred from system
exe                                                               # Suffix needed for executable filenames (Windows)
jobs              4                                               # The number of parallel jobs set up in opam configuration
make              make                                            # The 'make' command to use
opam-version      2.1.6                                           # The currently running opam version
os                linux                                           # Inferred from system
os-distribution   ubuntu                                          # Inferred from system
os-family         debian                                          # Inferred from system
os-version        22.04                                           # Inferred from system
root              /home/runner/.opam                              # The current opam root directory
switch            /home/runner/work/multicoretests/multicoretests # The identifier of the current switch
sys-ocaml-arch                                                    # Target architecture of the OCaml compiler present on your system
sys-ocaml-cc                                                      # Host C Compiler type of the OCaml compiler present on your system
sys-ocaml-libc                                                    # Host C Runtime Library type of the OCaml compiler present on your system
sys-ocaml-version                                                 # OCaml version present on your system independently of opam, if any

<><> Configuration variables from the current switch ><><><><><><><><><><><><><>
prefix   /home/runner/work/multicoretests/multicoretests/_opam
lib      /home/runner/work/multicoretests/multicoretests/_opam/lib
bin      /home/runner/work/multicoretests/multicoretests/_opam/bin
sbin     /home/runner/work/multicoretests/multicoretests/_opam/sbin
share    /home/runner/work/multicoretests/multicoretests/_opam/share
doc      /home/runner/work/multicoretests/multicoretests/_opam/doc
etc      /home/runner/work/multicoretests/multicoretests/_opam/etc
man      /home/runner/work/multicoretests/multicoretests/_opam/man
toplevel /home/runner/work/multicoretests/multicoretests/_opam/lib/toplevel
stublibs /home/runner/work/multicoretests/multicoretests/_opam/lib/stublibs
user     runner
group    docker

<><> Package variables ('opam var --package PKG' to show) <><><><><><><><><><><>
PKG:name       # Name of the package
PKG:version    # Version of the package
PKG:depends    # Resolved direct dependencies of the package
PKG:installed  # Whether the package is installed
PKG:enable     # Takes the value "enable" or "disable" depending on whether the package is installed
PKG:pinned     # Whether the package is pinned
PKG:bin        # Binary directory for this package
PKG:sbin       # System binary directory for this package
PKG:lib        # Library directory for this package
PKG:man        # Man directory for this package
PKG:doc        # Doc directory for this package
PKG:share      # Share directory for this package
PKG:etc        # Etc directory for this package
PKG:build      # Directory where the package was built
PKG:hash       # Hash of the package archive
PKG:dev        # True if this is a development package
PKG:build-id   # A hash identifying the precise package version with all its dependencies
PKG:opamfile   # Path of the curent opam file
(flags
 (-w
  @1..3@5..28@30..39@43@46..47@49..57@61..62-40
  -strict-sequence
  -strict-formats
  -short-paths
  -keep-locs))
(ocamlc_flags (-g))
(ocamlopt_flags (-g))
(melange.compile_flags (-g))
(c_flags
 (-m32
  -O2
  -fno-strict-aliasing
  -fwrapv
  -pthread
  -fPIC
  -pthread
  -m32
  -D_FILE_OFFSET_BITS=64
  -fdiagnostics-color=always))
(cxx_flags
 (-x
  c++
  -m32
  -O2
  -fno-strict-aliasing
  -fwrapv
  -pthread
  -fPIC
  -pthread
  -fdiagnostics-color=always))
(link_flags ())
(menhir_flags ())
(menhir_explain ())
(coq_flags (-q))
(coqdoc_flags (--toc))
(js_of_ocaml_flags
 (--pretty --source-map-inline))
(js_of_ocaml_build_runtime_flags
 (--pretty --source-map-inline))
(js_of_ocaml_link_flags (--source-map-inline))
# Packages matching: installed
# Name                     # Installed # Repository # Synopsis
base-bigarray              base        default
base-domains               base        default
base-nnp                   base        default      Naked pointers prohibited in the OCaml heap
base-threads               base        default
base-unix                  base        default
dune                       3.15.2      default      Fast, portable, and opinionated build system
ocaml                      5.3.0       default      The OCaml compiler (virtual package)
ocaml-config               3           default      OCaml Switch Configuration
ocaml-option-32bit         1           default      Set OCaml to be compiled in 32-bit mode for 64-bit Linux and OS X hosts
ocaml-option-bytecode-only 1           default      Compile OCaml without the native-code compiler
ocaml-variants             5.3.0+trunk default      Current trunk
qcheck-core                0.21.3      default      Core qcheck library

@gasche
Copy link

gasche commented May 21, 2024

No snark intended, I genuinely wonder how you work with these failures. For example I'm not sure if it is reasonably easy to extract a backtrace, and/or to observe the same failure within the debug runtime. (Is this segfault due to a memory corruption, or an assert failure?)

If you prefer to work on this without upstream looking over your shoulder for now, I am happy to let you work your magic and wait for easier reproduction instructions.

@jmid
Copy link
Collaborator Author

jmid commented May 21, 2024

OK, fair enough.
Some of these remaining ones are just hard to reproduce - I suspect because they are timing or signal related.

I've been trying today for this one: https://github.com/ocaml-multicore/multicoretests/actions?query=branch%3Adomain-dls-32-bit-focus

  • for one, we are using focused tests on the CI, that lets us iterate a single test, rather than a full test suite run
  • I failed to reproduce the sequential failure, then started thinking that the crash could be caused by a parallel one, if they just happen to crash quickly before either one of them prints anything
  • Rerunning both tests didn't help reproducing
  • I then tried hard-coding the last known problematic seed
  • Lastly, I've tried adding a stress-test-mode for STM similar to Add Lin_domain.stress_test #443, so far without luck

@jmid
Copy link
Collaborator Author

jmid commented May 21, 2024

I finally managed to reproduce this one - on 5.2.0 - and only once for now. It is indeed a sequential fault! 😮
https://github.com/ocaml-multicore/multicoretests/actions/runs/9180414541/job/25244781126#step:18:762

Starting 74-th run

random seed: 103830913
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential
/usr/bin/bash: line 1: 197189 Segmentation fault      (core dumped) ./focusedtest.exe -v -s 103830913
[ ]    0    0    0    0 / 1000     0.0s STM Domain.DLS test sequential (generating)

@jmid
Copy link
Collaborator Author

jmid commented May 21, 2024

Switching hard-coded seed to 107236932 (the first one) works much better!
Across 500 repetitions this triggered 56 segfaults on 5.2.0
https://github.com/ocaml-multicore/multicoretests/actions/runs/9181424522/job/25248130724
and 50 segfaults on 5.3.0+trunk
https://github.com/ocaml-multicore/multicoretests/actions/runs/9181424534/job/25248130821

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ocaml5-issue A potential issue in the OCaml5 compiler/runtime
Projects
None yet
Development

No branches or pull requests

2 participants