Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGILL in debugger on ARM Macs #815

Open
nilsvu opened this issue Sep 2, 2023 · 4 comments
Open

SIGILL in debugger on ARM Macs #815

nilsvu opened this issue Sep 2, 2023 · 4 comments
Assignees

Comments

@nilsvu
Copy link

nilsvu commented Sep 2, 2023

When running a program that links libxsmm in a debugger on ARM Macs I first get a EXC_BAD_INSTRUCTION, and when I skip those I get SIGILL. When skipping all of that I can run the program fine.

I have tried to set the LIBXSMM_TARGET=arm64 env variable, but that didn't help. Does anyone have a good solution here?

Environment:

  • macOS 13, M2 chip
  • AppleClang 14
  • libxsmm on current main branch, compiled with a simple make
  • LLDB debugger
@alheinecke
Copy link
Collaborator

alheinecke commented Sep 3, 2023

Do you have a reproducer? Can you try if you see the same issue on M1 (I only have access to M1).

@hfp
Copy link
Collaborator

hfp commented Sep 5, 2023

Background
By design, our CPUID implementation on ARM needs to handle SIGILL (since ARM µArch has no easy way to determine CPU features). LIBXSMM is sophisticated in the sense of still being able to determine CPU features across a variety of OS' and systems (subject of our discovery and exploration), if code is executed under a debugger, signals are caught by the debugger (perhaps no matter if any code wants to handle that signal). Default in gdb (and perhaps in lldb as well) is "catch all". This means, the debugger must be instructed to "get past the signal". For GDB, i.e., handle all pass (see https://sourceware.org/gdb/onlinedocs/gdb/Signals.html) or "GDB should not mention the occurrence of the signal at all.". For Apple like lldb (you did not want gdb or code-sign gdb on Apple) this means you are on your own (Apple).

Btw, "CPUID" means LIBXSMM attempts to select the best possible implementation on every supported platform. This is reasonable since code it generated at runtime anyway (independent of the compiler flags denoting the target). Also, using whatever compiler flag to make code more specific to a target is not expected to impact performance.

Desired behavior
If LIBXSMM is compiled using DBG=1 (effectively DBG != 0), we want to disable relying on handling SIGILL like no additional change needed for usual debuggers. The desired behavior was certainly implemented but it carries "holes" since it is a question when to disable our CPUID mechanism. It would be useful (like Alex pointed out) to learn about your case like compilation flags and which debugger. How did you built LIBXSMM under macOS, and which debugger/command did you use?

When running a program that links libxsmm in a debugger on ARM Macs I first get a EXC_BAD_INSTRUCTION, and when I skip those I get SIGILL. When skipping all of that I can run the program fine.

We want you "to run fine" if you build LIBXSMM with DBG=1. Objective is to anticipate debugging if debug symbols are present and to avoid sophisticated CPUID-code when debugging (at the expense of executing different code-paths versus "release" code).

@hfp hfp self-assigned this Sep 5, 2023
@nilsvu
Copy link
Author

nilsvu commented Sep 5, 2023

Thanks for your replies!

We want you "to run fine" if you build LIBXSMM with DBG=1.

This means I have to re-link my program to a debug build of libxsmm when I want to run it through a debugger, even if I'm not trying to debug anything related to libxsmm. That would be rather inconvenient. May I suggest instead to use an environment variable to specify the architecture and skip raising any signals if the environment variable is defined? From a brief inspection of the code that is what I thought LIBXSMM_TARGET did, but I didn't get it to work (when setting the variable I still got the signals in the debugger).

How did you built LIBXSMM under macOS, and which debugger/command did you use?

  1. Cloned the main branch and make.

  2. Linked the resulting libxsmm.dylib into my program.

  3. Loaded my program with lldb-1403.0.17.67 (/usr/bin/lldb on macOS 13.5.1) using file and then run.

  4. LLDB stops with this message:

    Process 83848 stopped
    * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_INSTRUCTION (code=1, subcode=0xd5380637)
        frame #0: 0x00000001013280c4 libxsmm.1.dylib`libxsmm_cpuid_arm + 156
    libxsmm.1.dylib`libxsmm_cpuid_arm:
    ->  0x1013280c4 <+156>: mrs    x23, ID_AA64ISAR1_EL1
        0x1013280c8 <+160>: ldr    w8, [x22, #0x9f8]
        0x1013280cc <+164>: cmp    w8, #0x7d1
        0x1013280d0 <+168>: b.gt   0x101328100               ; <+216>
    
  5. Disabled the EXC_BAD_INSTRUCTION like this:

    settings set platform.plugin.darwin.ignored-exceptions EXC_BAD_INSTRUCTION
    
  6. Rerun produces this:

    Process 83869 stopped
    * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGILL
        frame #0: 0x00000001013280c4 libxsmm.1.dylib`libxsmm_cpuid_arm + 156
    libxsmm.1.dylib`libxsmm_cpuid_arm:
    ->  0x1013280c4 <+156>: mrs    x23, ID_AA64ISAR1_EL1
        0x1013280c8 <+160>: ldr    w8, [x22, #0x9f8]
        0x1013280cc <+164>: cmp    w8, #0x7d1
        0x1013280d0 <+168>: b.gt   0x101328100               ; <+216>
    
  7. Disabled SIGILL like this:

    process handle SIGILL -n false -p true -s false
    
  8. Now I can debug my program.

@alheinecke
Copy link
Collaborator

@hfp : any updates on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants