Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

macOS: Undefined symbols for architecture arm64: [...] "___to_global" #1288

Open
JayFoxRox opened this issue Aug 28, 2023 · 8 comments · May be fixed by #1289
Open

macOS: Undefined symbols for architecture arm64: [...] "___to_global" #1288

JayFoxRox opened this issue Aug 28, 2023 · 8 comments · May be fixed by #1289

Comments

@JayFoxRox
Copy link

JayFoxRox commented Aug 28, 2023

Hi, I'm running pocl 4.0 on macOS (with SPIR-V) with minimal changes to https://github.com/Homebrew/homebrew-core/blob/bf6ac0be7aaf7ef66d89e8e7683f926ae981fcfa/Formula/p/pocl.rb (just bumping version and adding LLVM_SPIRV).

However, my actual test setup is complicated. I'm unable to reproduce the following erors in OpenCL programs, but I get them when using HIP (which I ported to macOS myself).

For now, I still think this is an issue in pocl, because I think the issue is in the .bc to .so linking step.

(Edit: I reproduced it in pure OpenCL now; so it's almost certainly a pocl bug)


I get this problem when trying to run HIP code (I added empty line around the important bit):

[...]
[2023-08-28 20:25:05.853340000]POCL: in fn int pocl_llvm_generate_workgroup_function_nowrite(unsigned int, cl_device_id, cl_kernel, _cl_command_node *, void **, int) at line 746:
  |    TIMING |       >>>        85.553  ms    API: llvm_workgroup_ir_func_gen
[2023-08-28 20:25:05.853365000]POCL: in fn llvm_codegen at line 150:
  |      LLVM |  Writing parallel.bc to /Users/fox/.cache/pocl/kcache/AH/NNOFPFPLGGDCMPEJBBOJLINIMAMNJDADFJDKA/_Z9floatMathPfS_S_/1-1-1-goffs0-smallgrid/parallel.bc.
[2023-08-28 20:25:05.862744000]POCL: in fn int pocl_llvm_codegen(cl_device_id, cl_program, void *, char **, uint64_t *) at line 906:
  |      LLVM |  Generating an object file directly.
[2023-08-28 20:25:05.907843000]POCL: in fn llvm_codegen at line 199:
  |      LLVM |  written /Users/fox/.cache/pocl/kcache/tempfile_ZQOwwV.so.o size 8072
[2023-08-28 20:25:05.907895000]POCL: in fn llvm_codegen at line 213:
  |      LLVM |  Temporary kernel.so file for kernel _Z9floatMathPfS_S_ : /Users/fox/.cache/pocl/kcache/tempfile_lfLBza.so
[2023-08-28 20:25:05.907899000]POCL: in fn llvm_codegen at line 215:
  |   GENERAL |  Linking final module


Undefined symbols for architecture arm64:
  "___to_global", referenced from:
      __pocl_kernel__Z9floatMathPfS_S_ in tempfile_ZQOwwV.so.o
  "___to_local", referenced from:
      __pocl_kernel__Z9floatMathPfS_S_ in tempfile_ZQOwwV.so.o
ld: symbol(s) not found for architecture arm64
error: linker command failed with exit code 1 (use -v to see invocation)


[2023-08-28 20:25:05.950245000]POCL: in fn llvm_codegen at line 239:
  |      LLVM |  Renaming temporary kernel.so to final ('/Users/fox/.cache/pocl/kcache/AH/NNOFPFPLGGDCMPEJBBOJLINIMAMNJDADFJDKA/_Z9floatMathPfS_S_/1-1-1-goffs0-smallgrid/_Z9floatMathPfS_S_.so') has failed.
[2023-08-28 20:25:05.950444000]POCL: in fn llvm_codegen at line 265:
  |    TIMING |       >>>       184.558  ms    API: llvm_codegen
Final linking of kernel _Z9floatMathPfS_S_ failed.
[...]

The names in the .bc are still this (2 underscores; is this correct / expected for the bc?):

% llvm-dis /Users/fox/.cache/pocl/kcache/AH/NNOFPFPLGGDCMPEJBBOJLINIMAMNJDADFJDKA/program.bc -o - | grep _to_global
  %42 = call ptr addrspace(1) @__to_global(ptr addrspace(4) %41) #0
  %63 = call ptr addrspace(1) @__to_global(ptr addrspace(4) %62) #0
  %84 = call ptr addrspace(1) @__to_global(ptr addrspace(4) %83) #0
  %__to_global.as_cast = addrspacecast ptr addrspace(4) %104 to ptr addrspace(1)
  %105 = ptrtoint ptr addrspace(1) %__to_global.as_cast to i64
  %109 = bitcast ptr addrspace(1) %__to_global.as_cast to ptr addrspace(1)
declare ptr addrspace(1) @__to_global(ptr addrspace(4)) #0

In the .so these are now tripple prefixed (3 underscores; which is probably good for macOS, but bad for pocl?):

% objdump -D /Users/fox/.cache/pocl/kcache/tempfile_s2EEeX.so.o | grep __to_global
    14e8:       94000000        bl      0 <___to_global>
    1518:       94000000        bl      0 <___to_global>
    1548:       94000000        bl      0 <___to_global>

I believe this is supposed to be handled by code which was modified in 4500774 (so this might be broken in 4.0; although I didn't try 3.x because I don't have LLVM 15.x anymore).
The brew formula for pocl still points at 3.x so I speculate that I'm just the first to hit this issue.

Before that patch, the code seems to have depended on the mangling by the compiler to generate the names. Now it seems to hardcode a specific mangling.
This assumption doesn't respect that macOS / OSX has a _ prefix for C functions (as part of the MachO ABI), so these are actually ___to_global (= 3 underscores) etc.
I've tried to find out which tool does the .bc → .so translation (llvm-codegen I guess?) and I also tried to find the part where the _ is added to the function name, but I wasn't able to.

Maybe someone more familiar can take a look?
Also, if someone can show some OpenCL code which generates these symbols, that'd be nice.

@JayFoxRox JayFoxRox changed the title macOS: Undefined symbols for architecture arm64: [...] "___to_global", referenced from: macOS: Undefined symbols for architecture arm64: [...] "___to_global" Aug 28, 2023
@JayFoxRox
Copy link
Author

JayFoxRox commented Aug 28, 2023

I've reproduced it in pure OpenCL now.

It happens if I have more than one to_global, so commenting out either gi1 or gi2 will suddenly make it work:

// poclbug.cl
kernel void poclbug(__global int* address) {
    volatile global int *gi1 = to_global(address);
    volatile global int *gi2 = to_global(address);
}

Trying to build with poclcc:

% poclcc ./poclbug.cl      
Undefined symbols for architecture arm64:
  "___to_global", referenced from:
      __pocl_kernel_poclbug in tempfile_pndNUM.so.o
ld: symbol(s) not found for architecture arm64
error: linker command failed with exit code 1 (use -v to see invocation)
Final linking of kernel poclbug failed.
zsh: abort      poclcc ./poclbug.cl

@pjaaskel
Copy link
Member

This is where we convert the AS builtins to AS casts: https://github.com/pocl/pocl/blob/main/lib/llvmopencl/linker.cpp#L393 that code is quite recent, maybe I have some (iterator?) bug there so it only replaces the first occurence. Just a quick guess.

@pjaaskel
Copy link
Member

...it could be the too common iterator invalidation bug: It erases instructions from the parent F which might invalidate the "users iterators"? We should collect the stale instructions and delete them after traversing. Maybe...

@JayFoxRox
Copy link
Author

I have a working debug build of pocl 4.x now and I think I figured it out; see my fix in #1289 ("works for me").

The bug happens much earlier than I thought.
It appears to be caused by kernel_compiler_passes[...].run, so the code I (speculatively) blamed isn't even running at that point.

The fix is trivial but the bug probably deserves investigation: Why this didn't affect any other users?
Also curious why this only breaks if the to_global function is used twice (or more).

@pjaaskel
Copy link
Member

pjaaskel commented Sep 7, 2023

@JayFoxRox bear in mind that MacOS hasn't had a maintainer for a long time nor active contributors, which is why this hasn't popped up. It works well for x86. I suggest try fixing the check in the loop I mentioned above to also catch the triple underscore name and if it doesn't fix it, check if there is something being done that invalidates the iterators. @isuruf might also help you through it as he is the MacOS maintainer. (I don't use Macs personally).

@isuruf
Copy link
Member

isuruf commented Sep 7, 2023

I cannot reproduce this bug with pocl 4.0

@pjaaskel
Copy link
Member

pjaaskel commented Sep 7, 2023

@isuruf I believe you need to run the chipStar or the conformance test suite (with SPIR-V mode) to produce these conversion operations through SPIR-V.

@isuruf
Copy link
Member

isuruf commented Sep 7, 2023

I tried only #1288 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants