Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nvim] all RPC requests appear to be cancelled #1194

Open
zbaylin opened this issue Sep 28, 2023 · 24 comments
Open

[nvim] all RPC requests appear to be cancelled #1194

zbaylin opened this issue Sep 28, 2023 · 24 comments

Comments

@zbaylin
Copy link

zbaylin commented Sep 28, 2023

  • Editor: neovim v0.9.2
  • OS: macOS 14.0 (Sonoma)
  • :LspInfo:
 Language client log: /Users/zbaylin/.local/state/nvim/lsp.log
 Detected filetype:   ocaml
 
 1 client(s) attached to this buffer: 
 
 Client: ocamllsp (id: 1, bufnr: [1])
 	filetypes:       ocaml, ocaml.menhir, ocaml.interface, ocaml.ocamllex, reason, dune
 	autostart:       true
 	root directory:  [redacted]
 	cmd:             [redacted]

Yesterday (after upgrading to macOS 14), the LSP appeared to repeatedly freeze after some small period of time editing. Looking at the LSP logs from Neovim, it appears that every LSP request is responded to with a JSON RPC cancellation (-32800).

I saw #1182, but pinning merlin-lib doesn't seem to fix this issue unfortunately.

The logs above don't seem to provide a ton of information, so let me know if there's somewhere else I can look for more verbose logs.

@zbaylin
Copy link
Author

zbaylin commented Sep 28, 2023

Update: I didn't realize when I made this issue that version constraints in my local switch forced me to use 1.15.1-4.14. Upgrading to 1.16.2 seems to fix the issue, but not 100% sure.

@zbaylin
Copy link
Author

zbaylin commented Sep 29, 2023

Can confirm that this issue is still present on 1.16.2 -- it also looks like CPU usage for the process shoots up to about 100% when the cancellations start to occur.

@voodoos
Copy link
Collaborator

voodoos commented Oct 2, 2023

Thanks for you report @zbaylin

Is your project public ? Is there any way for us to reproduce the issue ?

@zbaylin
Copy link
Author

zbaylin commented Oct 2, 2023

Unfortunately not, let me see if I can make a repro repo.

@zbaylin
Copy link
Author

zbaylin commented Oct 3, 2023

@voodoos unfortunately I wasn't able to come up with an easily reproducible example. However, I did maybe get some more information? I attached lldb to a running "frozen" ocamllsp process. As soon as I did that, the following breakpoint was triggered (even though I hadn't set any):

(lldb) process attach --pid 38276
Process 38276 stopped
* thread #1, stop reason = EXC_BREAKPOINT (code=1, subcode=0x188124220)
    frame #0: 0x0000000188124220 libsystem_platform.dylib`_os_unfair_lock_corruption_abort + 52
libsystem_platform.dylib`:
->  0x188124220 <+52>: brk    #0x1
    0x188124224 <+56>: stp    x20, x21, [sp, #-0x10]!
    0x188124228 <+60>: adrp   x20, 0
    0x18812422c <+64>: add    x20, x20, #0x95d          ; "BUG IN CLIENT OF LIBPLATFORM: os_unfair_lock is corrupt"
Target 0: (ocamllsp) stopped.
Executable module set to "[...]/ocamllsp".
Architecture set to: arm64-apple-macosx-.

Continuing the process immediately hits the same breakpoint. Here's the backtrace from lldb:

* thread #1, stop reason = EXC_BREAKPOINT (code=1, subcode=0x188124220)
  * frame #0: 0x0000000188124220 libsystem_platform.dylib`_os_unfair_lock_corruption_abort + 52
    frame #1: 0x000000018811f788 libsystem_platform.dylib`_os_unfair_lock_lock_slow + 332
    frame #2: 0x00000001880ee3f0 libsystem_pthread.dylib`pthread_mutex_destroy + 64
    frame #3: 0x0000000104585ccc ocamllsp`caml_thread_reinitialize [inlined] st_mutex_destroy(m=0x000060000199c040) at st_posix.h:228:8 [opt]
    frame #4: 0x0000000104585cc4 ocamllsp`caml_thread_reinitialize at st_stubs.c:425:7 [opt]
    frame #5: 0x00000001880f6f7c libsystem_pthread.dylib`_pthread_atfork_child_handlers + 76
    frame #6: 0x0000000187faeb90 libsystem_c.dylib`fork + 112
    frame #7: 0x0000000104584a0c ocamllsp`spawn_unix(v_env=<unavailable>, v_cwd=<unavailable>, v_prog=<unavailable>, v_argv=<unavailable>, v_stdin=<unavailable>, v_stdout=<unavailable>, v_stderr=<unavailable>, v_use_vfork=<unavailable>, v_setpgid=1) at spawn_stubs.c:439:43 [opt]
    frame #8: 0x00000001045bb790 ocamllsp`caml_c_call + 28
    frame #9: 0x00000001043da178 ocamllsp`camlSpawn__spawn_inner_997 + 152
    frame #10: 0x0000000104133d24 ocamllsp`camlOcaml_lsp_server__run_in_directory_5864 + 500
    frame #11: 0x00000001041c6ed0 ocamllsp`camlMerlin_kernel__Pparse__apply_rewriter_786 + 296
    frame #12: 0x00000001041c73d0 ocamllsp`camlMerlin_kernel__Pparse__rewrite_957 + 144
    frame #13: 0x00000001041c744c ocamllsp`camlMerlin_kernel__Pparse__apply_rewriters_str_inner_1399 + 68
    frame #14: 0x00000001041c751c ocamllsp`camlMerlin_kernel__Pparse__apply_rewriters_1089 + 52
    frame #15: 0x00000001041dcd0c ocamllsp`camlMerlin_kernel__Mppx__fun_1203 + 60
    frame #16: 0x00000001041dcbd0 ocamllsp`camlMerlin_kernel__Mppx__code_begin + 136
    frame #17: 0x00000001041c7fc4 ocamllsp`camlMerlin_kernel__Phase_cache__apply_inner_528 + 852
    frame #18: 0x00000001041de448 ocamllsp`camlMerlin_kernel__Mpipeline__fun_1778 + 152
    frame #19: 0x00000001043c4578 ocamllsp`camlMerlin_utils__Misc__try_finally_inner_3715 + 48
    frame #20: 0x000000010452b10c ocamllsp`camlCamlinternalLazy__force_lazy_block_362 + 140
    frame #21: 0x00000001041dcfd4 ocamllsp`camlMerlin_kernel__Mpipeline__fun_1613 + 172
    frame #22: 0x000000010452b10c ocamllsp`camlCamlinternalLazy__force_lazy_block_362 + 140
    frame #23: 0x00000001041de4ec ocamllsp`camlMerlin_kernel__Mpipeline__fun_1791 + 60
    frame #24: 0x000000010452b10c ocamllsp`camlCamlinternalLazy__force_lazy_block_362 + 140
    frame #25: 0x00000001041dcfd4 ocamllsp`camlMerlin_kernel__Mpipeline__fun_1613 + 172
    frame #26: 0x000000010452b10c ocamllsp`camlCamlinternalLazy__force_lazy_block_362 + 140
    frame #27: 0x00000001041a3470 ocamllsp`camlQuery_commands__dispatch_2992 + 7320
    frame #28: 0x00000001043bf4cc ocamllsp`camlMerlin_utils__Std__let_ref_3289 + 60
    frame #29: 0x00000001043c4578 ocamllsp`camlMerlin_utils__Misc__try_finally_inner_3715 + 48
    frame #30: 0x0000000104569818 ocamllsp`camlStdlib__Fun__protect_317 + 96
    frame #31: 0x00000001041d3480 ocamllsp`camlMerlin_kernel__Mocaml__with_state_438 + 80
    frame #32: 0x00000001040fa9b4 ocamllsp`camlOcaml_lsp_server__Document__fun_4120 + 116
    frame #33: 0x0000000104513508 ocamllsp`camlStdune__Exn_with_backtrace__try_with_422 + 40
    frame #34: 0x00000001043dc6e0 ocamllsp`camlLev_fiber__do_no_raise_1549 + 24
    frame #35: 0x00000001043e7820 ocamllsp`camlLev_fiber_util__Worker__run_322 + 72
    frame #36: 0x00000001043e8de4 ocamllsp`camlThread__fun_850 + 44
    frame #37: 0x00000001045bb800 ocamllsp`caml_start_program + 104
    frame #38: 0x00000001045b1770 ocamllsp`caml_callback_exn(closure=<unavailable>, arg=1) at callback.c:111:10 [opt]
    frame #39: 0x0000000104585f38 ocamllsp`caml_thread_start(arg=<unavailable>) at st_stubs.c:549:5 [opt]
    frame #40: 0x00000001880f3034 libsystem_pthread.dylib`_pthread_start + 136

(Sorry for the spam, but I hope that helps) It's unclear to me whether this is a bug in merlin or ocaml-lsp from the above, but it does seem like a deadlock is happening somewhere

@voodoos
Copy link
Collaborator

voodoos commented Oct 3, 2023

I am wondering whether this could be linked to the OS update. Could you try to reproduce on a freshly built switch ?

@zbaylin
Copy link
Author

zbaylin commented Oct 3, 2023

I did build a brand new switch after first encountering this issue but sadly it's still present

@zbaylin
Copy link
Author

zbaylin commented Oct 5, 2023

@voodoos I was able to somewhat fix this (I think) by removing this line:

Merlin_utils.Lib_config.System.set_run_in_directory (run_in_directory ());

which does square with @gasche's suggestion in the above linked PR that this might be an issue with the spawn library. Is there a reason the merlin version is replaced by a custom implementation? I would imagine spawn would be quicker than using Sys.command but have no evidence to back that up.

@voodoos
Copy link
Collaborator

voodoos commented Oct 6, 2023

Interesting, so it might be an issue with the Spawn library, or the way we use it. Still you didn't notice anything wrong before the OS update ?

The hang happens when running ppxes. Could you check that this isn't specific to the ppx your using ?
I guess buffers without ppxes shouldn't hang and buffers with any ppxes should ?

Also @rgrinberg may have more hindsight here ?

@gasche
Copy link
Member

gasche commented Oct 6, 2023

To summarize my reasoning on the upstream issue ocaml/ocaml#12636 :

  • I believe that something goes wrong on new OSX versions when combining OCaml threads and fork().
  • I don't know whether the OCaml runtime is at fault (maybe this is a bug in OSX). (I don't believe anymore that this is a bug in janestreet/spawn)
  • ocaml-lsp could avoid the issue by not using fork() anymore, that is, by using another approach than janestreet/spawn to run what is basically a Sys.command call in
    Spawn.spawn ~cwd ~prog ~argv ~stdin ~stdout ?stderr ()

(Note: the issue may be fixed in OCaml 5 (rather than OCaml 4.14), because the implementation of threads is different and may interact better with forking in this case.)

@zbaylin
Copy link
Author

zbaylin commented Oct 6, 2023

Thanks @gasche for the summary! In the linked PR it seems Xavier Leroy (avoiding tagging) has confirmed that this is a bug in the 4.14 runtime

The mutexes that protect I/O channels are being destroyed in the child process after fork() even though they can be locked, which is an error according to the POSIX standard and seems to be turned into a fatal error by the pthreads library.

Having not used posix_spawn before I'm hesitant to suggest using it would be a temporary fix of the issue here, but from what I have read it does seem possible:

Since glibc 2.24, the posix_spawn() function commences by calling clone(2) with CLONE_VM and CLONE_VFORK flags. 1

If CLONE_VM is set, the calling process and the child process run in the same memory space. 2


I looked into adding a shim Spawn module last night that has the same (or at least very close) interface, but uses Unix.create_process_env under the hood. The only problem I faced is that Spawn.spawn takes a cwd argument that calls chdir before execing. Unix.create_process* doesn't support this, but maybe just using Unix.chdir before and after would work.

@zbaylin
Copy link
Author

zbaylin commented Oct 6, 2023

Coincidentally, looking at the dune 3.11 release notes, it looks like @rgrinberg patched dune's vendored version of spawn to use posix_spawn on macOS in ocaml/dune#8090 -- maybe it's possible that same vendored version could be used here/an ocaml-community fork of spawn could be made?

@gasche
Copy link
Member

gasche commented Oct 6, 2023

Regarding an ocaml-lsp workaround: I think that it would be easier to look at the usage context for spawn inside ocaml-lsp. The one instance I found (hopefully the only one?) is in fact calling /bin/sh -c 'cmd'. If we are assuming that bash is available, it should be easy to turn this into a Sys.command call that basically does /bin/sh -c (cd dir && cmd). (In general doing chdir in a portable way is tricky, but here the code is already assuming sh is available.)

@gasche
Copy link
Member

gasche commented Oct 6, 2023

(But yes it would be even easier to reuse other people's work to fix similar issues. It seems unfortunate that dune's improvement to the spawn library are currently only available to dune, rather than other users of the library.)

@zbaylin
Copy link
Author

zbaylin commented Oct 6, 2023

Unfortunately it looks like there are numerous uses of Spawn.spawn in the LSP server

That's why I thought it would be a good idea to stub out a compatibility module for Unix.create_process_env, but when I tried the cd trick I got another bug that I couldn't seem to root out so I was trying to think of alternatives.

If I understand correctly, Sys.command blocks the thread it runs under (since it returns the exit code rather than the PID), so it's not a drop-in replacement for the spawn function. Granted, many invocations (like the one you mentioned) immediately call waitpid after but there are some that don't (see the ocamlformat.ml example).

@zbaylin
Copy link
Author

zbaylin commented Oct 6, 2023

I created a branch (https://github.com/skolemlabs/ocaml-lsp/tree/use-dune-vendored-spawn) that uses the vendored version of spawn from dune as a temporary fix. Obviously it's not ideal to have two separate forks of the library but in case anyone else comes across this issue in the interim, running

opam pin add ocaml-lsp-server git@github.com:skolemlabs/ocaml-lsp.git#use-dune-vendored-spawn

seems to fix the issue for now.

@voodoos
Copy link
Collaborator

voodoos commented Oct 9, 2023

It's great to have a workaround, thanks @zbaylin ! It might be worth it to release a version of ocaml-lsp with that workaround since the issue probably impact all users that upgrade to Sonoma ?

I am still somewhat surprised that we don't have more report of this issue, I will upgrade to Sonoma and try to reproduce.

Do you have an Intel of M1/2 mac ?

@voodoos
Copy link
Collaborator

voodoos commented Oct 9, 2023

Just finished the upgrade, I wasn't able to reproduce so far.

@zbaylin
Copy link
Author

zbaylin commented Oct 10, 2023

It might be worth it to release a version of ocaml-lsp with that workaround since the issue probably impact all users that upgrade to Sonoma ?

Sure, that sounds like a good idea to me. I'm not a huge fan of vendoring the exact same forked library twice, so I would suggest again that (even if long term) it would be a good idea to fork spawn into ocaml-community or somewhere. That obviously presents some issues with OPAM releases, etc. but it would mean not being beholden to Jane Street's release schedule.

Do you have an Intel of M1/2 mac ?

I have 2021 MacBook Pro with an M1 Pro. Here's a screenshot of About this Mac with the serial number redacted.
Screenshot 2023-10-10 at 10 05 27 AM

@voodoos
Copy link
Collaborator

voodoos commented Oct 10, 2023

I also have a 14-inch 2021. I upgraded yesterday to Sonoma, and have no issue so far with ocaml-lsp 🤔.

@frejsoya
Copy link

I created a branch (https://github.com/skolemlabs/ocaml-lsp/tree/use-dune-vendored-spawn) that uses the vendored version of spawn from dune as a temporary fix. Obviously it's not ideal to have two separate forks of the library but in case anyone else comes across this issue in the interim, running

opam pin add ocaml-lsp-server git@github.com:skolemlabs/ocaml-lsp.git#use-dune-vendored-spawn

seems to fix the issue for now.

This fixed it for me as well.
Had same issues with LSP, Neovim. Ocaml 5.1. osx Sonoma.

@voodoos
Copy link
Collaborator

voodoos commented Oct 16, 2023

@frejsoya is the project your're working on open source ? Is there a way for us to reproduce ?

@frejsoya
Copy link

@voodoos

No unfortunately not right know, i'll share something if it happens again. I hoped having one more hint at os x + Sonoma would help. Sorry for the "me too" :).

I'm still on @zbaylin changes. I'll see if i can get a stack trace using the current release during the week. That is if i can reproduce :)

@voodoos
Copy link
Collaborator

voodoos commented Oct 17, 2023

@frejsoya, @zbaylin, PR ocaml/ocaml#12646 backports some 5.0 changes to 4.14 that may solve the issue.

Would you try using this branch of OCaml and see if you are still able to reproduce ?
The branch is here: https://github.com/xavierleroy/ocaml/tree/thread-atfork

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants