New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix!: Perform a clean exit again #2463
Conversation
info!("Quit with code {}: {}", code, reason); | ||
} | ||
|
||
pub fn is_running(&self) -> bool { | ||
self.running.load(Ordering::Relaxed) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The running tracker only tracks the exit code now, so maybe it could be renamed or replaced by variable. But I left that for future refactoring.
Err(join_error) => error!("Error joining IO loop: '{}'", join_error), | ||
Ok(Err(error)) => { | ||
if !error.is_channel_closed() { | ||
error!("Error: '{}'", error); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add this message back. I thought it wasn't important, but it's actually, the message printed in this bug for example:
src/bridge/ui_commands.rs
Outdated
@@ -276,15 +278,15 @@ impl AsRef<str> for UiCommand { | |||
} | |||
|
|||
struct UIChannels { | |||
sender: LoggingSender<UiCommand>, | |||
sender: Mutex<Option<LoggingSender<UiCommand>>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These Mutex Options over concurrent capable types kill me a little, but after thinking about it for a bit nothing cleaner came to mind so I think its reasonable to leave it.
One thing we might consider is to use a parkinglot RwLock instead since the base case is just reading and we only write when the app closes. But tbh thats a micro optimization to take it or leave it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I simplified this, using OnceLock
src/bridge/mod.rs
Outdated
let neovim_exited = select! { | ||
_ = &mut session.io_handle => false, | ||
_ = process.wait() => { | ||
log::info!("The Neovim process quit before the IO stream, waiting two seconds"); | ||
true | ||
} | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this approach as one of my first attempts when trying to fix this problem, but wait()
doesn't detect that the process is gone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Linux this workaround is needed for Neovim 0.9.5 without my fix neovim/neovim#28157. There it correctly detects both the stdio stream close and process exit.
src/bridge/mod.rs
Outdated
let sleep = sleep(Duration::from_millis(2000)); | ||
tokio::pin!(sleep); | ||
select! { | ||
_ = session.io_handle => {} | ||
_ = &mut sleep => { | ||
log::info!("The IO stream was never closed, forcing Neovide to exit"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't tokio::time::timeout()
be simpler here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, but it looks like the timeout consumes the future, and I want to re-use it after that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to a Tokio timeout
src/bridge/mod.rs
Outdated
self.runtime.take().unwrap().shutdown_background(); | ||
log::info!("Starting neovim runtime shutdown"); | ||
let start = Instant::now(); | ||
shutdown_ui(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shutting down the UI explicitly from the runtime drop doesn't feel very clean. IMHO the runtime should be concerned only about itself, and shutting down other things should be done elsewhere.
As an aside, I'm not sure why the runtime is created manually instead of just using tokio::main
. I did some refactoring earlier and eliminated the entire NeovimRuntime
and just used tokio::main
, and it was a lot cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function is probably a bit poorly named, it shuts down the UI event handlers, not the whole UI system. And we do have to wait for them here, otherwise they start to print stuff to stderr instead of the logging system, as the rest of the runtime shuts down while they continue running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I simplified the code and removed this. The UI command tasks are now cancelled with the runtime shutdown.
@sid-6581, We really need to find the root cause of not being able to detect either the process exit or the closing of the io streams. That's something that I know for sure works in Windows, tons of programs are absolutely dependent on that, and I have written a few myself and not something we should work around and cause more bugs anymore. And those two are the only two signals we can rely on, since we need to keep the Neovide UI active, in order to be able to respond to errors that can happen after Unfortunately, I can't help much with that, since I'm unable to repeat it on any of my systems. It does seem to affect a few others, but for most it looks like the problem was with the corrupt shada files, which is now fixed so that the error message is shown and we can send the inputs. So, I think the problem is very rare. |
@fredizzimo I should have some more free time now to look at this again. I'll try to debug this again, I have a couple of things I'd like to try. At least I can still easily reproduce it, which is good. |
Great thanks, I would start by checking if it's stuck here: That's where I think they might violate one or more of the conditions required https://learn.microsoft.com/en-us/windows-hardware/drivers/ddi/ntifs/nf-ntifs-ntreadfile
Furthermore, this document says that asynchronous IO on pipes can only be performed using overlapped operations https://learn.microsoft.com/en-us/windows/win32/ipc/synchronous-and-overlapped-input-and-output#enabling-asynchronous-operation |
@fredizzimo I'm in the debugger now, and that's exactly where it's stuck. I need to dig further, but it looks like it's stuck waiting for a handle that the neovide process no longer has open. Here you see it's waiting for handle 0x384: But it doesn't appear that neovide still has that handle open: I can't quite track it back to anything specific to see what it's actually trying to wait on, since there are no clues in the current call stack, but I'll keep digging. |
I forgot to add That leads me to https://github.com/rust-lang/rust/blob/385fa9d845dd326c6bbfd58c22244215e431948a/library/std/src/sys/pal/windows/pipe.rs#L83 I'll backtrack from there. |
That's most likely one of the pipes created here with Line 84 in a973abb
You could maybe also try process monitor to log all the file io calls https://learn.microsoft.com/en-us/sysinternals/downloads/procmon |
Yep, that's what I'll try next. It's interesting that the pipe remains open when the nvim process is gone. |
I will put this as draft, so that it does not accidentally get merged until the Window issue is resolved or at least identified. |
I'm still digging, but from what I can tell: Sometimes both named pipes stay open after nvim exits, sometimes stdin is closed but stdout remains open. The named pipes are create in overlapped mode (only for neovide, not the ones that are handed to neovim), so
ProcMon slowed down things enough that I couldn't recreate the hang while it was running. I'll be traveling this weekend, but I'll get back to it when I get back. |
Actually, the more I think about, I don't understand why the read from the overlapped named pipe is a blocking operation using |
9e0699e
to
35582e8
Compare
I'm re-opening this. I still think it's better to hang at the exit on a few systems than to leak Neovim processes. |
* Don't request exit multiple times * Wait until the Neovim exits before exiting the event loop * Fix exit crash on Wayland * Disable detach when the window is closed for remote connections * Wait for all tasks to exits, so that logs don't go to stderr * Force quit after some timeout, when the neovim process quits * Add some logging * Clean up the io stream wait code * Simplify the code, let the runtime shutdown the UICommand tasks
What kind of change does this PR introduce?
The main loop now waits for Neovim to quit before exiting. This allows error messages to be shown and dealt with, instead of leaking nvim processes, as neovide will block here https://github.com/neovim/neovim/blob/15a2dd9e963784123273ec830e82e24ecea4ad0b/src/nvim/main.c#L802-L807
It primarily waits for the stdio stream to finish, but due to bugs, like neovim/neovim#26743, that does not always happen, so it additionally waits for the process to exit, and forces a quite if the io stream has not finished within 2 seconds after that. If you use Neovim nightly with neovim/neovim#28157, it should not happen
An exit crash that happens on some Wayland systems has also been fixed, by properly shutting down the rendering system before the event loop finishes.
Did this PR introduce a breaking change?
The special handling of just detaching from remote connections when closing the Window, which was added here #687 has been removed. It would uneccessarily complicate this change, and I found that re-attaching causes a lot of bugs, like invisible text, the cursor showing the wrong position, the screen not updating and so on, so it's not very usable anyway.
NOTE: That it's still possible to detach manually by the following command:
:call chanclose(g:neovide_channel_id)
Some hangs that @sid-6581 has reported on Windows might also start to appear again. But it will not hang compltely, just for 2 seconds. If that still happens, the actual cause should be determined and fixed. The same applies to other platforms as well of course.