Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix hanging after exit is sent #4152 #4153

Draft
wants to merge 68 commits into
base: master
Choose a base branch
from

Conversation

soulomoon
Copy link
Collaborator

@soulomoon soulomoon commented Mar 25, 2024

Trying to fix #4152

Reactor thread offered a db connect to hieDB, if it exist, the db connection exit with it.
Operation might need the db connection should only run if the reactor thread is still running.

Otherwise a race condition might happened:
Shutdown handler would stop the reactor thread, cancel the shake seesion.
If an operation is still trying to reach the db, in this case, the config update callback would try to call setSomethingModified which need db connections. It hangs HLS forever.

This PR ships the config update callback to the reactor thread to fix the deadlock.
Alternative would be to make the shutdown handler take away the ideaVar.

@soulomoon soulomoon changed the title surface the problem WIP to fix #4152 Mar 25, 2024
@soulomoon soulomoon added the WIP label Mar 25, 2024
@soulomoon
Copy link
Collaborator Author

soulomoon commented Mar 26, 2024

We need to send the update of config to reactor thread to avoid the race condition.
More detaily, the setSomethingModified should run under the reactor thread, since the WithDb, IndexQueue offered by runWithDb will be disconnected once the reactor thread is shutdown.

@soulomoon soulomoon changed the title WIP to fix #4152 WIP to fix #4152, Hanging after exit is sent Mar 26, 2024
@soulomoon soulomoon changed the title WIP to fix #4152, Hanging after exit is sent WIP to fix hanging after exit is sent #4152 Mar 26, 2024
@soulomoon soulomoon added status: needs review This PR is ready for review and removed WIP labels Mar 26, 2024
@soulomoon soulomoon marked this pull request as ready for review March 26, 2024 11:42
@soulomoon soulomoon changed the title WIP to fix hanging after exit is sent #4152 Fix hanging after exit is sent #4152 Mar 26, 2024
@jhrcek
Copy link
Collaborator

jhrcek commented Mar 26, 2024

I don't understand how the fix works (will leave review to people who understand it better) but great that you managed to resolve it.
Do you have any estimate how much time this saves across one test suite run?

@soulomoon
Copy link
Collaborator Author

soulomoon commented Mar 26, 2024

Do you have any estimate how much time this saves across one test suite run?

I have not measured, But I don't think there is much, only 3 sec for each time a server trying to close when there is config update going on, most test won't update the config, and there might be long enough time before we exit for the loading of the initial config. Instead of the initial motivation to speed up the testcase. The PR is rather a bug fix that is to solve the potential dead lock situation from happenning during the shutdown and exit of HLS.

@michaelpj
Copy link
Collaborator

Hmm, this is tricky. I'm not sure this is quite the right fix. Can we instead make it so that we throw in this case instead of hanging?

@soulomoon
Copy link
Collaborator Author

soulomoon commented Mar 27, 2024

Alternatively we could make the shutdown handler take away the ideStateVar to prevent the race condition. This would not change the thread the configupdate callback is running on,since we have only one users for this ideStateVar.
Do you think the alternative would be a better approach? @michaelpj
soulomoon#5

I think we should fix this race condition one way or the other.

@michaelpj
Copy link
Collaborator

I think that's a better approach. But I think maybe we can work at a lower level of granularity? Basically, I think the pattern we want is something like this:

  • During shutdown, various resources or capabilities may become unavailable
  • If we try to access those resources or capabilities then we should fail instead of getting stuck

In this case I think you've narrowed it down to the hiedb connection, so I think what we want is something like:

  • During shutdown, the database connection will become unavailable
  • If we try to access the database connection after this, we should fail

What you're suggesting with the ideVar is one possible way of doing this: we remove the handle that our code uses to find the resources. I see two problems with it:

  1. I'm sure in many other places we use readMVar. Now these will hang if they get triggered during shutdown. If they would all throw then I would be okay with it.
  2. This prevents things from working that could work. Maybe that doesn't matter but I could imagine it causing other issues (e.g. what if we had code to flush the values of persistent rules at shutdown and now it can't access the rule system?)

So I think it would be best to try and localise the guard more. Ideally we would just make the thing that currently hangs fail instead. I don't know exactly where that is, but maybe you do?

@soulomoon
Copy link
Collaborator Author

soulomoon commented Apr 2, 2024

  • During shutdown, the database connection will become unavailable
  • If we try to access the database connection after this, we should fail.

Yes, I agree about this.

In summary, we actually have two problem to solve here.

  • First problem, Access to database connection should fail if it is closed instead of hang. (This is root from runWithDb and can be fixed there? unfortunatly, I am not familiar with it, maybe @wz1000 or @jhrcek do)
  • Second problem, The config update callback should not be invoke after the shutdown handler is called(It is not only trying to access the db connection, also it would try to restart the shake session).
  • I'm sure in many other places we use readMVar. Now these will hang if they get triggered during shutdown. If they would all throw then I would be okay with it.
  • This prevents things from working that could work. Maybe that doesn't matter but I could imagine it causing other issues (e.g. what if we had code to flush the values of persistent rules at shutdown and now it can't access the rule system?)

I have update this branch with the gurad fix.
IMO, I think this can resolve your concerns @michaelpj :
We are now having only one user for ideStateVar after we put the ideaState value in it. It is the config update callback.

  • Currently, ideStateVar is used to guard against the config update callback before the initilization(and it provides access to ideaState).

After the our gurad fix, we would have a second user: the shutdown handler. So there would be a second functionality for the ideStateVar.

  • Guard against the config update callback after the shutdown.

After writing the above, I believe ideStateVar is an ideal guard to solve the second problem,

@michaelpj
Copy link
Collaborator

Access to database connection should fail if it is closed instead of hang.

Did you manage to figure out exactly which function is hanging?

The config update callback should not be invoke after the shutdown handler is called(It is not only trying to access the db connection, also it would try to restart the shake session).

Again, I'm not sure this is necessarily a problem. What happens if we restart the shake session after it is closed? It should fail! So I think it's fine if we allow "higher-level" things to run, so long as the "lower-level" things that can't work fail appropriately.

So there would be a second functionality for the ideStateVar.

... and now this is an invariant we have to remember to use whenever we work with ideStateVar! In that way it makes our lives more complicated, I feel.

@soulomoon
Copy link
Collaborator Author

soulomoon commented Apr 2, 2024

Again, I'm not sure this is necessarily a problem. What happens if we restart the shake session after it is closed? It should fail! So I think it's fine if we allow "higher-level" things to run, so long as the "lower-level" things that can't work fail appropriately.

🤔 If we only make the "lower-level" to fail instead of hanging,
the "higher-level" -- the main LSP thread handling the callback would still fail.
We would be left with something error like this:

Test suite hls-refactor-plugin-tests: RUNNING...
refactor
  initialize response capabilities
       code action:     FAIL
      Exception: onConfigChange: db already shutdown

Did you manage to figure out exactly which function is hanging?

The exact function that is hanging: onConfigChange -> setSomethingModified -> writeTQueue. runWithDb offers two variables WithHieDbShield withHieDb,hieChan to the outside world, if the thread running runWithDb is done, the writing to hieChan would hang(Not exactly the Access to database connection for correction).
I don't understand what is going on underneath the writeTQueue


Err... The truth is even wilder than my imagination, the reactor thread do not even have time to ship out putMVar dbMVar (WithHieDbShield withHieDb',hieChan'), before it is cancelled by the shutdown handler.

And restarting shake session hangs too, just not as common.

@soulomoon
Copy link
Collaborator Author

soulomoon commented Apr 3, 2024

I am creating a minimal example to see if I can reproduce the hanging.


This is a minimal example of the hanging:

{-# LANGUAGE RecordWildCards #-}
{-# LANGUAGE PackageImports #-}
{-# LANGUAGE LambdaCase #-}

import Control.Concurrent.STM
import Control.Monad (forever)
import Data.Functor (void)
import Control.Monad.IO.Unlift (MonadUnliftIO)
import           UnliftIO.Concurrent 
import UnliftIO (async, waitAnyCancel)
import GHC.IO (unsafeInterleaveIO)
import System.Timeout (timeout)

type Que = TQueue String
runWithDb :: (Que -> IO ()) -> IO ()
runWithDb k = do
    q <- newTQueueIO
    k q
main :: IO ()
main = do
    queVar <- newEmptyMVar 
    lifetime <- newEmptyMVar 
    die <- newEmptyMVar 
    mainQue <- newTQueueIO
    ~que <- unsafeInterleaveIO $ takeMVar queVar
    _ <- flip forkFinally (const $ pure ()) $ do
        untilMVar lifetime $ runWithDb $ \q -> do
            threadDelay 1000000
            putMVar queVar q
            forever $ do
                putStrLn "Waiting for message"
                msg <- atomically $ readTQueue mainQue
                putStrLn $ "Received: " ++ msg
    putMVar lifetime ()
    _ <- forkIO $ do 
        x <- atomically $ readTQueue que 
        putStrLn $ "Received: " ++ x
        putMVar die ()
    timeout 3000000 (takeMVar die) >>= \case
        Just _ -> putStrLn "Exiting"
        Nothing -> putStrLn "Timeout in 3 s"
    return ()

untilMVar :: MonadUnliftIO m => MVar () -> m () -> m ()
untilMVar mvar io = void $ do
    waitAnyCancel =<< traverse async [ io , readMVar mvar ]

@michaelpj
Copy link
Collaborator

We would be left with something error like this:

Sure, but then we can work from there, e.g. we can ask: should we be catching exceptions during shutdown, or in the config handler, or whatever. At least the problem is not masked!

if the thread running runWithDb is done, the writing to hieChan would hang(Not exactly the Access to database connection for correction).

Huh, that is weird. AFAIK writing to a TChan shouldn't hang, ever. I'll take a look at your reproduction code...

I also note that there's some stuff in there about retryIfSqlitBusy that looks like it might retry for up to 10 seconds, which might be a problem?

@soulomoon
Copy link
Collaborator Author

AFAIK writing to a TChan shouldn't hang, ever. I'll take a look at your reproduction code...

You are right writing the TChan won't hang if the TChan is already computed.

We have a very interesting situation:
The problematic part is ~que <- unsafeInterleaveIO $ takeMVar queVar, it delay the computation of que.
x <- atomically $ readTQueue que hangs, when it tries to actually compute the que and write to it.

@michaelpj
Copy link
Collaborator

when it tries to actually compute the que and write to it.

I also am not sure how that can happen! Surely creating the queue and then writing to it should be fine?

@soulomoon
Copy link
Collaborator Author

soulomoon commented Apr 3, 2024

Notice the computation of the que goes like this here: takeMVar queVar.
queVar might never be filled with the que value.

If the thread that would fill the queVar exit early.


I am not sure why unsafeInterleaveIO exist in the first place, this delay is causing a lot of confusion.🤔

AFAIK, It should be fine if we remove unsafeInterleaveIO like this 61b4808
Then the main thread can actually wait for the creation and delivery of the queue from the reactor thread in the initilization.

@michaelpj
Copy link
Collaborator

Yeah, that's super weird and I don't know why we do it like that. If it works the other way that looks reasonable to me, what do you think @wz1000 ?

In general it seems like anywhere we have a takeMVar we have a potential hang.

The other question of course is why the hanging thread doesn't get killed. I think it's probably because we process notifications synchronously, so if we get stuck processing a notification we will never see the exit notification that comes in. Tricky!

I think we could also make this simpler on the lsp side. The spec says we're allowed to ignore notifications after shutdown and error requests, so it could keep track of whether we have been shutdown, and if so block all further requests and notifications. That would stop us getting into this situation.

@soulomoon
Copy link
Collaborator Author

soulomoon commented Apr 7, 2024

@michaelpj
I see it is talked about it in the biweekly meeting, is there something new you guys have came up with?

@michaelpj
Copy link
Collaborator

That was just about the generic shutdown stuff. I would like to understand why there is a hang here. But it seems like at least one thing we should try is your commit that gets rid of the weirdness. Does that pass the tests? And does it solve the problem?

(I still think a comment from @wz1000 would be helpful here.)

@soulomoon
Copy link
Collaborator Author

soulomoon commented Apr 8, 2024

Yes, it passes the test and solves the problem.

Simply put, it hangs because we are trying to takeMvar of the queVar(It should contain the write chan of the db), but queVar is empty(Reactor thread is shutdown before it put the write chan to queVar).
What worse, this takeMvar happens in update callback(Not in the callsite of takeMvar), because we use unsafeInterleaveIO to delay the takeMvar IO operation. That is why it is so hassling.

So two things have been done here so far:

  1. update callback won't be running after the shutdown.
  2. Removal of unsafeInterleaveIO. It waits until reactor thread(The one create the db connection) put the write chan of db to queVar in the initialization.

The main concern is that removal of unsafeInterleaveIO would break some invariants that we do not see ?
I agree a comment from @wz1000 would be helpful here.

Also I would like to know if there would be unwanted side effect shipping the update callback to run in the reactor thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tests that are blocked for 3 second waitting for the server to exit
3 participants