Fix hanging after exit is sent #4152 #4153

soulomoon · 2024-03-25T23:36:46Z

Trying to fix #4152

Reactor thread offered a db connect to hieDB, if it exist, the db connection exit with it.
Operation might need the db connection should only run if the reactor thread is still running.

Otherwise a race condition might happened:
Shutdown handler would stop the reactor thread, cancel the shake seesion.
If an operation is still trying to reach the db, in this case, the config update callback would try to call setSomethingModified which need db connections. It hangs HLS forever.

This PR ships the config update callback to the reactor thread to fix the deadlock.
Alternative would be to make the shutdown handler take away the ideaVar.

soulomoon · 2024-03-26T01:46:52Z

We need to send the update of config to reactor thread to avoid the race condition.
More detaily, the setSomethingModified should run under the reactor thread, since the WithDb, IndexQueue offered by runWithDb will be disconnected once the reactor thread is shutdown.

jhrcek · 2024-03-26T16:57:32Z

I don't understand how the fix works (will leave review to people who understand it better) but great that you managed to resolve it.
Do you have any estimate how much time this saves across one test suite run?

soulomoon · 2024-03-26T18:50:15Z

Do you have any estimate how much time this saves across one test suite run?

I have not measured, But I don't think there is much, only 3 sec for each time a server trying to close when there is config update going on, most test won't update the config, and there might be long enough time before we exit for the loading of the initial config. Instead of the initial motivation to speed up the testcase. The PR is rather a bug fix that is to solve the potential dead lock situation from happenning during the shutdown and exit of HLS.

michaelpj · 2024-03-27T12:01:44Z

Hmm, this is tricky. I'm not sure this is quite the right fix. Can we instead make it so that we throw in this case instead of hanging?

soulomoon · 2024-03-27T15:41:56Z

Alternatively we could make the shutdown handler take away the ideStateVar to prevent the race condition. This would not change the thread the configupdate callback is running on，since we have only one users for this ideStateVar.
Do you think the alternative would be a better approach? @michaelpj
soulomoon#5

I think we should fix this race condition one way or the other.

michaelpj · 2024-04-02T10:08:14Z

I think that's a better approach. But I think maybe we can work at a lower level of granularity? Basically, I think the pattern we want is something like this:

During shutdown, various resources or capabilities may become unavailable
If we try to access those resources or capabilities then we should fail instead of getting stuck

In this case I think you've narrowed it down to the hiedb connection, so I think what we want is something like:

During shutdown, the database connection will become unavailable
If we try to access the database connection after this, we should fail

What you're suggesting with the ideVar is one possible way of doing this: we remove the handle that our code uses to find the resources. I see two problems with it:

I'm sure in many other places we use readMVar. Now these will hang if they get triggered during shutdown. If they would all throw then I would be okay with it.
This prevents things from working that could work. Maybe that doesn't matter but I could imagine it causing other issues (e.g. what if we had code to flush the values of persistent rules at shutdown and now it can't access the rule system?)

So I think it would be best to try and localise the guard more. Ideally we would just make the thing that currently hangs fail instead. I don't know exactly where that is, but maybe you do?

soulomoon · 2024-04-02T11:27:28Z

During shutdown, the database connection will become unavailable

If we try to access the database connection after this, we should fail.

Yes, I agree about this.

In summary, we actually have two problem to solve here.

First problem, Access to database connection should fail if it is closed instead of hang. (This is root from runWithDb and can be fixed there? unfortunatly, I am not familiar with it, maybe @wz1000 or @jhrcek do)
Second problem, The config update callback should not be invoke after the shutdown handler is called(It is not only trying to access the db connection, also it would try to restart the shake session).

I'm sure in many other places we use readMVar. Now these will hang if they get triggered during shutdown. If they would all throw then I would be okay with it.

This prevents things from working that could work. Maybe that doesn't matter but I could imagine it causing other issues (e.g. what if we had code to flush the values of persistent rules at shutdown and now it can't access the rule system?)

I have update this branch with the gurad fix.
IMO, I think this can resolve your concerns @michaelpj :
We are now having only one user for ideStateVar after we put the ideaState value in it. It is the config update callback.

Currently, ideStateVar is used to guard against the config update callback before the initilization(and it provides access to ideaState).

After the our gurad fix, we would have a second user: the shutdown handler. So there would be a second functionality for the ideStateVar.

Guard against the config update callback after the shutdown.

After writing the above, I believe ideStateVar is an ideal guard to solve the second problem,

michaelpj · 2024-04-02T12:51:54Z

Access to database connection should fail if it is closed instead of hang.

Did you manage to figure out exactly which function is hanging?

The config update callback should not be invoke after the shutdown handler is called(It is not only trying to access the db connection, also it would try to restart the shake session).

Again, I'm not sure this is necessarily a problem. What happens if we restart the shake session after it is closed? It should fail! So I think it's fine if we allow "higher-level" things to run, so long as the "lower-level" things that can't work fail appropriately.

So there would be a second functionality for the ideStateVar.

... and now this is an invariant we have to remember to use whenever we work with ideStateVar! In that way it makes our lives more complicated, I feel.

soulomoon · 2024-04-02T16:58:10Z

Again, I'm not sure this is necessarily a problem. What happens if we restart the shake session after it is closed? It should fail! So I think it's fine if we allow "higher-level" things to run, so long as the "lower-level" things that can't work fail appropriately.

🤔 If we only make the "lower-level" to fail instead of hanging,
the "higher-level" -- the main LSP thread handling the callback would still fail.
We would be left with something error like this:

Test suite hls-refactor-plugin-tests: RUNNING...
refactor
  initialize response capabilities
       code action:     FAIL
      Exception: onConfigChange: db already shutdown

Did you manage to figure out exactly which function is hanging?

The exact function that is hanging: onConfigChange -> setSomethingModified -> writeTQueue. runWithDb offers two variables WithHieDbShield withHieDb,hieChan to the outside world, if the thread running runWithDb is done, the writing to hieChan would hang(Not exactly the Access to database connection for correction).
~~I don't understand what is going on underneath the writeTQueue~~

Err... The truth is even wilder than my imagination, the reactor thread do not even have time to ship out putMVar dbMVar (WithHieDbShield withHieDb',hieChan'), before it is cancelled by the shutdown handler.

And restarting shake session hangs too, just not as common.

soulomoon · 2024-04-03T01:43:07Z

I am creating a minimal example to see if I can reproduce the hanging.

This is a minimal example of the hanging:

{-# LANGUAGE RecordWildCards #-}
{-# LANGUAGE PackageImports #-}
{-# LANGUAGE LambdaCase #-}

import Control.Concurrent.STM
import Control.Monad (forever)
import Data.Functor (void)
import Control.Monad.IO.Unlift (MonadUnliftIO)
import           UnliftIO.Concurrent 
import UnliftIO (async, waitAnyCancel)
import GHC.IO (unsafeInterleaveIO)
import System.Timeout (timeout)

type Que = TQueue String
runWithDb :: (Que -> IO ()) -> IO ()
runWithDb k = do
    q <- newTQueueIO
    k q
main :: IO ()
main = do
    queVar <- newEmptyMVar 
    lifetime <- newEmptyMVar 
    die <- newEmptyMVar 
    mainQue <- newTQueueIO
    ~que <- unsafeInterleaveIO $ takeMVar queVar
    _ <- flip forkFinally (const $ pure ()) $ do
        untilMVar lifetime $ runWithDb $ \q -> do
            threadDelay 1000000
            putMVar queVar q
            forever $ do
                putStrLn "Waiting for message"
                msg <- atomically $ readTQueue mainQue
                putStrLn $ "Received: " ++ msg
    putMVar lifetime ()
    _ <- forkIO $ do 
        x <- atomically $ readTQueue que 
        putStrLn $ "Received: " ++ x
        putMVar die ()
    timeout 3000000 (takeMVar die) >>= \case
        Just _ -> putStrLn "Exiting"
        Nothing -> putStrLn "Timeout in 3 s"
    return ()

untilMVar :: MonadUnliftIO m => MVar () -> m () -> m ()
untilMVar mvar io = void $ do
    waitAnyCancel =<< traverse async [ io , readMVar mvar ]

michaelpj · 2024-04-03T08:46:45Z

We would be left with something error like this:

Sure, but then we can work from there, e.g. we can ask: should we be catching exceptions during shutdown, or in the config handler, or whatever. At least the problem is not masked!

if the thread running runWithDb is done, the writing to hieChan would hang(Not exactly the Access to database connection for correction).

Huh, that is weird. AFAIK writing to a TChan shouldn't hang, ever. I'll take a look at your reproduction code...

I also note that there's some stuff in there about retryIfSqlitBusy that looks like it might retry for up to 10 seconds, which might be a problem?

soulomoon · 2024-04-03T09:03:25Z

AFAIK writing to a TChan shouldn't hang, ever. I'll take a look at your reproduction code...

You are right writing the TChan won't hang if the TChan is already computed.

We have a very interesting situation:
The problematic part is ~que <- unsafeInterleaveIO $ takeMVar queVar, it delay the computation of que.
x <- atomically $ readTQueue que hangs, when it tries to actually compute the que and write to it.

michaelpj · 2024-04-03T09:04:43Z

when it tries to actually compute the que and write to it.

I also am not sure how that can happen! Surely creating the queue and then writing to it should be fine?

soulomoon · 2024-04-03T09:07:55Z

Notice the computation of the que goes like this here: takeMVar queVar.
queVar might never be filled with the que value.

If the thread that would fill the queVar exit early.

I am not sure why unsafeInterleaveIO exist in the first place, this delay is causing a lot of confusion.🤔

AFAIK, It should be fine if we remove unsafeInterleaveIO like this 61b4808
Then the main thread can actually wait for the creation and delivery of the queue from the reactor thread in the initilization.

michaelpj · 2024-04-03T09:42:17Z

Yeah, that's super weird and I don't know why we do it like that. If it works the other way that looks reasonable to me, what do you think @wz1000 ?

In general it seems like anywhere we have a takeMVar we have a potential hang.

The other question of course is why the hanging thread doesn't get killed. I think it's probably because we process notifications synchronously, so if we get stuck processing a notification we will never see the exit notification that comes in. Tricky!

I think we could also make this simpler on the lsp side. The spec says we're allowed to ignore notifications after shutdown and error requests, so it could keep track of whether we have been shutdown, and if so block all further requests and notifications. That would stop us getting into this situation.

soulomoon · 2024-04-07T02:37:39Z

@michaelpj
I see it is talked about it in the biweekly meeting, is there something new you guys have came up with?

michaelpj · 2024-04-07T20:16:38Z

That was just about the generic shutdown stuff. I would like to understand why there is a hang here. But it seems like at least one thing we should try is your commit that gets rid of the weirdness. Does that pass the tests? And does it solve the problem?

(I still think a comment from @wz1000 would be helpful here.)

soulomoon · 2024-04-08T02:44:18Z

Yes, it passes the test and solves the problem.

Simply put, it hangs because we are trying to takeMvar of the queVar(It should contain the write chan of the db), but queVar is empty(Reactor thread is shutdown before it put the write chan to queVar).
What worse, this takeMvar happens in update callback(Not in the callsite of takeMvar), because we use unsafeInterleaveIO to delay the takeMvar IO operation. That is why it is so hassling.

So two things have been done here so far:

update callback won't be running after the shutdown.
Removal of unsafeInterleaveIO. It waits until reactor thread(The one create the db connection) put the write chan of db to queVar in the initialization.

The main concern is that removal of unsafeInterleaveIO would break some invariants that we do not see ?
I agree a comment from @wz1000 would be helpful here.

Also I would like to know if there would be unwanted side effect shipping the update callback to run in the reactor thread.

soulomoon changed the title ~~surface the problem~~ WIP to fix #4152 Mar 25, 2024

soulomoon added the WIP label Mar 25, 2024

soulomoon changed the title ~~WIP to fix #4152~~ WIP to fix #4152, Hanging after exit is sent Mar 26, 2024

soulomoon changed the title ~~WIP to fix #4152, Hanging after exit is sent~~ WIP to fix hanging after exit is sent #4152 Mar 26, 2024

soulomoon added status: needs review This PR is ready for review and removed WIP labels Mar 26, 2024

soulomoon marked this pull request as ready for review March 26, 2024 11:42

soulomoon requested review from santiweight and pepeiborra as code owners March 26, 2024 11:42

soulomoon changed the title ~~WIP to fix hanging after exit is sent #4152~~ Fix hanging after exit is sent #4152 Mar 26, 2024

soulomoon added the component: ghcide label Mar 26, 2024

soulomoon force-pushed the fix-stuck-at-exit branch from 7aa27ac to dc53035 Compare April 2, 2024 11:07

soulomoon added 16 commits May 7, 2024 03:03

wait for cancel

f8220ba

wait for cancel

013650b

push known targets back to session restart

e10d135

wait for the session to be restarted

39a45a2

split to restarting to stoping thread and starting thread

6d34bac

handle error during stopShakeSession

f76bff4

more sync

ebfb375

no debounce

5d77f61

Merge branch 'master' into soulomoon/shake-rebouncer

89ae61a

revert

36651b4

revert

1d73ef5

revert

bf8bb34

revert

06b975a

revert

79a63e4

Merge branch 'master' into soulomoon/shake-rebouncer

1817b52

rename

f387ccd

soulomoon force-pushed the fix-stuck-at-exit branch from 10eb6f1 to f387ccd Compare May 11, 2024 04:19

soulomoon added 13 commits May 11, 2024 13:47

wait for shake to be back

e51de46

clean

3313246

clean

e758db6

log shutdown

17cc869

add shutdown over log

0a95e76

revert file name changed

17591b0

Merge branch 'master' into fix-stuck-at-exit

3b05c27

Merge branch 'master' into fix-stuck-at-exit

dca194f

format

bca25ae

Merge branch 'master' into fix-stuck-at-exit

58bf9ac

Merge branch 'master' into fix-stuck-at-exit

be5b775

Merge branch 'master' into fix-stuck-at-exit

16dbbde

Merge branch 'master' into fix-stuck-at-exit

37f851a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix hanging after exit is sent #4152 #4153

Fix hanging after exit is sent #4152 #4153

soulomoon commented Mar 25, 2024 •

edited

soulomoon commented Mar 26, 2024 •

edited

jhrcek commented Mar 26, 2024

soulomoon commented Mar 26, 2024 •

edited

michaelpj commented Mar 27, 2024

soulomoon commented Mar 27, 2024 •

edited

michaelpj commented Apr 2, 2024

soulomoon commented Apr 2, 2024 •

edited

michaelpj commented Apr 2, 2024

soulomoon commented Apr 2, 2024 •

edited

soulomoon commented Apr 3, 2024 •

edited

michaelpj commented Apr 3, 2024

soulomoon commented Apr 3, 2024

michaelpj commented Apr 3, 2024

soulomoon commented Apr 3, 2024 •

edited

michaelpj commented Apr 3, 2024

soulomoon commented Apr 7, 2024 •

edited

michaelpj commented Apr 7, 2024

soulomoon commented Apr 8, 2024 •

edited

Fix hanging after exit is sent #4152 #4153

Are you sure you want to change the base?

Fix hanging after exit is sent #4152 #4153

Conversation

soulomoon commented Mar 25, 2024 • edited

soulomoon commented Mar 26, 2024 • edited

jhrcek commented Mar 26, 2024

soulomoon commented Mar 26, 2024 • edited

michaelpj commented Mar 27, 2024

soulomoon commented Mar 27, 2024 • edited

michaelpj commented Apr 2, 2024

soulomoon commented Apr 2, 2024 • edited

michaelpj commented Apr 2, 2024

soulomoon commented Apr 2, 2024 • edited

soulomoon commented Apr 3, 2024 • edited

michaelpj commented Apr 3, 2024

soulomoon commented Apr 3, 2024

michaelpj commented Apr 3, 2024

soulomoon commented Apr 3, 2024 • edited

michaelpj commented Apr 3, 2024

soulomoon commented Apr 7, 2024 • edited

michaelpj commented Apr 7, 2024

soulomoon commented Apr 8, 2024 • edited

soulomoon commented Mar 25, 2024 •

edited

soulomoon commented Mar 26, 2024 •

edited

soulomoon commented Mar 26, 2024 •

edited

soulomoon commented Mar 27, 2024 •

edited

soulomoon commented Apr 2, 2024 •

edited

soulomoon commented Apr 2, 2024 •

edited

soulomoon commented Apr 3, 2024 •

edited

soulomoon commented Apr 3, 2024 •

edited

soulomoon commented Apr 7, 2024 •

edited

soulomoon commented Apr 8, 2024 •

edited