Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CRASH] Clients running "reconnect" in the console after a WAD-switching related disconnect from the server can crash #949

Open
Acts19quiz opened this issue Apr 6, 2024 · 2 comments
Assignees
Labels

Comments

@Acts19quiz
Copy link

Acts19quiz commented Apr 6, 2024

Describe the bug
After bug #792 occurs, some (but not always all) clients get disconnected from the server that just switched WADs. After this disconnect occurs, an affected client goes straight to the console. Immediately running reconnect in the console to rejoin the game can cause the client to crash without an error message (i.e., the Odamex window just disapepars).

Build that the bug occurred in
10.4.0 G46E0E1-8728

To Reproduce

  1. Play an online game until you experience bug [BUG] WAD changing-related crash #792 (being 'disconnected-to-console' from a netgame after a server switches WADs)
  2. Immediately in the console, run reconnect (as a client)
  3. Reproduction might not be consistent or occur (see Additional context below)

Expected behavior
To rejoin the server you just disconnected from.

Screenshots, NetDemos, & Crash Dumps
The first crash from Thursday (crash dump with an associated netdemo):
odamex_g46e0e1_29968_20240405T005957.dmp
Odamex_HORDE_20240404_195957_ODAHORDE_230421.WAD_MAP09.zip
The second crash from Friday:
odamex_g46e0e1_5196_20240405T222342.dmp
Odamex_HORDE_20240405_172341_HORDAMEX.WAD_MAP13.zip

For context, I have my Odamex client setup to automatically start recording a netdemo as soon as a netgame starts. The short netdemo is the result of this super-brief game before the crash.

The netdemos require DOOM2.WAD, ODAHORDE_230421.WAD, and HORDAMEX.WAD.

Additional context
https://www.twitch.tv/videos/2111767444?t=113m30s Here is a video from another player's (Hekksy's) POV of the second crash who also got disconnected. At the 1:53:40 mark, you can see me (in the upper-left corner) briefly reconnected, then immediately crash, using the same command shown at 1:53:30. As you can see, reproduction of the crash isn't consistent. Under the same sequence of events, you might or might not crash using the reconnect command after getting a disconnect-to-console event upon WAD switching.

@loopfz
Copy link
Contributor

loopfz commented Apr 6, 2024

I'll attempt an explanation for the crash that sometimes happens in Horde mode, specifically when switching from a Hordamex map to a Odahorde map.

It happened on my koholint Horde servers and I've spent some time tracking it down.

Stack trace captured in gdb:

Breakpoint 1, 0x00007ffff7cd1968 in std::__throw_out_of_range_fmt(char const*, ...) () from /lib/x86_64-linux-gnu/libstdc++.so.6
(gdb) bt
#0  0x00007ffff7cd1968 in std::__throw_out_of_range_fmt(char const*, ...) ()
   from /lib/x86_64-linux-gnu/libstdc++.so.6
#1  0x00005555557be823 in std::vector<hordeDefine_t, std::allocator<hordeDefine_t> >::_M_range_check (this=<optimized out>, __n=<optimized out>)
    at /usr/include/c++/12/bits/stl_vector.h:1153
#2  std::vector<hordeDefine_t, std::allocator<hordeDefine_t> >::at (
    __n=<optimized out>, this=<optimized out>)
    at /usr/include/c++/12/bits/stl_vector.h:1175
#3  G_HordeDefine (id=<optimized out>)
    at /home/doom/dev/odamex/common/g_horde.cpp:515
#4  HordeState::serialize (this=<optimized out>)
    at /home/doom/dev/odamex/common/p_horde.cpp:404
#5  P_HordeInfo () at /home/doom/dev/odamex/common/p_horde.cpp:728
#6  SV_UpdateGametype (pl=...)
    at /home/doom/dev/odamex/server/src/sv_main.cpp:2887
#7  0x00005555557bef30 in SV_WriteCommands ()
    at /home/doom/dev/odamex/server/src/sv_main.cpp:3189
#8  0x00005555557c096f in SV_StepTics (count=0)
    at /home/doom/dev/odamex/server/src/sv_main.cpp:4200
#9  0x00005555557ca258 in SV_RunTics ()
    at /home/doom/dev/odamex/server/src/sv_main.cpp:4259
#10 0x00005555556ce208 in CappedTaskScheduler::run (this=0x5555560614f0)
    at /home/doom/dev/odamex/common/d_main.cpp:1012

What's happening: an out-of-bounds exception when serializing the Horde state to update clients.
SV_UpdateGameType -> P_HordeInfo -> serialize -> and the critical line is

                info.legacyID = G_HordeDefine(m_defineID).legacyID;

Here, m_defineID can be too large, causing the OOB when accessing the vector WAVE_DEFINES.
At first I was suspecting a bug in the code that initializes and handles m_defineID but it turns out that's not the case: the issue is a race condition.

In the meantime, the intermission is ending and the new wad gets loaded.
We go through G_LoadWad -> D_DoomWadReboot -> D_Init -> G_ParseHordeDefs
This clears WAVE_DEFINES, but m_defineID stays the same for the time being.
In the case of a switch from Hordamex to Odahorde, we go from many wave defines to... less wave defines! So if we had reached a high Wave number before the wad switch (thus our m_defineID was in the upper echelon), chances are high it is now OOB given the much lower define count of Odahorde.

m_defineID only gets reassigned in P_RunHordeTics() on the first tic. So there's a small window for the crash to happen.

To fix it, I'd suggest either synchronozing both events and calling HordeDirector.reset() earlier, and/or shielding the serialize code from the crash itself by skipping the code in SV_UpdateGametype entirely if gamestate == GS_INTERMISSION and optionally catching the OOB around the serialize call specifically.

@AlexMax AlexMax self-assigned this Apr 20, 2024
@AlexMax
Copy link
Contributor

AlexMax commented Apr 20, 2024

Good eye! I think your first approach makes more sense, but we'll see if there are any straggler issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants