Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qt 6.7: Weird IPC timeouts on Windows #8191

Open
The-Compiler opened this issue May 6, 2024 · 8 comments
Open

Qt 6.7: Weird IPC timeouts on Windows #8191

The-Compiler opened this issue May 6, 2024 · 8 comments
Milestone

Comments

@The-Compiler
Copy link
Member

Regarding the failure on windows

Had another crack at this today but didn't get anywhere:

  • I don't think it's user impacting, it seems there is some situation where we are starting the read timer but not stopping (or extending?) it, not sure where though
  • in the message above I said I saw those two log messages with the same socket ID in them close together. But I think that was a false lead because I'm seeing socket IDs re-used across multiple connections
  • the issue reproduces easy enough with just one test module (haven't tried narrowing it down to individual tests), which at least make debugging it in CI a bit faster: https://github.com/toofar/qutebrowser/tree/windows-ipc-debugging
  • also I re-removed the _old_socket handling in there to rule that out, it's still happening
  • I've had a go at going through the IPC server logic and noting down anywhere the timer got started or stopped to try to figure out where a gap in the logic could be, haven't spotted anywhere yet (I do see some "got connection but not handling it because we already have one" message, which seems odd the test process should be sending them one at a time, but even that case should be handled fine)
  • I guess another thing to do is compare log runs from windows before and after and see if that shows anything interesting (I tried to compare ubuntu vs windows but there was a lot of "probably normal" differences)

Originally posted by @toofar in #8175 (comment)

@The-Compiler
Copy link
Member Author

I can somewhat reliably reproduce via running qutebrowser and this script:

import time
import json
import getpass
import random
from PyQt6.QtNetwork import QLocalSocket

socket = QLocalSocket()

while True:
    name = f"qutebrowser-{getpass.getuser()}"
    socket.connectToServer(name)
    ok = socket.waitForConnected(1000)
    assert ok

    cmd = random.choice([":version", ":tab-close"])
    json_data = {"args": [cmd], "target_arg": None, "version": "3.1.0", "protocol_version": 1}
    line = json.dumps(json_data) + '\n'
    data = line.encode("utf-8")
    socket.writeData(data)
    ok = socket.waitForBytesWritten(1000)
    if not ok:
        print("!", end="", flush=True)
        continue
    assert socket.error() == QLocalSocket.LocalSocketError.UnknownSocketError
    socket.disconnectFromServer()
    if socket.state() != QLocalSocket.LocalSocketState.UnconnectedState:
        ok = socket.waitForDisconnected(1000)
        assert ok
    print(".", end="", flush=True)
    time.sleep(random.uniform(0.01, 0.2))

It seems to be easier to reproduce when I do some clicking in the UI, e.g. switching tabs.

The logging I added in db3d15d might still be a red herring, because the ID is fetched when the timeout function runs, and not when the timeout is started...

@The-Compiler
Copy link
Member Author

My best bet at getting a Windows C++ stack trace by inserting a breakpoint() and letting DebugDiag do its thing:

readline/Python stuff:

ntdll!NtDeviceIoControlFile+14	  
KERNELBASE!ConsoleCallServerGeneric+e9	  
KERNELBASE!ReadConsoleInternal+18d	  
KERNELBASE!ReadConsoleW+1a	  
python310!PyOS_Readline+542	  
python310!PyOS_Readline+38d	  
python310!PyOS_Readline+132	  
python310!Py_gitversion+63ba2	  
python310!Py_SetStandardStreamEncoding+807	  
python310!PyEval_EvalFrameDefault+1aa2	  
python310!PyEval_EvalFrameDefault+f65	  
python310!PyEval_EvalFrameDefault+f65	  
python310!PyEval_EvalFrameDefault+f65	  
python310!PyEval_EvalFrameDefault+f65	  
python310!PyEval_EvalFrameDefault+f65	  
python310!PyFunction_Vectorcall+87	  
python310!PyObject_GC_Malloc+1133	  
python310!PySys_GetSizeOf+141	  
python310!PyUnicode_ToUppercase+3ef	  
python310!PyUnicode_ToUppercase+20c	  
python310!PyUnicode_ToUppercase+186	  
python310!PyUnicode_ToUppercase+325	  
python310!PyEval_EvalFrameDefault+9338	  
python310!PyFunction_Vectorcall+87	  
python310!PyObject_GC_Malloc+1133	  
python310!Py_HandleSystemExit+80c	  
python310!Py_HandleSystemExit+58f	  
python310!PyVectorcall_Call+5c	  
python310!PyObject_Call+4f	  
QtCore!PyInit_QtCore+eb2f5	  
QtCore!PyInit_QtCore+eb73a	  
QtCore!PyInit_QtCore+eb7f4	  
QtCore!PyInit_QtCore+ed7ab	  
Qt6Core!QMetaObject::metacall+41	  
Qt6Core!QObject::qt_static_metacall+1842	  
Qt6Core!QMetaObject::activate+84	 

The interesting part:

Qt6Core!QTimer::timerEvent+cf	  
QtCore+66e56	  
Qt6Core!QObject::event+64	  
QtCore+590f3	  
Qt6Widgets!QApplicationPrivate::notify_helper+10e	  
Qt6Widgets!QApplication::notify+18c1	  
QtWidgets!PyInit_QtWidgets+1c3dcf	  
Qt6Core!QCoreApplication::notifyInternal2+11c	  
Qt6Core!QEventDispatcherWin32Private::sendTimerEvent+1a6	  
Qt6Core!QEventDispatcherWin32::event+1fc	  
Qt6Widgets!QApplicationPrivate::notify_helper+10e	  
Qt6Widgets!QApplication::notify+18c1	  
QtWidgets!PyInit_QtWidgets+1c3dcf	  
Qt6Core!QCoreApplication::notifyInternal2+11c	  
Qt6Core!QCoreApplicationPrivate::sendPostedEvents+209	  
Qt6Gui!QWindowsGuiEventDispatcher::sendPostedEvents+f	  
Qt6Core!QEventDispatcherWin32::processEvents+90	  
Qt6Gui!QWindowsGuiEventDispatcher::processEvents+19	  
Qt6Core!QEventLoop::exec+19f	  
Qt6Core!QCoreApplication::exec+15d	  

Init stuff:

QtWidgets!PyInit_QtWidgets+7c3e5	  
python310!PyObject_IsTrue+101	  
python310!PyEval_EvalFrameDefault+3ffe	  
python310!PyFunction_Vectorcall+87	  
python310!PyEval_EvalFrameDefault+707	  
python310!PyFunction_Vectorcall+87	  
python310!PyEval_EvalFrameDefault+1b88	  
python310!PyFunction_Vectorcall+87	  
python310!PyEval_EvalFrameDefault+1b88	  
python310!PyObject_FastCallDictTstate+15d	  
python310!PyEval_EvalCode+82	  
python310!PyNumber_Negative+45f	  
python310!PyNumber_Negative+35f	  
python310!PyEval_EvalFrameDefault+1aa2	  
python310!PyFunction_Vectorcall+87	  
python310!PyEval_EvalFrameDefault+707	  
python310!PyFunction_Vectorcall+87	  
python310!PyVectorcall_Call+5c	  
python310!PyObject_Call+4f	  
python310!PyArg_Parse+1ba	  
python310!Py_RunMain+13d	  
python310!Py_RunMain+15	  
python310!Py_Main+25	  
python+1230	  
kernel32!BaseThreadInitThunk+10	  
ntdll!RtlUserThreadStart+2b

@The-Compiler
Copy link
Member Author

The-Compiler commented May 6, 2024

I tried to improve the logging in f9ae555 and 245d117... and everything points to Windows/Qt/PyQt just immediately calling the timeout function after starting the timer.

Why? How?? Why (hopefully!) only this timer? I have no clue! I was unable to find anything that sounded remotely related. No bugreports, no changes that look remotely relevant, nothing.

I think this is user-facing: It happens before the IPC connection ever gets read, and then we disconnect, so we'd (even if rarely) sometimes miss IPC commands.

The best I can come up with is:

@@ -168,6 +168,7 @@ def __init__(self, socketname, parent=None):
         self._timer = usertypes.Timer(self, 'ipc-timeout')
         self._timer.setInterval(READ_TIMEOUT)
         self._timer.timeout.connect(self.on_timeout)
+        self._timer_start_time = None
 
         if utils.is_windows:  # pragma: no cover
             self._atime_timer = None
@@ -261,6 +262,7 @@ def handle_connection(self):
         log.ipc.debug("Client connected (socket {}).".format(self._socket_id))
         self._socket = socket
         self._timer.start()
+        self._timer_start_time = time.monotonic()
         socket.readyRead.connect(self.on_ready_read)
         if socket.canReadLine():
             log.ipc.debug("We can read a line immediately.")
@@ -393,11 +395,23 @@ def on_ready_read(self):
 
         if self._socket is not None:
             self._timer.start()
+            self._timer_start_time = time.monotonic()
 
     @pyqtSlot()
     def on_timeout(self):
         """Cancel the current connection if it was idle for too long."""
         assert self._socket is not None
+        assert self._timer_start_time is not None
+        if (
+            time.monotonic() - self._timer_start_time < READ_TIMEOUT / 1000 / 10
+            and qtutils.version_check("6.7.0", exact=True, compiled=False)
+            and utils.is_windows
+        ):
+            # WORKAROUND for unknown Qt bug (?) where the timer triggers immediately
+            # https://github.com/qutebrowser/qutebrowser/issues/8191
+            log.ipc.debug("Ignoring early on_timeout call")
+            return
+
         log.ipc.error("IPC connection timed out "
                       "(socket {}).".format(self._socket_id))
         self._socket.disconnectFromServer()

I hate it, but I spent the better part of my day with this now, and I can't come up with anything better. I think any further spelunking would require building Qt from source and firing up a C++ debugger to see what happens in Qt... but I'm exhausted and would rather move on to something more interesting.

@toofar thoughts?

@toofar
Copy link
Member

toofar commented May 6, 2024

That's very strange. Had a look at recent commits to QTimer and QLocalSocket and nothing jumped out at me. I had a thought that maybe it could be due to calls getting interleaved, like if handle_connection() gets called when we are in the middle of reading data from the socket. But there's nothing I could find in search results warning of anything like that and the docs for QTimer and QLocalSocket don't call it out either.

I guess if identifying and ignoring the situation is the best we can do then we can try that and look into it further if it comes up again.

A few questions:

  1. should we be setting singleShot() on this timer? The default is false right? (I tested and it didn't fix the issue though.)
  2. when the timer fires early, what does timer.remainingTime() report?
  3. assuming it is 0, should we be restarting the time when it has fired early?

I also worry that this will affect other timers, since we don't know what causes it. But since we haven't seen this issue on any of the linux test runs and people are already using 6.7 without complaint, I guess we hope it remains this isolated ¯_(ツ)_/¯

@The-Compiler
Copy link
Member Author

That's very strange. Had a look at recent commits to QTimer and QLocalSocket and nothing jumped out at me.

Agreed on both parts. I also looked at changes in qeventdispatcher_win.cpp which has the Windows-specific implementation, and Qt changelogs, and PyQt changelogs, and even diffed PyQt... no dice. Nothing I could find in Qt issues either. I have no idea what's going on there, but it does seem pretty isolated.

I had a thought that maybe it could be due to calls getting interleaved, like if handle_connection() gets called when we are in the middle of reading data from the socket. But there's nothing I could find in search results warning of anything like that and the docs for QTimer and QLocalSocket don't call it out either.

In f9ae555 I tried making sure that every connection is a separate object with its own QTimer. I like the result of that refactoring, but for now I want to do something different than poking at IPC stuff. I'll probably open an issue to finish that later™. However, it didn't help either.

I guess if identifying and ignoring the situation is the best we can do then we can try that and look into it further if it comes up again.

Agreed. I want to wait for Qt 6.7.1 before pushing it. It's now planned for next week apparently... but once it's out, I hope we'll be able to easily try it out with PyQt, as it should be binary compatible. This will probably delay v3.2.0 even further, but eh... Qt 6.7.0 seems to have other annoying issues on Windows, so it's probably for the better.

A few questions:

  1. should we be setting singleShot() on this timer? The default is false right? (I tested and it didn't fix the issue though.)

Good point. It shouldn't matter much because in on_timeout we call self._socket.disconnectFromServer(), and that calls the on_disconnected slot, which stops the timer.

  1. when the timer fires early, what does timer.remainingTime() report?

0

  1. assuming it is 0, should we be restarting the time when it has fired early?

See above - in theory yes, but since it's currently a recurring timer, it does already restart anyways.

I also worry that this will affect other timers, since we don't know what causes it. But since we haven't seen this issue on any of the linux test runs and people are already using 6.7 without complaint, I guess we hope it remains this isolated ¯\(ツ)

With 01cc3df and 31ab343 I tried to add a generic check to all timers that logs a warning. The CI run passed. I'm somewhat tempted to commit it and see how it fares over time (hah!). Then again, in case this issue just magically vanishes with Qt 6.7.1 again, we might as well pretend it was never there.

@The-Compiler
Copy link
Member Author

Well, I finally found something! [QTBUG-124496] QObject::timerEvent may be called before the timer elapsed - Qt Bug Tracker

I wonder if we have a single-shot QTimer running somewhere in the IPC code, which Qt confuses with our timeout timer? Or maybe QLocalS(ocket|erver) has one internally?

That also explains why anecdotally, "clicking around" in the UI made it reproduce faster, because that probably leads to more timers being used.

The bug also says Qt 6.5 in addition to 6.7, but the linked bug says 6.5.6 (or .5 in the comments?). So chances are it was a change that landed in Qt 6.7 but also was backported to 6.5 LTS?

Assuming it's still there in Qt 6.7.1, I suppose it might be a good idea to leave the warning logging in.

@toofar
Copy link
Member

toofar commented May 8, 2024

Ohh, good spotting! So on Windows the timers are backed by some Windows API and somehow the Windows things aren't cleared when the Qt ones are and then a new Qt one will re-use a still active windows thing? Hmm, that doesn't sound right. Does stopping a timer fire a timer event and that's what's being picked up by the re-used one? Maybe we could make a new timer object before killing the old one and that would resolve this? We do stop and start the timer a couple of times in this class.

Anyway, it's reassuring to know it's for sure a Windows only problem. And that Thiago is assigned to the ticket.

I tried to add a generic check to all timers

Damn I saw that and though "wait, we are trying to get a release out, have you heard of scope creep!". Then I looked at the commits and remembered we already have a QTimer wrapper 🤣
Looks pretty reasonable to me.

@toofar
Copy link
Member

toofar commented May 11, 2024

I pushed some more stuff to your branch! So the current set of stuff on there is:

  • incrementing socket IDs in the IPC server
  • ignore early timer firing for read timeout timer in IPC server
  • detect and log early timer firing in usertypes.Timer
  • switch all QTimer objects to be usertypes.Timer objects

The changes I made were to

  • refactor that "is this an early timer" check to be only in usertypes.timer and have the IPC server use that one
  • remove the is_windows and Qt6.7.0 checks (the checks should be a no-op if there is no bug, so they are safe?)
  • fix tests and lint and add a couple of unit tests

In my opinion that's all good to go, although I'm not hung up on much of it if you think some of it is risky and want to drop it (and I haven't looked at the socket ID stuff at all). Also I probably left a bunch of typos in there as usual. Personally I would edit the commit history to have less incidental stuff in it though ;) (If only so we don't have to skip over commits when running git blame on the ci workflow file.)

I should have some time tomorrow to follow up on this or have a look at pdf.js, so let me know how you want to move forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants