New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qt 6.7: Weird IPC timeouts on Windows #8191
Comments
I can somewhat reliably reproduce via running qutebrowser and this script: import time
import json
import getpass
import random
from PyQt6.QtNetwork import QLocalSocket
socket = QLocalSocket()
while True:
name = f"qutebrowser-{getpass.getuser()}"
socket.connectToServer(name)
ok = socket.waitForConnected(1000)
assert ok
cmd = random.choice([":version", ":tab-close"])
json_data = {"args": [cmd], "target_arg": None, "version": "3.1.0", "protocol_version": 1}
line = json.dumps(json_data) + '\n'
data = line.encode("utf-8")
socket.writeData(data)
ok = socket.waitForBytesWritten(1000)
if not ok:
print("!", end="", flush=True)
continue
assert socket.error() == QLocalSocket.LocalSocketError.UnknownSocketError
socket.disconnectFromServer()
if socket.state() != QLocalSocket.LocalSocketState.UnconnectedState:
ok = socket.waitForDisconnected(1000)
assert ok
print(".", end="", flush=True)
time.sleep(random.uniform(0.01, 0.2)) It seems to be easier to reproduce when I do some clicking in the UI, e.g. switching tabs. The logging I added in db3d15d might still be a red herring, because the ID is fetched when the timeout function runs, and not when the timeout is started... |
My best bet at getting a Windows C++ stack trace by inserting a readline/Python stuff:
The interesting part:
Init stuff:
|
I tried to improve the logging in f9ae555 and 245d117... and everything points to Windows/Qt/PyQt just immediately calling the Why? How?? Why (hopefully!) only this timer? I have no clue! I was unable to find anything that sounded remotely related. No bugreports, no changes that look remotely relevant, nothing. I think this is user-facing: It happens before the IPC connection ever gets read, and then we disconnect, so we'd (even if rarely) sometimes miss IPC commands. The best I can come up with is: @@ -168,6 +168,7 @@ def __init__(self, socketname, parent=None):
self._timer = usertypes.Timer(self, 'ipc-timeout')
self._timer.setInterval(READ_TIMEOUT)
self._timer.timeout.connect(self.on_timeout)
+ self._timer_start_time = None
if utils.is_windows: # pragma: no cover
self._atime_timer = None
@@ -261,6 +262,7 @@ def handle_connection(self):
log.ipc.debug("Client connected (socket {}).".format(self._socket_id))
self._socket = socket
self._timer.start()
+ self._timer_start_time = time.monotonic()
socket.readyRead.connect(self.on_ready_read)
if socket.canReadLine():
log.ipc.debug("We can read a line immediately.")
@@ -393,11 +395,23 @@ def on_ready_read(self):
if self._socket is not None:
self._timer.start()
+ self._timer_start_time = time.monotonic()
@pyqtSlot()
def on_timeout(self):
"""Cancel the current connection if it was idle for too long."""
assert self._socket is not None
+ assert self._timer_start_time is not None
+ if (
+ time.monotonic() - self._timer_start_time < READ_TIMEOUT / 1000 / 10
+ and qtutils.version_check("6.7.0", exact=True, compiled=False)
+ and utils.is_windows
+ ):
+ # WORKAROUND for unknown Qt bug (?) where the timer triggers immediately
+ # https://github.com/qutebrowser/qutebrowser/issues/8191
+ log.ipc.debug("Ignoring early on_timeout call")
+ return
+
log.ipc.error("IPC connection timed out "
"(socket {}).".format(self._socket_id))
self._socket.disconnectFromServer()
I hate it, but I spent the better part of my day with this now, and I can't come up with anything better. I think any further spelunking would require building Qt from source and firing up a C++ debugger to see what happens in Qt... but I'm exhausted and would rather move on to something more interesting. @toofar thoughts? |
That's very strange. Had a look at recent commits to QTimer and QLocalSocket and nothing jumped out at me. I had a thought that maybe it could be due to calls getting interleaved, like if I guess if identifying and ignoring the situation is the best we can do then we can try that and look into it further if it comes up again. A few questions:
I also worry that this will affect other timers, since we don't know what causes it. But since we haven't seen this issue on any of the linux test runs and people are already using 6.7 without complaint, I guess we hope it remains this isolated ¯_(ツ)_/¯ |
Agreed on both parts. I also looked at changes in qeventdispatcher_win.cpp which has the Windows-specific implementation, and Qt changelogs, and PyQt changelogs, and even diffed PyQt... no dice. Nothing I could find in Qt issues either. I have no idea what's going on there, but it does seem pretty isolated.
In f9ae555 I tried making sure that every connection is a separate object with its own
Agreed. I want to wait for Qt 6.7.1 before pushing it. It's now planned for next week apparently... but once it's out, I hope we'll be able to easily try it out with PyQt, as it should be binary compatible. This will probably delay v3.2.0 even further, but eh... Qt 6.7.0 seems to have other annoying issues on Windows, so it's probably for the better.
Good point. It shouldn't matter much because in
0
See above - in theory yes, but since it's currently a recurring timer, it does already restart anyways.
With 01cc3df and 31ab343 I tried to add a generic check to all timers that logs a warning. The CI run passed. I'm somewhat tempted to commit it and see how it fares over time (hah!). Then again, in case this issue just magically vanishes with Qt 6.7.1 again, we might as well pretend it was never there. |
Well, I finally found something! [QTBUG-124496] QObject::timerEvent may be called before the timer elapsed - Qt Bug Tracker I wonder if we have a single-shot QTimer running somewhere in the IPC code, which Qt confuses with our timeout timer? Or maybe That also explains why anecdotally, "clicking around" in the UI made it reproduce faster, because that probably leads to more timers being used. The bug also says Qt 6.5 in addition to 6.7, but the linked bug says 6.5.6 (or .5 in the comments?). So chances are it was a change that landed in Qt 6.7 but also was backported to 6.5 LTS? Assuming it's still there in Qt 6.7.1, I suppose it might be a good idea to leave the warning logging in. |
Ohh, good spotting! So on Windows the timers are backed by some Windows API and somehow the Windows things aren't cleared when the Qt ones are and then a new Qt one will re-use a still active windows thing? Hmm, that doesn't sound right. Does stopping a timer fire a timer event and that's what's being picked up by the re-used one? Maybe we could make a new timer object before killing the old one and that would resolve this? We do stop and start the timer a couple of times in this class. Anyway, it's reassuring to know it's for sure a Windows only problem. And that Thiago is assigned to the ticket.
Damn I saw that and though "wait, we are trying to get a release out, have you heard of scope creep!". Then I looked at the commits and remembered we already have a QTimer wrapper 🤣 |
I pushed some more stuff to your branch! So the current set of stuff on there is:
The changes I made were to
In my opinion that's all good to go, although I'm not hung up on much of it if you think some of it is risky and want to drop it (and I haven't looked at the socket ID stuff at all). Also I probably left a bunch of typos in there as usual. Personally I would edit the commit history to have less incidental stuff in it though ;) (If only so we don't have to skip over commits when running git blame on the ci workflow file.) I should have some time tomorrow to follow up on this or have a look at pdf.js, so let me know how you want to move forward. |
Had another crack at this today but didn't get anywhere:
_old_socket
handling in there to rule that out, it's still happeningOriginally posted by @toofar in #8175 (comment)
The text was updated successfully, but these errors were encountered: