[7.13.3] Investigating reasons for CoreConnectTimeoutError #7956

kozlovsky · 2024-04-04T12:55:55Z

CoreConnectTimeoutError is our most frequent error. According to Sentry, there are many different reasons for this error. We have eliminated many of them, but there are still many remaining reasons. Eliminating, say, 90% of them should be a significant improvement. I want to list most of the possible reasons for this issue. Then, if necessary, we can add separate sub-issues for some of them.

We can split CoreConnectTimeoutError cases into two categories.

Core continues to work. It executes some slow coroutines but cannot communicate with GUI until the timeout, which is currently equal to 120 seconds.
Cory apparently completely "freezes" and, after some point, does not put any output into stdout.

We can remedy the problem in the first case by increasing the timeout time. While it is preferable to connect with the GUI as quickly as possible, it is better to connect with a delay than to have a CoreConnectTimeoutError. For that reason, I suggest increasing the timeout to a longer time, say, 240 seconds.

Cases when Core continues to work till the timeout

The most frequent case is the delay in Component.start(), a generic method all components use. Sometimes, it freezes for 90 seconds or more. It calls the component-specific run method. To see the exact place inside the run method where a specific component freezes, we should enable slow coroutine stack tracing by setting SLOW_CORO_STACK_TRACING to 1.
DownloadManager.start() sometimes takes more than 52 seconds for a single step at await self.load_checkpoints(). I believe it is already fixed in the main branch.
DownloadManager.start_handle(): up to 60 seconds. Apparently, it freezes at the line
```
ltsession = self.get_session(download.config.get_hops())
```
SSLProtocol.connection_made() sometimes freezes for 2.76 seconds
TorrentChecker.check_random_tracker() sometimes freezes for 30 seconds or more. It is possible that Fix slow database queries, by using partial index idx_torrentstate__last_check__partial #7955 fixes it, but other reasons can also cause it. To be sure, we should look at the stack trace after enabling slow coroutine stack tracing.
_ProactorDatagramTransport._loop_reading sometimes requires up to 5 seconds
<IPv8.start.<locals>.ticker() running at ipv8_service.py:148>: 16 seconds
GuiProcessWatcher:check_gui_process sometimes freezes for 9.74 seconds. It is surprising because the method is straightforward and only contains psutil.Process.is_running() and psutil.Process.status() calls. It looks like they can sometimes be pretty slow.

Cases when Core completely "freezes"

In Sentry, we have multiple events of the same case; the last stdout line is:

INFO - tribler.core.check_os(121) - Check and enable code tracing. Process name: "core". Log dir: "C:\Users\<officer>\AppData\Roaming\.Tribler\7.13\log"

This is the following code fragment. In it, the method check_and_enable_code_tracing actually does nothing except logging because it first checks some command line arguments and does not find them:

def run_tribler_core_session(...) -> int:
    ...
    trace_logger = check_and_enable_code_tracing('core', log_dir)

    loop = asyncio.get_event_loop()
    exception_handler = default_core_exception_handler
    loop.set_exception_handler(exception_handler.unhandled_error_observer)

    try:
        exit_code = loop.run_until_complete(core_session(config, components=list(components_gen(config))))
    finally:
        if trace_logger:
            trace_logger.close()

        # Flush the logs to the file before exiting
        for handler in logging.getLogger().handlers:
            handler.flush()

    return exit_code

Core successfully executed the check_and_enable_code_tracing call, but where can it freeze after that? list(components_gen(config)) should output other log records, so the freeze should happen before that.

In my opinion, as a minimum change, we should significantly increase the timeout value and enable slow coroutine stack tracing by default for the binary build. Enabling stack tracing should slow down Core code a bit, but it is not too visible for users (as UI works at the same speed as before) and is crucial to discovering the actual reason for freezes.

The text was updated successfully, but these errors were encountered:

kozlovsky added the type: bug label Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[7.13.3] Investigating reasons for CoreConnectTimeoutError #7956

[7.13.3] Investigating reasons for CoreConnectTimeoutError #7956

kozlovsky commented Apr 4, 2024 •

edited

[7.13.3] Investigating reasons for CoreConnectTimeoutError #7956

[7.13.3] Investigating reasons for CoreConnectTimeoutError #7956

Comments

kozlovsky commented Apr 4, 2024 • edited

Cases when Core continues to work till the timeout

Cases when Core completely "freezes"

kozlovsky commented Apr 4, 2024 •

edited