Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[7.13.3] Investigating reasons for CoreConnectTimeoutError #7956

Open
kozlovsky opened this issue Apr 4, 2024 · 0 comments
Open

[7.13.3] Investigating reasons for CoreConnectTimeoutError #7956

kozlovsky opened this issue Apr 4, 2024 · 0 comments

Comments

@kozlovsky
Copy link
Collaborator

kozlovsky commented Apr 4, 2024

CoreConnectTimeoutError is our most frequent error. According to Sentry, there are many different reasons for this error. We have eliminated many of them, but there are still many remaining reasons. Eliminating, say, 90% of them should be a significant improvement. I want to list most of the possible reasons for this issue. Then, if necessary, we can add separate sub-issues for some of them.

We can split CoreConnectTimeoutError cases into two categories.

  • Core continues to work. It executes some slow coroutines but cannot communicate with GUI until the timeout, which is currently equal to 120 seconds.
  • Cory apparently completely "freezes" and, after some point, does not put any output into stdout.

We can remedy the problem in the first case by increasing the timeout time. While it is preferable to connect with the GUI as quickly as possible, it is better to connect with a delay than to have a CoreConnectTimeoutError. For that reason, I suggest increasing the timeout to a longer time, say, 240 seconds.

Cases when Core continues to work till the timeout

  • The most frequent case is the delay in Component.start(), a generic method all components use. Sometimes, it freezes for 90 seconds or more. It calls the component-specific run method. To see the exact place inside the run method where a specific component freezes, we should enable slow coroutine stack tracing by setting SLOW_CORO_STACK_TRACING to 1.

  • DownloadManager.start() sometimes takes more than 52 seconds for a single step at await self.load_checkpoints(). I believe it is already fixed in the main branch.

  • DownloadManager.start_handle(): up to 60 seconds. Apparently, it freezes at the line

    ltsession = self.get_session(download.config.get_hops())
  • SSLProtocol.connection_made() sometimes freezes for 2.76 seconds

  • TorrentChecker.check_random_tracker() sometimes freezes for 30 seconds or more. It is possible that Fix slow database queries, by using partial index idx_torrentstate__last_check__partial #7955 fixes it, but other reasons can also cause it. To be sure, we should look at the stack trace after enabling slow coroutine stack tracing.

  • _ProactorDatagramTransport._loop_reading sometimes requires up to 5 seconds

  • <IPv8.start.<locals>.ticker() running at ipv8_service.py:148>: 16 seconds

  • GuiProcessWatcher:check_gui_process sometimes freezes for 9.74 seconds. It is surprising because the method is straightforward and only contains psutil.Process.is_running() and psutil.Process.status() calls. It looks like they can sometimes be pretty slow.

Cases when Core completely "freezes"

  • In Sentry, we have multiple events of the same case; the last stdout line is:

    INFO - tribler.core.check_os(121) - Check and enable code tracing. Process name: "core". Log dir: "C:\Users\<officer>\AppData\Roaming\.Tribler\7.13\log"
    

    This is the following code fragment. In it, the method check_and_enable_code_tracing actually does nothing except logging because it first checks some command line arguments and does not find them:

    def run_tribler_core_session(...) -> int:
        ...
        trace_logger = check_and_enable_code_tracing('core', log_dir)
    
        loop = asyncio.get_event_loop()
        exception_handler = default_core_exception_handler
        loop.set_exception_handler(exception_handler.unhandled_error_observer)
    
        try:
            exit_code = loop.run_until_complete(core_session(config, components=list(components_gen(config))))
        finally:
            if trace_logger:
                trace_logger.close()
    
            # Flush the logs to the file before exiting
            for handler in logging.getLogger().handlers:
                handler.flush()
    
        return exit_code

    Core successfully executed the check_and_enable_code_tracing call, but where can it freeze after that? list(components_gen(config)) should output other log records, so the freeze should happen before that.


In my opinion, as a minimum change, we should significantly increase the timeout value and enable slow coroutine stack tracing by default for the binary build. Enabling stack tracing should slow down Core code a bit, but it is not too visible for users (as UI works at the same speed as before) and is crucial to discovering the actual reason for freezes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant