Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incident - High number of builds on GESIS server on 25 January #2905

Open
rgaiacs opened this issue Jan 25, 2024 · 2 comments
Open

Incident - High number of builds on GESIS server on 25 January #2905

rgaiacs opened this issue Jan 25, 2024 · 2 comments
Assignees

Comments

@rgaiacs
Copy link
Collaborator

rgaiacs commented Jan 25, 2024

Around 5:00am UTC+1 of 25 January 2025, GESIS server started to accumulate build pods.

Screenshot 2024-01-25 at 14-14-49 1  Overview - Dashboards - Grafana

The increase number of build pods continuers from 5:00am UTC+1 until 7:30am UTC+1 when it abruptly dropped.

At the same time, number of user pods on GESIS server decreased.

Screenshot 2024-01-25 at 14-12-21 1  Overview - Dashboards - Grafana

GESIS server was NOT cordoned during this period.

Screenshot 2024-01-25 at 14-24-05 Node Activity - Dashboards - Grafana

And CPU load on GESIS server was average during this period.

Screenshot 2024-01-25 at 14-24-37 Node Activity - Dashboards - Grafana

The popular repositories before the incident were

Screenshot 2024-01-25 at 14-28-16 View panel - Pod Activity - Dashboards - Grafana

GESIS server looks normal now. cc @arnim

@rgaiacs rgaiacs self-assigned this Jan 25, 2024
@arnim
Copy link
Contributor

arnim commented Jan 25, 2024

Thank you for reporting :)

@rgaiacs
Copy link
Collaborator Author

rgaiacs commented Jan 25, 2024

Binder Pod

Some builds failed with

[E 240125 04:00:29 build:669] log event not json: 'Exception ignored in: <function Application.__del__ at 0x7f95c6954ca0>\n'
[E 240125 04:00:29 build:669] log event not json: 'Exception ignored in: <function Application.__del__ at 0x7f95c6954ca0>\n'
[E 240125 04:00:29 build:669] log event not json: 'Traceback (most recent call last):\n'
[E 240125 04:00:29 build:669] log event not json: 'Traceback (most recent call last):\n'
[E 240125 04:00:29 build:669] log event not json: '  File "/usr/lib/python3.10/site-packages/traitlets/config/application.py", line 1043, in __del__\n'
[E 240125 04:00:29 build:669] log event not json: '  File "/usr/lib/python3.10/site-packages/traitlets/config/application.py", line 1043, in __del__\n'
[E 240125 04:00:29 build:669] log event not json: '  File "/usr/lib/python3.10/site-packages/traitlets/config/application.py", line 1032, in close_handlers\n'
[E 240125 04:00:29 build:669] log event not json: '  File "/usr/lib/python3.10/site-packages/traitlets/traitlets.py", line 719, in __get__\n'
[E 240125 04:00:29 build:669] log event not json: '  File "/usr/lib/python3.10/site-packages/traitlets/config/application.py", line 1032, in close_handlers\n'
[E 240125 04:00:29 build:669] log event not json: '  File "/usr/lib/python3.10/site-packages/traitlets/traitlets.py", line 719, in __get__\n'
[E 240125 04:00:29 build:669] log event not json: "TypeError: 'NoneType' object is not callable\n"
[E 240125 04:00:29 build:669] log event not json: "TypeError: 'NoneType' object is not callable\n"
[I 240125 04:00:30 build:679] Finished streaming logs of build-binder-2dexamples-2dr-974603-02c-0e
[I 240125 04:00:30 build:679] Finished streaming logs of build-binder-2dexamples-2dr-974603-02c-0e
[W 240125 04:00:31 builder:161] Stream closed while handling /binder/build/gh/binder-examples/r/HEAD?build_token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJleHAiOjE3MDYxNTU0OTQsImF1ZCI6ImdoL2JpbmRlci1leGFtcGxlcy9yL0hFQUQiLCJvcmlnaW4iOiJteWJpbmRlci5vcmcifQ.apED-jm0GH25vqpofjwU9AEf7_zmHnBB1SiJ5h2Vy7E

but this happens all the time.

Some callback error like

[E 240125 04:14:52 ioloop:770] Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7f432af05310>>, <Task finished name='Task-1437
908' coro=<Launcher.launch.<locals>.handle_chunk() done, defined at /usr/local/lib/python3.11/site-packages/binderhub/launcher.py:279> exception=Finish()>)
    Traceback (most recent call last):
      File "/usr/local/lib/python3.11/site-packages/binderhub/builder.py", line 159, in emit
        await self.flush()
    tornado.iostream.StreamClosedError: Stream is closed
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/usr/local/lib/python3.11/site-packages/tornado/ioloop.py", line 750, in _run_callback
        ret = callback()
              ^^^^^^^^^^
      File "/usr/local/lib/python3.11/site-packages/tornado/ioloop.py", line 774, in _discard_future_result
        future.result()
      File "/usr/local/lib/python3.11/site-packages/binderhub/launcher.py", line 291, in handle_chunk
        await event_callback(event)
      File "/usr/local/lib/python3.11/site-packages/binderhub/builder.py", line 696, in handle_progress_event
        await self.emit(
      File "/usr/local/lib/python3.11/site-packages/binderhub/builder.py", line 163, in emit
        raise Finish()
    tornado.web.Finish

that also happens all the time.

What catched my attention was

[I 240125 04:16:00 build:640] Stopping watch of build-ruby-2dnumo-2dnumo-2dnarray-adfbde-8dd-4e
[W 240125 04:16:01 log:135] 405 HEAD /binder/ (anonymous@63.143.42.251) 2.68ms
[I 240125 04:16:03 health:152] Checking registry status
[I 240125 04:16:04 health:117] Health check check_docker_registry took 0.804s
[I 240125 04:16:10 build:748] Deleting Failed build build-binder-2dexamples-2dr-974603-02c-0e (repo=https://github.com/binder-examples/r)
[I 240125 04:16:10 build:748] Deleting Failed build build-hankcs-2dpyhanlp-d3e861-f09-8b (repo=https://github.com/hankcs/pyhanlp)
[I 240125 04:16:10 build:748] Deleting Failed build build-ruby-2dnumo-2dnumo-2dnarray-adfbde-8dd-4e (repo=https://github.com/ruby-numo/numo-narray)
[I 240125 04:16:10 build:782] Deleted 3/6 build pods
[I 240125 04:16:21 health:152] Checking registry status
[I 240125 04:16:22 health:117] Health check check_docker_registry took 0.923s
[I 240125 04:16:22 log:135] 200 GET /binder/ (anonymous@63.143.42.251) 3.71ms
[I 240125 04:16:38 health:152] Checking registry status
[I 240125 04:16:39 health:117] Health check check_docker_registry took 0.791s
[I 240125 04:16:54 health:152] Checking registry status
[I 240125 04:16:54 health:222] Getting pod statistics for gesis
[I 240125 04:16:55 health:117] Health check check_docker_registry took 0.808s
[I 240125 04:17:11 health:152] Checking registry status
[I 240125 04:17:12 health:117] Health check check_docker_registry took 0.839s
[I 240125 04:17:27 health:152] Checking registry status
[I 240125 04:17:28 health:117] Health check check_docker_registry took 0.901s
[I 240125 04:17:45 health:152] Checking registry status
[I 240125 04:17:46 health:117] Health check check_docker_registry took 0.803s
[I 240125 04:18:01 health:152] Checking registry status
[I 240125 04:18:01 health:222] Getting pod statistics for gesis
[I 240125 04:18:02 health:117] Health check check_docker_registry took 0.812s
[I 240125 04:18:18 health:152] Checking registry status
[I 240125 04:18:19 health:117] Health check check_docker_registry took 0.790s
[I 240125 04:18:36 health:152] Checking registry status
[I 240125 04:18:37 health:117] Health check check_docker_registry took 1.542s
[I 240125 04:18:37 log:135] 200 GET /binder/health (anonymous@34.71.121.177) 1543.64ms
[I 240125 04:18:37 log:135] 200 GET /binder/health (anonymous@34.71.121.177) 1527.70ms
[I 240125 04:18:37 log:135] 200 GET /binder/health (anonymous@35.239.35.132) 1449.17ms
[I 240125 04:18:53 health:152] Checking registry status
[I 240125 04:18:53 health:117] Health check check_docker_registry took 0.827s
[I 240125 04:19:10 health:152] Checking registry status
[I 240125 04:19:10 health:222] Getting pod statistics for gesis
[I 240125 04:19:11 health:117] Health check check_docker_registry took 0.838s
[I 240125 04:19:16 build:563] Started build build-gesiscss-2dorc2-2dtest-2dbuild-ced273-4a3-a6

For a couple of minutes, the only entry in the log is health check. And later the health check start failing

[I 240125 04:24:35 health:117] Health check check_jupyterhub_api took 3.011s
[E 240125 04:24:35 health:54] Error checking check_jupyterhub_api
    Traceback (most recent call last):
      File "/usr/local/lib/python3.11/site-packages/binderhub/health.py", line 52, in wrapper
        res = await f(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/local/lib/python3.11/site-packages/binderhub/health.py", line 27, in wrapper
        return await f(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/local/lib/python3.11/site-packages/binderhub/health.py", line 110, in wrapped
        return await f(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/local/lib/python3.11/site-packages/binderhub/health.py", line 143, in check_jupyterhub_api
        await AsyncHTTPClient().fetch(hub_url + "hub/api/health", request_timeout=3)
    tornado.curl_httpclient.CurlError: HTTP 599: Operation timed out after 3000 milliseconds with 0 bytes received
[W 240125 04:24:35 health:196] Unhealthy services: [{'service': 'JupyterHub API', 'ok': False}]
[E 240125 04:24:35 log:135] 503 GET /binder/health (anonymous@35.239.35.132) 3014.89ms
[W 240125 04:24:35 health:196] Unhealthy services: [{'service': 'JupyterHub API', 'ok': False}]
[E 240125 04:24:35 log:135] 503 GET /binder/health (anonymous@34.71.121.177) 2025.55ms
[W 240125 04:24:35 health:196] Unhealthy services: [{'service': 'JupyterHub API', 'ok': False}]
[E 240125 04:24:35 log:135] 503 GET /binder/health (anonymous@35.239.35.132) 1852.56ms
[W 240125 04:24:36 health:196] Unhealthy services: [{'service': 'JupyterHub API', 'ok': False}]
[E 240125 04:24:36 log:135] 503 GET /binder/health (anonymous@35.239.35.132) 2.84ms
[W 240125 04:24:36 health:196] Unhealthy services: [{'service': 'JupyterHub API', 'ok': False}]
[E 240125 04:24:36 log:135] 503 GET /binder/health (anonymous@35.239.35.132) 3.73ms
[W 240125 04:24:36 health:196] Unhealthy services: [{'service': 'JupyterHub API', 'ok': False}]
[E 240125 04:24:36 log:135] 503 GET /binder/health (anonymous@34.71.121.177) 2.63ms

But, 2 minutes later, the health check starts work again.

[I 240125 04:26:08 health:152] Checking registry status
[I 240125 04:26:08 health:222] Getting pod statistics for gesis
[I 240125 04:26:09 health:117] Health check check_docker_registry took 0.865s
[I 240125 04:26:26 health:152] Checking registry status
[I 240125 04:26:27 health:117] Health check check_docker_registry took 0.790s
[W 240125 04:26:41 log:135] 405 HEAD /binder/ (anonymous@63.143.42.251) 2.82ms

Later, Hub API stop working:

[I 240125 04:39:27 health:222] Getting pod statistics for gesis
[E 240125 04:39:32 launcher:122] Error accessing Hub API (using https://notebooks.gesis.org/binder/jupyter/hub/api/users/hanlpbot-hanlp-binder-xiqbyxpi/server/progress): HTTP 599: HTTP/2 stream 0 was not closed cleanly
: INTERNAL_ERROR (err 2)
[E 240125 04:39:32 launcher:122] Error accessing Hub API (using https://notebooks.gesis.org/binder/jupyter/hub/api/users/scikit-learn-scikit-learn-2eekxsjd/server/progress): HTTP 599: HTTP/2 stream 0 was not closed cle
anly: INTERNAL_ERROR (err 2)
[E 240125 04:39:32 launcher:122] Error accessing Hub API (using https://notebooks.gesis.org/binder/jupyter/hub/api/users/ucb-ee290c-tape-chisel-bootcamp-ju49jgy7/server/progress): HTTP 599: HTTP/2 stream 0 was not clos
ed cleanly: INTERNAL_ERROR (err 2)
[E 240125 04:39:32 launcher:122] Error accessing Hub API (using https://notebooks.gesis.org/binder/jupyter/hub/api/users/geo-python-site-9ojs8hgc/server/progress): HTTP 599: HTTP/2 stream 0 was not closed cleanly: INTE
RNAL_ERROR (err 2)
[E 240125 04:39:32 launcher:122] Error accessing Hub API (using https://notebooks.gesis.org/binder/jupyter/hub/api/users/hanlpbot-hanlp-binder-mca1cur4/server/progress): HTTP 599: HTTP/2 stream 0 was not closed cleanly
: INTERNAL_ERROR (err 2)
[E 240125 04:39:32 launcher:122] Error accessing Hub API (using https://notebooks.gesis.org/binder/jupyter/hub/api/users/ucb-ee290c-tape-chisel-bootcamp-6bx9viqw/server/progress): HTTP 599: HTTP/2 stream 0 was not clos
ed cleanly: INTERNAL_ERROR (err 2)
[E 240125 04:39:32 launcher:122] Error accessing Hub API (using https://notebooks.gesis.org/binder/jupyter/hub/api/users/ipython-ipython-in-depth-ll82qv4v/server/progress): HTTP 599: HTTP/2 stream 0 was not closed clea
nly: INTERNAL_ERROR (err 2)
[E 240125 04:39:32 launcher:122] Error accessing Hub API (using https://notebooks.gesis.org/binder/jupyter/hub/api/users/ucb-ee290c-tape-chisel-bootcamp-i0bt3n4u/server/progress): HTTP 599: HTTP/2 stream 0 was not clos
ed cleanly: INTERNAL_ERROR (err 2)
[E 240125 04:39:32 launcher:122] Error accessing Hub API (using https://notebooks.gesis.org/binder/jupyter/hub/api/users/binder-examples-requirements-9qfd4sec/server/progress): HTTP 599: HTTP/2 stream 0 was not closed 
cleanly: INTERNAL_ERROR (err 2)
[I 240125 04:39:32 health:117] Health check check_jupyterhub_api took 302.220s

@arnim I believe that something on the GESIS network might have caused this. I'm excluding something on the Kubernetes network because we didn't change anything.

Build Pod

I didn't found anything relevant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants