Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: fix graceful shutdown, add SIGTERM handler #1951

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

frostmar
Copy link
Contributor

@frostmar frostmar commented Jan 9, 2024

An attempt to address issue #1946

馃憤 Adds signal handling for SIGTERM
馃憤 Allows graceful shutdown to complete (instead of HTTP requests in-flight being terminated suddenly)

馃憥 doesn't allow time for in-flight requests to finish. The context is cancelled for each in-flight request as a side-effect of the way servers shutdown and a parent context is cancelled. I can't see a way to get this to work as I'd like, so opening this draft PR for suggestions or discussion.

I believe the code flow is now:

  1. signal received
  2. signal handling runs down.Shutdown(tctx), causing each server to stop listening for new requests, and immediately return http.ErrServerClosed from their Serve() or ListenAndServe() calls
  3. so srvs.Wait() immediately completes for the server ErrorGroup, cancelling it's context srvctx
  4. srvctx is a parent of each http request's context, so each in-flight request begins context-cancellation processing
  5. process exits when all in-flight requests end (which is unnecessarily quickly, as they're all cancelled instead of getting a chance to complete)

I tried not having all request contexts children of srvctx, but got lots of context-related warnings logged from pkg/ctxlock/ctxlock.go during graceful shutdown


In practice this changes the behaviour seen when clair does a graceful shutdown -

From:

  • in-flight requests are terminated suddenly, with no content (or perhaps partial content if sending of the body had started)
  • clair graceful shutdown code doesn't properly run to completion

To:

  • in-flight requests have their context cancelled, processing of that has 10sec to complete. Probably a http error response and body is received
  • clair graceful shutdown code runs to completion (but causes in-flight requests to be context-cancelled rather than allowing a grace period to finish successfully)

Adds signal handling for SIGTERM
Allows graceful shutdown to complete (instead of HTTP requests in-flight being terminated)

WIP: doesn't allow time for in-flight requests to finish. The context is cancelled for each in-flight request as a side-effect of the way servers shutdown and a parent context is cancelled
Signed-off-by: Mark Frost <frostmar@uk.ibm.com>
@app-sre-bot
Copy link
Collaborator

Can one of the admins verify this patch?

@frostmar
Copy link
Contributor Author

frostmar commented Jan 9, 2024

Example running this code

  1. run a clair: go run ./cmd/clair/ -conf ./local-dev/clair/config-frostmar-local.yaml -mode indexer
  2. start a longer-running request: curl -H "Content-Type: application/json" http://localhost:6060/indexer/api/v1/index_report -d ...etc...
  3. CTRL+C (SIGINT) clair
  4. request response is a cancellation, (instead of terminated with no body)
    {"code":"internal-error","message":"failed to start scan: failed to fetch layers: encountered error while fetching a layer: context canceled"}
    

Clair logs:

2:59PM INF index request start component=libindex/Libindex.Index manifest=sha256:38d3748ab6e1fcb70906848c9d1bdf23af3d9402aeda0a408dde40fa59c87c3b request_id=ea8395ae99607f8b
2:59PM DBG locking attempt component=libindex/Libindex.Index manifest=sha256:38d3748ab6e1fcb70906848c9d1bdf23af3d9402aeda0a408dde40fa59c87c3b request_id=ea8395ae99607f8b
2:59PM DBG locking OK component=libindex/Libindex.Index manifest=sha256:38d3748ab6e1fcb70906848c9d1bdf23af3d9402aeda0a408dde40fa59c87c3b request_id=ea8395ae99607f8b
2:59PM INF starting scan component=indexer/controller/Controller.Index manifest=sha256:38d3748ab6e1fcb70906848c9d1bdf23af3d9402aeda0a408dde40fa59c87c3b request_id=ea8395ae99607f8b
2:59PM INF manifest to be scanned component=indexer/controller/Controller.Index manifest=sha256:38d3748ab6e1fcb70906848c9d1bdf23af3d9402aeda0a408dde40fa59c87c3b request_id=ea8395ae99607f8b state=CheckManifest
2:59PM INF layers fetch start component=indexer/controller/Controller.Index manifest=sha256:38d3748ab6e1fcb70906848c9d1bdf23af3d9402aeda0a408dde40fa59c87c3b request_id=ea8395ae99607f8b state=FetchLayers
2:59PM DBG fetching layers component=indexer/controller/Controller.Index count=1 manifest=sha256:38d3748ab6e1fcb70906848c9d1bdf23af3d9402aeda0a408dde40fa59c87c3b request_id=ea8395ae99607f8b state=FetchLayers
2:59PM DBG layer fetch start arena=/var/folders/fh/03vck1t10pb340wx73411k7w0000gn/T/ component=libindex/fetchArena.realizeLayer layer=sha256:60e3fbbba55b85dbc675e512ce2c3c8dd658550cb516ac64cdc74b75f90805cd manifest=sha256:38d3748ab6e1fcb70906848c9d1bdf23af3d9402aeda0a408dde40fa59c87c3b request_id=ea8395ae99607f8b state=FetchLayers uri=http://localhost:8087/blobs/sha256:60e3fbbba55b85dbc675e512ce2c3c8dd658550cb516ac64cdc74b75f90805cd

^C

2:59PM INF Received signal 'interrupt', gracefully shutting down component=main
2:59PM WRN layers fetch failure error="encountered error while fetching a layer: context canceled" component=indexer/controller/Controller.Index manifest=sha256:38d3748ab6e1fcb70906848c9d1bdf23af3d9402aeda0a408dde40fa59c87c3b request_id=ea8395ae99607f8b state=FetchLayers
2:59PM INF layers fetch done component=indexer/controller/Controller.Index manifest=sha256:38d3748ab6e1fcb70906848c9d1bdf23af3d9402aeda0a408dde40fa59c87c3b request_id=ea8395ae99607f8b state=FetchLayers
2:59PM INF index request done component=libindex/Libindex.Index manifest=sha256:38d3748ab6e1fcb70906848c9d1bdf23af3d9402aeda0a408dde40fa59c87c3b request_id=ea8395ae99607f8b
2:59PM DBG http error response error="failed to start scan: failed to fetch layers: encountered error while fetching a layer: context canceled" code=500 component=httptransport/New request_id=ea8395ae99607f8b
2:59PM INF handled HTTP request component=httptransport/New duration=3817.390542 method=POST remote_addr=[::1]:60510 request_id=ea8395ae99607f8b request_uri=/indexer/api/v1/index_report status=500
2:59PM INF successfully shut down servers component=main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants