Skip to content
This repository has been archived by the owner on Mar 27, 2021. It is now read-only.

Fix "...Span <span name> is GC'ed without being ended." issue (caused by a BT timeout) #761

Open
sming opened this issue Feb 11, 2021 · 1 comment

Comments

@sming
Copy link
Contributor

sming commented Feb 11, 2021

100's of Tracing Spans are left un-ended from every query timeout

  • I am a prism goalie
  • Who wants to have a stable heroic
  • So that I can focus on features and not get woken up at night and have angry users

These un-ended spans represent a real runtime risk to heroic. If ~700-1000 of these are left hanging around after each timeout-d query, it's conceivable that the JVM will :

  • potentially run out of memory altogether
  • experience much longer GC pauses / sweep times (cos of all the hanging spans needing reaping)
  • hugely inflate the size of heroic's logs, costing us $$$ and obscuring "genuine" problems

Proposed Solution

  • find the correct location to catch the BT timeout exception (not trivial)
  • catch it, end the span and throw it out again

Repro Steps

  • run heroic locally with GUC config and on branch feature/add-bigtable-timeout-settings-refactored
  • capture a lengthy query from grafana using the chrome dev tools network tab
  • alter the query to hit localhost and watch the logs, you'll see this message

List of methods concerned from logs

  1. ERROR io.opencensus.trace.Tracer - Span localMetricsManager.fetchSeries is GC'ed without being ended.
  2. ERROR io.opencensus.trace.Tracer - Span bigtable.fetchBatch is GC'ed without being ended.
@sming sming created this issue from a note in Observability Kanban (To do) Feb 11, 2021
@sming sming changed the title Fix "...Span localMetricsManager.fetchSeries is GC'ed without being ended." issue (caused by a BT timeout) Fix "...Span <span name> is GC'ed without being ended." issue (caused by a BT timeout) Feb 12, 2021
@sming sming moved this from To do to Inbox in Observability Kanban Mar 24, 2021
@sming
Copy link
Contributor Author

sming commented Mar 24, 2021

FYI @adsail , moving to inbox as it's not something we'll need to tackle until more aggressive timeouts are deployed

@lmuhlha lmuhlha removed this from Inbox in Observability Kanban Mar 29, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant