Skip to content
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.

queries with high number of series trigger excessive pointslicepool usage #1912

Open
Dieterbe opened this issue Oct 1, 2020 · 1 comment
Labels
Milestone

Comments

@Dieterbe
Copy link
Contributor

Dieterbe commented Oct 1, 2020

Yesterday I did a fun experiment.
I made a copy of an internal single-tenant monitoring cluster (well, with only a subset of the data, but anyway)
the number of pods don't really matter, but for completeness:

  • 16 query pods
  • 32 read pods (16x2 replication)
  • 16 write pods

more importantly is that active series (both in idx and tank) was about 75k for all read and write pods, or about 1.2M cluster wide (3.5M taking into account replication)

I then did this query

time curl `localhost/graphite/render?target=*&target=*.*&target=*.*.*&target=*.*.*.*&target=*.*.*.*.*&target=*.*.*.*.*.*&target=*.*.*.*.*.*.*&target=*.*.*.*.*.
*.*.*&target=*.*.*.*.*.*.*.*.*&target=*.*.*.*.*.*.*.*.*.*&target=*.*.*.*.*.*.*.*.*.*.*' >/dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  587M    0  587M    0     0  7914k      0 --:--:--  0:01:16 --:--:--  156M
real	1m 16.05s
user	0m 0.18s
sys	0m 0.53s

this query hit 1.1.86M series (all series in the cluster) and returned 15M points (so 12 points per series, thanks to max-points-per-req-soft, and having a 2h rollup, and the default time window being 24h)
to achieve this, this happened to memory usage:

The memory usage for read pods seemed a bit crazy.
I would expect memory usage to be something along the lines of (at least conceptually, minus the overhead), per series:
for the initial find:

metric metadata/details (name, interval, etc): few hundred B

for the getdata:

12 points of 12B each = 144B
sliceheader = 24B
metric metadata/details (name, interval, etc): few hundred B

let's call the /getdata 600B per series, at 75k series that's 45MB, not 1300. That's a lot more overhead then expected.

I ran the same query again and took a bunch of profiles.
I could have done this a lot more scientifically and get profiles at exact right times (exact query start, just before first response comes in on query pod, right when last response comes in on query pod, etc) but I didn't have a lot of time to set up a proper experiment.

https://snapshot.raintank.io/dashboard/snapshot/r5dzlOiBEDznDwxIRrbtoQn9s28arbEk?orgId=2
i actually ran it a 3rd time to be able to get the file query.new-after, a heap profile on the query pod right after the query completes. for all intents and purposes I think we can pretend i just executed one query and collected profiles as follows:
heap
profiles.zip

the query pods first spend ~900MB encoding the request bodies for the /getdata calls, then the rest decoding responses from query pods and generating the final output. nothing too unexpected here, though i hadn't considered the cost of the bodies for /getdata before.

the read pod profile is pretty clear:

(pprof) top30 -cum
Showing nodes accounting for 1323.76MB, 94.98% of 1393.78MB total
Dropped 112 nodes (cum <= 6.97MB)
Showing top 30 nodes out of 45
      flat  flat%   sum%        cum   cum%
       3MB  0.22%  0.22%     1245MB 89.33%  github.com/grafana/metrictank/api.(*Server).getTarget
         0     0%  0.22%     1245MB 89.33%  github.com/grafana/metrictank/api.(*Server).getTargetsLocal.func1
         0     0%  0.22%  1241.50MB 89.07%  github.com/grafana/metrictank/api.(*Server).getSeriesFixed
 1241.50MB 89.07% 89.29%  1241.50MB 89.07%  github.com/grafana/metrictank/pointslicepool.(*PointSlicePool).GetMin
         0     0% 89.29%  1239.95MB 88.96%  github.com/grafana/metrictank/api.Fix
(...)
ROUTINE ======================== github.com/grafana/metrictank/api.Fix in /go/src/github.com/grafana/metrictank/api/dataprocessor.go
         0     1.21GB (flat, cum) 88.96% of Total
         .          .     74:		// the requested range is too narrow for the requested interval
         .          .     75:		return []schema.Point{}
         .          .     76:	}
         .          .     77:
         .          .     78:	neededCap := int((last-first)/interval + 1)
         .     1.21GB     79:	out := pointSlicePool.GetMin(neededCap)
         .          .     80:	out = out[:neededCap]
         .          .     81:
         .          .     82:	// i iterates in. o iterates out. t is the ts we're looking to fill.
         .          .     83:	for t, i, o := first, 0, -1; t <= last; t += interval {
         .          .     84:		o += 1
(pprof) list GetMin
Total: 1.36GB
ROUTINE ======================== github.com/grafana/metrictank/pointslicepool.(*PointSlicePool).GetMin in /go/src/github.com/grafana/metrictank/pointslicepool/pointslicepool.go
    1.21GB     1.21GB (flat, cum) 89.07% of Total
         .          .     43:		return make([]schema.Point, 0, minCap)
         .          .     44:	}
         .          .     45:	// even if our caller needs a smaller cap now, we expect they will put it back in the pool
         .          .     46:	// so it can later be reused.
         .          .     47:	// may as well allocate a size now that we expect will be more useful down the road.
    1.21GB     1.21GB     48:	return make([]schema.Point, 0, p.defaultSize)
         .          .     49:}

While we only need a cap of 12, we allocate a cap of 2000! At 12B per point that's 24kB per series that's 1.8GB on average (pre-overhead) !

This is a pretty interesting case. We have the max-points-per-req settings which allow to control how much space we allocate for points across the entire request (regardless of how many series are used), but this is some kind of loophole, once you query many series such that the amount of points per series is small, we lose control over the amount of space allocated because we allows allocate slices of at least 2000 points capacity.

metrictank --version
metrictank (version: v1.0-148-gb8bb526 - runtime: go1.15.2)
@Dieterbe Dieterbe added the bug label Oct 1, 2020
Dieterbe added a commit that referenced this issue Oct 1, 2020
1) save memory by not using large slices when we only need small ones
  In particular, fix #1912
2) use a default size that corresponds to a capacity that append()
   actually uses (I believe a size of 800 gets rounded up by make)
@Dieterbe
Copy link
Contributor Author

Dieterbe commented Oct 5, 2020

(side note: this issue can happen both when requesting very short timeranges - which obviously result in few points - , but also longer timeranges. because either due to the time window being queries, or thanks to max-points-per-req-soft, a rollup may be selected where even a long window still results in few points)

@Dieterbe Dieterbe changed the title queries with high number of series trigger super duper excessive pointslicepool usage queries with high number of series trigger excessive pointslicepool usage Oct 6, 2020
@Dieterbe Dieterbe added this to the sprint-18 milestone Nov 4, 2020
@Dieterbe Dieterbe removed their assignment Sep 26, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

1 participant