Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] CPU usage increased a lot since 1.17.0 #228

Open
mga-chka opened this issue Sep 13, 2022 · 4 comments
Open

[BUG] CPU usage increased a lot since 1.17.0 #228

mga-chka opened this issue Sep 13, 2022 · 4 comments

Comments

@mga-chka
Copy link
Collaborator

Describe the bug
CF issue #222, the cpu usage since version 1.17.0 is bigger.
With a new version we have noticeably higher CPU usage (around 60%/core in average, befefore it was close to 0), but it's still acceptable.
This new behavior might be normal since the version 1.17.0 process queries much faster.
The idea of this task is to investigate if 1.17.0 introduced a useless CPU bottleneck.

@matthax
Copy link

matthax commented Aug 3, 2023

Hey, I'm wondering if anything came of this. We've been running some benchmarks and I'm having some concern because we're seeing consistent 60%-70% utilization on a 16 core machine, processing about 1,300 qps peak.

The queries themselves are relatively lightweight and don't entail a ton of resources, memory usage stays around 35MiB. Is this the expected behavior?

image

Here's the memory utilization:
image

The speed is good but if we project this utilization out over our actual Clickhouse cluster, we'd actually be supplying more resources to CHProxy than to the python service connecting to it.

image

The debug logs don't offer a ton of insight here, but I did try disabling any cache to see if perhaps there were cycles being wasted trying to purge cache directories that could account for the utilization but I didn't see a change there. This seems worthy of a pprof - perhaps it's just a classic GC issue that might be easily resolved?

@matthax
Copy link

matthax commented Aug 3, 2023

I was able to improve performance slightly by setting GOMAXPROCS, this is deployed onto kubernetes and the node was reporting 32cpu available, even though it is only being allocated 30.

https://github.com/uber-go/automaxprocs might be a good option here.

Regardless, the same characteristic GC pauses (I think) still exist and CPU utilization is very high
image

At this point, I think I might actually need a bigger machine to test this farther. It seems that we're actually maxing out CPU util on the 30 available cores, I'm not sure how far it'll grow 👀

@mga-chka
Copy link
Collaborator Author

mga-chka commented Aug 3, 2023

can you try the same benchmark with a smaller machine? Since you're doing up to 1300 QPS, it can be the garbage collection that uses as much resource as possible.

FYI, in a previous issue regarding performance (that was fixed since), the guy was able to have a much better QPS/CPU ratio (2 CPU core for an avg QPS at 800-850):
#264

@matthax
Copy link

matthax commented Aug 4, 2023

can you try the same benchmark with a smaller machine? Since you're doing up to 1300 QPS, it can be the garbage collection that uses as much resource as possible.

FYI, in a previous issue regarding performance (that was fixed since), the guy was able to have a much better QPS/CPU ratio (2 CPU core for an avg QPS at 800-850): #264

So interesting discovery here. Please note we scaled the ch cluster since that last round of benchmarks, so this set of results isn't quite comparable to yesterday. I went ahead and re-ran the benchmarks on 30 cores, 16, 8, 6, and 4.

30 CPU allocated

Type Name Request Count Failure Count Median Response Time Average Response Time Min Response Time Max Response Time Average Content Size Requests/s Failures/s 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100%
grpc query_metrics 465473 0 45 51.2230113525825 21.590776996163200 298.78877400187800 2.9337491111192300 1536.1861360607900 0.0 45 54 61 66 82 110 130 150 200 240 300
  Aggregated 465473 0 45 51.2230113525825 21.590776996163200 298.78877400187800 2.9337491111192300 1536.1861360607900 0.0 45 54 61 66 82 110 130 150 200 240 300

16 CPU allocated

Type Name Request Count Failure Count Median Response Time Average Response Time Min Response Time Max Response Time Average Content Size Requests/s Failures/s 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100%
grpc query_metrics 465740 0 45 51.18937875467220 21.675225005310500 296.39671899349200 2.94348134152102 1537.0889316798300 0.0 45 54 61 66 82 110 130 150 190 240 300
  Aggregated 465740 0 45 51.18937875467220 21.675225005310500 296.39671899349200 2.94348134152102 1537.0889316798300 0.0 45 54 61 66 82 110 130 150 190 240 300

8 CPU allocated

Type Name Request Count Failure Count Median Response Time Average Response Time Min Response Time Max Response Time Average Content Size Requests/s Failures/s 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100%
grpc query_metrics 469484 0 44 50.778038869628400 20.85051400354130 326.78853499965000 2.967626159783930 1549.4528706400000 0.0 44 53 61 66 82 110 130 150 200 250 330
  Aggregated 469484 0 44 50.778038869628400 20.85051400354130 326.78853499965000 2.967626159783930 1549.4528706400000 0.0 44 53 61 66 82 110 130 150 200 250 330

6 CPU allocated

Type Name Request Count Failure Count Median Response Time Average Response Time Min Response Time Max Response Time Average Content Size Requests/s Failures/s 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100%
grpc query_metrics 472992 0 44 50.40309555482100 20.119906999752900 317.3640710010660 2.9199627055003000 1576.63319874048 0.0 44 53 60 65 81 110 130 150 190 240 320
  Aggregated 472992 0 44 50.40309555482100 20.119906999752900 317.3640710010660 2.9199627055003000 1576.63319874048 0.0 44 53 60 65 81 110 130 150 190 240 320

4 CPU allocated

Type Name Request Count Failure Count Median Response Time Average Response Time Min Response Time Max Response Time Average Content Size Requests/s Failures/s 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100%
grpc query_metrics 473686 0 44 50.3294800098238 20.29696999670710 382.6167549996170 2.919239749538720 1563.3277452262800 0.0 44 53 60 65 81 100 130 140 190 250 380
  Aggregated 473686 0 44 50.3294800098238 20.29696999670710 382.6167549996170 2.919239749538720 1563.3277452262800 0.0 44 53 60 65 81 100 130 140 190 250 380

Please note, I have set GOMAXPROCS in each of these to the number of free cores.

While I am sort of amused at the performance characteristics here, I think it would be worthwhile to try to run this down, would you agree or does it not seem like a priority issue?

My concern is that we wouldn't want to scale our proxy only horizontally, that has negative implications in terms of open connections and the concurrency limits. e.g. 6 proxy nodes, we need to limit the concurrent query to 1/6 for each node (assuming we're still using the default ch concurrency of 100). If we continue to scale out, we would find ourselves limiting that per-proxy concurrency rule farther and might see less even query distribution than we get with a smaller set of proxy nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants