Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telemetry shard can be overwhelmed by too many connections #502

Open
i1i1 opened this issue Sep 30, 2022 · 4 comments
Open

Telemetry shard can be overwhelmed by too many connections #502

i1i1 opened this issue Sep 30, 2022 · 4 comments

Comments

@i1i1
Copy link
Contributor

i1i1 commented Sep 30, 2022

At some moment of time when there are too many nodes submitting data (like more than 15000 nodes). We discovered that shard starts to leak memory.

Our suspicion was that due to rust async nature, shard didn't have enough time to handle all messages from nodes and send them to core (Node ws connections just take all cpu time). So smth like this can be implemented subspace#42

We also discovered that deploying more shards on a single machine also helps.

@jsdw
Copy link
Collaborator

jsdw commented Oct 5, 2022

It's not obvious to me at a glance what your code changes do in the linked PR; would you be able to summarize?

If a shard cannot process and send data fast enough, it's internal unbounded channels will end up building up more and more messages as it can't work through them as fast as it receives them (not technically a leak though I think). The same is ultimately true of the core; it can eventually get overwhelmed if it cannot process incoming data quickly enough.

There are a bunch of solutions of different complexities (bounded channels are a nice idea in theory for instance but earlier versions did attempt this approach and it's also very easy to run into deadlock situations. limiting the number of connections allowed to a shard is a fairly simple approach (and a limit could be enforced via a proxy in front of the shard). Running more shards of course is the way to ultimately be handle more connections (and things like k8s can spin up more shards as needed to cope with demand).

So I think the best solution for now is just to run more shards or limit connections to the shard using a proxy in front of it if it can't cope with demand :)

@jsdw jsdw changed the title Telemetry shard: memory leak Telemetry shard can be overwhelmed by too many connections Oct 5, 2022
@nazar-pc
Copy link

nazar-pc commented Oct 5, 2022

Solution in linked PR is to have more connections to the core. Would be nice for telemetry shard to automatically benefit from multiple cores available instead of being limited to just one.

@i1i1
Copy link
Contributor Author

i1i1 commented Oct 5, 2022

It's not obvious to me at a glance what your code changes do in the linked PR; would you be able to summarize?

As Nazar said, pr introduces more connections to the core which run as background tasks. The point is that sending data would be parallelized.

@jsdw
Copy link
Collaborator

jsdw commented Oct 5, 2022

Ah yup ok I see now; you spawn multiple aggregators and each node that connects is assigned randomly to one of those. Clever :)

I'm not super against that approach (though being so IO bound I wonder how much it actually helps?). That said, the general point of the shard+core architecture is that you would spawn multiple shards to distribute the connection load when the numbers got too great (and of course these can be on the same or different machines, and fully parallelise the entire flow and not just the "aggregator" bit).

I made a similar (but more fiddly) tweak in the Core, allowing multiple aggregator loops to be spun up to spread the load of feed connections, but I would prefer to also remove that and make it possible to spawn multiple independent telemetry_core services too if we start to hit a bottleneck with feed subscriptions.

TL;DR I'd prolly rather we stick to spawning more instances and try to keep the code in each instance as simple as possible (eventually I'd like to remove similar code from the telemetry-core for the same reason).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants