Telemetry shard can be overwhelmed by too many connections #502

i1i1 · 2022-09-30T14:20:21Z

At some moment of time when there are too many nodes submitting data (like more than 15000 nodes). We discovered that shard starts to leak memory.

Our suspicion was that due to rust async nature, shard didn't have enough time to handle all messages from nodes and send them to core (Node ws connections just take all cpu time). So smth like this can be implemented subspace#42

We also discovered that deploying more shards on a single machine also helps.

jsdw · 2022-10-05T11:35:54Z

It's not obvious to me at a glance what your code changes do in the linked PR; would you be able to summarize?

If a shard cannot process and send data fast enough, it's internal unbounded channels will end up building up more and more messages as it can't work through them as fast as it receives them (not technically a leak though I think). The same is ultimately true of the core; it can eventually get overwhelmed if it cannot process incoming data quickly enough.

There are a bunch of solutions of different complexities (bounded channels are a nice idea in theory for instance but earlier versions did attempt this approach and it's also very easy to run into deadlock situations. limiting the number of connections allowed to a shard is a fairly simple approach (and a limit could be enforced via a proxy in front of the shard). Running more shards of course is the way to ultimately be handle more connections (and things like k8s can spin up more shards as needed to cope with demand).

So I think the best solution for now is just to run more shards or limit connections to the shard using a proxy in front of it if it can't cope with demand :)

nazar-pc · 2022-10-05T11:50:24Z

Solution in linked PR is to have more connections to the core. Would be nice for telemetry shard to automatically benefit from multiple cores available instead of being limited to just one.

i1i1 · 2022-10-05T12:06:31Z

It's not obvious to me at a glance what your code changes do in the linked PR; would you be able to summarize?

As Nazar said, pr introduces more connections to the core which run as background tasks. The point is that sending data would be parallelized.

jsdw · 2022-10-05T15:07:02Z

Ah yup ok I see now; you spawn multiple aggregators and each node that connects is assigned randomly to one of those. Clever :)

I'm not super against that approach (though being so IO bound I wonder how much it actually helps?). That said, the general point of the shard+core architecture is that you would spawn multiple shards to distribute the connection load when the numbers got too great (and of course these can be on the same or different machines, and fully parallelise the entire flow and not just the "aggregator" bit).

I made a similar (but more fiddly) tweak in the Core, allowing multiple aggregator loops to be spun up to spread the load of feed connections, but I would prefer to also remove that and make it possible to spawn multiple independent telemetry_core services too if we start to hit a bottleneck with feed subscriptions.

TL;DR I'd prolly rather we stick to spawning more instances and try to keep the code in each instance as simple as possible (eventually I'd like to remove similar code from the telemetry-core for the same reason).

jsdw changed the title ~~Telemetry shard: memory leak~~ Telemetry shard can be overwhelmed by too many connections Oct 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Telemetry shard can be overwhelmed by too many connections #502

Telemetry shard can be overwhelmed by too many connections #502

i1i1 commented Sep 30, 2022 •

edited

jsdw commented Oct 5, 2022 •

edited

nazar-pc commented Oct 5, 2022

i1i1 commented Oct 5, 2022

jsdw commented Oct 5, 2022 •

edited

Telemetry shard can be overwhelmed by too many connections #502

Telemetry shard can be overwhelmed by too many connections #502

Comments

i1i1 commented Sep 30, 2022 • edited

jsdw commented Oct 5, 2022 • edited

nazar-pc commented Oct 5, 2022

i1i1 commented Oct 5, 2022

jsdw commented Oct 5, 2022 • edited

i1i1 commented Sep 30, 2022 •

edited

jsdw commented Oct 5, 2022 •

edited

jsdw commented Oct 5, 2022 •

edited