omfwd: implement native load balancing - phase 1 #5369

rgerhards · 2024-04-29T09:04:45Z

This patch implements a very simple round-robin load balancer for omfwd. Its intent is to provide a solid base on which more elaborate functionality can be build.

That said, the load balancing basics will be in place, including the handling of failing load balancing targets and imporoved and adapted pstats counters.

The new functionality is fully backwards compatible with previous configuration settings.

New action() config params:

pool.resumeinterval

New/"changed" rstats counters
Each target receives its own set of pstats counters. Most importantly this is the case for byte counts. That counter retains the same naming, but there may now be multiple of these counters, one for each target ip, port tuple.

New pstats message count to target
Among others, this can be used for checking that the load balancer works as intended. The so-far byte count emitted does not provide a clear indication of how many messages the targets had actually processed.

For obvious reasons, this message count makes most sense in advanced load balancing scenarios, but also provides additional insight into round-robin. Non-matches indicate that targets went offline, and we can now evaluate the impact this had on processing.

tools/omfwd.c

davidelang · 2024-05-07T00:36:14Z

As a future enhancement, if a DNS lookup returns multiple IP addresses, use this to RR between them

davidelang · 2024-05-07T00:37:49Z

We still want to have rebindInterval for people who have other load balancers

rgerhards · 2024-05-07T06:59:54Z

We still want to have rebindInterval for people who have other load balancers

I guess you looked at the refactoring commit. This is just moving the functonality out of a helper library to the main omfwd module. It does NOT disable the functionality. It's a code cleanup, as omfwd is the only module using it.

side-note: while doing the enhancement work, I noticed that rebind interval causes some mild message duplication for quite some time. This went unnoticed. To address that efficiently, rebindInterval in the future will be considered once per batch. That means up to maxBatchSize -1 messages may be transmitted more than the rebindinterval is. That's the cleanest mode of operation and should not make any difference for real deployments.

rgerhards · 2024-05-07T07:02:28Z

As a future enhancement, if a DNS lookup returns multiple IP addresses, use this to RR between them

That's a good idea, but let's get the core functionality in place first. Most importantly, that request requires many more changes to helper objects. Please note that the load balancing, as well as other upcoming work, is a sponsored opportunity, so I need to concentrate on the sponsor's need first. I am very glad we found someone to help make this important enhancement finally happen. Note that the code change looks small, but there is a lot to consider "in the background" to implement load balancing in a useful way into the engine. There are lots of subtle issues. The current approach IMHO addresses them best.

This patch implements a very simple round-robin load balancer for omfwd. Its intent is to provide a solid base on which more elaborate functionality can be build. That said, the load balancing basics will be in place, including the handling of failing load balancing targets and imporoved and adapted pstats counters. The new functionality is fully backwards compatible with previous configuration settings. New action() config params: * pool.resumeinterval New/"changed" rstats counters Each target receives its own set of pstats counters. Most importantly this is the case for byte counts. That counter retains the same naming, but there may now be multiple of these counters, one for each target ip, port tuple. New pstats message count to target Among others, this can be used for checking that the load balancer works as intended. The so-far byte count emitted does not provide a clear indication of how many messages the targets had actually processed. For obvious reasons, this message count makes most sense in advanced load balancing scenarios, but also provides additional insight into round-robin. Non-matches indicate that targets went offline, and we can now evaluate the impact this had on processing. - re-design rebind functionality This now works at the transaction level. It causes a rebind of all pool members. Previous code did not work 100% correct since for a couple of years now (after output batching integration). As cleanup, rebindInterval support has been removed from tcpClt, because omfwd is the only user. This permits a cleaner code path. We also noticed a bug with rebindInterval: it caused some mild message duplication for quite some time. This went unnoticed. To address that efficiently, rebindInterval in the future will be considered once per batch. That means up to (maxBatchSize - 1) messages may be transmitted more than the rebindinterval is. That's the cleanest mode of operation and should not make any difference for real deployments.

davidelang · 2024-05-07T13:05:32Z

I was actually looking at the code change and noticed the rebindinterval was removed and did not appear to be added back anywhere else. As long as the functionality is still there. As for message duplication, I was under the impression that the rebind would only happen between batches, so one batch would be completely sent, close the connection, reopen the connection, and start sending the next batch. how is it duplicating messages anywhere in this? David Lang

…

On Tue, 7 May 2024, Rainer Gerhards wrote: Date: Tue, 07 May 2024 00:00:15 -0700 From: Rainer Gerhards ***@***.***> Reply-To: rsyslog/rsyslog ***@***.***> To: rsyslog/rsyslog ***@***.***> Cc: David Lang ***@***.***>, Comment ***@***.***> Subject: Re: [rsyslog/rsyslog] omfwd: implement native load balancing - phase 1 (PR #5369) > We still want to have rebindInterval for people who have other load balancers I guess you looked at the refactoring commit. This is just moving the functonality out of a helper library to the main omfwd module. It does NOT disable the functionality. It's a code cleanup, as omfwd is the only module using it. side-note: while doing the enhancement work, I noticed that rebind interval causes some mild message duplication for quite some time. This went unnoticed. To address that efficiently, rebindInterval in the future will be considered once per batch. That means up to maxBatchSize -1 messages may be transmitted more than the rebindinterval is. That's the cleanest mode of operation and should not make any difference for real deployments.

davidelang · 2024-05-07T13:07:28Z

On Tue, 7 May 2024, Rainer Gerhards wrote: That's a good idea, but let's get the core functionality in place first. Most importantly, that request requires many more changes to helper objects. Please note that the load balancing, as well as other upcoming work, is a sponsored opportunity, so I need to concentrate on the sponsor's need first. I am very glad we found someone to help make this important enhancement finally happen. Note that the code change looks small, but there is a lot to consider "in the background" to implement load balancing in a useful way into the engine. There are lots of subtle issues. The current approach IMHO addresses them best.

Yes, core functionality (especially if it's sponsored) first. That's why I tagged it as a future feature. At some point I'd be interested to hear about the background considerations (especially the subtle issues), but not urgent David Lang

rgerhards · 2024-05-07T13:57:17Z

lose the connection, reopen the connection, and start sending the next batch.

It happens on the message level, not batch. That's also where potential duplication comes in. The new mode (found in omfwd.c) works on batches and here we can ensure that a complete "transaction" is written to the remote target before a rebind is done.

The issue, I think, appeared some years ago when we introduced the new transactional interface vs. the old-style method. it's not a big issue, so it went unnoticed until my current work.

rgerhards · 2024-05-07T14:00:11Z

At some point I'd be interested to hear about the background considerations (especially the subtle issues), but not urgent

just the most important point first: there were the choice to a) re-write large parts of the rsyslog core (well, not a complete rewrite, but a large set of changes) and working with the existing interfaces or b) a kind of emulating the same behaviour for a pool as for a single target). After a lot of thinking, I went for b), which also seems to be the simpler and highly efficient method. So far, the results look very good.

rgerhards added the enhancement label Apr 29, 2024

rgerhards added this to the v8.2406 milestone Apr 29, 2024

rgerhards self-assigned this Apr 29, 2024

github-advanced-security bot found potential problems Apr 29, 2024

View reviewed changes

tools/omfwd.c Show resolved Hide resolved

rgerhards marked this pull request as draft April 29, 2024 10:16

rgerhards force-pushed the omfwd-loadbalance-s1 branch 9 times, most recently from fce6d35 to 60d1ada Compare May 2, 2024 10:44

github-advanced-security bot found potential problems May 2, 2024

View reviewed changes

tools/omfwd.c Show resolved Hide resolved

tools/omfwd.c Fixed Show resolved Hide resolved

rgerhards force-pushed the omfwd-loadbalance-s1 branch 10 times, most recently from 8fec56f to 0425a5d Compare May 6, 2024 15:07

omfwd: milestone: move targetCounter to wrkrInstance

7cb2f61

rgerhards force-pushed the omfwd-loadbalance-s1 branch from 0425a5d to 7cb2f61 Compare May 7, 2024 07:40

minitcpsrvr cleanup, prep for addtl test

21a6fae

rgerhards force-pushed the omfwd-loadbalance-s1 branch from ddb48f2 to 21a6fae Compare May 7, 2024 15:52

new test for aborted connections

eb87e02

rgerhards force-pushed the omfwd-loadbalance-s1 branch from 238aeb2 to eb87e02 Compare May 10, 2024 07:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

omfwd: implement native load balancing - phase 1 #5369

omfwd: implement native load balancing - phase 1 #5369

rgerhards commented Apr 29, 2024

davidelang commented May 7, 2024

davidelang commented May 7, 2024

rgerhards commented May 7, 2024

rgerhards commented May 7, 2024

davidelang commented May 7, 2024 via email

davidelang commented May 7, 2024 via email

rgerhards commented May 7, 2024

rgerhards commented May 7, 2024

omfwd: implement native load balancing - phase 1 #5369

Are you sure you want to change the base?

omfwd: implement native load balancing - phase 1 #5369

Conversation

rgerhards commented Apr 29, 2024

davidelang commented May 7, 2024

davidelang commented May 7, 2024

rgerhards commented May 7, 2024

rgerhards commented May 7, 2024

davidelang commented May 7, 2024 via email

davidelang commented May 7, 2024 via email

rgerhards commented May 7, 2024

rgerhards commented May 7, 2024