Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

omfwd: implement native load balancing - phase 1 #5369

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

rgerhards
Copy link
Member

This patch implements a very simple round-robin load balancer for omfwd. Its intent is to provide a solid base on which more elaborate functionality can be build.

That said, the load balancing basics will be in place, including the handling of failing load balancing targets and imporoved and adapted pstats counters.

The new functionality is fully backwards compatible with previous configuration settings.

New action() config params:

  • pool.resumeinterval

New/"changed" rstats counters
Each target receives its own set of pstats counters. Most importantly this is the case for byte counts. That counter retains the same naming, but there may now be multiple of these counters, one for each target ip, port tuple.

New pstats message count to target
Among others, this can be used for checking that the load balancer works as intended. The so-far byte count emitted does not provide a clear indication of how many messages the targets had actually processed.

For obvious reasons, this message count makes most sense in advanced load balancing scenarios, but also provides additional insight into round-robin. Non-matches indicate that targets went offline, and we can now evaluate the impact this had on processing.

@rgerhards rgerhards added this to the v8.2406 milestone Apr 29, 2024
@rgerhards rgerhards self-assigned this Apr 29, 2024
tools/omfwd.c Show resolved Hide resolved
@rgerhards rgerhards marked this pull request as draft April 29, 2024 10:16
@rgerhards rgerhards force-pushed the omfwd-loadbalance-s1 branch 9 times, most recently from fce6d35 to 60d1ada Compare May 2, 2024 10:44
tools/omfwd.c Show resolved Hide resolved
tools/omfwd.c Fixed Show resolved Hide resolved
@rgerhards rgerhards force-pushed the omfwd-loadbalance-s1 branch 10 times, most recently from 8fec56f to 0425a5d Compare May 6, 2024 15:07
@davidelang
Copy link
Contributor

As a future enhancement, if a DNS lookup returns multiple IP addresses, use this to RR between them

@davidelang
Copy link
Contributor

We still want to have rebindInterval for people who have other load balancers

@rgerhards
Copy link
Member Author

We still want to have rebindInterval for people who have other load balancers

I guess you looked at the refactoring commit. This is just moving the functonality out of a helper library to the main omfwd module. It does NOT disable the functionality. It's a code cleanup, as omfwd is the only module using it.

side-note: while doing the enhancement work, I noticed that rebind interval causes some mild message duplication for quite some time. This went unnoticed. To address that efficiently, rebindInterval in the future will be considered once per batch. That means up to maxBatchSize -1 messages may be transmitted more than the rebindinterval is. That's the cleanest mode of operation and should not make any difference for real deployments.

@rgerhards
Copy link
Member Author

As a future enhancement, if a DNS lookup returns multiple IP addresses, use this to RR between them

That's a good idea, but let's get the core functionality in place first. Most importantly, that request requires many more changes to helper objects. Please note that the load balancing, as well as other upcoming work, is a sponsored opportunity, so I need to concentrate on the sponsor's need first. I am very glad we found someone to help make this important enhancement finally happen. Note that the code change looks small, but there is a lot to consider "in the background" to implement load balancing in a useful way into the engine. There are lots of subtle issues. The current approach IMHO addresses them best.

This patch implements a very simple round-robin load balancer
for omfwd. Its intent is to provide a solid base on which more
elaborate functionality can be build.

That said, the load balancing basics will be in place, including
the handling of failing load balancing targets and imporoved and
adapted pstats counters.

The new functionality is fully backwards compatible with previous
configuration settings.

New action() config params:
* pool.resumeinterval

New/"changed" rstats counters
Each target receives its own set of pstats counters. Most
importantly this is the case for byte counts. That counter retains
the same naming, but there may now be multiple of these counters,
one for each target ip, port tuple.

New pstats message count to target
Among others, this can be used for checking that the load balancer
works as intended. The so-far byte count emitted does not provide
a clear indication of how many messages the targets had actually
processed.

For obvious reasons, this message count makes most sense in
advanced load balancing scenarios, but also provides additional
insight into round-robin. Non-matches indicate that targets
went offline, and we can now evaluate the impact this had
on processing.

- re-design rebind functionality

This now works at the transaction level. It causes a rebind of all
pool members. Previous code did not work 100% correct since for a
couple of years now (after output batching integration).

As cleanup, rebindInterval support has been removed from tcpClt,
because omfwd is the only user. This permits a cleaner code path.

We also noticed a bug with rebindInterval:  it caused some mild
message duplication for quite some time. This went unnoticed.
To address that efficiently, rebindInterval in the future will
be considered once per batch. That means up to (maxBatchSize - 1)
messages may be transmitted more than the rebindinterval is.
That's the cleanest mode of operation and should not make any
difference for real deployments.
@davidelang
Copy link
Contributor

davidelang commented May 7, 2024 via email

@davidelang
Copy link
Contributor

davidelang commented May 7, 2024 via email

@rgerhards
Copy link
Member Author

lose the connection, reopen the connection, and start sending the next batch.

It happens on the message level, not batch. That's also where potential duplication comes in. The new mode (found in omfwd.c) works on batches and here we can ensure that a complete "transaction" is written to the remote target before a rebind is done.

The issue, I think, appeared some years ago when we introduced the new transactional interface vs. the old-style method. it's not a big issue, so it went unnoticed until my current work.

@rgerhards
Copy link
Member Author

At some point I'd be interested to hear about the background considerations (especially the subtle issues), but not urgent

just the most important point first: there were the choice to a) re-write large parts of the rsyslog core (well, not a complete rewrite, but a large set of changes) and working with the existing interfaces or b) a kind of emulating the same behaviour for a pool as for a single target). After a lot of thinking, I went for b), which also seems to be the simpler and highly efficient method. So far, the results look very good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants