New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
threaded dests: add reopen()
with exponential backoff strategy
#4825
base: master
Are you sure you want to change the base?
threaded dests: add reopen()
with exponential backoff strategy
#4825
Conversation
This Pull Request introduces config grammar changessyslog-ng/12763388b6a838f8a03e769af139ad6f49664b6b -> alltilla/exponential-backoff-threaded-dest --- a/destination
+++ b/destination
amqp(
+ reopen(
+ <empty>
+ initial-seconds(<nonnegative-float>)
+ maximum-seconds(<nonnegative-float>)
+ multiplier(<positive-float>)
+ )
)
bigquery(
+ reopen(
+ <empty>
+ initial-seconds(<nonnegative-float>)
+ maximum-seconds(<nonnegative-float>)
+ multiplier(<positive-float>)
+ )
)
example-destination(
+ reopen(
+ <empty>
+ initial-seconds(<nonnegative-float>)
+ maximum-seconds(<nonnegative-float>)
+ multiplier(<positive-float>)
+ )
)
http(
+ reopen(
+ <empty>
+ initial-seconds(<nonnegative-float>)
+ maximum-seconds(<nonnegative-float>)
+ multiplier(<positive-float>)
+ )
)
java(
+ reopen(
+ <empty>
+ initial-seconds(<nonnegative-float>)
+ maximum-seconds(<nonnegative-float>)
+ multiplier(<positive-float>)
+ )
)
kafka-c(
+ reopen(
+ <empty>
+ initial-seconds(<nonnegative-float>)
+ maximum-seconds(<nonnegative-float>)
+ multiplier(<positive-float>)
+ )
)
loki(
+ reopen(
+ <empty>
+ initial-seconds(<nonnegative-float>)
+ maximum-seconds(<nonnegative-float>)
+ multiplier(<positive-float>)
+ )
)
mongodb(
+ reopen(
+ <empty>
+ initial-seconds(<nonnegative-float>)
+ maximum-seconds(<nonnegative-float>)
+ multiplier(<positive-float>)
+ )
)
mqtt(
+ reopen(
+ <empty>
+ initial-seconds(<nonnegative-float>)
+ maximum-seconds(<nonnegative-float>)
+ multiplier(<positive-float>)
+ )
)
opentelemetry(
+ reopen(
+ <empty>
+ initial-seconds(<nonnegative-float>)
+ maximum-seconds(<nonnegative-float>)
+ multiplier(<positive-float>)
+ )
)
python(
+ reopen(
+ <empty>
+ initial-seconds(<nonnegative-float>)
+ maximum-seconds(<nonnegative-float>)
+ multiplier(<positive-float>)
+ )
)
redis(
+ reopen(
+ <empty>
+ initial-seconds(<nonnegative-float>)
+ maximum-seconds(<nonnegative-float>)
+ multiplier(<positive-float>)
+ )
)
riemann(
+ reopen(
+ <empty>
+ initial-seconds(<nonnegative-float>)
+ maximum-seconds(<nonnegative-float>)
+ multiplier(<positive-float>)
+ )
)
smtp(
+ reopen(
+ <empty>
+ initial-seconds(<nonnegative-float>)
+ maximum-seconds(<nonnegative-float>)
+ multiplier(<positive-float>)
+ )
)
sql(
+ reopen(
+ <empty>
+ initial-seconds(<nonnegative-float>)
+ maximum-seconds(<nonnegative-float>)
+ multiplier(<positive-float>)
+ )
)
stomp(
+ reopen(
+ <empty>
+ initial-seconds(<nonnegative-float>)
+ maximum-seconds(<nonnegative-float>)
+ multiplier(<positive-float>)
+ )
)
syslog-ng-otlp(
+ reopen(
+ <empty>
+ initial-seconds(<nonnegative-float>)
+ maximum-seconds(<nonnegative-float>)
+ multiplier(<positive-float>)
+ )
)
|
Signed-off-by: Attila Szakacs <attila.szakacs@axoflow.com>
Signed-off-by: Attila Szakacs <attila.szakacs@axoflow.com>
Signed-off-by: Attila Szakacs <attila.szakacs@axoflow.com>
Signed-off-by: Attila Szakacs <attila.szakacs@axoflow.com>
1635fcf
to
fbf1ebc
Compare
Build FAILURE |
Signed-off-by: Attila Szakacs <attila.szakacs@axoflow.com>
Signed-off-by: Attila Szakacs <attila.szakacs@axoflow.com>
#include "syslog-ng.h" | ||
|
||
#define EXPONENTIAL_BACKOFF_OPTIONS_INITIAL_SECONDS_DEFAULT (0.1) | ||
#define EXPONENTIAL_BACKOFF_OPTIONS_MAXIMUM_SECONDS_DEFAULT (1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems very low in my opinion compared to the old default. I think we shouldn't go lower than 10 seconds by default. If we are planning to introduce this for traditional syslog/network drivers too, I would also consider going with different defaults there (or just keep the old 60 seconds as maximum?).
Note: connection attempts are logged internally on a high log level. For example, using the HTTP produces 2 error messages with each reopen attempt:
curl: error sending HTTP request; url='...', error='Couldn\'t connect to server', worker_index='0', driver='#anon-destination0#1', location='...'
Server disconnected while preparing messages for sending, trying again; driver='#anon-destination0#1', location='...', worker_index='0', reopen='0.00', batch_size='1'
TLDR: Do we really want 1
as maximum? If so, then we should do something about internal logging connection attempts as we are currently spamming 2-3 error messages every second per worker.
In my opinion, we should rather choose a higher default (60 or 10 would be my suggestion) and keep the logging as-is. The exponential retry mechanism will make sure that we reconnect instantly in case of momentary errors, and we even have 5 more attempts before reaching 1
second of wait time and 10 attempts before reaching 60
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed this was the "default" for max seconds, unless specified otherwise.
static gdouble | ||
_get_next_reopen_seconds(LogThreadedDestWorker *self) | ||
{ | ||
if (self->time_reopen != -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value is set explicitly by the following drivers, disabling the exponential backoff feature:
- Kafka
- Python
@@ -333,6 +359,7 @@ static void | |||
_process_result_success(LogThreadedDestWorker *self) | |||
{ | |||
_accept_batch(self); | |||
exponential_backoff_reset(self->exponential_backoff); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a double-reset here due to the function responsible for "explicit ack management".
It doesn't cause any problems, so I guess it's okay.
No description provided.