Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very large memory-consumption in SplunkHttp #265

Open
UnitedMarsupials opened this issue Nov 22, 2022 · 7 comments
Open

Very large memory-consumption in SplunkHttp #265

UnitedMarsupials opened this issue Nov 22, 2022 · 7 comments

Comments

@UnitedMarsupials
Copy link

UnitedMarsupials commented Nov 22, 2022

Depending on whether Splunk-logging is enabled in log4j2.xml:

    <SplunkHttp
    ...
                     batch_size_count="2099"
                     batch_interval="3"/>

our application's total heap-usage rises by about 4Gb, as can be seen on these charts from jconsole! Granted, the application is verbose with multiple events per second at times, but 4Gb still seems excessive...

The receiving HEC is the same for the entire enterprise -- I doubt, that's the bottleneck. Please, advise.

Edit: the #249, which seems to be about the same/similar problem, is closed -- perhaps, prematurely. There is no good way to handle the Splunk server being down (or slow), perhaps, there should be options to simply drop log-entries, when such a situation happens -- to preserve heap -- based either on the events' age and/or severity.

The total count of such dropped messages can be kept -- and logged on its own, when possible: "HEC latency necessitated dropping of %u events".

With Splunk-logging
Without Splunk-logging

@oliver-brm
Copy link

Seeing the same issue. In our case, HTTP requests get stuck in the HTTP client's queue (okhttp3.Dispatcher.readyAsyncCalls). Same on your side?

@twaslowski
Copy link

Experiencing the same issue. In our case this even caused our application running in AWS Fargate to crash as the G1 Garbage Collector would take up 100% of available CPU cycles, which would eventually lead to ECS shutting down the application because it wasn't responding to health checks in a timely fashion.

Our temporary fix was to decrease the amount of logs we're sending to Splunk and increase the heap size, but the long-term solution will likely be to log to Cloudwatch only and export the logs with a dedicated Lambda. I'd be interested if there's any other fixes for this issue though.

@oliver-brm
Copy link

oliver-brm commented Feb 2, 2023 via email

@UnitedMarsupials
Copy link
Author

As a workaround, batching log events helped in our case. That way, the amount of requests to Splunk is reduced and they’re less likely to pile up in the HTTP client.

Just curious, what batch-size did you find useful? We raised the number from the original 17 to 2099 -- and that still leads to the above-charted heap-usage... Our batch_interval here is 3, though -- maybe, we ought to raise that too...

@oliver-brm
Copy link

We started off with the recommended values (see here)

Looking at the settings you posted initially, it appears to me that you might be sending too fast. batch_interval is interpreted in milliseconds, so you might want to raise that to 3000.

@Tomboyo
Copy link

Tomboyo commented Feb 9, 2024

We encountered this recently, so for anyone trying to understand the problem, it looks like the default behavior of the appender is to dispatch every logging event as a discrete http request to the HEC endpoint. Those requests are enqueued in the okhttp dispatcher, which by default (send_mode=serial) only uses one thread to send the requests to the API. Those requests pile up in the okhttp buffer until the application hits an OOME.

You fix this by configuring batch_size_count or batch_size_bytes to a nonzero value, by also configuring batch_interval to a nonzero number of ms (to guarantee that an incomplete buffer is flushed if no events arrive to fill it over some period), and by optionally configuring send_mode to "parallel" so that more threads can step in to help with bursts of logging.

As I understand it, anyway.

@UnitedMarsupials
Copy link
Author

You fix this by configuring batch_size_count or batch_size_bytes

None of these methods will fix the problem. They'll help reduce it, yes, but not eliminate completely. A fix would involve dropping events, when JVM is getting closer to hitting OOM, but no one dares to propose such a dataloss :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants