Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

large file sizes causing OOMKills and timeouts #155

Open
timcosta opened this issue Jun 5, 2023 · 4 comments
Open

large file sizes causing OOMKills and timeouts #155

timcosta opened this issue Jun 5, 2023 · 4 comments

Comments

@timcosta
Copy link

timcosta commented Jun 5, 2023

hi all! i'm investigating using matano for some log ingestion, and some of the ALB log files i'm looking at are extremely large - 100MB compressed, multiple GB decompressed. we're running into resource exhaustion issues for memory usage, even after manually adjusting limits in the console to the maximum of 10240MB of memory. this happens in multiple lambdas, most notably the transform and writer.

the specific issues we're seeing in the writer are basically that it logs INFO lake_writer: Starting 25 downloads from S3 and then 20s later it's killed by lambda for exceeding 10240 MB of memory used. can this 25 number be tuned or tweaked to take into account size?

the transformer and databatcher issues we were able to resolve by increasing the timeout and memory, which should be covered by #85 when it's included. i may be able to contribute this depending on how our discovery goes, but not sure how long it would be until that could happen.

from the investigation i've done into this problem for a custom processing solution, that "best" resolutions appear to be either loading the data and processing it as a stream rather than loading it all into memory at once, or have some sort of pre-processor that splits large files into smaller chunks before they get to the loader.

do y'all have any thoughts on the best path forward here, or if matano would ever consider handling situations like this where the inputs/batches cannot be processed due to size?

@Samrose-Ahmed
Copy link
Contributor

Samrose-Ahmed commented Jun 5, 2023

Hi thanks for the issue.

Generally we don't recommend interesting such large files but this should be possible, the lake writer logic just needs to be modified to be a modified to be a bit more intelligent and file size aware. Optimal parquet sizes are 100-500MB so it shouldn't need to bring more than that in memory at a time.

I'd also like to see why it's ending up with that much data in lake writer and not flushing earlier, let me do some testing and update.

@timcosta
Copy link
Author

timcosta commented Jun 5, 2023

awesome, thanks! this is the managed AWS_ELB managed ingestion pipeline using files written directly by the ALB. what would you recommend in a situation like this? a pre-processor we write that splits these files into smaller ones before putting them into a bucket matano ingests from?

@Samrose-Ahmed
Copy link
Contributor

Samrose-Ahmed commented Jun 5, 2023

Splitting would work but we would probably want to support it out of the box in this case.

I will take a closer look at the code, you can watch this issue.

@Samrose-Ahmed
Copy link
Contributor

Were you able to test this out? I tested 2GB uncompressed ALB logs in the linked PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants