Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance on large files - avoid spilling to disk #4

Open
d-cameron opened this issue May 21, 2019 · 4 comments
Open

Performance on large files - avoid spilling to disk #4

d-cameron opened this issue May 21, 2019 · 4 comments
Labels
enhancement New feature or request

Comments

@d-cameron
Copy link

Looking through the source code and specifications document, I've noticed that both compress and decompression spill to disk for large files. This is particularly problematic in the decompression scenario due to the high temporary disk usage.

Have you considered extending the file format to support multiple blocks? For example:

Header = format descriptor, format version, sequence type, flags, name separator, line length

DataBlock = Number of sequences, IDs, Comments, Lengths, Mask, Sequence, Quality

And the overall structure:

Header, Title, [DataBlock]+

Then you could stream NAF files with no disk usage and a fixed memory overhead. There is a slight compression penalty to having multiple data block but that will be trivially low for large data blocks. Both BAM and CRAM uses variants of this blocked compression approach.

@KirillKryukov
Copy link
Owner

Decompression does not spill to disk. If it did, it would be problematic, as you say. Compression uses temporary disk storage though, which may be not ideal.

Indeed, I consider extending the format to support multiple blocks. My main reason for this is not to avoid disk usage in compression, which I see as acceptable. But more importantly, multi-block format will enable faster random access - i.e., partial decompression of just specified sequences or coordinate range.

This has to be carefully designed and harmonized with other planned features. So it may take me some while. But I'm very positive to extending the format in that direction.

@KirillKryukov KirillKryukov added the enhancement New feature or request label Oct 22, 2019
@yhoogstrate
Copy link

You could set --temp-dir to /dev/shm/ to effectively write to temporarily to RAM and avoid additional IO.

Interesting discussion on how to proceed with random access. I may need this for something else :)

@KirillKryukov
Copy link
Owner

@yhoogstrate I somehow missed this comment. It's a good idea to use /dev/shm/ when possible. I added it to the manual ( https://github.com/KirillKryukov/naf/blob/develop/Compress.md#temporary-storage ). Thanks!

@yhoogstrate
Copy link

@yhoogstrate I somehow missed this comment. It's a good idea to use /dev/shm/ when possible. I added it to the manual ( https://github.com/KirillKryukov/naf/blob/develop/Compress.md#temporary-storage ). Thanks!

I will try to make a PR resolving this soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants