Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Is there an example of buffered read/seek in chunks? #73

Open
marcellmars opened this issue Feb 13, 2023 · 3 comments
Open

Question: Is there an example of buffered read/seek in chunks? #73

marcellmars opened this issue Feb 13, 2023 · 3 comments

Comments

@marcellmars
Copy link

I was playing with bgzf archives and it was fairly easy to use bgzf.Reader in bufio.Reader so the the archive could be read in chunks. In one pass I would make an useful index of offsets so later on I could use the very large archive as if it was memory mapped file on the disk.

I tried to find if there's any example of using xflate in the similar way. All of the examples I could find would read the whole compressed archive into the memory.

So, my question is, is there an example of buffered read/seek in chunks of the compressed "xflated" archive?

I found the custom implementation of what I tried to describe here via io.ReadSeeker as not trivial one. So if there's already an example I would appreciate it immensely :)

@dsnet
Copy link
Owner

dsnet commented Feb 13, 2023

I don't quite understand how your use of bgzf works to begin with, so I'm unable to suggest an equivalent use with xflate. Do you have an example?

@dsnet
Copy link
Owner

dsnet commented Feb 13, 2023

Also, keep in mind that XFLATE operates differently than BGZF. BGZF is effectively a linked-list of independently compressed segments, so you need to read through the whole file to determine the boundaries of each segment. In contrast, XFLATE contains an index that reports the location of each segment in O(1). Thus, you can seek to the middle of an XFLATE file without needing to ever read all the content before that point.

@marcellmars
Copy link
Author

marcellmars commented Feb 15, 2023

ok. here's the use case.

i have a very large json per line gzipped file.
with bgzf i do two passes.

in first pass i use gzip.NewReader which is then used in bufio.NewReader where i do .ReadBytes('\n') to find line by line. then i pass the read line to bgzf.Writer write it and flush every million lines. that's how i end up with bgzf gzipped archive.

in second pass i use bgzf.Reader where i pass the *os.File and do the same like in the first pass: bufio.NewReader/.ReadBytes('\n') line by line and make an index where id from json is a key and add bgzf.Chunk as a value.

that works fine. very little RAM and fast enough moving through the compressed file finding the particular json via its id.

meanwhile i played with rac and did somewhat similar approach. there i didn't have to do two passes against a compressed archive as rac accept index with offset/length values made against the uncompressed file. so i made that index against the uncompressed .jsonl file. for rac i pass *os.File to rac.Reader and its .SeekRange prepares the rac.Reader to give all of its content via io.ReadAll given i provided the offset and length (made against uncompressed file) to the .SeekRange.

so both of this experiments gave me the way to query the very large compressed archive of many lines of json records for some chunk where a particular json will be found. for any particular query it uses very little RAM and it is fairly fast.

i am sure XFLATE could be used for this use case. i just couldn't figure out how to use the reference to the compressed archive (e.g. *os.File) and provide the offset/length so XFLATE gives me the desired json. i already have index made against the uncompressed file so i wonder if i can use it for XFLATE.

i hope this explains it better.

just to mention: i managed to use xflate.NewWriter and write XFLATE archive in chunks and i can read from it with gzip.NewReader.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants