[DRAFT] Cache control for re-using previously-downloaded headers #325

jmarshall · 2018-07-12T09:54:21Z

This proposal is a follow-up to #322. It will require rebasing etc as #322 develops, so I don't anticipate updating or polishing this until after the class proposal has landed in master.

However, if that facility is to be used by clients to enable re-using previously-downloaded headers and this is to be done safely, then I think using HTTP cache control will be the natural way to make it safe and extrapolating ETag/etc to the htsget ticket is a natural extension. So if enabling this safety is considered important, I think this followup will also need to be considered soon after class.

But this is somewhat moot in the absence of implementations, hence this separate PR.

jdidion · 2018-07-17T21:17:27Z

Thanks for the additions here and in #322. Practical question: I am trying to implement a POC htsget server that splits up BAM files into header and body pieces. I'm at a loss for how to actually create a header BAM file and a body BAM file, such that the header BAM is valid by itself but can be concatenated with the body BAM to create the final BAM. In pseudo-python, the way I want my client to interact with the server is:

urlobjs = get_urls_from_server()
headers = None

with open('out.bam', 'wb') as outbam:
    for urlobj in urls:
        url_content = fetch(urlobj['url'])
        if urlobj['class'] == 'header':
            headers = decode_bam(url_content).headers
            # ...do something with the headers...
        outbam.write(url_content)

# Now open the BAM for reading
with open_bam('out.bam', 'r') as bam_reader:
    # The headers in the BAM file should be the same as what I read from the header BAM above
    assert bam_reader.headers == headers
    for record in bam_reader:
        # ...do something with the record...

I realize that I can use samtools to split my BAM file into header-only and body-only by first converting to SAM, splitting into header-only SAM and body-only SAM, and then converting both of those back into BAM. But concatenating those two files does not create a valid BAM. I guess I could just write out the body BAM and then use samtools reheader to add the header, but that's quite slow for large BAM files. Any other suggestions?

jmarshall · 2018-07-17T22:19:40Z

You need to find the boundary file offset between the header and the body, which requires understanding the format in a way that a general-purpose read-the-records API won't provide. So for BAM, you need to

figure out the length in uncompressed bytes of the BAM header, essentially by adding up
l_text + sum(l_name[1…n_ref]) plus the constant-size fields;
figure out how many BGZF blocks at the start of the file are used for those headers, by adding up blocks' isize fields until the total equals the uncompressed header size.

At that point, you'll have the header-body boundary (in “compressed space”) that you're looking for.

Note that this assumes that a new BGZF block is started for the first body data record (i.e. the header-body boundary is also at a BGZF block boundary) — this has never been stated in the SAM specification, but is something that the main implementations have done for BAM since 2010 (see #300). It seems to me that implementing htsget requires BAM files to have this property. (And similarly for BCF files, but AFAIK the main implementations don't do this for them!)

In practice, you'd more likely find this boundary by looking in a BAI/etc index for the virtual file offset of the first body data record — e.g. (presumably) by finding the smallest ioffset in any of the linear indices.

jdidion · 2018-07-17T23:40:21Z

@jmarshall that makes sense, thanks for the explanation. I have a python library for parsing index files I've been meaning to release for a while - seems like that will be useful here. I'll work on putting together a library and command line tool that can be used to split up BAM/BCF/etc files for serving by htsget.

jmarshall added the htsget label Jul 12, 2018

jmarshall mentioned this pull request Jul 12, 2018

Added class to URLs in the response #322

Merged

mlin added this to Next in htsget Sep 16, 2018

mlin moved this from Next to Now in htsget Sep 16, 2018

jmarshall force-pushed the cache-control branch from edcea92 to b0f8114 Compare April 25, 2019 12:09

jmarshall force-pushed the cache-control branch from b0f8114 to 6813e19 Compare May 14, 2019 10:03

jmarshall force-pushed the cache-control branch from 6813e19 to 2aa4cd6 Compare April 13, 2021 13:22

mlin moved this from Now to Next in htsget May 12, 2021

jmarshall force-pushed the cache-control branch from 2aa4cd6 to 72b60cb Compare June 8, 2021 21:13

mlin moved this from Next to Now in htsget Jun 8, 2021

jmarshall marked this pull request as draft June 8, 2021 21:28

Add cache control fields

e14957b

jmarshall force-pushed the cache-control branch from 72b60cb to e14957b Compare July 7, 2021 14:33

brainstorm mentioned this pull request Aug 3, 2022

Consider caching expensive functions umccr/htsget-rs#109

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] Cache control for re-using previously-downloaded headers #325

[DRAFT] Cache control for re-using previously-downloaded headers #325

jmarshall commented Jul 12, 2018 •

edited

jdidion commented Jul 17, 2018 •

edited

jmarshall commented Jul 17, 2018 •

edited

jdidion commented Jul 17, 2018

[DRAFT] Cache control for re-using previously-downloaded headers #325

Are you sure you want to change the base?

[DRAFT] Cache control for re-using previously-downloaded headers #325

Conversation

jmarshall commented Jul 12, 2018 • edited

jdidion commented Jul 17, 2018 • edited

jmarshall commented Jul 17, 2018 • edited

jdidion commented Jul 17, 2018

jmarshall commented Jul 12, 2018 •

edited

jdidion commented Jul 17, 2018 •

edited

jmarshall commented Jul 17, 2018 •

edited