-
Notifications
You must be signed in to change notification settings - Fork 572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Environment-dependent samtools view error (invalid BGZF header) with remote GCS BAM #1811
Comments
Thanks for the very comprehensive report. I see there are different versions of libcurl (and presumably its dependencies) present in the two environments. Assuming that you're using the system-installed libcurl in both cases, could you run |
Sure here they are: PASS
FAIL
I also was able to confirm this using PASS
FAIL
And the outputs look quite different for those full logs |
Yes, it's possible that some obscure bug got fixed between the two versions, I'd suspect in one of libcurl, openSSL or maybe nghttp2 if it's being used. Is the problem very reproducible, and do you know of any publicly-available data that might show it? There might also be some clues in the debug output, but you'd probably want to edit them for sensitive data before posting them here... |
It doesn't crash when using this public BAM: (it needs to be fetched in the chr20 contig because there are no alignments elsewhere) |
I should say, using my own private gs URI it always fails. Here are some potentially relevant differences I found between the verbose logs (on the same BAM, on different environments): FAIL
PASS
FAIL
PASS
But actually, this is the strangest thing I found, during the final request, which is the one that fails: FAIL
PASS
Somehow the byte ranges are different? Doesn't seem to make any sense to me why that would be the case, for the exact same command with the interval: |
The different range requests look odd. Could you be using an incorrect or out of date in index in one of your environments? For remote files It might be worth checking for any local |
oh yes this definitely explains it. In my original script, the first bam would succeed but the second one would fail. They have different URIs, but the same filename (orphaned from the directory). For example:
I've renamed my remote BAMs/BAIs and it works now. Is there anything I can do to prevent this behavior if I want to sample from multiple bams of the same name (in parallel)? I am working with the outputs of cloud workflows which often have generic filenames and unique folder structures. |
Could I suggest using a hash or some unique aspect of the remote bai to check whether the local bai is actually the right file? Or even just storing them in some temporary directory using a hash of the entire gs URI? |
Are you using the latest version of samtools and HTSlib? If not, please specify.
Yes, but this error is error is environment dependent so I will list both environments:
PASS environment
FAIL environment
Please describe your environment.
Both environments
Linux 5.8.0-53-generic
x86_64
FAIL environment
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PASS environment
gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Please specify the steps taken to generate the issue, the command you are running and the relevant output.
Depending on where I run the following command:
I either get the expected result (the full region as a SAM) or I get only the header, and then an error message:
The failing environment is my own local machine, and the passing environment is a fresh Ubuntu 22 docker instance I am running interactively on my local machine. Both have the same version of samtools built from source of the latest release.
To summarize the differences, I ran
diff
on the twosamtools --version
results:The text was updated successfully, but these errors were encountered: