Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Environment-dependent samtools view error (invalid BGZF header) with remote GCS BAM #1811

Open
rlorigro opened this issue Mar 16, 2023 · 8 comments
Assignees

Comments

@rlorigro
Copy link

rlorigro commented Mar 16, 2023

Are you using the latest version of samtools and HTSlib? If not, please specify.

Yes, but this error is error is environment dependent so I will list both environments:

PASS environment

samtools 1.17
Using htslib 1.17
Copyright (C) 2023 Genome Research Ltd.

Samtools compilation details:
    Features:       build=configure curses=yes 
    CC:             gcc
    CPPFLAGS:       
    CFLAGS:         -Wall -g -O2
    LDFLAGS:        
    HTSDIR:         htslib-1.17
    LIBS:           
    CURSES_LIB:     -lncursesw

HTSlib compilation details:
    Features:       build=configure libcurl=yes S3=no GCS=yes libdeflate=no lzma=yes bzip2=yes plugins=no htscodecs=1.4.0
    CC:             gcc
    CPPFLAGS:       
    CFLAGS:         -Wall -g -O2 -fvisibility=hidden
    LDFLAGS:        -fvisibility=hidden 

HTSlib URL scheme handlers present:
    built-in:	 preload, data, file
    Google Cloud Storage:	 gs+http, gs+https, gs
    libcurl:	 imaps, pop3, gophers, http, smb, gopher, sftp, ftps, imap, smtp, smtps, rtsp, scp, ftp, telnet, mqtt, rtmp, ldap, https, ldaps, smbs, tftp, pop3s, dict
    crypt4gh-needed:	 crypt4gh
    mem:	 mem

FAIL environment

samtools 1.17
Using htslib 1.17
Copyright (C) 2023 Genome Research Ltd.

Samtools compilation details:
    Features:       build=configure curses=yes
    CC:             gcc
    CPPFLAGS:
    CFLAGS:         -Wall -g -O2
    LDFLAGS:
    HTSDIR:         htslib-1.17
    LIBS:
    CURSES_LIB:     -lncursesw

HTSlib compilation details:
    Features:       build=configure libcurl=yes S3=yes GCS=yes libdeflate=no lzma=yes bzip2=yes plugins=no htscodecs=1.4.0
    CC:             gcc
    CPPFLAGS:
    CFLAGS:         -Wall -g -O2 -fvisibility=hidden
    LDFLAGS:        -fvisibility=hidden

HTSlib URL scheme handlers present:
    built-in:	 preload, data, file
    S3 Multipart Upload:	 s3w, s3w+https, s3w+http
    Amazon S3:	 s3+https, s3+http, s3
    Google Cloud Storage:	 gs+http, gs+https, gs
    libcurl:	 imaps, pop3, http, smb, gopher, sftp, ftps, imap, smtp, smtps, rtsp, scp, ftp, telnet, rtmp, ldap, https, ldaps, smbs, tftp, pop3s, dict
    crypt4gh-needed:	 crypt4gh
    mem:	 mem

Please describe your environment.

Both environments
Linux 5.8.0-53-generic
x86_64

FAIL environment
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

PASS environment
gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0

Please specify the steps taken to generate the issue, the command you are running and the relevant output.

Depending on where I run the following command:

samtools view -h -F 4 gs://bucket/alignment.bam chr2:40000000-40000001 > a.txt

I either get the expected result (the full region as a SAM) or I get only the header, and then an error message:

[E::bgzf_read_block] Invalid BGZF header at offset 795492778
[E::bgzf_read] Read block operation failed with error 2 after 0 of 4 bytes
samtools view: retrieval of region #0 failed
samtools view: error closing "gs://bucket/alignment.bam": -1

The failing environment is my own local machine, and the passing environment is a fresh Ubuntu 22 docker instance I am running interactively on my local machine. Both have the same version of samtools built from source of the latest release.

To summarize the differences, I ran diff on the two samtools --version results:

--- fail.txt    2023-03-16 10:19:10.771830174 -0700
+++ pass.txt    2023-03-16 10:19:10.779830480 -0700
@@ -16 +16 @@
-    Features:       build=configure libcurl=yes S3=yes GCS=yes libdeflate=no lzma=yes bzip2=yes plugins=no htscodecs=1.4.0
+    Features:       build=configure libcurl=yes S3=no GCS=yes libdeflate=no lzma=yes bzip2=yes plugins=no htscodecs=1.4.0
@@ -24,2 +23,0 @@
-    S3 Multipart Upload:        s3w, s3w+https, s3w+http
-    Amazon S3:  s3+https, s3+http, s3
@@ -27 +25 @@
-    libcurl:    imaps, pop3, http, smb, gopher, sftp, ftps, imap, smtp, smtps, rtsp, scp, ftp, telnet, rtmp, ldap, https, ldaps, smbs, tftp, pop3s, dict
+    libcurl:    imaps, pop3, gophers, http, smb, gopher, sftp, ftps, imap, smtp, smtps, rtsp, scp, ftp, telnet, mqtt, rtmp, ldap, https, ldaps, smbs, tftp, pop3s, dict
@daviesrob
Copy link
Member

Thanks for the very comprehensive report. I see there are different versions of libcurl (and presumably its dependencies) present in the two environments. Assuming that you're using the system-installed libcurl in both cases, could you run curl --version in each environment and let us know what it says please?

@rlorigro
Copy link
Author

Sure here they are:

PASS

curl 7.81.0 (x86_64-pc-linux-gnu) libcurl/7.81.0 OpenSSL/3.0.2 zlib/1.2.11 brotli/1.0.9 zstd/1.4.8 libidn2/2.3.2 libpsl/0.21.0 (+libidn2/2.3.2) libssh/0.9.6/openssl/zlib nghttp2/1.43.0 librtmp/2.3 OpenLDAP/2.5.13
Release-Date: 2022-01-05
Protocols: dict file ftp ftps gopher gophers http https imap imaps ldap ldaps mqtt pop3 pop3s rtmp rtsp scp sftp smb smbs smtp smtps telnet tftp 
Features: alt-svc AsynchDNS brotli GSS-API HSTS HTTP2 HTTPS-proxy IDN IPv6 Kerberos Largefile libz NTLM NTLM_WB PSL SPNEGO SSL TLS-SRP UnixSockets zstd

FAIL

curl 7.68.0 (x86_64-pc-linux-gnu) libcurl/7.68.0 OpenSSL/1.1.1f zlib/1.2.11 brotli/1.0.7 libidn2/2.2.0 libpsl/0.21.0 (+libidn2/2.2.0) libssh/0.9.3/openssl/zlib nghttp2/1.40.0 librtmp/2.3
Release-Date: 2020-01-08
Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtmp rtsp scp sftp smb smbs smtp smtps telnet tftp 
Features: AsynchDNS brotli GSS-API HTTP2 HTTPS-proxy IDN IPv6 Kerberos Largefile libz NTLM NTLM_WB PSL SPNEGO SSL TLS-SRP UnixSockets

I also was able to confirm this using samtools view --verbosity=8:

PASS

user-agent: htslib/1.17 libcurl/7.81.0

FAIL

user-agent: htslib/1.17 libcurl/7.68.0

And the outputs look quite different for those full logs

@daviesrob
Copy link
Member

Yes, it's possible that some obscure bug got fixed between the two versions, I'd suspect in one of libcurl, openSSL or maybe nghttp2 if it's being used. Is the problem very reproducible, and do you know of any publicly-available data that might show it? There might also be some clues in the debug output, but you'd probably want to edit them for sensitive data before posting them here...

@rlorigro
Copy link
Author

rlorigro commented Mar 16, 2023

It doesn't crash when using this public BAM:
gs://pepper-deepvariant-public/T2T_case_study/chm13.draft_v0.9.hifi_20k.wm_1.11.pri.chr20.bam

(it needs to be fetched in the chr20 contig because there are no alignments elsewhere)

@rlorigro
Copy link
Author

rlorigro commented Mar 16, 2023

I should say, using my own private gs URI it always fails.

Here are some potentially relevant differences I found between the verbose logs (on the same BAM, on different environments):

FAIL

* Using HTTP2, server supports multi-use

PASS

* Using HTTP2, server supports multiplexing

FAIL

* old SSL session ID is stale, removing
* Connection state changed (MAX_CONCURRENT_STREAMS == 100)!
< HTTP/2 200 

PASS

* old SSL session ID is stale, removing
< HTTP/2 200 

But actually, this is the strangest thing I found, during the final request, which is the one that fails:

FAIL

< content-range: bytes 7461327743-9166231634/9166231635
< content-length: 1704903892

PASS

< content-range: bytes 8508119962-9166231634/9166231635
< content-length: 658111673

Somehow the byte ranges are different? Doesn't seem to make any sense to me why that would be the case, for the exact same command with the interval: chr20:40000000-40000001

@daviesrob
Copy link
Member

The different range requests look odd. Could you be using an incorrect or out of date in index in one of your environments? For remote files samtools has a habit of caching the index (.bai or .csi) file locally, so it doesn't have to download it again if you search for lots of ranges. If your local copy matches the file name, but isn't the correct index file then it may lead to the results you're seeing.

It might be worth checking for any local .bai or .csi files, and move them elsewhere to force samtools to download a new copy. Then you could compare the new and old index files to see if they're the same.

@rlorigro
Copy link
Author

rlorigro commented Mar 17, 2023

oh yes this definitely explains it. In my original script, the first bam would succeed but the second one would fail. They have different URIs, but the same filename (orphaned from the directory).

For example:

gs://bucket/hash_code_1/alignment.bam
gs://bucket/hash_code_1/alignment.bam.bai
gs://bucket/hash_code_2/alignment.bam
gs://bucket/hash_code_2/alignment.bam.bai

I've renamed my remote BAMs/BAIs and it works now. Is there anything I can do to prevent this behavior if I want to sample from multiple bams of the same name (in parallel)? I am working with the outputs of cloud workflows which often have generic filenames and unique folder structures.

@rlorigro
Copy link
Author

Could I suggest using a hash or some unique aspect of the remote bai to check whether the local bai is actually the right file?

Or even just storing them in some temporary directory using a hash of the entire gs URI?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants