Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number of sequences in BAM header mismatches reference #126

Open
tnnandi opened this issue Jul 25, 2023 · 5 comments
Open

Number of sequences in BAM header mismatches reference #126

tnnandi opened this issue Jul 25, 2023 · 5 comments

Comments

@tnnandi
Copy link

tnnandi commented Jul 25, 2023

Hi,

When I am using SvABA with the hg38 (no alt) reference file and a BAM file obtained by aligning a whole human genome to the same reference, I get the following error (only the first few lines are shown):

!!!!!!!!!!! WARNING !!!!!!!!!!!
!!!!!! Number of sequences in BAM header mismatches reference
!!!!!! BAM: 3366 -- Ref: 194
!!!!!! BAM sequence id 1: "chr2" -- Ref sequence id 1: "chr10"
!!!!!! BAM sequence id 2: "chr3" -- Ref sequence id 2: "chr11"
!!!!!! BAM sequence id 3: "chr4" -- Ref sequence id 3: "chr11_KI270721v1_random"
!!!!!! BAM sequence id 4: "chr5" -- Ref sequence id 4: "chr12"
!!!!!! BAM sequence id 5: "chr6" -- Ref sequence id 5: "chr13"
!!!!!! BAM sequence id 6: "chr7" -- Ref sequence id 6: "chr14"
!!!!!! BAM sequence id 7: "chr8" -- Ref sequence id 7: "chr14_GL000009v2_random"
!!!!!! BAM sequence id 8: "chr9" -- Ref sequence id 8: "chr14_GL000225v1_random"
!!!!!! BAM sequence id 9: "chr10" -- Ref sequence id 9: "chr14_KI270722v1_random"
!!!!!! BAM sequence id 10: "chr11" -- Ref sequence id 10: "chr14_GL000194v1_random"
!!!!!! BAM sequence id 11: "chr12" -- Ref sequence id 11: "chr14_KI270723v1_random"
.....

This error arises from the following condition in src/savaba/run_savaba.cpp:
if (b_header.NumSequences() != bwa_header.NumSequences()),
and when I do samtools view -H <bamfile> | wc -l I do indeed get a number close to 3366 but that includes lot of information in addition to that of just the sequences (e.g., steps for alignment).

So, I don't see any reason why both headers (for the bam file and the reference file) should be of the same length if we are not limiting our comparison to only those lines that contain the sequence information.

I have ensured that the BAM file has indeed been obtained by mapping to the same reference genome (I re-did it myself), so I'm not sure how to get around this issue.

Thanks,
Tarak

@walaj
Copy link
Owner

walaj commented Jul 25, 2023 via email

@tnnandi
Copy link
Author

tnnandi commented Jul 26, 2023

Hi @walaj , thank you very much for the prompt response!

Here's the link to a file that contains the header of the bam file + first few entries: https://drive.google.com/file/d/1i66328Q4bmjMAGhD_KF6Z4mQcWBVpjua/view?usp=drive_link

The first line of the reference file looks like this (it starts with chr1):

chr1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Also, this seems to be more than just a warning as the code fails if trigger_explain is True that is caused by a mismatch in the headers.

Thanks,
Tarak

@walaj
Copy link
Owner

walaj commented Jul 26, 2023 via email

@zhaowsong
Copy link

Hi @walaj , thank you very much for the prompt response!

Here's the link to a file that contains the header of the bam file + first few entries: https://drive.google.com/file/d/1i66328Q4bmjMAGhD_KF6Z4mQcWBVpjua/view?usp=drive_link

The first line of the reference file looks like this (it starts with chr1):

chr1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Also, this seems to be more than just a warning as the code fails if trigger_explain is True that is caused by a mismatch in the headers.

Thanks, Tarak

Hi,

When I am using SvABA with the hg38 (no alt) reference file and a BAM file obtained by aligning a whole human genome to the same reference, I get the following error (only the first few lines are shown):

!!!!!!!!!!! WARNING !!!!!!!!!!!
!!!!!! Number of sequences in BAM header mismatches reference
!!!!!! BAM: 3366 -- Ref: 194
!!!!!! BAM sequence id 1: "chr2" -- Ref sequence id 1: "chr10"
!!!!!! BAM sequence id 2: "chr3" -- Ref sequence id 2: "chr11"
!!!!!! BAM sequence id 3: "chr4" -- Ref sequence id 3: "chr11_KI270721v1_random"
!!!!!! BAM sequence id 4: "chr5" -- Ref sequence id 4: "chr12"
!!!!!! BAM sequence id 5: "chr6" -- Ref sequence id 5: "chr13"
!!!!!! BAM sequence id 6: "chr7" -- Ref sequence id 6: "chr14"
!!!!!! BAM sequence id 7: "chr8" -- Ref sequence id 7: "chr14_GL000009v2_random"
!!!!!! BAM sequence id 8: "chr9" -- Ref sequence id 8: "chr14_GL000225v1_random"
!!!!!! BAM sequence id 9: "chr10" -- Ref sequence id 9: "chr14_KI270722v1_random"
!!!!!! BAM sequence id 10: "chr11" -- Ref sequence id 10: "chr14_GL000194v1_random"
!!!!!! BAM sequence id 11: "chr12" -- Ref sequence id 11: "chr14_KI270723v1_random"
.....

This error arises from the following condition in src/savaba/run_savaba.cpp: if (b_header.NumSequences() != bwa_header.NumSequences()), and when I do samtools view -H <bamfile> | wc -l I do indeed get a number close to 3366 but that includes lot of information in addition to that of just the sequences (e.g., steps for alignment).

So, I don't see any reason why both headers (for the bam file and the reference file) should be of the same length if we are not limiting our comparison to only those lines that contain the sequence information.

I have ensured that the BAM file has indeed been obtained by mapping to the same reference genome (I re-did it myself), so I'm not sure how to get around this issue.

Thanks, Tarak

Hi tnnandi, have you solved your issue? I'm also facing the same problem

@walaj
Copy link
Owner

walaj commented Oct 13, 2023

I still think the reference and the header are different. @zhaowsong can you confirm with grep etc the number of unique reference contigs in both the reference and the bam header you are using?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants