Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bam: Explain the use-case of the reference information section #713

Open
zaeleus opened this issue Mar 26, 2023 · 2 comments
Open

bam: Explain the use-case of the reference information section #713

zaeleus opened this issue Mar 26, 2023 · 2 comments
Assignees
Labels

Comments

@zaeleus
Copy link

zaeleus commented Mar 26, 2023

§ 4.2 "The BAM format" (2022-08-22) describes the layout of a reference information section (n_ref...l_ref). However, there is no explanation on what this is used for or how it should be used, considering the reference sequence dictionary is typically parsed from the SAM header.

@jmarshall jmarshall added the sam label Mar 27, 2023
@jmarshall
Copy link
Member

SAM files and especially BAM files need not have @SQ headers in their textual headers.

For reading a SAM file, this is no real problem, as an implementation could just add a new reference sequence to its internal dictionary when a previously unseen one is encountered in a SAM record. (Whether any implementations really support this mode of operation is unclear, but I believe it is still allowed by the specification.) Or the reference sequence dictionary could be provided externally in conjunction with the SAM file; cf samtools view -T etc.

For a BAM file, this is more of a problem, because the BAM alignment records do not contain the actual names of the reference sequences. Hence the binary header (the n_ref/l_name/name/l_ref fields) must always be present and complete in a BAM file and provides the canonical way to interpret the refID and next_refID indexes in the file's alignment records. As a bonus it also provides the length of each reference sequence.

Hence:

  1. In a BAM file, the binary header provides the canonical ordering of the reference sequences. (If @SQ headers are present in the textual header, as they usually are, they usually appear in the same order as in the binary header. But it is not an error if they appear in a different order, and the order defined by the binary header wins.)

  2. l_text might be zero.

This lore is fairly common knowledge amongst those who were implementing SAM and BAM at the time they were being defined, but you are quite right that it needs to be discussed in the specification.

@jkbonfield
Copy link
Contributor

(Whether any implementations really support this mode of operation is unclear, but I believe it is still allowed by the specification.)

This is how gap5 worked. It just created and extended contigs as it went. Obviously nearly always the header data was available, so I'm not sure why I chose that route. Maybe it was just easier to implement the that way, as it had to cope with user editing, inserting, joining, etc so already needed tools that could dynamically grow contigs and create new binning indices as it went.

Agreed with the issue though - this all needs explicitly stating.

@jmarshall jmarshall self-assigned this Mar 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: To do (backlog)
Development

No branches or pull requests

3 participants