Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in __getMinFileOffset() #2319

Open
serosko opened this issue Jun 28, 2018 · 17 comments
Open

Bug in __getMinFileOffset() #2319

serosko opened this issue Jun 28, 2018 · 17 comments
Assignees
Labels

Comments

@serosko
Copy link

serosko commented Jun 28, 2018

Related to #2273.
(using commit 19f5561).
When using jumpToRegion() with BAI files created by Picard tools or samtools-0.1.7, __getMinFileOffset() return 0 for the file offset, because those indices will contain 0 for bins without any alignments (in contrast to indices created by newer versions of samtools, which contain the value of the next smaller bin instead of 0). The difference has been described/discussed in the SAM tools mailing list https://sourceforge.net/p/samtools/mailman/message/25923233/.
This behaviour leads to an error when subsequently trying to read the next record because we jumped to the beginning of the file.
A possible fix is scanning the linear index for the next lowest bin containing a value > 0 if a 0 would be returned and then, after jumping, calling read readRecord() until the desired refID has been reached or passed.

@marehr
Copy link
Member

marehr commented Jul 17, 2018

Hi @serosko,

thank you for your bug report. I think we have to discuss this in the team. @smehringer who extensively worked on this is unfortunately on a really long vacation. We might have to postpone this issue until she is back.

I cursorily read the discussion on the mailing list and it seems that the default behaviour is indeed to do a linear scan to find the next lowest bin. This value presets a lower position of the linear scan to speed up the search. In recent versions, it seems that this value is always pointing to the correct position and that this linear scan only takes O(1) time which makes it effectively a jump.

It seems that in version 0.1.8 the change happened and the same field now contains the precomputed lower position for the linear scan to the next bin. They argue that this does not change behaviour, which I agree if you assume that this is a linear scan.

If you look at the release date of samtools-0.1.7 (Nov 16, 2009), it is nearly 10years ago.

So is it really worthwhile to not assume that this is a jump?

@serosko Is there a more recent version of Picard which you could use which fixed the problem?
I think there was a specification document of the sam format, maybe we should look again if there was any mentioning of a linear scan.

@serosko
Copy link
Author

serosko commented Jul 17, 2018

Hi, thank you for your reply.
I have no problem with you postponing the issue until smehringer returns, since I could fix the problem internally by using the described method. I agree that the version which leads to the problem is quite old now and nobody should use it anymore. But unfortunately that data I got to work on dates back quite a while and some of the files have these old indices. Because there are so many of them, recomputing the indices would not be preferable. Speaking of the SAM format documentation (https://samtools.github.io/hts-specs/SAMv1.pdf), the linear index is very briefly described in section 5.1.3, and there it says

In the linear index, for each tiling 16384bp window on the reference, we record the smallest file offset of the alignments that overlap with the window.

Unfortunately it fails to specify what the bin should contain if no such alignments exist and accessing the linear index is not further elaborated.
Anyway, there is no case where __getMinFileOffset should return a 0, because that file offset always points to the beginning of the BAM-header where no alignment can exist.
If you decide not to change the behavior of the code, maybe a warning or note in the documentation of the function(s) would be good, to avoid that other users which might come across the same problem have to search for the reason for too long.

@marehr
Copy link
Member

marehr commented Jul 17, 2018

I looked at the PR again and if you look at

19f5561#diff-8639bf08377bf2a04ab67d87e0166199L243

The original code did a linear scan, so this is a regression and should be fixed.

Can you create a pull request (with a test case) if you fixed this already?

@serosko
Copy link
Author

serosko commented Jul 17, 2018

I can create a PR of my fix, but I am not sure about the test case. At my current work place I cannot send any files, except maybe my code. But any BAM file with a linear index containing 0's should work (e.g created with samtools-0.1.7.)

@smehringer
Copy link
Member

Hi @serosko ,

Thanks for spotting this.

Although this "only" fixes a bug for very old versions that hopefully not a lot of people use, I think it's worth to fix. It's been a while since I worked on this so I hope I understood the issue well enough. Your proposed fix makes it necessary to scan the index in worst case O(n) backwards (correct?), so I thought maybe it's better to correct for 0's directly on reading the index into memory. If a zero is encountered (which as I understand can never be correct for a header must be present in BAM files), it is treated as "no alignment in this interval" and the offset is set to the previous offset instead of storing the 0. This way the index is brought up to the current standard expected by __getMinFileOffset and no linear search is needed.
What do you think?

@marehr :

The original code did a linear scan, so this is a regression and should be fixed.

Do you mean that the old code was a regression or the new one?

@marehr
Copy link
Member

marehr commented Aug 1, 2018

Do you mean that the old code was a regression or the new one?

The new one, because the linear scan was removed.

@serosko
Copy link
Author

serosko commented Aug 1, 2018

Hi @smehringer ,
your suggestion of correcting the 0's while reading sounds good and, in my opinion, should solve the problem at the root without the need for a linear scan.

@smehringer
Copy link
Member

smehringer commented Aug 1, 2018

@marehr Actually, if you look at the code, you can see that it is completely wrong. It says in the documentation above: If there are no linear indices for this reference then we use the linear min offset of the next reference that has an linear index.. This makes absolutely no sense IMPOV because two reference sequences are completely independent of each other, as are their indices. If I want to jump to a region of ref2, say chr2, it does not help me to ask the index of ref1 (chr1) ... not in any way I can think of. It was never noticed though because the linear index, as far as I can tell, was never empty in the versions of samtools and picard.

The 0 values that @serosko referred to are stored WITHIN the linear index, so if(empty(index._linearIndices[refId])) would not catch since the linear index is non-empty.

So what I am saying is that this is not a regression but the issue was never caught so far.

PS: what do you think of my proposed fix?

@smehringer
Copy link
Member

smehringer commented Aug 2, 2018

@serosko
I noticed that your proposed solution, the search for an offset is continued in other references in case no offset can be found in the current linear index. Why? If no alignments can be found in the region of a specific reference why do I care for the region in another reference? For all we know it could be a completely different genome (e.g. when working with alignments on different e.coli genomes)?

@serosko
Copy link
Author

serosko commented Aug 2, 2018

@smehringer
My intention was to continue searching until I find some valid offset, which could also be on some other contig/chromosome/reference sequence. As far as I understood the intention and function of jumpToRegion, it is to set the position to the last record before the specified region. This can also be on another reference sequence. If for example we want to jump to chr4:1-10000, regardless of whether it exists or not, we expect jumpToRegion to set the position to the first entry before the region, which would e.g. be somewhere at the end of chr3, or - if the end of chr3 also doesn't exist - somewhere before that or even on an earlier reference sequence. When then calling readRecord function, the next entry is returned. If the region exists, everything is fine. If it doesn't, readRecord will return the first entry after the specified region, which should be the intended behaviour.

@smehringer
Copy link
Member

@serosko
Hm, I disagree. If I want to jump to a region on chr4, I do not expect to read records from chr3 instead since their sequences and alignments are totally unrelated. Especially since the order of chrosomes is sometimes 1, 11, 12, ..., 2, 3, 4, ... (lexicographically). If I jump to chr11:1-10000 with no records then I land at chr1?

@serosko
Copy link
Author

serosko commented Aug 3, 2018

@smehringer
that's what is written in the seqan tutorial (http://seqan.readthedocs.io/en/master/Tutorial/InputOutput/SamAndBamIO.html):

After jumping, the next record to be read is before the given region. Therefore, you have to skip records until you access the one you are looking for.

After jumping you can still check if the hasAlignments bool of the function to see if there are any alignments in the region. So yes, if you try to jump to chr11:1-10000 without any records in your example, you should land at the end of chr1 or whatever is the last entry before the region in your bam file with the bool hasAlignments set to false.

@smehringer
Copy link
Member

I see your point. It will ensure a valid seek call afterwards.

I see two issues though:

  1. Assuming this behaviour, and assuming I want to query the very first interval chr1:1-1000 that by chance has no reads (so Picard will set the offset in the very first bin of the very first index to 0), where do I jump to? Your proposed solution, in this case, will cause the jumpToRegion function to return false, indicating a failure due to "no alignments in this region", whereas in your previous case it returns true. This would be inconsistent.

  2. If I want to jump to a region but I discover that this region is empty, is it still desirable to read records? The sentence you cited from the tutorials makes sense to me if you jump to the beginning of a full region and then search for a specific read within. But with the knowledge of no read being in the region anyway what is the use case of landing somewhere else in the file?

In general, thank you for discussing this. I myself never used nor designed this functionality (although I messed with it when introducing viewRecords) so with your input we seem to catch many edge cases not thought of before.

@serosko
Copy link
Author

serosko commented Aug 3, 2018

You make a valid point with the first issue you bring up

I just looked at the code of the old jumpToRegion and have to correct a few assumptions of how it works. If the region was empty jumpToRegion did not set the position to the last entry before it, but to the first record after the region, as can be seen from this code fragment of the linear scan in jumpToRegion:

while (!atEnd(bamFile))
{
        auto seek_pos = position(bamFile); // store current pos to reset bamFile if necessary

        readRecord(record, bamFile);

        if (record.beginPos >= regionEnd)
            break;

        if (record.beginPos >= regionStart)
        {
            hasAlignments = true;
            setPosition(bamFile, seek_pos);
            break;
        }
}

My assumption for the file position to be set to the last entry before the region was only true, if the region was not empty. This makes me think that calling readRecord after trying to jump to an empty region is a bad idea, because we already skipped an aditional record after the region. So, if we for example have:

chr1 123 ...
chr3 1 ...
chr4 1 ...

in our bam file and want to jump chr2:10-20, the minFileOffset will first set the position in the bam file to chr1:123, and the readRecord call during above loop will return the entry at chr3:1, which is after the region. Therefore the position will not be reset to chr1:123 and jumpToRegion will end, setting hasAlignments to false. If we now call readRecord again, we get chr4:1.

So maybe it is not that important where we end up if the specified region is empty, as long as it is consistent and the next readRecord does not crash.

@rrahn
Copy link
Contributor

rrahn commented Aug 3, 2018

@serosko IMHO the function jumpToRegion is a real misconception. I also would not trust the documentation and try to implement a behaviour that is falsely documented. So what you actually want is to extract all reads with an alignment in the specified region (see: @smehringer answer with viewRecord). jumpToRegion should jump to the first offset, whose record overlaps with the specified region. This does not mean that all subsequent reads will overlap with the region, as for long reads the lengths are very heterogeneous. However, later some records might fall again into it. So you need to iterate through all records until you find the first one that begins after the region or the end of the file is reached. Note, that for short reads this might be ok (probably the data at hand when implementing the function). Anyway, this means that there should be two possible outcomes, when calling jumpToRegion. Either, there is at least one valid alignment that overlaps the region, then the bam file is set to the first record sorted by their begin position in the reference and the function returns true, or if no read could be found, then the file position should be set to the first record after the region or to the end of the file and should return false.

@smehringer didn't we fix this? Did we forget to update the documentation?

@serosko
Copy link
Author

serosko commented Aug 3, 2018

Those two outcomes sound sensible to me and would indeed improve the at the moment somewhat confusing behavior/description of the function.

@smehringer
Copy link
Member

@rrahn Yes, currently we jump to the first alignment within the region if any are present (hasAlignments=true), or the first alignment after the region(or end of file) if none is present (hasAlignments=false). The latter behaviour is not documented and should in any case be updated. The jumpToRegion function returns true in both cases.

This only works though, because in new versions the linear index gives the file offset of the first alignment in the bin spanning the region (if present) or the offset of the first alignment in a preceding bin. This way we can afterwards advance in the BAM file and check the region again. With 0's as file offsets this doesn't work. So the question is how to handle this case (Correcting the index while reading it in or going back until a valid index entry in another preceeding reference is found, both share the problem of what to do if all preceeding index entry's are invalid/0's).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants