-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug in __getMinFileOffset() #2319
Comments
Hi @serosko, thank you for your bug report. I think we have to discuss this in the team. @smehringer who extensively worked on this is unfortunately on a really long vacation. We might have to postpone this issue until she is back. I cursorily read the discussion on the mailing list and it seems that the default behaviour is indeed to do a linear scan to find the next lowest bin. This value presets a lower position of the linear scan to speed up the search. In recent versions, it seems that this value is always pointing to the correct position and that this linear scan only takes It seems that in version 0.1.8 the change happened and the same field now contains the precomputed lower position for the linear scan to the next bin. They argue that this does not change behaviour, which I agree if you assume that this is a linear scan. If you look at the release date of samtools-0.1.7 (Nov 16, 2009), it is nearly 10years ago. So is it really worthwhile to not assume that this is a jump? @serosko Is there a more recent version of Picard which you could use which fixed the problem? |
Hi, thank you for your reply.
Unfortunately it fails to specify what the bin should contain if no such alignments exist and accessing the linear index is not further elaborated. |
I looked at the PR again and if you look at 19f5561#diff-8639bf08377bf2a04ab67d87e0166199L243 The original code did a linear scan, so this is a regression and should be fixed. Can you create a pull request (with a test case) if you fixed this already? |
I can create a PR of my fix, but I am not sure about the test case. At my current work place I cannot send any files, except maybe my code. But any BAM file with a linear index containing 0's should work (e.g created with samtools-0.1.7.) |
Hi @serosko , Thanks for spotting this. Although this "only" fixes a bug for very old versions that hopefully not a lot of people use, I think it's worth to fix. It's been a while since I worked on this so I hope I understood the issue well enough. Your proposed fix makes it necessary to scan the index in worst case O(n) backwards (correct?), so I thought maybe it's better to correct for 0's directly on reading the index into memory. If a zero is encountered (which as I understand can never be correct for a header must be present in BAM files), it is treated as "no alignment in this interval" and the offset is set to the previous offset instead of storing the 0. This way the index is brought up to the current standard expected by @marehr :
Do you mean that the old code was a regression or the new one? |
The new one, because the linear scan was removed. |
Hi @smehringer , |
@marehr Actually, if you look at the code, you can see that it is completely wrong. It says in the documentation above: The 0 values that @serosko referred to are stored WITHIN the linear index, so So what I am saying is that this is not a regression but the issue was never caught so far. PS: what do you think of my proposed fix? |
@serosko |
@smehringer |
@serosko |
@smehringer
After jumping you can still check if the hasAlignments bool of the function to see if there are any alignments in the region. So yes, if you try to jump to chr11:1-10000 without any records in your example, you should land at the end of chr1 or whatever is the last entry before the region in your bam file with the bool hasAlignments set to false. |
I see your point. It will ensure a valid seek call afterwards. I see two issues though:
In general, thank you for discussing this. I myself never used nor designed this functionality (although I messed with it when introducing |
You make a valid point with the first issue you bring up I just looked at the code of the old jumpToRegion and have to correct a few assumptions of how it works. If the region was empty jumpToRegion did not set the position to the last entry before it, but to the first record after the region, as can be seen from this code fragment of the linear scan in jumpToRegion:
My assumption for the file position to be set to the last entry before the region was only true, if the region was not empty. This makes me think that calling readRecord after trying to jump to an empty region is a bad idea, because we already skipped an aditional record after the region. So, if we for example have: chr1 123 ... in our bam file and want to jump chr2:10-20, the minFileOffset will first set the position in the bam file to chr1:123, and the readRecord call during above loop will return the entry at chr3:1, which is after the region. Therefore the position will not be reset to chr1:123 and jumpToRegion will end, setting hasAlignments to false. If we now call readRecord again, we get chr4:1. So maybe it is not that important where we end up if the specified region is empty, as long as it is consistent and the next readRecord does not crash. |
@serosko IMHO the function @smehringer didn't we fix this? Did we forget to update the documentation? |
Those two outcomes sound sensible to me and would indeed improve the at the moment somewhat confusing behavior/description of the function. |
@rrahn Yes, currently we jump to the first alignment within the region if any are present ( This only works though, because in new versions the linear index gives the file offset of the first alignment in the bin spanning the region (if present) or the offset of the first alignment in a preceding bin. This way we can afterwards advance in the BAM file and check the region again. With 0's as file offsets this doesn't work. So the question is how to handle this case (Correcting the index while reading it in or going back until a valid index entry in another preceeding reference is found, both share the problem of what to do if all preceeding index entry's are invalid/0's). |
Related to #2273.
(using commit 19f5561).
When using jumpToRegion() with BAI files created by Picard tools or samtools-0.1.7, __getMinFileOffset() return 0 for the file offset, because those indices will contain 0 for bins without any alignments (in contrast to indices created by newer versions of samtools, which contain the value of the next smaller bin instead of 0). The difference has been described/discussed in the SAM tools mailing list https://sourceforge.net/p/samtools/mailman/message/25923233/.
This behaviour leads to an error when subsequently trying to read the next record because we jumped to the beginning of the file.
A possible fix is scanning the linear index for the next lowest bin containing a value > 0 if a 0 would be returned and then, after jumping, calling read readRecord() until the desired refID has been reached or passed.
The text was updated successfully, but these errors were encountered: