-
Notifications
You must be signed in to change notification settings - Fork 572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Samtools merge on .sam files? Inconsistent input-order-dependant behaviour #2054
Comments
Most (but not all) of the samtools tools will read and write SAM, BAM or CRAM files. I'm not sure how the order of input files affects the header size. Do any of the files look incorrect? |
The files look fine as far as I can assess that. Samtools can read them anyway. Still the differences are not just minor. For a ~10GB alignment file the differences are several hundred MB of header lines. |
Can you count the different header tags and see which tags have been added? |
Sorry for the wait. I think I must have gotten a bit confused when writing this issue. For merged.sam, rev_merged.sam and wildcard_merged.sam the line counts for lines starting with @, starting with <sample_id> and lines starting with read are all the same. I must apologize for my previous oversight. For merging two .sam file with 6.8G and 1.1G respectively, the merged files are between 7.3G and 7.6G in size. |
I merged 74G and 81G sam files in both orders. The resulting files had only 6 bytes size difference out of 155G. I am not sure why you are seeing such a big difference in size. |
Right, thanks for checking! I'll have to do more digging later. Ill close the issue for now. |
samtools 1.13
Using htslib 1.13+ds
running on: ubuntu:22.04
container
I have two
.sam
files which I want to merge. They are name-sorted.Q1: There is no documented behaviour for
samtools merge
on.sam
files. Documentation only mentions.bam
files. Issamtools merge
supposed to be used for.sam
files too?Q2: Assuming
samtools merge
can be used on samfiles, I noticed that the following four commands all yield different files of different sizes:Surprisingly, input order seems to play a role! Using wildcards again gives different results.
For all of these options the sum of the sequence lines of the parent files are equal to the number of sequence lines in the merged files. The header lines however get reduced quite a bit.
Can you explain this behavior and advise me on what to use?
Thanks!
The text was updated successfully, but these errors were encountered: