Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Samtools merge on .sam files? Inconsistent input-order-dependant behaviour #2054

Closed
RayHackett opened this issue May 16, 2024 · 6 comments
Closed
Assignees

Comments

@RayHackett
Copy link

samtools 1.13
Using htslib 1.13+ds
running on: ubuntu:22.04
container

I have two .sam files which I want to merge. They are name-sorted.
Q1: There is no documented behaviour for samtools merge on .sam files. Documentation only mentions .bam files. Is samtools merge supposed to be used for .sam files too?

Q2: Assuming samtools merge can be used on samfiles, I noticed that the following four commands all yield different files of different sizes:

  •      `samtools merge -n -o merged.sam file1.sam file2.sam`
    
  •      `samtools merge -n -o rev_merged.sam file1.sam file2.sam`
    
  •      `samtools merge -n -o wildcard_merged.sam file*.sam`
    
  •      `samtools view -Sh --no-PG file1.sam > viewmerged.sam;  samtools view -S file2.sam >> viewmerged.sam`
    

Surprisingly, input order seems to play a role! Using wildcards again gives different results.
For all of these options the sum of the sequence lines of the parent files are equal to the number of sequence lines in the merged files. The header lines however get reduced quite a bit.
Can you explain this behavior and advise me on what to use?
Thanks!

@whitwham
Copy link
Contributor

Most (but not all) of the samtools tools will read and write SAM, BAM or CRAM files. samtools merge will handle any of these file formats.

I'm not sure how the order of input files affects the header size. Do any of the files look incorrect?

@RayHackett
Copy link
Author

The files look fine as far as I can assess that. Samtools can read them anyway. Still the differences are not just minor. For a ~10GB alignment file the differences are several hundred MB of header lines.

@whitwham
Copy link
Contributor

Can you count the different header tags and see which tags have been added?

@RayHackett
Copy link
Author

Sorry for the wait. I think I must have gotten a bit confused when writing this issue.
The line counts are all identical. Only merging the output of view cuts some @rg and @pg lines.

For merged.sam, rev_merged.sam and wildcard_merged.sam the line counts for lines starting with @, starting with <sample_id> and lines starting with read are all the same. I must apologize for my previous oversight.

For merging two .sam file with 6.8G and 1.1G respectively, the merged files are between 7.3G and 7.6G in size.
Is this kind of a difference reproducible for you with any two alignment files?

@whitwham
Copy link
Contributor

I merged 74G and 81G sam files in both orders. The resulting files had only 6 bytes size difference out of 155G. I am not sure why you are seeing such a big difference in size.

@RayHackett
Copy link
Author

Right, thanks for checking! I'll have to do more digging later. Ill close the issue for now.
Again, I appreciate your responses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants