Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

option -v to filter out reads in given regions. #669

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

jyh1
Copy link

@jyh1 jyh1 commented Apr 23, 2017

I wanted to be able to remove reads appearing in the specified indexes. For
instance, we could exclude reads mapped to chrM with command `samtools view
-v a.bam chrM`, which is equivalent with `samtools view a.bam chr1 chr2 ...`.
For consitency, `samtools view -v a.bam` will print nothing.

A very ad-hoc solution. The right way is to support this in the sam.c of htslib?

#666

    I wanted to be able to remove reads appearing in the specified indexes. For
    instance, we could exclude reads mapped to chrM with command `samtools view
    -v a.bam chrM`, which is equivalent with `samtools view a.bam chr1 chr2 ...`.
    For consitency, `samtools view -v a.bam` will print nothing.

    A somewhat ad-hoc solution. The right way is to support this in the sam.c
    of htslib?
@mblue9
Copy link

mblue9 commented Jan 15, 2018

This would be SO useful imo! I would love to have this option right now for filtering out mitochondrial reads.

@colindaven
Copy link

Isn't this similar in scope to #423 ?

@daviesrob
Copy link
Member

@colindaven It's not quite the same. #423 inverts the filter test but leaves the region list unchanged - so flags and BED files (which act as a filter) get inverted. This attempts to invert the region list but leaves the filter unchanged. The interaction between the two may get a bit tricky, so it needs some thought.

@jyh1
Copy link
Author

jyh1 commented Feb 8, 2018

@daviesrob Exactly. In essence, the region list is just a kind of filter and I think it only exists for efficiency reason so we could take advantage that the reads are indexed. I believe that's where all the confusion comes from. I think another option could be allowing the user to inefficiently select the regions at the filter level, as oftentimes it is what most of us end up doing by adding an awk following the pipe.

@jkbonfield
Copy link
Contributor

How is the inversion happening with regards to overlaps? If we have a region Z:1000-2000 on a seq Z of length 3000, is this equivalent to reads that overlap Z:1-999 and Z:2001-3000 or is it equivalent to reads that do not overlap Z:1000-2000? I can see both potentially being useful.

@jyh1
Copy link
Author

jyh1 commented Feb 9, 2018

@jkbonfield Right, both will be useful in some circumstances. In this pull request it is equivalent to reads that do not overlap Z:1000-2000. I couldn't figure out a nice way to support both of them at the same time without over complicating the options. IMO it would be very useful to simply support filtering out an entire seq as it is the most common case (like removing chrM reads) and currently it is a pain in the ass to do this simple job, you have to either convert to sam, pipe to an awk, convert back or do some bash scripting exercise like here. The users could always resort to bedtools intersect if they really need precisely mask out certain regions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants