Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add view -X flag to drop all aux tags #871

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

EvanTheB
Copy link
Contributor

I do not know if you want this feature, and it is implemented in the wrong place.

I added -X flag to drop all aux tags. I use this for compression when I just want to save the fastq-ish data.

I think I should have added the code to htslib, if you want it, I am happy to modify so it works like that.

@jkbonfield
Copy link
Contributor

I don't think this belongs in htslib really, but it could make use of the htslib feature to give hints to the decoder. On BAM there's little that can be done (except maybe not bothering to do validation?), but with CRAM it's possible to tell the decoder to ignore blocks in the file - don't bother decompressing them and no need to serialise all the tags together.

Eg:

    if (hts_set_opt(state->fp, CRAM_OPT_REQUIRED_FIELDS,
                    SAM_QNAME|SAM_FLAG|SAM_RNAME|SAM_POS|SAM_MAPQ|
                    SAM_CIGAR|SAM_RNEXT|SAM_TLEN|SAM_SEQ|SAM_QUAL) < 0)
        error...

(Clearly we ought to add SAM_ALL so we can do SAM_ALL & ~SAM_AUX.)

@jkbonfield
Copy link
Contributor

Maybe there's also an argument for adapting how the current --input-fmt-option required_fields=0x4ff implementation works (that example on CRAM does what you want by the way, implemented using the hts_set_opt call above). If you wanted to go further and drop other fields like TLEN, RNEXT, ISIZE, and filter non primary read1/read2 you could use 0x601 say and use view -F 0xF00 to drop all the secondary, supplementary etc reads.

However input-fmt-option is a hint. When reading a BAM record we'll have read all the data so the fields are already there. When reading CRAM, if the data is necessary for decoding of other fields (eg we must know POS and RNAME to decode SEQ) then it'll be in the structures, but otherwise it'll be given a place-holder value (*, 0, etc). Perhaps what we want though is a required_fields equivalent for output-fmt-option which goes beyond an optimisation hint to become a statement of what will be stored. At this point it's essentially a crude columnar filter. (Crude because it's all or nothing as far as tags go, barring RG.)

Thoughts anyone?

@EvanTheB
Copy link
Contributor Author

Is there documentation somewhere for what the --input-* and --output-* args take? I keep finding random examples scattered around but no exhaustive doc.

@EvanTheB
Copy link
Contributor Author

Does #516 with an empty whitelist do the same?

@jkbonfield
Copy link
Contributor

jkbonfield commented Jun 20, 2018

Thanks @EvanTheB - I'd totally forgotten about that aging PR! We should discuss it and make a decision as it's plenty mature by now. :-)

As for the options arguments, they're in the samtools man page under "GLOBAL OPTIONS". Quite a lot are CRAM only or only apply on input or output, but this is described in the text. We just added level (compression level) to there to so I'll update the man page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants