Filter out very long intron gene (max intron size) #9

baozg · 2022-09-20T13:00:38Z

It seems a lots of long-intron protein mapping in the miniprot result, can we use some parameters to filter out these? Its size smaller than the default -G 200k. Did it confused by the context of different gene?

Liftoff was the evidence using existing annotation.

The text was updated successfully, but these errors were encountered:

lh3 · 2022-09-20T14:29:42Z

Terminal exons are problematic as they are sometimes too short to be aligned accurately. It is not possible to get high sensitivity and high specificity at the same time based on a single protein. You may filter a terminal exon at low alignment score but you will end up with an incomplete CDS.

For the purpose of gene prediction, you need to integrate signals from multiple proteins. You can choose the best alignment for each protein (at the cost of missing gene duplications). When there are multiple hits in a region, choose the hit with a better score or at higher identity.

In general, it is not advised to take raw protein alignment as the final annotation, just as we have to run something like stringtie to annotate a genome from RNA-seq read alignment.

baozg · 2022-09-20T14:35:44Z

Thanks for the prompt reply and nice advices. I will take next extra filter step with miniprot gff3. I am totally agree with you about the annotation step. Cross-species protein alignment give signals for isoforms expressed in specific conditions since the cost of comperhensive RNA-Seq. I just need a protein alignment layer which have good tradoff between specificity and sensetity.

lh3 · 2022-09-20T14:45:18Z

I will keep this issue open. Probably many users will have a similar question. I do need to tune parameters more carefully for terminal exons in future. I am also thinking to write a tool for filtering but that won't happen soon.

By the way, what query proteins were you using? How many proteins?

baozg · 2022-09-20T14:52:24Z

For reference, it was hifiasm-based Arabidopsis thaliana. The query was taken from a TE annotation tool which they use for filter flase TE by protein-coding gene (https://github.com/oushujun/EDTA/blob/master/database/alluniRefprexp082813). It consists of 102,447 proteins from different plants. Another protein dataset I typically use was the swiss-prot plant part (~40,000 hints with review).

lh3 · 2022-09-21T01:20:11Z

I see. If there are multiple proteins mapped to the same locus, you may filter out the proteins at lower alignment score (6th column in GFF) or at lower identity (the Identity tag). Distant proteins are harder to be mapped correctly.

baozg · 2022-09-21T08:32:23Z

Great!! I will filter out by that tag. But another issue still exists. Since the various annotation quality of different species assembly, it's hard to make a tradeoff between the closest and best quality protein. So I prefer to use various protein dataset or use manual reviewed protein as evidence for annotation.

What's your recommend divergence that miniprot can handle? Or maybe add some presets like minimap2 -asm5/10/20 to change the thresold for divergence protein mapping?

lh3 added the question Further information is requested label Sep 20, 2022

baozg closed this as completed Sep 20, 2022

lh3 reopened this Sep 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter out very long intron gene (max intron size) #9

Filter out very long intron gene (max intron size) #9

baozg commented Sep 20, 2022

lh3 commented Sep 20, 2022 •

edited

baozg commented Sep 20, 2022

lh3 commented Sep 20, 2022

baozg commented Sep 20, 2022

lh3 commented Sep 21, 2022

baozg commented Sep 21, 2022

Filter out very long intron gene (max intron size) #9

Filter out very long intron gene (max intron size) #9

Comments

baozg commented Sep 20, 2022

lh3 commented Sep 20, 2022 • edited

baozg commented Sep 20, 2022

lh3 commented Sep 20, 2022

baozg commented Sep 20, 2022

lh3 commented Sep 21, 2022

baozg commented Sep 21, 2022

lh3 commented Sep 20, 2022 •

edited