Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter out very long intron gene (max intron size) #9

Open
baozg opened this issue Sep 20, 2022 · 6 comments
Open

Filter out very long intron gene (max intron size) #9

baozg opened this issue Sep 20, 2022 · 6 comments
Labels
question Further information is requested

Comments

@baozg
Copy link

baozg commented Sep 20, 2022

Hi, @lh3

It seems a lots of long-intron protein mapping in the miniprot result, can we use some parameters to filter out these? Its size smaller than the default -G 200k. Did it confused by the context of different gene?

image

Liftoff was the evidence using existing annotation.

@lh3
Copy link
Owner

lh3 commented Sep 20, 2022

Terminal exons are problematic as they are sometimes too short to be aligned accurately. It is not possible to get high sensitivity and high specificity at the same time based on a single protein. You may filter a terminal exon at low alignment score but you will end up with an incomplete CDS.

For the purpose of gene prediction, you need to integrate signals from multiple proteins. You can choose the best alignment for each protein (at the cost of missing gene duplications). When there are multiple hits in a region, choose the hit with a better score or at higher identity.

In general, it is not advised to take raw protein alignment as the final annotation, just as we have to run something like stringtie to annotate a genome from RNA-seq read alignment.

@lh3 lh3 added the question Further information is requested label Sep 20, 2022
@baozg
Copy link
Author

baozg commented Sep 20, 2022

Thanks for the prompt reply and nice advices. I will take next extra filter step with miniprot gff3. I am totally agree with you about the annotation step. Cross-species protein alignment give signals for isoforms expressed in specific conditions since the cost of comperhensive RNA-Seq. I just need a protein alignment layer which have good tradoff between specificity and sensetity.

@baozg baozg closed this as completed Sep 20, 2022
@lh3
Copy link
Owner

lh3 commented Sep 20, 2022

I will keep this issue open. Probably many users will have a similar question. I do need to tune parameters more carefully for terminal exons in future. I am also thinking to write a tool for filtering but that won't happen soon.

By the way, what query proteins were you using? How many proteins?

@lh3 lh3 reopened this Sep 20, 2022
@baozg
Copy link
Author

baozg commented Sep 20, 2022

For reference, it was hifiasm-based Arabidopsis thaliana. The query was taken from a TE annotation tool which they use for filter flase TE by protein-coding gene (https://github.com/oushujun/EDTA/blob/master/database/alluniRefprexp082813). It consists of 102,447 proteins from different plants. Another protein dataset I typically use was the swiss-prot plant part (~40,000 hints with review).

@lh3
Copy link
Owner

lh3 commented Sep 21, 2022

I see. If there are multiple proteins mapped to the same locus, you may filter out the proteins at lower alignment score (6th column in GFF) or at lower identity (the Identity tag). Distant proteins are harder to be mapped correctly.

@baozg
Copy link
Author

baozg commented Sep 21, 2022

Great!! I will filter out by that tag. But another issue still exists. Since the various annotation quality of different species assembly, it's hard to make a tradeoff between the closest and best quality protein. So I prefer to use various protein dataset or use manual reviewed protein as evidence for annotation.

What's your recommend divergence that miniprot can handle? Or maybe add some presets like minimap2 -asm5/10/20 to change the thresold for divergence protein mapping?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants