Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The best probability and LDDT score to filter in easy-search #243

Open
Jigyasa3 opened this issue Feb 17, 2024 · 4 comments
Open

The best probability and LDDT score to filter in easy-search #243

Jigyasa3 opened this issue Feb 17, 2024 · 4 comments

Comments

@Jigyasa3
Copy link

Jigyasa3 commented Feb 17, 2024

Hi @martin-steinegger ,

Thank you again for a great resource!
I am using the foldseek easy-search command to annotate some proteins of interest. I am selecting the annotation with the highest prob and LDDT score for each protein. I wanted to confirm if there is a filter that I can use to confidently say what the putative annotation is for the protein of interest?
For example, I have several hits that have prob of >0.7, but the LDDT score <0.3. While most of the proteins have prob of >0.7 and LDDT score >0.5. What is the "best" cutoff for annotating proteins using Foldseek?

At the same time, where can I find the target protein description? If my target protein is MGYP001275795760, where can I find its full name?

Any suggestions?

@milot-mirdita
Copy link
Member

The safest cut-off is neither prob nor LDDT/TM-score (in our opinion), since neither has a multiple testing correction in-built. When searching against potentially hundreds of millions of entities, E-value will likely be the most/only reliable indicator of homology for annotation. In your range, its probably not possible to say for certain that either of the hits are reliable annotations. All of them have probably high E-values? With high E-values and uncertain LDDT/TM-score/prob we can just establish that there is some structural similarity to be found; for stronger statements additional evidence is required.

The MGYP proteins come from MGnify. You can find the source assembly from the metadata on the MGnify download server: http://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/current_release/

Specifically the [mgy_assemblies.tsv.gz](http://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/current_release/mgy_assemblies.tsv.gz) file. I don't think that the EBI offers a service yet to map MGYP accessions to their source.

@Jigyasa3
Copy link
Author

Jigyasa3 commented Feb 19, 2024

Hi @milot-mirdita ,

Thank you for replying! I wanted to confirm another thing, while the E. values of the results are high, the alignment length of the match varies a lot! Some proteins have an alignment length of less than 50 amino acids (but high probability, LDDT score, and E.value).
I was wondering if these proteins can be considered as remote homologs?
Or would you suggest a more stringent filtering criterion for defining remote homologs?

Regards,
Jigyasa

@milot-mirdita
Copy link
Member

Just to clarify and make sure that there is no miscommunication or typos: A high value for E-values is bad. E-values should be as low and close to 0 as possible. E-values of < 10^-3 are normally very certain homologs. For higher values you'd need other evidence to establish homology.

@Jigyasa3
Copy link
Author

Jigyasa3 commented Feb 20, 2024

Hi @milot-mirdita ,
I am comparing the output from Foldseek with hh-suite to find remote homologs, and I observe that none of the hits have E. values less than 1e-3.
Link to the open issue. Is there a way to examine false negatives?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants