Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finding a suitable reference for a set of genomes #26

Open
MostafaYA opened this issue May 2, 2022 · 3 comments
Open

Finding a suitable reference for a set of genomes #26

MostafaYA opened this issue May 2, 2022 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@MostafaYA
Copy link

Hello, thanks for this great tool.
Just a question:
I wonder how to select the appropriate reference for a set of (diverse) genomes.
When I run the referenceseeker in this case, it gives different reference for each genome.

@oschwengers oschwengers self-assigned this May 2, 2022
@oschwengers oschwengers added the enhancement New feature or request label May 2, 2022
@oschwengers
Copy link
Owner

Hi @MostafaYA,
thanks for this excellent question! This is indeed an interesting use case and we already started to work on a solution for that. However, this will still take a while. Maybe we can provide a solution for that at the end of this year .

@pvanheus
Copy link

@oschwengers any update on that work? I'm wondering what the best approach would be here? Two passes, the first that finds all candidates for all samples and the second that computes distance to each of these candidates and finds the one with the lowest average distance?

@oschwengers
Copy link
Owner

Thanks @pvanheus for bringing this up again. Actually, this just slipped down my priority list. But if there is still a need for and interest in that, I would try to work on this as a side-side project. Unfortunately, I cannot make any reliable commitments to this right now.

Regarding the WF: right as you mentioned: First we have to calculate approx. genome distances (for instance Mash) as a rough estimate to select reference candidates. Then we have to compute ANI between all query and reference candidates and then rank & select these references. The main task we tried to work on is how to best rank the reference genomes as ANI difference of course can differ a lot between a reference and the given query genomes. How to handle harsh outliers for example? As a simple approach we played around with classic arithmetic/geometric/harmonic means....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants