Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selectable CAG generation method in v0.9 #64

Open
jgolob opened this issue Mar 2, 2021 · 3 comments
Open

Selectable CAG generation method in v0.9 #64

jgolob opened this issue Mar 2, 2021 · 3 comments
Assignees

Comments

@jgolob
Copy link
Collaborator

jgolob commented Mar 2, 2021

I would suggest the specific CAG generation method be selectable by a command line flag, between the original (ANN-based) approach and the new contig-informed approach.

There are many other improvements in v0.9. This change would both facilitate A/B testing of the new method while creating a unified version that retains both the prior core method and the other improvements.

@sminot
Copy link
Collaborator

sminot commented Mar 2, 2021

I do like the idea of making different aspects of the workflow selectable, especially when there are such large changes as the updated clustering in v0.9. The big outstanding question is, how do we convey to the user what this selectable option entails?

If we just look at the principles used for the old (ANN-limited) and new (exhaustive) clustering methods, the major expectation is that the new method will find fewer CAGs because (a) it will join CAGs which are inappropriately split up by the ANN-limited method, and (b) because it filters genes to those which co-assemble with at least one other gene. Of course, the degree of (b) can already be limited by options provided currently in v0.9.

So the question is, what is the benefit of the previous ANN-based method? One potential answer is that it does not require any assembly, and so a user can provide a reference gene catalog and skip de novo assembly entirely. However, I suspect that based on your experience with A/B testing there may be some other advantages of the old method. I do think it is rather important to figure out what those advantages might be in order to decide whether it is worth supporting the old method (and carrying that extra code forward in future versions). I am more than happy to dig into the A/B data that you have generated to get to the bottom of what the real differences are in the CAGs created previously which now seem to be absent in v0.9.

Looking forward to hearing your thoughts!

@jgolob
Copy link
Collaborator Author

jgolob commented Mar 2, 2021

I think the use case you present (to allow assembly-free CAG generation from a pre-existing library) is a good one to continue with the ANN method.

Likewise, I think ANN is worth supporting for now. It can be 'deprecated' in the v1.0 release. I worry with the other changes in v0.9, A/B testing may be complicated.

One thought is ANN may be serving a useful filtering role here as compared to the comprehensive clustering--which may be overfitting / creating spurious co-abundant groups based on limited replicates and a lot of features being considered. It's too early to tell for certain, but at this point it is clear that we need to do more A/B testing.

For all of this I think it's worth the effort of maintaining both methods in one branch for now.

@sminot
Copy link
Collaborator

sminot commented Mar 2, 2021

I agree, the A/B testing is going to be really key here. I just did a quick review of the codebase, and the CAG-creation process was refactored significantly in v0.9 meaning that adding back ANN as a selectable option is going to take a bit of work. In the meantime, what if we just used the existing versioning system to compare the results of the previous release v0.8.6 (using ANN) with the newest version v0.9 (using exhaustive clustering)? That seems like it might meet our immediate needs of figuring out the A/B differences clearly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants