Selectable CAG generation method in v0.9 #64

jgolob · 2021-03-02T12:14:02Z

I would suggest the specific CAG generation method be selectable by a command line flag, between the original (ANN-based) approach and the new contig-informed approach.

There are many other improvements in v0.9. This change would both facilitate A/B testing of the new method while creating a unified version that retains both the prior core method and the other improvements.

sminot · 2021-03-02T15:34:42Z

I do like the idea of making different aspects of the workflow selectable, especially when there are such large changes as the updated clustering in v0.9. The big outstanding question is, how do we convey to the user what this selectable option entails?

If we just look at the principles used for the old (ANN-limited) and new (exhaustive) clustering methods, the major expectation is that the new method will find fewer CAGs because (a) it will join CAGs which are inappropriately split up by the ANN-limited method, and (b) because it filters genes to those which co-assemble with at least one other gene. Of course, the degree of (b) can already be limited by options provided currently in v0.9.

So the question is, what is the benefit of the previous ANN-based method? One potential answer is that it does not require any assembly, and so a user can provide a reference gene catalog and skip de novo assembly entirely. However, I suspect that based on your experience with A/B testing there may be some other advantages of the old method. I do think it is rather important to figure out what those advantages might be in order to decide whether it is worth supporting the old method (and carrying that extra code forward in future versions). I am more than happy to dig into the A/B data that you have generated to get to the bottom of what the real differences are in the CAGs created previously which now seem to be absent in v0.9.

Looking forward to hearing your thoughts!

jgolob · 2021-03-02T15:50:06Z

I think the use case you present (to allow assembly-free CAG generation from a pre-existing library) is a good one to continue with the ANN method.

Likewise, I think ANN is worth supporting for now. It can be 'deprecated' in the v1.0 release. I worry with the other changes in v0.9, A/B testing may be complicated.

One thought is ANN may be serving a useful filtering role here as compared to the comprehensive clustering--which may be overfitting / creating spurious co-abundant groups based on limited replicates and a lot of features being considered. It's too early to tell for certain, but at this point it is clear that we need to do more A/B testing.

For all of this I think it's worth the effort of maintaining both methods in one branch for now.

sminot · 2021-03-02T15:55:04Z

I agree, the A/B testing is going to be really key here. I just did a quick review of the codebase, and the CAG-creation process was refactored significantly in v0.9 meaning that adding back ANN as a selectable option is going to take a bit of work. In the meantime, what if we just used the existing versioning system to compare the results of the previous release v0.8.6 (using ANN) with the newest version v0.9 (using exhaustive clustering)? That seems like it might meet our immediate needs of figuring out the A/B differences clearly.

jgolob assigned sminot Mar 2, 2021

sminot added the discussion label Mar 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Selectable CAG generation method in v0.9 #64

Selectable CAG generation method in v0.9 #64

jgolob commented Mar 2, 2021

sminot commented Mar 2, 2021

jgolob commented Mar 2, 2021

sminot commented Mar 2, 2021

Selectable CAG generation method in v0.9 #64

Selectable CAG generation method in v0.9 #64

Comments

jgolob commented Mar 2, 2021

sminot commented Mar 2, 2021

jgolob commented Mar 2, 2021

sminot commented Mar 2, 2021