Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about --collapse #94

Open
Antonia-Chalka opened this issue Jul 19, 2021 · 2 comments
Open

Question about --collapse #94

Antonia-Chalka opened this issue Jul 19, 2021 · 2 comments

Comments

@Antonia-Chalka
Copy link

I have a very basic question about how the --collapse flag determines grouping. Does it collapse genotypes that have the exact same distribution across all the samples, or is some other type of correlation statistic used to determine that (and if so, what is it and what is the threshold)?

Both readme and the paper note the following:

For each phenotype supplied via columns in the traits
file, Scoary does the following: first, correlated genotype
variants are collapsed. Plasmid genes, for example, are
typically inherited together rather than as individual
units and Scoary will collapse these genes into a single
unit.

@Antonia-Chalka
Copy link
Author

From a quick view at the code in the methods script, it seems the correlation has to be perfect, but there's also a mention of having a 'softer' mention so I'm not 100% sure 😅

@AdmiralenOla
Copy link
Owner

Thanks for your question, and sorry about the wait.

As you have already figured out, the genotypes need to be 100% correlated to be collapsed. You may also have seen from the code that I thought about using a softer threshold, but I have never gotten around to implementing that.

I'm also a bit uncertain how the distribution of the collapsed variant should be counted, i.e. should it be present in all isolates with either of the original variants? I'm uncertain how that would impact other assumptions that are made.

Another thing I'm not sure about is whether the collapsed genes should then go through subsequent rounds of correlation -> collapse. That is, when we collapse two genes into one, this will have a new distribution pattern, and there is a chance that this new pattern will fall within the correlation threshold of being collapsed with yet another gene.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants