Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

To-do #1

Open
apcamargo opened this issue Jan 23, 2021 · 2 comments
Open

To-do #1

apcamargo opened this issue Jan 23, 2021 · 2 comments

Comments

@apcamargo
Copy link
Owner

apcamargo commented Jan 23, 2021

  • Use Vamb's transformation to reduce the number of TNF dimensions (103, instead of 136)
  • Reduce memory footprint:
    • Use screed
    • Use hashes
    • Use Rust in taxopy
  • Implement a modular interface, so that users can choose between several combinations. For example:
    • Sequence composition
    • Sequence composition + coverage
    • Sequence composition + coverage + codon usage
    • Sequence composition + coverage + codon usage + taxonomy
    • Coverage + codon usage
@jakobnissen
Copy link

If you need help implementing the transformation, let me know.
You can find a description of the idea in this paper, where I got the idea from: Kislyuk, A., Bhatnagar, S., Dushoff, J. et al. Unsupervised statistical clustering of environmental shotgun sequences. BMC Bioinformatics 10, 316 (2009). https://doi.org/10.1186/1471-2105-10-316

Practically speaking, I've found the best way is to compute a kernel beforehand using Scipy, save it to a file. then just load the kernel at runtime. You can just copy the file src/create_kernel.py directly from Vamb.

@apcamargo
Copy link
Owner Author

Thanks, @jakobnissen!

Now that I finished the first version of geNomad I might pick this up again. Your approach looks clean, I'll use it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants