Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 16 NSPs #1081

Open
jameshadfield opened this issue Aug 21, 2023 · 0 comments
Open

Add 16 NSPs #1081

jameshadfield opened this issue Aug 21, 2023 · 0 comments

Comments

@jameshadfield
Copy link
Member

jameshadfield commented Aug 21, 2023

I looked into the feasibility of adding the 16 NSPs into the exported (Auspice) dataset. This'll need nextclade v3 since RdRp includes the slip site, so perhaps a time to make some bigger changes too. (We've decided not to modify the ORF1a ORF1b annotations; discussion on slack.)

  • Nextclade does the translations, so we need to update the genemap.gff for Nextclade's 'sars-cov-2' dataset.
  • Our ancestral reconstruction of the translations (rule translate) is what creates the annotations block in the JSON. This currently uses defaults/reference_seq.gb for the annotations, and nothing else uses this.
    • We can shift the reconstruction to augur ancestral, and either keep the script to generate the JSON annotations, or (preferred) just keep a JSON representation of the annotations block in the repo and use this. (We'll want to have more than just the coordinates in the JSON - we'll want to add some extra display names / colours / descriptions; the latter being important to explain why we use ORF1a + ORF1b!)
    • This will allow us to remove this genbank file

Other things noticed / improvements we could make:

  • The workflow-config-file.rst has fallen out of date. This is seemingly inevitable with documentation, but this is a good chance to improve it.
  • We don't use any nextclade datasets other than 'sars-cov-2'; I assumed we'd use the 'sars-cov-2-21L' dataset for our 21L builds, and we have config settings to allow this, but I don't think we do.
  • rule align uses Nextalign, with a fasta + gff from the ncov repo. Why don't we replace the fasta+gff with the nextclade dataset we fetch later on in the process?
    • My understanding of nextclade v3 is we'll replace nextalign with nextclade in this step anyways.
  • rule build_mutation_summary and rule mutation_summary seem unused. If these can be removed, we could then remove defaults/reference.seq.fasta (alignment_reference), defaults/annotation.gff (annotation). If the rules are still in use, we may want to use the nextclade dataset files anyway.
    • The 2nd rule here is the only place we use the translations from rule align, so we may be able to avoid translating every genome.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

1 participant