Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare for Changes in Clade Naming in Nextclade/Nextstrain #371

Open
1 of 2 tasks
emmahodcroft opened this issue Feb 1, 2023 · 2 comments
Open
1 of 2 tasks

Prepare for Changes in Clade Naming in Nextclade/Nextstrain #371

emmahodcroft opened this issue Feb 1, 2023 · 2 comments
Labels
discussion Talk about anything scope: data Related to data scope: scripts

Comments

@emmahodcroft
Copy link
Collaborator

emmahodcroft commented Feb 1, 2023

Nextclade now breaks down Nextstrain clades into year-letter and WHO, and only gives the "old" 'full' name in a new column, clade_legacy.

Example:
Old: clade_nextstrain == 22F (Omicron)

New:
clade_nextstrain == 22F
clade_who == Omicron
clade_legacy == 22F (Omicron)

This doesn't directly impact CoVariants as we don't use the Nextclade file directly, but the metadata.tsv that comes after the ncov-ingest workflow. Currently this hasn't changed, but it may change either by just replacing Nextstrain_clade (which we use) with the shortened name, or by doing this and also adding a "legacy" column.

For clarity, we currently compare values in Nextstrain_clade with display_name from clusters.py (containing things like 22F (Omicron))

If a legacy column is added, switching is as simple as just using this new column, with the rest of the code remaining the same. If there isn't one, or we want to be more future-proof, we should ensure we can just use a different entry in clusters.py which has the year-letter name.

We currently have an entry nextstrain_name, but this has been used inconsistently - sometimes with the 'full' name (21L (Omicron)) and sometimes just year-letter (22A).
To help us switch to that option more easily in future, I propose switching now so that all nextstrain_name entries are year-letter.

This should mean that in future, we would need to switch from using display_name in cluster_analysis.py to using nextstrain_name. This shouldn't be too bad but will need checking as it's a little more complex than I thought.

If this is the path we go, here's a small checklist:

  • Change all nextstrain_name to use year-letter
  • Adjust cluster_analysis.py to use nextstrain_name instead of display_name - and check it works.

Clearly, all of the above is only relevant to clades we track that are official Nextstrain clades. For those that aren't official (mostly older ones), we use Pango or SNPs, so this is unchanged.

@emmahodcroft emmahodcroft added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers needs triage Pending maintainers' attention scope: data Related to data discussion Talk about anything scope: scripts and removed enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers needs triage Pending maintainers' attention labels Feb 1, 2023
@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Feb 3, 2023

@emmahodcroft Let's see if this also affects web. Theoretically, web *should* use only build_name in significant places, but there might be some funny effects in case I deviated from that. So please also watch out for strange things in web as you migrate.

I hope you don't need to change build names. If you do, then it will be a journey, because that's how the md files, URLs and other stuff is linked together.

@emmahodcroft
Copy link
Collaborator Author

emmahodcroft commented Feb 3, 2023

I don't plan to change the build names, as they're used all over.

RE the nextstrain_name -- I'll keep an eye out - I had the same thought. The main reason I am fairly confident is that it turns out a while ago I accidentally got inconsistent about the naming (started using just year-letter) and as far as I can tell I've never noticed any impact of this. This is the main thing that made me confident that we must not be using if anywhere, or I'd have noticed whenever I first started messing it up (probably about a year ago now) or sometime in between.

But agree - cant' be too careful!

I do not expect to change the build_name - totally agree.
The only other thing that might be worth exploring changing is display_name as perhaps we'd like to move to something a bit more flexible (perhaps including the pango in some cases, as Nextstrain is somewhat moving to do?). But I'd want to do a separate scope to check how much this is used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Talk about anything scope: data Related to data scope: scripts
Projects
None yet
Development

No branches or pull requests

2 participants