Machine-Readable Reference File #163

emmahodcroft · 2021-05-19T16:00:44Z

This a draft PR to outline a JSON file format which would contain all information about the Variants & Mutations that are tracked on CoVariants, with defining mutations, to allow a 'lookup' that other apps and programs could automatically link to by using the list of information.

This is a Draft PR. I would love feedback.
I recognise that in the file there are comments which are not allowed - those are to provide clarity to the file structure. It also currently just includes 1 example of a Variant & Mutation, to settle on a good format. I will then write a script which generates this file from the existing files.

I'm not very familiar with JSON format and have found it restrictive - some parts I didn't even convert as I'm not sure if they're useful & I wasn't sure how to convert in a way that's concise.

@nodrogluap and @chaoran-chen I'd really appreciate your thoughts on this for what you have in mind to do!

Information:

I imagine alignment_defining mutations will be most useful as these can be used to try to identify sequences from alignment only. However, this will miss sequences that have reversions, miscalls or are missing coverage at this position.
phylogenetic_defining are what are used to put the 'labels' on Nextstrain trees - they mark the branch where all these mutations are present, and all sequence below this (whether or not any particular sequence has these mutations - so it takes care of reversions and non-coverage)
build_name is what's used in file names & URLs as it's 'safe'. Sometimes it corresponds better to the display_name, sometimes not. If the discrepancy is big enough that it's problematic, then I could try to reconcile this within CoVariants.

Questions:

Is phylogenetic_defining useful? or just leave it?
Is color useful?
Is pango_name useful? This may not match 1:1 with running Pango. It's just taken from the name table on CoVariants
I would like to include the amino-acid mutations that correspond to defining mutations where possible - see comments on lines 21-23. @ivan-aksamentov Would you have a good suggestion for how to incorporate this?
I have included the list of 'complete' mutations (see lines 41-76) - I haven't converted to JSON format. This is what makes the 'side-sausage' on CoVariants pages. Is this useful? Or is this not so useful and just get rid of?

To do:

Finalize format of the file & most useful field
Write script to generate file automatically

vercel · 2021-05-19T16:00:46Z

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/hodcroftlab/covariants/HtcC7A1zYMdjvnyCUeDqLxAKCCKU
✅ Preview: https://covariants-git-covariantsfile-hodcroftlab.vercel.app

chaoran-chen · 2021-05-24T22:47:06Z

Hi Emma! Thanks for creating this file, I believe that it will be very useful!

Is phylogenetic_defining useful? or just leave it?

I don't have a concrete usecase for it right now but it sounds useful.

Is color useful?

Not for me.

Is pango_name useful? This may not match 1:1 with running Pango. It's just taken from the name table on CoVariants

Yes, very much.

I have included the list of 'complete' mutations (see lines 41-76) - I haven't converted to JSON format. This is what makes the 'side-sausage' on CoVariants pages. Is this useful? Or is this not so useful and just get rid of?

I find a complete list of mutations useful.

Two suggestions:

The notation for the mutations should be consistent. For example, if it is not possible to add the reference base in all cases, then it would be better to remove it everywhere. I think that the short string format is easier and that there is no need for {left: ..., pos: ..., right: ...}.
The nucleotide-level and amino acid-level mutations should be distinguished clearly. For all mutations, it could be like this:

all_mutations: {
	"nonsynonymous": [
		{ "amino_acid": 'S:N123Y',  nucleotide: ["A1234T"]},
		...
	],
	"synonymous_nucleotide": ["A2345C", "C3456G"]
}

chaoran-chen · 2021-08-23T06:42:22Z

@emmahodcroft I would like to change my previous answer: having the colors field could be useful for me/cov-spectrum. It might be a good idea to use the same colors as covariants whenever possible because some reports use screenshots from both sites.

first go at machine-readable file

83ac88b

emmahodcroft marked this pull request as draft May 19, 2021 16:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Machine-Readable Reference File #163

Machine-Readable Reference File #163

emmahodcroft commented May 19, 2021 •

edited

vercel bot commented May 19, 2021

chaoran-chen commented May 24, 2021

chaoran-chen commented Aug 23, 2021 •

edited

Machine-Readable Reference File #163

Are you sure you want to change the base?

Machine-Readable Reference File #163

Conversation

emmahodcroft commented May 19, 2021 • edited

vercel bot commented May 19, 2021

chaoran-chen commented May 24, 2021

chaoran-chen commented Aug 23, 2021 • edited

emmahodcroft commented May 19, 2021 •

edited

chaoran-chen commented Aug 23, 2021 •

edited