feat: mutations relative to arbitrary node #1454

ivan-aksamentov · 2024-05-14T04:44:24Z

This extends concept of private mutations (private mutations are mutations relative to the parent node on the ref tree) to a more general concept of mutations relative to an arbitrary node of interest.

The ref nodes of interest are described by the user in the .meta.extensions.nextclade.reference_nodes of the input Auspice JSON. The description can also contain constraints: we can match node to only query samples belonging to a certain clade or lineage.

Private mutations functionality is unchanged. New functionality, inputs and outputs are added on top. Though the implementation algo is largely reused.

Test

PR in data for testing: nextstrain/nextclade_data#198 (branch with the same name). Dataset nextstrain/sars-cov-2/wuhan-hu-1/proteins there has reference_nodes config added to tree.json. Can be used like this:

https://nextclade-git-feat-mutations-relative-to-node-nextstrain.vercel.app/?dataset-server=gh&dataset-name=nextstrain/sars-cov-2/wuhan-hu-1/proteins

Work items

read input config from Auspice JSON
calculate relative nuc mutations
calculate relative aa mutations
filter by clade and clade-like attributes
output to Nextclade JSON
output to Nextclade NDJSON
output to TSV and CSV
pass required data between js and wasm
display in web app

For consideration:

? use regex for matching clade-like values, rather than string equality
? filter by gene

Inputs

Example configuration object. Put it into .meta of Auspice JSON (such that it becomes .meta.extensions.nextclade.reference_nodes)

Click to expand

{
  "extensions": {
    "nextclade": {
      "reference_nodes": [
        {
          "name": "NODE_0000659",
          "displayName": "BA.2.86 (23I)",
          "description": "Ancestral BA.2.86 sequence"
        },
        {
          "name": "XBB.1.5",
          "displayName": "XBB.1.5 (23A)",
          "description": "Ancestral XBB.1.5 sequence. Vaccine strain 2023/2024",
          "include": {
            "clade": ["23A"]
          }
        },
        {
          "name": "NODE_0000862",
          "displayName": "BA.5 (22B)",
          "description": "Ancestral BA.5 sequence. Vaccine strain 2022/2023",
          "include": {
            "clade": ["22B"]
          }
        }
      ]
    }
  }
}

The name field should match the name field of one of the nodes on the tree.
The displayName and description are optional arbitrary strings used for display purposes.
The include field should be an object, which contains:
- keys: names from the .meta.extensions.nextclade.clade_node_attrs (for clade-like attributes) or string "clade" (for the built-in clades).
- values: a list of values of the clade-like attribute or a list of built-in clades. Only query sequences which match these attributes are considered for calculation of mutations relative to that node.
If the include field is not present, then no constraints applied (all query sequences are considered).

Outputs

Output JSON and NDJSON

Example fragment of output json entry (entry in the .results[] array) (mutation lists are truncated for demonstration purposes)

Click to expand

{
  "relativeNucMutations": [
    {
      "refNode": {
        "name": "NODE_0000659",
        "displayName": "BA.2.86 (23I)",
        "description": "Ancestral BA.2.86 sequence"
      },
      "muts": {
        "privateSubstitutions": [
          {"pos": 404, "refNuc": "A", "qryNuc": "G"},
          {"pos": 896, "refNuc": "A", "qryNuc": "C"}
        ],
        "privateDeletions": [],
        "reversionSubstitutions": [
          {"pos": 896, "refNuc": "A", "qryNuc": "C"},
          {"pos": 3430, "refNuc": "T", "qryNuc": "G"}
        ],
        "labeledSubstitutions": [
          {
            "substitution": {"pos": 404, "refNuc": "A", "qryNuc": "G"},
            "labels": ["23A", "23D", "23F", "23B", "22F", "23E", "23H", "23G"]
          },
          {
            "substitution": {"pos": 2333, "refNuc": "C", "qryNuc": "T"},
            "labels": ["23F", "23H"]
          }
        ],
        "unlabeledSubstitutions": [
          {"pos": 4089, "refNuc": "C", "qryNuc": "T"},
          {"pos": 11344, "refNuc": "C", "qryNuc": "T"}
        ],
        "totalPrivateSubstitutions": 75,
        "totalPrivateDeletions": 0,
        "totalReversionSubstitutions": 37,
        "totalLabeledSubstitutions": 34,
        "totalUnlabeledSubstitutions": 4
      }
    }
  ],
  "relativeAaMutations": [
    {
      "refNode": {
        "name": "NODE_0000659",
        "displayName": "BA.2.86 (23I)",
        "description": "Ancestral BA.2.86 sequence"
      },
      "muts": {
        "E": {
          "privateSubstitutions": [
            {"cdsName": "E", "pos": 10, "refAa": "T", "qryAa": "A"}
          ],
          "privateDeletions": [],
          "reversionSubstitutions": [],
          "totalPrivateSubstitutions": 1,
          "totalPrivateDeletions": 0,
          "totalReversionSubstitutions": 0
        },
        "S": {
          "privateSubstitutions": [
            {"cdsName": "S", "pos": 20, "refAa": "T", "qryAa": "R"},
            {"cdsName": "S", "pos": 26, "refAa": "-", "qryAa": "S"}
          ],
          "privateDeletions": [
            {"cdsName": "S", "pos": 23, "refAa": "S"},
            {"cdsName": "S", "pos": 143, "refAa": "Y"}
          ],
          "reversionSubstitutions": [
            {"cdsName": "S", "pos": 20, "refAa": "T", "qryAa": "R"},
            {"cdsName": "S", "pos": 49, "refAa": "L", "qryAa": "S"}
          ],
          "totalPrivateSubstitutions": 39,
          "totalPrivateDeletions": 2,
          "totalReversionSubstitutions": 25
        }
      }
    }
  ]
}

Output TSV and CSV

TODO

Visualization in Nextclade Web

TODO

This extends concept of private mutations (relative to the parent node on the ref tree) to mutations relative to an arbitrary node of interest. The ref nodes of interest are described by the user in the `.meta .extensions .nextclade .reference_nodes` of the input Auspice JSON. The description can also contain constrains: we can match node to only query samples belonging to a certain clade or lineage. Private mutations functionality is unchanged, this is only an addition. Though the implementation algo is largely reused. On this commit only nuc mutations are added.

Similarly to b537132, add relative amino acid mutations

This just passes through from js to wasm the data that is now required to output relative mutations

vercel · 2024-05-14T04:44:28Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Updated (UTC)
nextclade	❌ Failed (Inspect)		May 30, 2024 9:49pm

ivan-aksamentov · 2024-05-14T04:49:38Z

packages/nextclade/src/analyze/find_relative_mutations.rs

+  ref_nodes
+    .iter()
+    .map(|&ref_node| -> Result<_, Report> {
+      let node = graph
+        .iter_nodes()
+        .find(|node| node.payload().name == ref_node.name)
+        .ok_or_else(|| eyre!("Unable to find reference node on the tree: '{}'", &ref_node.name))?;
+
+      let muts = find_private_nuc_mutations(
+        node.payload(),
+        substitutions,
+        deletions,
+        missing,
+        alignment_range,
+        ref_seq,
+        non_acgtns,
+        virus_properties,
+      );
+
+      Ok(RelativeNucMutations {
+        ref_node: ref_node.to_owned(),
+        muts,
+      })
+    })


There is very little new logic, mostly bookkeeping. The find_private_*_mutations() for nucs and aa are reused as is. The only difference compared to private mutations is that the code now runs multiple times, for each requested node.

This code fragment is for nucs. The sibling function for aa is just below that.

ivan-aksamentov · 2024-05-14T04:58:52Z

packages/nextclade/src/analyze/find_relative_mutations.rs

+    .reference_nodes
+    .iter()
+    .filter(|node| {
+      // For each attribute key in includes, check that the attribute value of this sample match
+      // at least one item in the include list
+      node.include.iter().all(|(key, includes)| {
+        let curr_value = if key == "clade" { clade } else { &clade_node_attrs[key] };
+        includes.iter().any(|include_value| include_value == curr_value) // TODO: consider regex match rather than equality
+      })
+    })


This is the logic for constraining the mutations calculation by clades and clade-like attributes. If include field is present, then we lookup the constrained attribute on the query sample and only consider this node if the query attribute's value is matching any of the values in the include list.

For example, if config has node of interest which is only relevant for clades 23A and 23B:

{ "...": "...", "include": { "clade": ["23A", "23B"] } }

then mutations relative to this node will be calculated only for query samples of clade 23A and 23B.

Same for pango lineages:

{ "...": "...", "include": { "Nextclade_pango": ["A.1.2.3", "A.1.2.3.4"] } }

It is up for discussion how multiple filters (multiple keys in the include object) should be combined - using boolean OR or boolean AND.

For testing of nextstrain/nextclade#1454

…tive-to-node

ivan-aksamentov added 3 commits May 14, 2024 05:06

feat: calculate aa muts relative to arbitrary node

466a5ad

Similarly to b537132, add relative amino acid mutations

fix(web): ensure correct json and ndjson export

6e7afc4

This just passes through from js to wasm the data that is now required to output relative mutations

ivan-aksamentov commented May 14, 2024

View reviewed changes

ivan-aksamentov added a commit to nextstrain/nextclade_data that referenced this pull request May 14, 2024

add ref nodes to nextstrain/sars-cov-2/wuhan-hu-1/proteins dataset

6b84be3

For testing of nextstrain/nextclade#1454

ivan-aksamentov mentioned this pull request May 14, 2024

add ref nodes to nextstrain/sars-cov-2/wuhan-hu-1/proteins dataset nextstrain/nextclade_data#198

Draft

ivan-aksamentov added a commit to nextstrain/nextclade_data that referenced this pull request May 14, 2024

add ref nodes to nextstrain/sars-cov-2/wuhan-hu-1/proteins dataset

e547104

For testing of nextstrain/nextclade#1454

ivan-aksamentov added 2 commits May 14, 2024 12:54

feat(web): display relative substitutions in nuc sequence view

4045383

feat(web): add ref node selector

822af4e

vercel bot deployed to Preview May 14, 2024 11:02 View deployment

Merge remote-tracking branch 'origin/master' into feat/mutations-rela…

5ddbfb2

…tive-to-node

vercel bot had a problem deploying to Preview May 30, 2024 21:49 Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: mutations relative to arbitrary node #1454

feat: mutations relative to arbitrary node #1454

ivan-aksamentov commented May 14, 2024 •

edited

vercel bot commented May 14, 2024 •

edited

ivan-aksamentov May 14, 2024 •

edited

ivan-aksamentov May 14, 2024 •

edited

feat: mutations relative to arbitrary node #1454

Are you sure you want to change the base?

feat: mutations relative to arbitrary node #1454

Conversation

ivan-aksamentov commented May 14, 2024 • edited

Test

Work items

Inputs

Outputs

Output JSON and NDJSON

Output TSV and CSV

Visualization in Nextclade Web

vercel bot commented May 14, 2024 • edited

ivan-aksamentov May 14, 2024 • edited

Choose a reason for hiding this comment

ivan-aksamentov May 14, 2024 • edited

Choose a reason for hiding this comment

ivan-aksamentov commented May 14, 2024 •

edited

vercel bot commented May 14, 2024 •

edited

ivan-aksamentov May 14, 2024 •

edited

ivan-aksamentov May 14, 2024 •

edited