Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: mutations relative to arbitrary node #1454

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

ivan-aksamentov
Copy link
Member

@ivan-aksamentov ivan-aksamentov commented May 14, 2024

This extends concept of private mutations (private mutations are mutations relative to the parent node on the ref tree) to a more general concept of mutations relative to an arbitrary node of interest.

The ref nodes of interest are described by the user in the .meta.extensions.nextclade.reference_nodes of the input Auspice JSON. The description can also contain constraints: we can match node to only query samples belonging to a certain clade or lineage.

Private mutations functionality is unchanged. New functionality, inputs and outputs are added on top. Though the implementation algo is largely reused.

Test

PR in data for testing: nextstrain/nextclade_data#198 (branch with the same name). Dataset nextstrain/sars-cov-2/wuhan-hu-1/proteins there has reference_nodes config added to tree.json. Can be used like this:

https://nextclade-git-feat-mutations-relative-to-node-nextstrain.vercel.app/?dataset-server=gh&dataset-name=nextstrain/sars-cov-2/wuhan-hu-1/proteins

Work items

  • read input config from Auspice JSON
  • calculate relative nuc mutations
  • calculate relative aa mutations
  • filter by clade and clade-like attributes
  • output to Nextclade JSON
  • output to Nextclade NDJSON
  • output to TSV and CSV
  • pass required data between js and wasm
  • display in web app

For consideration:

  • ? use regex for matching clade-like values, rather than string equality
  • ? filter by gene

Inputs

Example configuration object. Put it into .meta of Auspice JSON (such that it becomes .meta.extensions.nextclade.reference_nodes)

Click to expand
{
  "extensions": {
    "nextclade": {
      "reference_nodes": [
        {
          "name": "NODE_0000659",
          "displayName": "BA.2.86 (23I)",
          "description": "Ancestral BA.2.86 sequence"
        },
        {
          "name": "XBB.1.5",
          "displayName": "XBB.1.5 (23A)",
          "description": "Ancestral XBB.1.5 sequence. Vaccine strain 2023/2024",
          "include": {
            "clade": ["23A"]
          }
        },
        {
          "name": "NODE_0000862",
          "displayName": "BA.5 (22B)",
          "description": "Ancestral BA.5 sequence. Vaccine strain 2022/2023",
          "include": {
            "clade": ["22B"]
          }
        }
      ]
    }
  }
}
  • The name field should match the name field of one of the nodes on the tree.

  • The displayName and description are optional arbitrary strings used for display purposes.

  • The include field should be an object, which contains:

    • keys: names from the .meta.extensions.nextclade.clade_node_attrs (for clade-like attributes) or string "clade" (for the built-in clades).
    • values: a list of values of the clade-like attribute or a list of built-in clades. Only query sequences which match these attributes are considered for calculation of mutations relative to that node.

    If the include field is not present, then no constraints applied (all query sequences are considered).

Outputs

Output JSON and NDJSON

Example fragment of output json entry (entry in the .results[] array) (mutation lists are truncated for demonstration purposes)

Click to expand
{
  "relativeNucMutations": [
    {
      "refNode": {
        "name": "NODE_0000659",
        "displayName": "BA.2.86 (23I)",
        "description": "Ancestral BA.2.86 sequence"
      },
      "muts": {
        "privateSubstitutions": [
          {"pos": 404, "refNuc": "A", "qryNuc": "G"},
          {"pos": 896, "refNuc": "A", "qryNuc": "C"}
        ],
        "privateDeletions": [],
        "reversionSubstitutions": [
          {"pos": 896, "refNuc": "A", "qryNuc": "C"},
          {"pos": 3430, "refNuc": "T", "qryNuc": "G"}
        ],
        "labeledSubstitutions": [
          {
            "substitution": {"pos": 404, "refNuc": "A", "qryNuc": "G"},
            "labels": ["23A", "23D", "23F", "23B", "22F", "23E", "23H", "23G"]
          },
          {
            "substitution": {"pos": 2333, "refNuc": "C", "qryNuc": "T"},
            "labels": ["23F", "23H"]
          }
        ],
        "unlabeledSubstitutions": [
          {"pos": 4089, "refNuc": "C", "qryNuc": "T"},
          {"pos": 11344, "refNuc": "C", "qryNuc": "T"}
        ],
        "totalPrivateSubstitutions": 75,
        "totalPrivateDeletions": 0,
        "totalReversionSubstitutions": 37,
        "totalLabeledSubstitutions": 34,
        "totalUnlabeledSubstitutions": 4
      }
    }
  ],
  "relativeAaMutations": [
    {
      "refNode": {
        "name": "NODE_0000659",
        "displayName": "BA.2.86 (23I)",
        "description": "Ancestral BA.2.86 sequence"
      },
      "muts": {
        "E": {
          "privateSubstitutions": [
            {"cdsName": "E", "pos": 10, "refAa": "T", "qryAa": "A"}
          ],
          "privateDeletions": [],
          "reversionSubstitutions": [],
          "totalPrivateSubstitutions": 1,
          "totalPrivateDeletions": 0,
          "totalReversionSubstitutions": 0
        },
        "S": {
          "privateSubstitutions": [
            {"cdsName": "S", "pos": 20, "refAa": "T", "qryAa": "R"},
            {"cdsName": "S", "pos": 26, "refAa": "-", "qryAa": "S"}
          ],
          "privateDeletions": [
            {"cdsName": "S", "pos": 23, "refAa": "S"},
            {"cdsName": "S", "pos": 143, "refAa": "Y"}
          ],
          "reversionSubstitutions": [
            {"cdsName": "S", "pos": 20, "refAa": "T", "qryAa": "R"},
            {"cdsName": "S", "pos": 49, "refAa": "L", "qryAa": "S"}
          ],
          "totalPrivateSubstitutions": 39,
          "totalPrivateDeletions": 2,
          "totalReversionSubstitutions": 25
        }
      }
    }
  ]
}

Output TSV and CSV

TODO

Visualization in Nextclade Web

TODO

This extends concept of private mutations (relative to the parent node on the ref tree) to mutations relative to an arbitrary node of interest.

The ref nodes of interest are described by the user in the `.meta .extensions .nextclade .reference_nodes` of the input Auspice JSON. The description can also contain constrains: we can match node to only query samples belonging to a certain clade or lineage.

Private mutations functionality is unchanged, this is only an addition. Though the implementation algo is largely reused.

On this commit only nuc mutations are added.
Similarly to b537132, add relative amino acid mutations
This just passes through from js to wasm the data that is now required to output relative mutations
Copy link

vercel bot commented May 14, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated (UTC)
nextclade ❌ Failed (Inspect) May 30, 2024 9:49pm

Comment on lines +40 to +63
ref_nodes
.iter()
.map(|&ref_node| -> Result<_, Report> {
let node = graph
.iter_nodes()
.find(|node| node.payload().name == ref_node.name)
.ok_or_else(|| eyre!("Unable to find reference node on the tree: '{}'", &ref_node.name))?;

let muts = find_private_nuc_mutations(
node.payload(),
substitutions,
deletions,
missing,
alignment_range,
ref_seq,
non_acgtns,
virus_properties,
);

Ok(RelativeNucMutations {
ref_node: ref_node.to_owned(),
muts,
})
})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is very little new logic, mostly bookkeeping. The find_private_*_mutations() for nucs and aa are reused as is. The only difference compared to private mutations is that the code now runs multiple times, for each requested node.

This code fragment is for nucs. The sibling function for aa is just below that.

Comment on lines +126 to +135
.reference_nodes
.iter()
.filter(|node| {
// For each attribute key in includes, check that the attribute value of this sample match
// at least one item in the include list
node.include.iter().all(|(key, includes)| {
let curr_value = if key == "clade" { clade } else { &clade_node_attrs[key] };
includes.iter().any(|include_value| include_value == curr_value) // TODO: consider regex match rather than equality
})
})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the logic for constraining the mutations calculation by clades and clade-like attributes. If include field is present, then we lookup the constrained attribute on the query sample and only consider this node if the query attribute's value is matching any of the values in the include list.

For example, if config has node of interest which is only relevant for clades 23A and 23B:

{
  "...": "...",
  "include": { "clade": ["23A", "23B"] }
}

then mutations relative to this node will be calculated only for query samples of clade 23A and 23B.

Same for pango lineages:

{
  "...": "...",
  "include": { "Nextclade_pango": ["A.1.2.3", "A.1.2.3.4"] }
}

It is up for discussion how multiple filters (multiple keys in the include object) should be combined - using boolean OR or boolean AND.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant