Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

embed: Support multiple input files for alignments and distance matrices #10

Open
huddlej opened this issue Feb 8, 2024 · 1 comment · May be fixed by #19
Open

embed: Support multiple input files for alignments and distance matrices #10

huddlej opened this issue Feb 8, 2024 · 1 comment · May be fixed by #19
Assignees
Labels
enhancement New feature or request

Comments

@huddlej
Copy link
Contributor

huddlej commented Feb 8, 2024

Context

To produce embeddings for multiple gene segments like HA and NA for influenza H3N2, we currently concatenate the alignments for each gene to create a single alignment file and then calculate the distance matrix from that concatenated alignment. This concatenation step requires additional work from the user, though, that could be easily performed by the pathogen-embed command.

Description

Ideally, users could provide multiple input files for both alignments and distance matrices to the pathogen-embed command. In this way, users could precalculate a distance matrix per gene segment and let the embed command add the distances matrices internally. The interface might look like this:

# Create distance matrix for H3N2 HA alignment.
pathogen-distance \
  --alignment h3n2_ha_alignment.fasta \
  --output h3n2_ha_distances.csv

# Create distance matrix for H3N2 NA alignment.
pathogen-distance \
  --alignment h3n2_na_alignment.fasta \
  --output h3n2_na_distances.csv

# Run MDS on the HA and NA distances.
pathogen-embed \
  --distance-matrix h3n2_ha_distances.csv h3n2_na_distances.csv \
  --output-dataframe h3n2_ha_na_mds.csv \
  mds

# Run t-SNE on HA and NA alignments (for PCA initialization) and distances.
pathogen-embed \
  --alignment h3n2_ha_alignment.fasta h3n2_na_alignment.fasta \
  --distance-matrix h3n2_ha_distances.csv h3n2_na_distances.csv \
  --output-dataframe h3n2_ha_na_t-sne.csv \
  t-sne

# Run t-SNE on HA and NA alignments for PCA initialization and to calculate the distance matrix on the fly.
pathogen-embed \
  --alignment h3n2_ha_alignment.fasta h3n2_na_alignment.fasta \
  --output-dataframe h3n2_ha_na_t-sne.csv \
  t-sne

# Run t-SNE with an HA alignment for PCA initialization and HA/NA distance matrices for the embedding.
pathogen-embed \
  --alignment h3n2_ha_alignment.fasta \
  --distance-matrix h3n2_ha_distances.csv h3n2_na_distances.csv \
  --output-dataframe h3n2_ha_na_t-sne.csv \
  t-sne

This approach allows each distance matrix to be produced in parallel, for example in a Snakemake workflow, which will speed up a computationally expensive part of the analysis.

Possible solution

To support this new functionality, the pathogen-embed command needs to:

  1. accept one or more arguments to --alignment and --distance-matrix
  2. load all given alignment files and, if more than one file is given, concatenate the alignments before running embeddings
  3. load all given distance matrix files and, if more than one file is given, sum the distances from all matrices into a single distance matrix before running embeddings

In the case where the user only provides alignments and the embedding requires a distance matrix, the command's current logic remains unchanged and operates on the concatenated alignment it produces from step 2 above.

It should be possible for the user to provide a single alignment file to use for PCA initialization of t-SNE, for example, and also provide multiple distance matrices to use for the embedding.

@huddlej huddlej added the enhancement New feature or request label Feb 8, 2024
@nandsra21 nandsra21 self-assigned this Mar 29, 2024
@nandsra21
Copy link
Collaborator

Working Implementation

pathogen-distance \
  --alignment h3n2_ha_alignment.fasta \
  --output h3n2_ha_distances.csv

# Create distance matrix for H3N2 NA alignment.
pathogen-distance \
  --alignment h3n2_na_alignment.fasta \
  --output h3n2_na_distances.csv

# Run MDS on the HA and NA distances.
# Change: must add an alignment
pathogen-embed \
 --alignment h3n2_ha_alignment.fasta h3n2_na_alignment.fasta \
  --distance-matrix h3n2_ha_distances.csv h3n2_na_distances.csv \
  --output-dataframe h3n2_ha_na_mds.csv \
  mds

# Run t-SNE on HA and NA alignments (for PCA initialization) and distances.
pathogen-embed \
  --alignment h3n2_ha_alignment.fasta h3n2_na_alignment.fasta \
  --distance-matrix h3n2_ha_distances.csv h3n2_na_distances.csv \
  --output-dataframe h3n2_ha_na_t-sne.csv \
  t-sne

# Run t-SNE on HA and NA alignments for PCA initialization and to calculate the distance matrix on the fly.
pathogen-embed \
  --alignment h3n2_ha_alignment.fasta h3n2_na_alignment.fasta \
  --output-dataframe h3n2_ha_na_t-sne.csv \
  t-sne

# Run t-SNE with an HA alignment for PCA initialization and HA/NA distance matrices for the embedding.
# Change: same number of alignments as distance matrices
pathogen-embed \
  --alignment h3n2_ha_alignment.fasta h3n2_na_alignment.fasta \
  --distance-matrix h3n2_ha_distances.csv h3n2_na_distances.csv \
  --output-dataframe h3n2_ha_na_t-sne.csv \
  t-sne```

@nandsra21 nandsra21 linked a pull request Apr 10, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants