Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix annotate metadata with numeric strains #952

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented May 25, 2022

Description of proposed changes

Fix annotate metadata with numeric strains by setting the dtype of the "strain" column in the sequence index to "string".

Related issue(s)

Fixes #948

Testing

  • Added functional test for issue, confirmed that test failed, and confirmed that fix in c49a5a6 makes the test pass
  • Tested by CI

scripts/annotate_metadata_with_index.py Outdated Show resolved Hide resolved
@huddlej
Copy link
Contributor Author

huddlej commented Aug 4, 2022

CI fails on the "push" because this PR branches from the workflow with the old Nextalign/Nextclade commands, but the CI runs with the latest Docker image using new Nextalign/Nextclade:

python3 scripts/sanitize_sequences.py             --sequences data/example_sequences.fasta.gz             --strip-prefixes hCoV-19/ SARS-CoV-2/             --output /dev/stdout 2> logs/sanitize_sequences_gisaid.txt             | nextalign             --jobs=2             --reference defaults/reference_seq.fasta             --genemap defaults/annotation.gff             --genes ORF1a,ORF1b,S,ORF3a,E,M,ORF6,ORF7a,ORF7b,ORF8,N,ORF9b             --sequences /dev/stdin             --output-dir results/translations             --output-basename seqs_gisaid             --output-fasta results/aligned_gisaid.fasta             --output-insertions results/insertions_gisaid.tsv > logs/align_gisaid.txt 2>&1;
        xz -2 -T 2 results/aligned_gisaid.fasta;
        xz -2 -T 2 results/translations/seqs_gisaid*.fasta

CI passes on the "pull" which runs the results we'd see after merging this PR into master where the alignment commands match the Docker environment.

python3 scripts/sanitize_sequences.py             --sequences data/example_sequences.fasta.gz             --strip-prefixes hCoV-19/ SARS-CoV-2/             --output /dev/stdout 2> logs/sanitize_sequences_gisaid.txt             | nextalign run             --jobs=2             --reference defaults/reference_seq.fasta             --genemap defaults/annotation.gff             --output-translations results/translations/seqs_gisaid.gene.{gene}.fasta             --output-fasta results/aligned_gisaid.fasta             --output-insertions results/insertions_gisaid.tsv > logs/align_gisaid.txt 2>&1;
        xz -2 -T 2 results/aligned_gisaid.fasta;
        xz -2 -T 2 results/translations/seqs_gisaid.gene.*.fasta

Given that the latter passes and is the check that matters, we should be safe to merge.

Adds a functional test to cover a use case described in #948.
Sets the dtype of the strain column in the sequence index to "string"
prior to annotating metadata with that index. This change prevents
pandas from inferring the dtype as numeric when strain names are all
numeric.

Fixes #948
Replaces "merge" between metadata and sequence index with a "join" that
uses the index of the input data frames to merge their content. This
change allows us to support metadata where strains are indexed by a "name"
column instead of a "strain" column. The sequence index will always have
a "strain" column (by design), so we load the sequence index and tell
pandas the name of its column to index by. This additional change allows
the "join" command to work as expected without ever needing to specify
the name of the strain column in the metadata.

This commit adds a functional test for this expected behavior.
@huddlej huddlej force-pushed the fix-annotate-metadata-with-numeric-strains branch from 17c75ff to bdc1775 Compare August 5, 2022 20:41
@huddlej huddlej self-assigned this Nov 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

annotate_metadata_with_index raises ValueError when dtypes don't match
3 participants