Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

measurements export does not consistently allow the strain column to be used as a grouping column #1428

Open
huddlej opened this issue Mar 5, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@huddlej
Copy link
Contributor

huddlej commented Mar 5, 2024

Current behavior and how to reproduce

The augur measurements export command does not allow users to specify the column used as the --strain-column in the --grouping-column argument unless the user knows that the strain column gets renamed internally to strain and the user does not specify --include-columns. The following examples will demonstrate the nature of this issue and how to reproduce the behavior.

Reproducing this error requires building a minimal tree and measurements dataset. I ended up using the public data from the measurements panel paper as a starting point. The sequences and metadata are attached here: h3n2_data.zip

Clone the measurements panel repo and change into the directory:

git clone https://github.com/blab/measurements-panel.git
cd measurements-panel

Download sequences and metadata attached above into the data/ subdirectory.

Run the workflow to build the Auspice JSONs with the latest Nextstrain CLI and image:

nextstrain build . \
  auspice/measurements-panel_flu_seasonal_h3n2_ha.json \
  auspice/measurements-panel_flu_seasonal_h3n2_ha_measurements.json \
  --forceall

Confirm that the Auspice JSONs work as expected by dragging them on to https://auspice.us/.

Next, start a Nextstrain shell and rebuild the measurements panel manually with a command that overrides the grouping columns:

nextstrain shell .

augur measurements export \
  --collection results/h3n2/antigenic_distances_between_strains.tsv \
  --collection-config results/h3n2/measurements_collection_config.json \
  --grouping-column reference_strain \
  --strain-column test_strain \
  --value-column log2_titer \
  --key h3n2_ha_cell_hi \
  --title 'H3N2 cell-passaged HI measurements' \
  --x-axis-label 'normalized log2 titer distance from reference' \
  --threshold 0.0 2.0 \
  --filters reference_strain clade_reference source serum \
  --show-threshold \
  --hide-overall-mean \
  --minify-json \
  --output-json auspice/measurements-panel_flu_seasonal_h3n2_ha_measurements.json

Confirm that this panel JSON works in auspice.us with the original tree JSON. Then, try adding the strain column to the list of grouping columns (so we can reverse the grouping on the y-axis to the test strain as when we want to see the distribution of measurements in that direction):

augur measurements export \
  --collection results/h3n2/antigenic_distances_between_strains.tsv \
  --collection-config results/h3n2/measurements_collection_config.json \
  --grouping-column reference_strain test_strain \
  --strain-column test_strain \
  --value-column log2_titer \
  --key h3n2_ha_cell_hi \
  --title 'H3N2 cell-passaged HI measurements' \
  --x-axis-label 'normalized log2 titer distance from reference' \
  --threshold 0.0 2.0 \
  --filters reference_strain clade_reference source serum \
  --show-threshold \
  --hide-overall-mean \
  --minify-json \
  --output-json auspice/measurements-panel_flu_seasonal_h3n2_ha_measurements.json

This command produces the following error:

ERROR: Provided grouping column 'test_strain' does not exist in collection TSV.

This (clearly incorrect) error suggests that the strain column test_strain gets renamed internally such that the user can no longer refer to it during the code that parses the grouping columns. To test this, we can change the name of the grouping column from test_strain to strain as follows:

augur measurements export \
  --collection results/h3n2/antigenic_distances_between_strains.tsv \
  --collection-config results/h3n2/measurements_collection_config.json \
  --grouping-column reference_strain strain \
  --strain-column test_strain \
  --value-column log2_titer \
  --key h3n2_ha_cell_hi \
  --title 'H3N2 cell-passaged HI measurements' \
  --x-axis-label 'normalized log2 titer distance from reference' \
  --threshold 0.0 2.0 \
  --filters reference_strain clade_reference source serum \
  --show-threshold \
  --hide-overall-mean \
  --minify-json \
  --output-json auspice/measurements-panel_flu_seasonal_h3n2_ha_measurements.json

This works!

But, if I also add the --include-columns argument to limit which data appear in the panel, like so:

augur measurements export \
  --collection results/h3n2/antigenic_distances_between_strains.tsv \
  --collection-config results/h3n2/measurements_collection_config.json \
  --grouping-column reference_strain strain \
  --strain-column test_strain \
  --value-column log2_titer \
  --key h3n2_ha_cell_hi \
  --title 'H3N2 cell-passaged HI measurements' \
  --x-axis-label 'normalized log2 titer distance from reference' \
  --threshold 0.0 2.0 \
  --filters reference_strain clade_reference source serum \
  --show-threshold \
  --hide-overall-mean \
  --minify-json \
  --include-columns test_strain reference_strain log2_titer clade_reference source serum \
  --output-json auspice/measurements-panel_flu_seasonal_h3n2_ha_measurements.json

I get the following new error:

ERROR: Provided grouping column 'strain' was not in the list of columns to include: ['test_strain', 'reference_strain', 'log2_titer', 'clade_reference', 'source', 'serum'].

This error suggests that the grouping column validation associated with --include-columns occurs before test_strain gets renamed internally to strain. Then, when I try to change the grouping column to the value listed in the error message above (test_strain):

augur measurements export \
  --collection results/h3n2/antigenic_distances_between_strains.tsv \
  --collection-config results/h3n2/measurements_collection_config.json \
  --grouping-column reference_strain test_strain \
  --strain-column test_strain \
  --value-column log2_titer \
  --key h3n2_ha_cell_hi \
  --title 'H3N2 cell-passaged HI measurements' \
  --x-axis-label 'normalized log2 titer distance from reference' \
  --threshold 0.0 2.0 \
  --filters reference_strain clade_reference source serum \
  --show-threshold \
  --hide-overall-mean \
  --minify-json \
  --include-columns test_strain reference_strain log2_titer clade_reference source serum \
  --output-json auspice/measurements-panel_flu_seasonal_h3n2_ha_measurements.json

I get the following error again:

ERROR: Provided grouping column 'test_strain' does not exist in collection TSV.

Expected behavior

I would expect to be able to consistently refer to the strain column by its name in the input collection file for all arguments to the measurements export command. Although the use case here is unusual for our traditional serology data, other types of data benefit from this ability to group by the strain column on the y-axis.

Possible solution

One possible solution would be to keep the original strain column under its same name but add a new internal id column that copies the strain column values for use by internal logic. We could drop the original strain column from the internal data frame just before writing out the JSON version of the measurements.

@huddlej huddlej added the bug Something isn't working label Mar 5, 2024
huddlej added a commit to blab/loes-2024-neutralization-nextstrain that referenced this issue Mar 6, 2024
Add "strain" to grouping columns and remove the "include columns"
option, so the grouping works as expected (see
nextstrain/augur#1428 for more details).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant