Use weighted sampling for Asia builds #1106

victorlin · 2024-05-03T23:09:45Z

Description of proposed changes

This replaces the Asia/China/India split with population-based weighted sampling (possible with nextstrain/augur#1454).

I ran trial builds (linked below) for both GISAID and open. I think a good comparison is gisaid/asia/all-time before and after weighted sampling. Some notes from that comparison:

With weighted sampling, more sequences are included for large countries such as Indonesia which did not get the same treatment as China and India prior to this PR.
With weighted sampling, there is a high chance that no sequences are included for small countries such as Maldives.
The sampling difference results in some subtle but maybe significant differences in the frequencies panel.

Caveat: weighted target sizes are calculated without taking into account the actual number of sequences available per group. This means under-sampled countries will still be under-sampled even if it has a high weight, and results in significantly fewer total sequences than requested by --subsample-max-sequences). This is already the case with the uniform sampling approach prior to this PR, but it may be more noticeable now with large countries that are under-sampled.

Questions for reviewers

What can be done to check that the sampling results are appropriate?
Is it a good idea to weight on population size? I was thinking per-country case counts over time might be better, but not sure where to get that data.

Checklist

Run trial builds for open, gisaid
- open/asia 1m, 2m, 6m, all-time
- gisaid/asia 1m, 2m, 6m, all-time
Sampling results look ok?
Drop 1afb6d7
Merge+release Implement weighted sampling augur#1454
Post-merge: consider this for other region builds?

Release checklist

If this pull request introduces new features, complete the following steps:

Update docs/src/reference/change_log.md in this pull request to document these changes by the date they were added.

To be used for weighted sampling in a future commit.

This replaces the Asia/China/India split with population-based weighted sampling (possible in Augur version X.X.X). This requires changing the geographical grouping resolution from division to country, but I assume it was only grouped by division in an attempt to have varying group sizes per country, and that population-based weighting is an acceptable replacement.

victorlin · 2024-05-08T17:42:40Z

Not ready to be merged but opening for initial review and feedback to shape implementation in nextstrain/augur#1454.

nextstrain_profiles/nextstrain-open/builds.yaml

joverlee521 · 2024-05-08T23:02:39Z

I was thinking per-country case counts over time might be better, but not sure where to get that data.

We ingest country case counts from OWID in forecasts-ncov and upload them to s3://nextstrain-data/files/workflows/forecasts-ncov/cases/global.tsv.gz.

victorlin · 2024-05-10T00:03:00Z

workflow/snakemake_rules/main_workflow.smk

+rule get_weights:
+    output: "data/country_week_weights.tsv"


Re: using country case counts from OWID

Trying this with 6b6cd2b. Trial runs: open, gisaid

There's some missing case count data, i.e. sequences are present despite country/week combinations having a case count of 0. Reasons and how these were addressed:

Things listed out here:

ncov/scripts/get_weights.py

Lines 33 to 44 in 5ca1afa

# Fill in weights for missing groups. Notes:

# 1. For missing data flanked by weeks with an available case count, this

# uses linear interpolation to "guess" the case count.

# 2. Countries with abrupt reporting start or cutoff will extrapolate the

# first/last available week for the missing weeks. It may be possible to

# interpolate from 0 on the very end instead of extrapolating the

# constant, but that seems a bit more difficult and not sure if it's any

# better.

# 3. It looks like some case counts don't represent cases within that week

# but rather cumulative since the last reported week. Unless that can be

# assumed for every week that follows a gap in data, I don't think

# there's anything that can be done here.

On top of (1), there are also sequences present for weeks and countries missing entirely from case count data. To address this, I added a patch in the Augur PR that simply drops these sequences rather than erroring out.

Quick tests on the 100k subsampled dataset make me skeptical about this approach. The high case counts attempt to capture more than the available amount of sequences, while at the same time making weights for weeks with small case counts very small and thus not sample from those at all (despite their potential importance in phylogenetic analysis). Hopefully the trial run results will make it clear whether to pursue this direction.

Trial runs are complete. Results seem significantly different – it'd be good to have a phylo person take a look. Here's quick links to gisaid/asia/all-time for:

Current build on master: uniform sampling by division+month

Original PR implementation: weighted sampling by country population, uniform sampling by month

New PR implementation: weighted sampling by country+week case counts

Note from builds.yaml that (2) and (3) only apply to the focal Asia samples. Context (non-Asia) samples still conform to (1), but that could be changed.

@victorlin First, one big picture comment related to this comment in the augur PR:

I want to make sure we have at least one solid use case for this feature.

I see example 2 above ("Original PR implementation") as exactly the use case for this feature where we want to minimize the kind of manual counting and hardcoded subsampling blocks that @trvrb had to implement to support proper weighting of India and China relative to the rest of Asia. If we can replace that subsampling YAML approach with a clearer, simpler table of weights by country or region, that's a big win.

Digging into the specific trees you linked to above, there are two main points that came to mind:

Weights by country population, uniform sampling by month looks remarkably like the original implementation, but with a simpler config in the ncov repo and slightly better representation of countries like India and China than even the original implementation. Below I've included example views of the map colored by clade per country for each of the three builds linked above where I've also filtered to "region=Asia".

The original view (sampling by division and month):

The weighted by population view (sampling by weighted country and uniform month):

The weighted by case counts view (sampling by weighted country and case counts):

The first two views show how similar the two builds are qualitatively, but the second view has more data for China, India, Indonesia, and Pakistan than the first. The second view has correspondingly fewer samples for Japan, Thailand, and Malaysia. This seems correct. For example, Indonesia and Malaysia have similar numbers of samples in the first view, but Indonesia has 8 times as many people as Malaysia. The second view more accurately represents that difference in population with 194 samples for Indonesia and 23 for Malaysia.

From 31e1099's commit message:

This requires changing the geographical grouping resolution from division to country, but I assume it was only grouped by division in an attempt to have varying group sizes per country, and that population-based weighting is an acceptable replacement.

I am curious whether @trvrb chose to group by division for this reason only (to get varying group sizes per country) or if he really preferred even sampling by division. Anecdotally (and perhaps I've misremembered the specifics here), I thought he and Marlin were concerned about overrepresentation of sequences from specific divisions in India as new variants were emerging. To explore this case, I made example map views from each of your builds above at the division resolution and filtering to "country=India".

Original view of India by division:

India by division with population weights by country:

India by division with weights by country and case counts:

The third view confirms what we already saw above, that sampling by case counts enriches for older clades, which I discuss more in my second point below.

The first view shows a similar representation of each state in India, but the second view (which only groups by country) shows an overrepresentation of samples from a smaller set of states. I can't say that either one of these views is "the correct" implementation, since that judgement depends on the question we're trying to answer with these trees. However, this difference in results does raise the question for me of whether would could have weighted subsampling by country and uniform sampling by division in the same group by logic. In other words, could we have group_by: "country division year month" and still use the weights per country? If so, we get the best of both worlds for this use case.

Weights by case counts do not seem as useful for this specific ncov use case. The tree from example 3 above overrepresents samples from older clades circulating around the start of 2023 (22B) at the expense of including more samples from recent clades (e.g., 22F, 23I). Your first comment in this thread highlights all of the issue with case counts which I think could be summarized as: case counts are noisy data. They are also less reliable data now that there is less funding for surveillance. While population counts per country are also inaccurate and will change over time, the time scale of those changes is decades and you still get useful relative comparisons between countries independent of their genomic surveillance budgets.

This comment above specifically made me worry about using case counts for subsampling in the ncov builds:

there are also sequences present for weeks and countries missing entirely from case count data. To address this, I added a patch in the Augur PR that simply drops these sequences rather than erroring out.

When case count data are missing but sequences are present, I would still want to have the sequences in the tree. Even if you reverted this behavior of Augur to not drop sequences when cases were missing, I probably wouldn't reach for case counts as weights in these builds. That said, Miguel might reach for exactly this when designing a subsampling scheme for SARS-CoV-2 in King County where the case counts during a specific time span are more reliable and the analysis is more sensitive to geographic sampling and case load during that time. I would show this example approach to weighting by case counts to him and get his feedback on how useful it would have been for him.

Thanks @huddlej, this is really valuable. You confirmed my suspicious - I'll revert 6b6cd2b.

could we have group_by: "country division year month" and still use the weights per country?

Trying this with 835481d (GISAID trial run). Will update here when the run is complete.

Here's gisaid/asia/all-time for weighted country + uniform division sampling filtered to india by division (view):

Looks like it worked! What's weird is that the number of sequences from India went from 760 in the weighted country run to 1083 in the weighted country + uniform division run. I don't think this can be explained by subsampling randomness – will need to look into it a bit deeper (this is where the --output-group-by-sizes file discussed in nextstrain/augur#1454 (comment) will be useful)

Nice! That looks pretty good to me at first glance. The increased number of sequences for India as a whole makes sense from the perspective that each division has to be equally represented by a fixed number of sequences per group within the country and there are a lot of divisions. Looking at the Malaysia/Indonesia representation, I see the same effect where Indonesia has more divisions than Malaysia and ends up with 12 times as many sequences instead of 8 times as many that we'd expect from country population weights alone. This seems like an expected behavior of the group by country and division, though, so I might not worry about this much...

victorlin added 3 commits May 3, 2024 15:43

Add script for global per-country population sizes

ba0d7ea

To be used for weighted sampling in a future commit.

🚧 [DROP] run asia builds only

1afb6d7

victorlin self-assigned this May 3, 2024

victorlin mentioned this pull request May 7, 2024

Implement weighted sampling nextstrain/augur#1454

Draft

4 tasks

victorlin marked this pull request as ready for review May 8, 2024 17:42

genehack approved these changes May 8, 2024

View reviewed changes

nextstrain_profiles/nextstrain-open/builds.yaml Show resolved Hide resolved

jameshadfield marked this pull request as draft May 8, 2024 22:08

victorlin commented May 10, 2024

View reviewed changes

victorlin force-pushed the victorlin/weighted-sampling branch from 5ca1afa to 5d2c691 Compare May 10, 2024 16:45

🚧 Use country/week case counts as weights

6b6cd2b

victorlin force-pushed the victorlin/weighted-sampling branch from 5d2c691 to 6b6cd2b Compare May 10, 2024 18:45

victorlin changed the title ~~Use population-based weighted sampling for Asia builds~~ Use weighted sampling for Asia builds May 10, 2024

victorlin mentioned this pull request May 23, 2024

Generate subsampling config with a script #1102

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use weighted sampling for Asia builds #1106

Use weighted sampling for Asia builds #1106

victorlin commented May 3, 2024 •

edited

victorlin commented May 8, 2024

joverlee521 commented May 8, 2024

victorlin May 10, 2024 •

edited

victorlin May 10, 2024

huddlej May 20, 2024

victorlin May 20, 2024

victorlin May 21, 2024

huddlej May 21, 2024

	# Fill in weights for missing groups. Notes:
	# 1. For missing data flanked by weeks with an available case count, this
	# uses linear interpolation to "guess" the case count.
	# 2. Countries with abrupt reporting start or cutoff will extrapolate the
	# first/last available week for the missing weeks. It may be possible to
	# interpolate from 0 on the very end instead of extrapolating the
	# constant, but that seems a bit more difficult and not sure if it's any
	# better.
	# 3. It looks like some case counts don't represent cases within that week
	# but rather cumulative since the last reported week. Unless that can be
	# assumed for every week that follows a gap in data, I don't think
	# there's anything that can be done here.

Use weighted sampling for Asia builds #1106

Are you sure you want to change the base?

Use weighted sampling for Asia builds #1106

Conversation

victorlin commented May 3, 2024 • edited

Description of proposed changes

Questions for reviewers

Checklist

Release checklist

victorlin commented May 8, 2024

joverlee521 commented May 8, 2024

victorlin May 10, 2024 • edited

Choose a reason for hiding this comment

victorlin May 10, 2024

Choose a reason for hiding this comment

huddlej May 20, 2024

Choose a reason for hiding this comment

victorlin May 20, 2024

Choose a reason for hiding this comment

victorlin May 21, 2024

Choose a reason for hiding this comment

huddlej May 21, 2024

Choose a reason for hiding this comment

victorlin commented May 3, 2024 •

edited

victorlin May 10, 2024 •

edited