Skip to content
This repository has been archived by the owner on Feb 7, 2023. It is now read-only.

Source for number of variants and samples on main page #129

Open
wm75 opened this issue Apr 27, 2020 · 4 comments
Open

Source for number of variants and samples on main page #129

wm75 opened this issue Apr 27, 2020 · 4 comments

Comments

@wm75
Copy link
Collaborator

wm75 commented Apr 27, 2020

Currently, the main page states this under Results for Genomics:

These lists are updated daily. There are 397 sites showing intra-host variation across 33 samples (with frequencies between 5% and 95%). Twenty nine samples have fixed differences at 39 sites from the published reference.

This leaves two questions:

  1. when have these numbers been updated last (based on which samples)?
  2. when is a variant considered a fixed difference?

When I analyzed https://covid19.galaxyproject.org/genomics/4-Variation/variant_list.tsv today, I got this for the filter condition 0.95 >= float(af) >= 0.05:
Samples with variants: 378
Total number of variants observed: 260
Number of sites observed to carry variants: 259

For the fixed differences I tried float(af) == 1.0 giving:
Samples with variants: 55
Total number of variants observed: 27
Number of sites observed to carry variants: 27

and float(af) > 0.95 resulting in:
Samples with variants: 378
Total number of variants observed: 260
Number of sites observed to carry variants: 259

@mvdbeek
Copy link
Member

mvdbeek commented Apr 27, 2020

  1. when have these numbers been updated last (based on which samples)?

1d272c3
27 days ago. We need to template this, it's clearly way more than 397

When I analyzed https://covid19.galaxyproject.org/genomics/4-Variation/variant_list.tsv today, I got this for the filter condition 0.95 >= float(af) >= 0.05:
Samples with variants: 378
Total number of variants observed: 260
Number of sites observed to carry variants: 259

How'd you do this? I just downloaded the file and there are 1410 unique positions that fall into this criterium:

import pandas as pd

df = pd.read_csv('https://covid19.galaxyproject.org/genomics/4-Variation/variant_list.tsv', sep='\t')
intrahost = df[(df['AF'] > 0.05) & (df['AF'] < 0.95) ]
len(intrahost.POS.unique())

returns 1410. More if you also count by variant.

@wm75
Copy link
Collaborator Author

wm75 commented Apr 27, 2020

Ah, I pasted the same result snippet twice above, sorry!
The first result should have been:
Samples with variants: 334
Total number of variants observed: 1466
Number of sites observed to carry variants: 1424

@wm75
Copy link
Collaborator Author

wm75 commented Apr 27, 2020

... and if I go with the filter 0.95 > float(af) > 0.05 instead of my initial >=:
Samples with variants: 333
Total number of variants observed: 1452
Number of sites observed to carry variants: 1410

which seems to be what you obtained, right?

@wm75
Copy link
Collaborator Author

wm75 commented Apr 27, 2020

I used plain Python for this:

def calc_latest_stats(fun):
    with open('va_current.tsv') as i:
        hdr = next(i)
        samples = set()
        variants = set()
        sites = set()
        for line in i:
            s, c, p, r, a, d, af, *others = line.split('\t')
            if fun(s, c, p, a, af) is True:
                samples.add(s)
                variants.add((c, p, a))
                sites.add((c, p))

    print('Samples with variants:', len(samples))
    print('Total number of variants observed:', len(variants))
    print('Number of sites observed to carry variants:', len(sites))

with, e.g.,:
fun = lambda s, c, p, a, af: True if 0.95 > float(af) > 0.05 else False

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants