ignores missing entries in .Rtab files #158

julibeg · 2021-06-06T16:44:12Z

No description provided.

julibeg · 2021-06-06T16:46:41Z

Regarding unit testing: After the change, all tests passed. Should we / I write new tests or rather change the tests/presence_absence.Rtab.gz file to include some missing values and adapt the tests accordingly (or leave testing as is since the change won't break things that worked before anyway)?

mgalardini · 2021-06-07T07:08:25Z

Thanks for the PR; maybe making a new Rtab file with missing values and making sure those are handled correctly in an ad-hoc unit test? Happy to prepare it if you can prepare a simple example file.

julibeg · 2021-07-07T20:21:05Z

Sorry for the delay! I added a simple unit test now.

julibeg · 2021-07-07T20:23:31Z

When looking at the code I realised that missing values are always replaced by 0s regardless of file type (as far as I can tell). Is this intended? Otherwise, AF could be used or the majority allele.

mgalardini · 2021-07-08T07:39:10Z

Thanks, I'll review the changes shortly and merge.

I checked again and we do ignore the samples with missing values (see here:

pyseer/pyseer/input.py

Line 427 in 1359b61

    
           kstrains = sorted(set(d.keys()).intersection(all_strains)) # This will include missing

).

I think we may want to assume that an empty "cell" also means that the gene has missing information on its status, other than the . char. Will try to add that change too

mgalardini · 2021-07-08T07:43:16Z

Ah wait I think we might have had a misunderstanding (my fault really!). How we handle missing variants in VCF files is a bit convoluted because we have to handle corner cases due to burden testing. We do assign a missing value if a variant is missing from the VCF file (

pyseer/pyseer/input.py

Line 479 in 1359b61

d[sample] = np.nan

).

So I think we should not merge this, unless it is not working as intended as it is (i.e. the --max-missing option is working as expected with Rtab files).

Sorry for not thinking this through earlier!

julibeg · 2021-07-08T20:26:37Z

I see. I can't say that I understand all the edge cases and how they are handled in the code. However, there is also

pyseer/pyseer/input.py

Line 486 in 1359b61

del d[sample]

which removes samples from the dict if the second haplotype is also '.' (which should always be the case in bacterial VCFs). The change in cfed703 should do the same thing for .Rtab files, but I don't know how this relates to burden testing etc.

johnlees · 2021-07-09T11:15:49Z

Missing data is difficult, and in my opinion is something that's more of a user consideration. But it is expected in many cases. I think the desired pyseer behaviour would be:

Exclude samples with a missing genotype from the regression in question (easy enough with the linear/wg models, not sure with the LMM as that's done in blocks). Does setting nan work ok?
With VCF as input missing sites are frequent so dealing with them this way ideally. If not possible, e.g. in the LMM, setting them to the major allele is probably sensible.
With kmers/unitigs missing data = absent so we don't need to deal with this.
Agree that Rtabs should be consistent with VCFs, even though the data we have used up until this point hasn't had missing elements.

Although I'm open to suggestions!

Using del d[sample] sets to 0, which is the simplest thing we could do. I think we should confirm whether np.nan works as intended, and if so use that with both Rtab and VCF.

If not, we should do a simple fill. plink has a similar option for missing data e.g. use the reference, use the major allele, use the minor allele. In our case I think we just want use major allele or use reference (0). We could either just pick one of these, or we could add as a command line option.

ignores missing entries in .Rtab files

cfed703

added unit test for missing values in .Rtab file

1359b61

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ignores missing entries in .Rtab files #158

ignores missing entries in .Rtab files #158

julibeg commented Jun 6, 2021

julibeg commented Jun 6, 2021 •

edited

mgalardini commented Jun 7, 2021

julibeg commented Jul 7, 2021 •

edited

julibeg commented Jul 7, 2021

mgalardini commented Jul 8, 2021

mgalardini commented Jul 8, 2021

julibeg commented Jul 8, 2021

johnlees commented Jul 9, 2021

ignores missing entries in .Rtab files #158

Are you sure you want to change the base?

ignores missing entries in .Rtab files #158

Conversation

julibeg commented Jun 6, 2021

julibeg commented Jun 6, 2021 • edited

mgalardini commented Jun 7, 2021

julibeg commented Jul 7, 2021 • edited

julibeg commented Jul 7, 2021

mgalardini commented Jul 8, 2021

mgalardini commented Jul 8, 2021

julibeg commented Jul 8, 2021

johnlees commented Jul 9, 2021

julibeg commented Jun 6, 2021 •

edited

julibeg commented Jul 7, 2021 •

edited