Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ignores missing entries in .Rtab files #158

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

julibeg
Copy link
Contributor

@julibeg julibeg commented Jun 6, 2021

No description provided.

@julibeg
Copy link
Contributor Author

julibeg commented Jun 6, 2021

Regarding unit testing: After the change, all tests passed. Should we / I write new tests or rather change the tests/presence_absence.Rtab.gz file to include some missing values and adapt the tests accordingly (or leave testing as is since the change won't break things that worked before anyway)?

@mgalardini
Copy link
Owner

Thanks for the PR; maybe making a new Rtab file with missing values and making sure those are handled correctly in an ad-hoc unit test? Happy to prepare it if you can prepare a simple example file.

@julibeg
Copy link
Contributor Author

julibeg commented Jul 7, 2021

Sorry for the delay! I added a simple unit test now.

@julibeg
Copy link
Contributor Author

julibeg commented Jul 7, 2021

When looking at the code I realised that missing values are always replaced by 0s regardless of file type (as far as I can tell). Is this intended? Otherwise, AF could be used or the majority allele.

@mgalardini
Copy link
Owner

Thanks, I'll review the changes shortly and merge.

I checked again and we do ignore the samples with missing values (see here:

kstrains = sorted(set(d.keys()).intersection(all_strains)) # This will include missing
).

I think we may want to assume that an empty "cell" also means that the gene has missing information on its status, other than the . char. Will try to add that change too

@mgalardini
Copy link
Owner

Ah wait I think we might have had a misunderstanding (my fault really!). How we handle missing variants in VCF files is a bit convoluted because we have to handle corner cases due to burden testing. We do assign a missing value if a variant is missing from the VCF file (

d[sample] = np.nan
).

So I think we should not merge this, unless it is not working as intended as it is (i.e. the --max-missing option is working as expected with Rtab files).

Sorry for not thinking this through earlier!

@julibeg
Copy link
Contributor Author

julibeg commented Jul 8, 2021

I see. I can't say that I understand all the edge cases and how they are handled in the code. However, there is also

del d[sample]

which removes samples from the dict if the second haplotype is also '.' (which should always be the case in bacterial VCFs). The change in cfed703 should do the same thing for .Rtab files, but I don't know how this relates to burden testing etc.

@johnlees
Copy link
Collaborator

johnlees commented Jul 9, 2021

Missing data is difficult, and in my opinion is something that's more of a user consideration. But it is expected in many cases. I think the desired pyseer behaviour would be:

  • Exclude samples with a missing genotype from the regression in question (easy enough with the linear/wg models, not sure with the LMM as that's done in blocks). Does setting nan work ok?
  • With VCF as input missing sites are frequent so dealing with them this way ideally. If not possible, e.g. in the LMM, setting them to the major allele is probably sensible.
  • With kmers/unitigs missing data = absent so we don't need to deal with this.
  • Agree that Rtabs should be consistent with VCFs, even though the data we have used up until this point hasn't had missing elements.

Although I'm open to suggestions!

Using del d[sample] sets to 0, which is the simplest thing we could do. I think we should confirm whether np.nan works as intended, and if so use that with both Rtab and VCF.

If not, we should do a simple fill. plink has a similar option for missing data e.g. use the reference, use the major allele, use the minor allele. In our case I think we just want use major allele or use reference (0). We could either just pick one of these, or we could add as a command line option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants