Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate supporting fixed-length MNP variants #51

Open
timothymillar opened this issue Sep 21, 2020 · 1 comment
Open

Investigate supporting fixed-length MNP variants #51

timothymillar opened this issue Sep 21, 2020 · 1 comment
Assignees
Labels
enhancement New feature or request low priority

Comments

@timothymillar
Copy link
Collaborator

Freebayes likes to output MNPs e.g. TGC TAC,TGG.
Currently the best approach with these is to 'atomize' them into SNPs before assembly (removing redundant bases).

There are two options to support these without any user pre-processing:

  • automatically atomize them
  • generalize the encoding of read distributions to include MNPs

The second option would technically be the most computationally efficient because it only requires fewer variants.
The first option assesses all possible combinations of the atomized SNPs, not just the alts of the MNP.

The main complexity of encoding an MNP is how to handle the case where a read does not extend the entire length of the MNP.

@timothymillar timothymillar added the enhancement New feature or request label Sep 21, 2020
@timothymillar timothymillar self-assigned this Sep 21, 2020
@timothymillar
Copy link
Collaborator Author

This may require allowing read allele "distributions" in which the allele probabilities at a position sum to > 1.
This is because these probabilities are interpreted as "the probability of drawing this read from a haplotype with these alleles assuming the read can only cover the bases that it does".
Currently this means that P(T,A,- | T,A,A) + P(T,A,- | T,A,T) > 1 because a gap - is ignored when calculating the probability.
Encoded as an MNP we should get the same result P(TA- | TAA) + P(TA- | TAT) > 1.

For example two SNPs that could be encoded as a single MNP with all possible combinations of SNP alleles:

>>> haps = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
>>> onehot = (haps[..., None] == np.arange(0,2)).astype(np.int)  # onehot encoding of haps

The probabilities for matching a full read (i.e. covers both SNPs) sum to 1:

>>> read_full = np.array([[0.9, 0.1], [0.9, 0.1]])
>>> np.nanprod((read_full * onehot).sum(axis=-1), axis=-1)  # calculate prob of IIS with each haplotype
array([0.81, 0.09, 0.09, 0.01])

But not in the case of a partial read:

>>> read_partial = np.array([[0.9, 0.1], [np.nan, np.nan]])
>>> np.nanprod((read_partial * onehot).sum(axis=-1), axis=-1)
array([0.9, 0.9, 0.1, 0.1])

These should be the probabilities used to in the equivalent MNP encoding as a single variant e.g.:

>>> read_full =np.array([[0.81, 0.09, 0.09, 0.01]])
>>> read_partial = np.array([[0.9, 0.9, 0.1, 0.1]])

However, if not all possible SNP encodings where present (i.e. the MNP is restricted to a subset of the possible haplotypes) then these probabilities would need to be normalized.
In which case the probabilities of the partial read should be 'normalized' to sum to the total found if all SNP combinations are present?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request low priority
Projects
None yet
Development

No branches or pull requests

1 participant