Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Potential computation in anvi-get-pn-ps-ratio is incorrect #2195

Open
apcamargo opened this issue Dec 26, 2023 · 2 comments
Open

[BUG] Potential computation in anvi-get-pn-ps-ratio is incorrect #2195

apcamargo opened this issue Dec 26, 2023 · 2 comments

Comments

@apcamargo
Copy link

Short description of the problem

When anvi'o computes the potential of a given codon, it does so by evaluating whether mutations in one position (that is, with Hamming distance of 1) generate synonymous or non-synonymous codons. However, the Nei-Gojobori method takes into account the distance between pairs of codons when computing potentials. That is, if you have the ACC codon in the reference and a ATT variant, the potential of ACC is computed by evaluating all possible mutations paths between those two codons.

You can find an implementation of the Nei-Gojobori method here.

anvi'o version

Anvi'o .......................................: marie (v8)
Python .......................................: 3.10.12

Profile database .............................: 38
Contigs database .............................: 21
Pan database .................................: 16
Genome data storage ..........................: 7
Auxiliary data storage .......................: 2
Structure database ...........................: 2
Metabolic modules database ...................: 4
tRNA-seq database ............................: 2

System info

Linux 4.18.0-425.3.1.el8.x86_64 x86_64

anvi'o executed via Docker.

@ekiefl
Copy link
Contributor

ekiefl commented Jan 4, 2024

Hi @apcamargo,

Thanks for the bug report. I agree that this might be an improvement relative to the current method.

Ramblings

I'm embarrassed to say that I'm having trouble understanding the code that I wrote so many years ago, but let's start from CodonsEngine.calc_pN_pS (from within anvio.variabilityops):

anvio/anvio/variabilityops.py

Lines 2647 to 2722 in 4eb5767

def calc_pN_pS(self, contigs_db=None, grouping='site', comparison='reference', potentials=None, add_potentials=False, log_transform=False):
"""Calculate new columns in self.data corresponding to each site's contribution to a grouping
First, this function calculates fN and fS for each SCV relative to a comparison codon (see `comparison`).
fN and fS are respectively the fraction of non-synonymous and synonymous variation observed in the SAAV.
fN and fS sum to 1. Then, each fN and fS value is divided by the number of non-synonymous sites and
synonymous sites (aka the synonymity potentials), which will depend on the grouping (see `grouping`).
Parameters
==========
contigs_db : dbops.ContigsSuperclass
If None, it is assumed that `self` has inherited `ContigsSuperclass` already.
grouping : str, 'site'
This keyword specifies the number of synonymous and non-synonymous sites that fN and fS
should be divided by to yield pN and pS. E.g. if `grouping` is 'gene', then each SCV
belonging to gene X should be divided by the number of synonymous and non-synonymous
sites that are present in gene X's sequence, i.e. the output of
`utils.get_synonymous_and_non_synonymous_potential`. As another example, if `grouping`
is 'site' and the comparison codon at a site is CGA, then there are 1.33 synonymous
sites and 1.66 non-synonymous sites. So each SCV's pN and pS should be calculated by
taking fN and fS, and dividing them by 1.66 and 1.33 respectively. If `potentials` is
None, the available `grouping` values are {'gene', 'site'}, otherwise `grouping` can be
assigned to anything, and it will be used to name the columns (see Returns).
comparison : str, 'reference'
This specifices the comparison codon that should be used for determining synonymity. Options are
{'reference', 'consensus', 'popular_consensus'}. reference means the codon in the reference sequence
is the comparison codon, consensus means most common codon in the SCV should be used as the reference,
and popular_consensus means that the most frequently observed consensus codon across all samples should
be used as the comparison codon.
potentials : numpy.array, None
Most people should not use this. If provides a way to create custom groupings. If
passed, it should be a (len(self.data), 2) shaped numpy array. potentials[:,0] and
potentials[:,1] specify the amount that each SCV's fS and fS values should be divided by
to yield pS and pN.
log_transform : bool, False
Log tranforms pN and pS values by log10(x + 1e-4).
Returns
=======
output : None
This function does not return anything. It creates 2 new columns in `self.data` with
names pN_{grouping}_{comparison} and pS_{grouping}_{comparison}, unless `grouping` is
'site', in which case the column names are pN_{comparison} and pS_{comparison}. If
add_potentials is True, the number of synonymous and nonsynonymous sites will be
added as the columns nS_{grouping}_{comparison} and nN_{grouping}_{comparison}.
"""
if contigs_db is None:
contigs_db = self
frac_syns, frac_nonsyns = self.calc_synonymous_fraction(comparison=comparison)
if potentials is None:
if grouping == 'site':
potentials = self._get_per_position_potential(comparison=comparison)
elif grouping == 'gene':
potentials = self._get_per_gene_potential(contigs_db=contigs_db, comparison=comparison)
else:
raise NotImplementedError(f"calc_pN_pS doesnt know the grouping '{grouping}'")
group_tag = (grouping + '_') if grouping != 'site' else ''
pN_name, pS_name = f"pN_{group_tag}{comparison}", f"pS_{group_tag}{comparison}"
if log_transform:
pN_name, pS_name = 'log_' + pN_name, 'log_' + pS_name
self.data[pS_name] = frac_syns/potentials[:, 0]
self.data[pN_name] = frac_nonsyns/potentials[:, 1]
if log_transform:
self.data[pS_name] = np.log10(self.data[pS_name] + 1e-4)
self.data[pN_name] = np.log10(self.data[pN_name] + 1e-4)
if add_potentials:
nN_name, nS_name = f"nN_{group_tag}{comparison}", f"nS_{group_tag}{comparison}"
self.data[nS_name] = potentials[:, 0]
self.data[nN_name] = potentials[:, 1]

If we implemented the Nei-Gojobori method, it seems like we would need to change the calculation of synonymous and non-synonymous fractions.

In the current implementation, every codon allele contributes either to the number of synonymous differences, or the number of non-synonymous differences. The amount it contributes is defined by the allele's frequency.

In contrast, the Nei-Gojobori method says that double- or triple-nucleotide differences contribute twice or three times as much to the number of synonymous and non-synonymous differences, and they have the capacity to contribute to both simultaneously. From the paper:

image
#2196

Scope of changes

From what I can tell, implementing the Nei-Gojobori method requires changing how we calculate the number of s- and ns-differences, not how we calculate s- and ns-potentials. I found this worth mentioning because in your post you refer to potentials in a way that is different to how I define potentials in the codebase:

anvio/anvio/utils.py

Lines 1646 to 1657 in 4eb5767

def get_synonymous_and_non_synonymous_potential(list_of_codons_in_gene, just_do_it=False):
"""
When calculating pN/pS or dN/dS, the number of variants classified as synonymous or non
synonymous need to be normalized by the sequence's potential for synonymous and
non-synonymous changes. That is calculated by mutating each position to the other 3
nucleotides and calculating whether the mutation is synonymous or non synonymous. Each
mutation gets a score of 1/3, since there are 3 possible mutations for each site. If the
sequence is of length L, the nonsynonymous and synonymous potentials sum to L.
list_of_codons_in_gene is a list of the codons as they appear in the gene sequence, e.g.
['ATG', ..., 'TAG'], which can be generated from utils.get_list_of_codons_for_gene_call
"""

Do you agree with me on this point? If so, I believe that calc_synonymous_fraction and it's delegate, _calculate_synonymous_fraction, are the only functions that would need to be refactored. Here is calc_synonymous_fraction:

anvio/anvio/variabilityops.py

Lines 2558 to 2606 in 4eb5767

def calc_synonymous_fraction(self, comparison='reference'):
"""Compute the fraction of synonymous and non-synonymous substitutions relative to a reference"""
self.progress.new('Calculating per-site synonymous fraction')
self.progress.update('...')
# Some lookups
coding_codons = sorted(constants.coding_codons)
codon_to_num = {coding_codons[i]: i for i in range(len(coding_codons))}
codon_nums = np.arange(len(coding_codons))
is_synonymous = np.zeros((len(coding_codons), len(coding_codons))).astype(bool)
for i, codon1 in enumerate(coding_codons):
for j, codon2 in enumerate(coding_codons):
if constants.codon_to_AA[codon1] == constants.codon_to_AA[codon2]:
is_synonymous[i, j] = True
else:
is_synonymous[i, j] = False
# Populate necessary per-site info
self.progress.update('...')
if comparison == 'popular_consensus':
self.progress.update('Finding popular consensus; You have time for a quick stretch')
self.data['popular_consensus'] = self.data.\
groupby('unique_pos_identifier')\
['consensus'].\
transform(lambda x: x.value_counts().index[0])
comparison_array = self.data\
[comparison].\
map(codon_to_num).\
fillna(-1).\
values.\
astype(int)
counts_array = self.data[constants.coding_codons].values.astype(int)
coverage = self.data['coverage'].values.astype(int)
self.progress.update("You're ungrateful if you think this is slow")
frac_syns, frac_nonsyns = _calculate_synonymous_fraction(
counts_array,
comparison_array,
coverage,
codon_nums,
is_synonymous,
)
self.progress.end()
return np.array(frac_syns), np.array(frac_nonsyns)

calc_synonymous_fraction essentially synthesizes the required data from self into numerical arrays that are passed to _calculate_synonymous_fraction, a just-in-time compiled workhorse, which returns the fraction of synonymous and non-synonymous components.

Based on what I understood from reading the paper, we need to implement the following:

  • Each step in each mutational path contributes to either a s- or ns-difference (our current method only captures this if the allele differs by a single nucleotide)

  • The "contribution" of each allele should be twice as much for alleles with 2-nucleotide differences and thrice as much for alleles with 3-nucleotide differences. This is based on the fact that in their GTT -> GTA, s_d + n_d = 1, whereas in their TTT -> GTA example, s_d + n_d = 2

Unknowns

I don't understand this, but I think it may play an important role in their method (which is definitely missing from anvi'o's current implementation)

image

@apcamargo
Copy link
Author

If I understand it correctly, that formula is to account for silent substitutions. It takes the observed distance between two sequences and computes the estimated distance assuming a uniform substitution rate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants