[BUG] Potential computation in `anvi-get-pn-ps-ratio` is incorrect #2195

apcamargo · 2023-12-26T19:58:07Z

Short description of the problem

When anvi'o computes the potential of a given codon, it does so by evaluating whether mutations in one position (that is, with Hamming distance of 1) generate synonymous or non-synonymous codons. However, the Nei-Gojobori method takes into account the distance between pairs of codons when computing potentials. That is, if you have the ACC codon in the reference and a ATT variant, the potential of ACC is computed by evaluating all possible mutations paths between those two codons.

You can find an implementation of the Nei-Gojobori method here.

anvi'o version

Anvi'o .......................................: marie (v8)
Python .......................................: 3.10.12

Profile database .............................: 38
Contigs database .............................: 21
Pan database .................................: 16
Genome data storage ..........................: 7
Auxiliary data storage .......................: 2
Structure database ...........................: 2
Metabolic modules database ...................: 4
tRNA-seq database ............................: 2

System info

Linux 4.18.0-425.3.1.el8.x86_64 x86_64

anvi'o executed via Docker.

The text was updated successfully, but these errors were encountered:

ekiefl · 2024-01-04T00:16:21Z

Hi @apcamargo,

Thanks for the bug report. I agree that this might be an improvement relative to the current method.

Ramblings

I'm embarrassed to say that I'm having trouble understanding the code that I wrote so many years ago, but let's start from CodonsEngine.calc_pN_pS (from within anvio.variabilityops):

anvio/anvio/variabilityops.py

Lines 2647 to 2722 in 4eb5767

    
               def calc_pN_pS(self, contigs_db=None, grouping='site', comparison='reference', potentials=None, add_potentials=False, log_transform=False): 
        
                   """Calculate new columns in self.data corresponding to each site's contribution to a grouping 
        
                   First, this function calculates fN and fS for each SCV relative to a comparison codon (see `comparison`). 
        
                   fN and fS are respectively the fraction of non-synonymous and synonymous variation observed in the SAAV. 
        
                   fN and fS sum to 1. Then, each fN and fS value is divided by the number of non-synonymous sites and 
        
                   synonymous sites (aka the synonymity potentials), which will depend on the grouping (see `grouping`). 
        
                   Parameters 
        
                   ========== 
        
                   contigs_db : dbops.ContigsSuperclass 
        
                       If None, it is assumed that `self` has inherited `ContigsSuperclass` already. 
        
                   grouping : str, 'site' 
        
                       This keyword specifies the number of synonymous and non-synonymous sites that fN and fS 
        
                       should be divided by to yield pN and pS. E.g. if `grouping` is 'gene', then each SCV 
        
                       belonging to gene X should be divided by the number of synonymous and non-synonymous 
        
                       sites that are present in gene X's sequence, i.e. the output of 
        
                       `utils.get_synonymous_and_non_synonymous_potential`.  As another example, if `grouping` 
        
                       is 'site' and the comparison codon at a site is CGA, then there are 1.33 synonymous 
        
                       sites and 1.66 non-synonymous sites. So each SCV's pN and pS should be calculated by 
        
                       taking fN and fS, and dividing them by 1.66 and 1.33 respectively. If `potentials` is 
        
                       None, the available `grouping` values are {'gene', 'site'}, otherwise `grouping` can be 
        
                       assigned to anything, and it will be used to name the columns (see Returns). 
        
                   comparison : str, 'reference' 
        
                       This specifices the comparison codon that should be used for determining synonymity. Options are 
        
                       {'reference', 'consensus', 'popular_consensus'}. reference means the codon in the reference sequence 
        
                       is the comparison codon, consensus means most common codon in the SCV should be used as the reference, 
        
                       and popular_consensus means that the most frequently observed consensus codon across all samples should 
        
                       be used as the comparison codon. 
        
                   potentials : numpy.array, None 
        
                       Most people should not use this. If provides a way to create custom groupings. If 
        
                       passed, it should be a (len(self.data), 2) shaped numpy array. potentials[:,0] and 
        
                       potentials[:,1] specify the amount that each SCV's fS and fS values should be divided by 
        
                       to yield pS and pN. 
        
                   log_transform : bool, False 
        
                       Log tranforms pN and pS values by log10(x + 1e-4). 
        
                   Returns 
        
                   ======= 
        
                   output : None 
        
                       This function does not return anything. It creates 2 new columns in `self.data` with 
        
                       names pN_{grouping}_{comparison} and pS_{grouping}_{comparison}, unless `grouping` is 
        
                       'site', in which case the column names are pN_{comparison} and pS_{comparison}. If 
        
                       add_potentials is True, the number of synonymous and nonsynonymous sites will be 
        
                       added as the columns nS_{grouping}_{comparison} and nN_{grouping}_{comparison}. 
        
                   """ 
        
                   if contigs_db is None: 
        
                       contigs_db = self 
        
                   frac_syns, frac_nonsyns = self.calc_synonymous_fraction(comparison=comparison) 
        
                   if potentials is None: 
        
                       if grouping == 'site': 
        
                           potentials = self._get_per_position_potential(comparison=comparison) 
        
                       elif grouping == 'gene': 
        
                           potentials = self._get_per_gene_potential(contigs_db=contigs_db, comparison=comparison) 
        
                       else: 
        
                           raise NotImplementedError(f"calc_pN_pS doesnt know the grouping '{grouping}'") 
        
                   group_tag = (grouping + '_') if grouping != 'site' else '' 
        
                   pN_name, pS_name = f"pN_{group_tag}{comparison}", f"pS_{group_tag}{comparison}" 
        
                   if log_transform: 
        
                       pN_name, pS_name = 'log_' + pN_name, 'log_' + pS_name 
        
                   self.data[pS_name] = frac_syns/potentials[:, 0] 
        
                   self.data[pN_name] = frac_nonsyns/potentials[:, 1] 
        
                   if log_transform: 
        
                       self.data[pS_name] = np.log10(self.data[pS_name] + 1e-4) 
        
                       self.data[pN_name] = np.log10(self.data[pN_name] + 1e-4) 
        
                   if add_potentials: 
        
                       nN_name, nS_name = f"nN_{group_tag}{comparison}", f"nS_{group_tag}{comparison}" 
        
                       self.data[nS_name] = potentials[:, 0] 
        
                       self.data[nN_name] = potentials[:, 1]

If we implemented the Nei-Gojobori method, it seems like we would need to change the calculation of synonymous and non-synonymous fractions.

In the current implementation, every codon allele contributes either to the number of synonymous differences, or the number of non-synonymous differences. The amount it contributes is defined by the allele's frequency.

In contrast, the Nei-Gojobori method says that double- or triple-nucleotide differences contribute twice or three times as much to the number of synonymous and non-synonymous differences, and they have the capacity to contribute to both simultaneously. From the paper:

#2196

Scope of changes

From what I can tell, implementing the Nei-Gojobori method requires changing how we calculate the number of s- and ns-differences, not how we calculate s- and ns-potentials. I found this worth mentioning because in your post you refer to potentials in a way that is different to how I define potentials in the codebase:

anvio/anvio/utils.py

Lines 1646 to 1657 in 4eb5767

    
           def get_synonymous_and_non_synonymous_potential(list_of_codons_in_gene, just_do_it=False): 
        
               """ 
        
               When calculating pN/pS or dN/dS, the number of variants classified as synonymous or non 
        
               synonymous need to be normalized by the sequence's potential for synonymous and 
        
               non-synonymous changes. That is calculated by mutating each position to the other 3 
        
               nucleotides and calculating whether the mutation is synonymous or non synonymous. Each 
        
               mutation gets a score of 1/3, since there are 3 possible mutations for each site. If the 
        
               sequence is of length L, the nonsynonymous and synonymous potentials sum to L. 
        
               list_of_codons_in_gene is a list of the codons as they appear in the gene sequence, e.g. 
        
               ['ATG', ..., 'TAG'], which can be generated from utils.get_list_of_codons_for_gene_call 
        
               """

Do you agree with me on this point? If so, I believe that calc_synonymous_fraction and it's delegate, _calculate_synonymous_fraction, are the only functions that would need to be refactored. Here is calc_synonymous_fraction:

anvio/anvio/variabilityops.py

Lines 2558 to 2606 in 4eb5767

    
           def calc_synonymous_fraction(self, comparison='reference'): 
        
               """Compute the fraction of synonymous and non-synonymous substitutions relative to a reference""" 
        
               self.progress.new('Calculating per-site synonymous fraction') 
        
               self.progress.update('...') 
        
               # Some lookups 
        
               coding_codons = sorted(constants.coding_codons) 
        
               codon_to_num = {coding_codons[i]: i for i in range(len(coding_codons))} 
        
               codon_nums = np.arange(len(coding_codons)) 
        
               is_synonymous = np.zeros((len(coding_codons), len(coding_codons))).astype(bool) 
        
               for i, codon1 in enumerate(coding_codons): 
        
                   for j, codon2 in enumerate(coding_codons): 
        
                       if constants.codon_to_AA[codon1] == constants.codon_to_AA[codon2]: 
        
                           is_synonymous[i, j] = True 
        
                       else: 
        
                           is_synonymous[i, j] = False 
        
               # Populate necessary per-site info 
        
               self.progress.update('...') 
        
               if comparison == 'popular_consensus': 
        
                   self.progress.update('Finding popular consensus; You have time for a quick stretch') 
        
                   self.data['popular_consensus'] = self.data.\ 
        
                       groupby('unique_pos_identifier')\ 
        
                       ['consensus'].\ 
        
                       transform(lambda x: x.value_counts().index[0]) 
        
               comparison_array = self.data\ 
        
                   [comparison].\ 
        
                   map(codon_to_num).\ 
        
                   fillna(-1).\ 
        
                   values.\ 
        
                   astype(int) 
        
               counts_array = self.data[constants.coding_codons].values.astype(int) 
        
               coverage = self.data['coverage'].values.astype(int) 
        
               self.progress.update("You're ungrateful if you think this is slow") 
        
               frac_syns, frac_nonsyns = _calculate_synonymous_fraction( 
        
                   counts_array, 
        
                   comparison_array, 
        
                   coverage, 
        
                   codon_nums, 
        
                   is_synonymous, 
        
               ) 
        
               self.progress.end() 
        
               return np.array(frac_syns), np.array(frac_nonsyns)

calc_synonymous_fraction essentially synthesizes the required data from self into numerical arrays that are passed to _calculate_synonymous_fraction, a just-in-time compiled workhorse, which returns the fraction of synonymous and non-synonymous components.

Based on what I understood from reading the paper, we need to implement the following:

Each step in each mutational path contributes to either a s- or ns-difference (our current method only captures this if the allele differs by a single nucleotide)
The "contribution" of each allele should be twice as much for alleles with 2-nucleotide differences and thrice as much for alleles with 3-nucleotide differences. This is based on the fact that in their GTT -> GTA, s_d + n_d = 1, whereas in their TTT -> GTA example, s_d + n_d = 2

Unknowns

I don't understand this, but I think it may play an important role in their method (which is definitely missing from anvi'o's current implementation)

apcamargo · 2024-01-04T00:39:32Z

If I understand it correctly, that formula is to account for silent substitutions. It takes the observed distance between two sequences and computes the estimated distance assuming a uniform substitution rate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Potential computation in `anvi-get-pn-ps-ratio` is incorrect #2195

[BUG] Potential computation in `anvi-get-pn-ps-ratio` is incorrect #2195

apcamargo commented Dec 26, 2023

ekiefl commented Jan 4, 2024

apcamargo commented Jan 4, 2024

[BUG] Potential computation in anvi-get-pn-ps-ratio is incorrect #2195

[BUG] Potential computation in anvi-get-pn-ps-ratio is incorrect #2195

Comments

apcamargo commented Dec 26, 2023

Short description of the problem

anvi'o version

System info

ekiefl commented Jan 4, 2024

Ramblings

Scope of changes

Unknowns

apcamargo commented Jan 4, 2024

[BUG] Potential computation in `anvi-get-pn-ps-ratio` is incorrect #2195

[BUG] Potential computation in `anvi-get-pn-ps-ratio` is incorrect #2195