Keep cigartuples #108

heidi-holappa · 2023-08-30T12:30:33Z

Introduction

This pull request adds a new feature for predicting and correcting errors in long reads to IsoQuant.

Constants

The constants are collected in the start of the function correct_transcript_splice_sites from which the code execution starts. This way they can be conveniently moved outside of the function or re-configured in the future, if needed. One possible use case would be to give the user the option to select a strategy to use, or alter the constants with arguments.

def correct_transcript_splice_sites(arguments):
  # ...
  
  ACCEPTED_DEL_CASES = [3, 4, 5, 6]
  SUPPORTED_STRANDS = ['+', '-']
  THRESHOLD_CASES_AT_LOCATION = 0.7
  MIN_N_OF_ALIGNED_READS = 5
  WINDOW_SIZE = 8
  MORE_CONSERVATIVE_STRATEGY = False
  
  # ...

Note

Constant "Threshold cases at location" is only used when "More conservative strategy" is True. See section "error prediction strategies" for additional information.

Extracting cases and computing deletions

The assigned_reads list contains ReadAssignment objects. From each read start and end locations and cigartuples are extracted and for each splice site between the start and end location deletions are counted. First the locations within start and end of read are extracted from the exons-list. It is important to note that a read my start and end in the middle of an exon.

for read_assignment in assigned_reads:
    read_start = read_assignment.corrected_exons[0][0]
    read_end = read_assignment.corrected_exons[-1][1]
    cigartuples = read_assignment.cigartuples
    if not cigartuples:
        continue
    count_deletions_for_splice_site_locations(arguments)

For each matching location the location is first added to the splice_cite_cases dictionary, if missing. After this the deletions are computed from the cigartuples.

Note

The key-value 'del_pos_distr' is only needed for the more conservative strategy.

def count_deletions_for_splice_site_locations(arguments):
    
    matching_locations = extract_splice_site_locations_within_aligned_read(read_start, read_end, exons)
    
    for location in matching_locations:
        if location not in splice_site_cases:
            splice_site_cases[location] = {
                'location_is_end': bool,  
                'deletions': {},
                'del_pos_distr': [0 for _ in range(WINDOW_SIZE)],
                'most_common_del': -1,
                'canonical_bases_found': False
            }
        
        aligned_location = extract_location_from_cigar_string(arguments)
        
        count_deletions_from_cigar_codes_in_given_window(arguments)

The data structure of information to be extracted is the following:

{
    'location': 
        {
            'location_is_end': bool,  
            'deletions': dict,
            'del_pos_distr': list,
            'most_common_del': int,
            'canonical_bases_found': bool
        }
}

location_is_end: A boolean indicating whether the location is the end of an exon
deletions: A dictionary with number of deletions as key and count of reads as value
del_pos_distr: A list of integers with length of the predefined window. Each index contains the count of deletions in the respective position in the window.
most_common_del: Integer presenting the most common case of deletion. If no distinct case is found, the value is $-1$. The value also indicates direction. If the location is the start of an exon, the value is positive. Otherwise the value is negative.
canonical_bases_found:}** A boolean stating whether there exists candidate bases for a canonical pair at the distance of the most_common_del from the current location.

The computation of deletions from cigartuples happens in two steps. First the aligned location is extracted from the cigartuple:

def extract_location_from_cigar_string(arguments):
    relative_position = splice_site_location - read_start
    alignment_position = 0
    ref_position = 0

    for cigar_code in cigartuples:

        if cigar_code[0] in [0, 2, 3, 7, 8]:
            ref_position += cigar_code[1]
        if ref_position <= relative_position and not \
                read_start + ref_position == read_end:
            alignment_position += cigar_code[1]
        else:
            return alignment_position + (cigar_code[1] - (ref_position - relative_position))

    return -1

After this, the cigartuple is iterated again and starting from the aligned location a length of the predefined window_size cigarcodes are extracted.

Note

This part of the code could be optimized by performing these two operations at once.

def count_deletions_from_cigar_codes_in_given_window(arguments):
    count_of_deletions = 0
    
    cigar_code_list = []
    location = 0

    if location_is_end:
        aligned_location = aligned_location - window_size + 1

    for cigar_code in cigartuples:
        if window_size == len(cigar_code_list):
            break
        if location + cigar_code[1] > aligned_location:
            overlap = location + \
                cigar_code[1] - (aligned_location + len(cigar_code_list))
            cigar_code_list.extend(
                [cigar_code[0] for _ in range(min(window_size -
                                                len(cigar_code_list), overlap))])
        location += cigar_code[1]

    for i in range(window_size):
        if i >= len(cigar_code_list):
            break
        if cigar_code_list[i] == 2:
            count_of_deletions += 1
            splice_site_data["del_pos_distr"][i] += 1
    
    if count_of_deletions not in splice_site_data["deletions"]:
        splice_site_data["deletions"][count_of_deletions] = 0
    
    splice_site_data["deletions"][count_of_deletions] += 1

Correcting errors

The main function iterates through all extracted cases. If the reads aligned to the given location exceed MIN_N_OF_ALIGNED_READS, the location is verified for errors. If MORE_CONSERVATIVE_STRATEGY is selected, two additional verifications are made.

def correct_splice_site_errors(arguments):
    locations_with_errors = []
    for case in splice_cite_locations:
        
        reads = sum of reads at current location
        if reads < MIN_N_OF_ALIGNED_READS:
            continue
        
        compute_most_common_del_and_verify_nucleotides(arguments)
        
        if MORE_CONSERVATIVE_STRATEGY:
            if not sublist_largest_values_exists(arguments):
                continue
            if not threshold_for_del_cases_exceeded(argument):
                continue

        if canonical pair is found:
            locations_with_errors.append(location of case)
    
    return locations_with_errors

The most common deletion is stored to the dictionary as it is used in error correction if an error is found. It is stored containing the distance and direction. In exon start location the value is positive and in exon end location the value is negative. For this reason an absolute value is checked against ACCEPTED_DEL_CASES.

def compute_most_common_del_and_verify_nucleotides(
        arguments):
    
    # Compute most common case of deletions
    splice_site_data["most_common_del"] = compute_most_common_case_of_deletions(
        arguments)
    
    # Extract nucleotides from most common deletion location if it is an accepted case
    if abs(splice_site_data["most_common_del"]) in ACCEPTED_DEL_CASES:
        extract_nucleotides_from_most_common_del_location(
            arguments)

For locations with a suitable most common deletion case a candidate bases for a canonical pair are verified. Strand and the location of the case (start or end of exon) is taken into consideration. \

Warning

At the time it remains an open question whether the index correction is correctly set for IsoQuant. This needs to be verified.

def extract_nucleotides_from_most_common_del_location(
        arguments):
    idx_correction = 0
    extraction_start = location + most_common_del + idx_correction
    extraction_end = location + most_common_del + 2 + idx_correction
    try:
        extracted_canonicals =  chr_record[extraction_start:extraction_end]
    except KeyError:
        extracted_canonicals = 'XX'
    
    canonical_pairs = {
        '+': {
            'start': ['AG', 'AC'],
            'end': ['GT', 'GC', 'AT']
        },
        '-': {
            'start': ['AC', 'GC', 'AC'],
            'end': ['CT', 'GT']
        }
    }
    
    if location is end:
        possible_canonicals = canonical_pairs[strand]['end']
    else:
        possible_canonicals = canonical_pairs[strand]['start']
    if extracted_canonicals in possible_canonicals:
        splice_site_data["canonical_bases_found"] = True

Finally a list of corrected exons is created:

def generate_updated_exon_list(arguments):
    updated_exons = []
    for exon in exons:
            updated_exon = exon
            if exon[0] in locations_with_errors:
                corrected_location = exon[0] + splice_site_cases[exon[0]]["most_common_del"]
                updated_exon = (corrected_location, exon[1])
            if exon[1] in locations_with_errors:
                corrected_location = exon[1] + splice_site_cases[exon[1]]["most_common_del"]
                updated_exon = (exon[0], corrected_location)
            updated_exons.append(updated_exon)
    return updated_exons

In more conservative strategy two additional validations are made. There has to be $n$ adjacent nucleotides that have larger or equal values to nucleotides in other positions (see explanation in next section):

def sublist_largest_values_exists(lst, n):
    largest_values = set(sorted(lst, reverse=True)[:n])
    count = 0

    for num in lst:
        if num in largest_values:
            count += 1
            if count >= n:
                return True
        else:
            count = 0

    return False

Additionally there has to be $n$ (not necessarily adjacent nucleotides) for which a preset threshold is exceeded. Note that because of the first additional constraint, we can be certain that in the event of return value being True, clearly all nucleotides in the sublist of largest values also exceed this constraint.

def threshold_for_del_cases_exceeded(arguments):
    total_cases = sum of deletions
    nucleotides_exceeding_treshold = 0
    for value in del_pos_distr:
        if value  > total_cases * THRESHOLD_CASES_AT_LOCATION:
            nucleotides_exceeding_treshold += 1
    return bool(nucleotides_exceeding_treshold >= abs(most_common_del))

Error prediction strategies

Two strategies for error prediction are available:

Conservative:
\begin{enumerate}

There has to be a distinct most common case of deletions and it is one of the accepted deletion cases (constant ACCEPTED_DEL_CASES).
There has to be a canonical pair at the distance of the most common case of deletions from the splice site (constant WINDOW_SIZE)
The number of aligned reads at the given location must exceed a preset threshold (constant MIN_N_OF_ALIGNED_READS)

\textbfVery conservative:

There has to be a distinct most common case of deletions and it is one of the accepted deletion cases (constant ACCEPTED_DEL_CASES).
There has to be a canonical pair at the distance of the most common case of deletions from the splice site (constant WINDOW_SIZE)
The number of aligned reads at the given location must exceed a preset threshold (constant MIN_N_OF_ALIGNED_READS)
There has to be atleast $n$ indeces ($n$ is the distinct most common case of deletions), in which a threshold for deletions has to be exceeded (constant THRESHOLD_CASES_AT_LOCATION)
There has to be $n$ adjacent nucleotides that have larger or equal values to nucleotides in other positions (see explanation below)

Elaboration for condition 5:

Let $S$ be the list of elements in window and $A = {k_1, \ldots, k_n }$ be $n$ adjacent indices that is a sublist of $S$. Let $B={h_1,\ldots,h_m}$ be the sublist of the remaining (possibly non-adjacent) indices in $S$, so that $\forall h_i\in B\;h_i\notin A$, $\forall k_j\in A\;k_j\notin B$ and $|A| + |B| = |S|$.

Now for condition 3 to apply it holds that

$$\forall S[k_j]\;\not\exists S[h_i]\;\text{s.t.}\;S[k_j] < S[h_i].$$

Note: as this is a list of elements, it may have multiple elements with equal value.

…nittests

…p_cigartuples

andrewprzh and others added 30 commits August 8, 2023 17:58

keep cigartuples in read assignment

1b26e04

template for transcript correction

27f52fe

Add initial implementation for transcript_splice_site_corrector and u…

65f324d

…nittests

Fix issues with two unittests

bd52a52

Fix issue with two datastructures having the same var name

99c9d39

Refactor code into separate functions

98619db

Refactor code into separate functions

321be9c

Expand tests for untested functions

f30996b

Expand unittests

3f8aa16

expand unittests

3061f0f

Fix key-issue in splice_site_dict

8eefa31

Add threshold verification to conservative strategy

cc53125

Add constant for threshold to args

af81624

Update function name

2740474

fix cigartuples, can be None sometimes

e2e3ed5

Add logger.debug to see corrected exons

07ab569

Merge branch 'keep_cigartuples' of github.com:ablab/IsoQuant into kee…

c8efee0

…p_cigartuples

Add logger.debug to see corrected exons

7953ff3

Add logger.debug to see corrected exons

bec61f8

Add logger.debug to see corrected exons

8e46fc1

Add logger.debug to see indel calc

0359146

Add debugging to see matching cases list

bc91708

Add debugger to cases with no cigartuples

054aa1c

add debug line to verify if cigartuples are found on some reads

5ccbc81

Move debug to correct transcripts

6e217ec

Move debug to correct transcripts

82fdc3a

Move debug to correct transcripts

20440ae

Move debug to correct transcripts

83f5ae6

Move debug to correct transcripts

f8f363f

Fix bug with dict key ref

b37f59b

heidi-holappa and others added 16 commits August 25, 2023 14:11

Add test for GraphBasedModelConstructor

8c240f4

Check for abs value

2360b62

Expand unittests for GraphBaseModelConstructor method

898f75c

Improve debugger stdouts

05de4f3

Update unittest after changing constant positioning

c9d09b6

Move WINDOW_SIZE to main func

35a0942

Move const WINDOW_SIZE upper

9e2c29d

Change division to multiplication

5c22578

Expand tests

46d01bb

Shorten key name

5a90864

Update tests after key name change

47ca8b3

fix cigartuples exactly where they needed to be

b825a5c

Change idx correction for FASTA extract

3c07e17

Fix unittests after fixing issue with chr_record idx-correction

ce52d63

Merge branch 'keep_cigartuples' of github.com:ablab/IsoQuant into kee…

aa4f08a

…p_cigartuples

Remove unneeded logger.debugs

fb7db12

andrewprzh self-requested a review August 31, 2023 15:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep cigartuples #108

Keep cigartuples #108

heidi-holappa commented Aug 30, 2023

Keep cigartuples #108

Are you sure you want to change the base?

Keep cigartuples #108

Conversation

heidi-holappa commented Aug 30, 2023

Introduction

Constants

Extracting cases and computing deletions

Correcting errors

Error prediction strategies