Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix indexing error for reverse strand in calc_orf_gc #101

Open
wants to merge 1 commit into
base: GoogleImport
Choose a base branch
from

Conversation

althonos
Copy link
Contributor

Hi again!

Overview

I noticed while addressing #100 that the GC% computed for genes in the reverse strand was often wrong by a small margin, despite the correct gene sequence being extracted, pointing at an indexing error. After checking for out-of-bound reads, I noticed that in calc_orf_gc the loop would read past sequence end in the following part:

      for(j = last[fr]+3; j <= nod[i].ndx; j+=3)
        gc[fr] += is_gc(seq, j) + is_gc(seq, j+1) + is_gc(seq, j+2);

Indeed, on the reverse strand, last[fr] is set to the index of the STOP codon; because for reverse-strand codon this is always the index of the last nucleotide, not the first, the iteration should start 1 nucleotide later, not 3.

Fix

Start the iteration at the right coordinates 馃槃

Example

Taking the same contig CAKWEX010000332.1 as in #100, I ran Prodigal on both the contig and its reverse complement; the genes predicted in both cases matched in sequences, but the GC% didn't match; namely, the GC content was wrong when the genes were on the reverse strand (i changed the gc_cont precision so that the difference is easier to see):

               gene strand    true_gc    fwd_gc    bwd_gc
CAKWEX010000332.1_1     -1  63.592526  0.635925  0.635925
CAKWEX010000332.1_2     -1  58.043118  0.582090  0.580431
CAKWEX010000332.1_3     -1  60.073260  0.601954  0.600733

After applying the fix, the GC-content is consistent independently of whether the gene is on the direct or reverse strand:

               gene strand    true_gc    fwd_gc    bwd_gc
CAKWEX010000332.1_1     -1  63.592526  0.635925  0.635925
CAKWEX010000332.1_2     -1  58.043118  0.580431  0.580431
CAKWEX010000332.1_3     -1  60.073260  0.600733  0.600733

@althonos
Copy link
Contributor Author

althonos commented Sep 7, 2023

@hyattpd : Do you think you could merge #100 and this PR as well? Thanks 馃槂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant