Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix buffer initialization in SD motif detection code #100

Merged
merged 2 commits into from
Jan 27, 2023

Conversation

althonos
Copy link
Contributor

Hi again !

A very detailed report by @pchaumeil in althonos/pyrodigal#28 helped me discover a new bug in the Prodigal source. This is affecting the detection of Shine-Dalgarno motifs for start codons on the reverse strand, where the searched region may go out-of-bound.

Overview

In shine_dalgarno_mm and shine_dalgarno_exact, the match buffer stores a score for each position of the motif being searched. In the event the region overlaps with the sequence edge, only the sequence characters should be matched. However, an issue with indexing causes the initialization not to happen for ahead-of-start nucleotides, and the match buffer may contain random bytes. This can lead to the wrong RBS motif being identified, and occasionally some genes having wrong coordinates.

Fix

Fix initialization of the match buffer so that all positions are set to a default mismatch value (-10). In the code that compares the 6-base region to AGGAGG, positions before 0 are skipped, and so match[i] may no be initialized for all values of i:

  for(i = 0; i < limit; i++) {
    if(pos+i < 0) continue;
    if(i%3 == 0 && is_a(seq, pos+i) == 1) match[i] = 2.0;
    else if(i%3 != 0 && is_g(seq, pos+i) == 1) match[i] = 3.0;
    else match[i] = -10.0;
  }

Initializing the array to -10 makes the undefined behaviour disappear. In fact, since match is stack-allocated, if may contain remnants of a previous call, like 2.0 or 3.0, which made the error hard to debug in the first place.

Example

In the assembly GCA_934838455.1, there is a contig (CAKWEX010000332.1) which ends with the following nucleotide sequence:

...CATGGGGCCTATCC
   ^^^        ^^^
   (1)        (2)

It contains a start codon (1) and a RBS motif (2) on the reverse strand. However, after running Prodigal on this contig, the predicted gene is:

>CAKWEX010000332.1_3 # 2936 # 3754 # -1 # ID=1_3;partial=00;start_type=ATG;rbs_motif=AGGA;rbs_spacer=5-10bp;gc_cont=0.602
MSNHFEGLGKTWLTLLNDPEKEVPAVVMQVMKEGKTRDCWQRKDSKEETMVLAWPVETGF
RAGVTVHGNAGDQLRPVSTYPLLEGAPNDMTVNETYLWQNETEGEVSATCNEGANPLWFY
SPFLFRDRENLTPGVRHTFLIAGLAYGLRRALLDEMTITEGVEYERYVAEWLAQNPGKTR
LDVPQLTVDLRGARIVVPGDVASEYQIRVPVTSVEEMHIQNEKIYMLIVEFGLNTPNPLR
FPLYAPERVCKIVPQAGDEIDAIIWLQGRIID*

As you can see, the RBS motif is AGGA; however, there is no AGGA in the sequence upstream of the start codon, only GGA. The missing A is actually an artifact of the match buffer not being initialized. After applying the fix, the gene is predicted with a GGA motif, as expected:

>CAKWEX010000332.1_3 # 2936 # 3754 # -1 # ID=1_3;partial=00;start_type=ATG;rbs_motif=GGA/GAG/AGG;rbs_spacer=5-10bp;gc_cont=0.602
MSNHFEGLGKTWLTLLNDPEKEVPAVVMQVMKEGKTRDCWQRKDSKEETMVLAWPVETGF
RAGVTVHGNAGDQLRPVSTYPLLEGAPNDMTVNETYLWQNETEGEVSATCNEGANPLWFY
SPFLFRDRENLTPGVRHTFLIAGLAYGLRRALLDEMTITEGVEYERYVAEWLAQNPGKTR
LDVPQLTVDLRGARIVVPGDVASEYQIRVPVTSVEEMHIQNEKIYMLIVEFGLNTPNPLR
FPLYAPERVCKIVPQAGDEIDAIIWLQGRIID*

@althonos
Copy link
Contributor Author

althonos commented Feb 3, 2023

@hyattpd: Do you think you could have a look at #101 and #102 too? 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants