New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: detect nested ORFs #501
Conversation
start_pos: [Option<usize>; 3], | ||
start_pos: [Vec<usize>; 3], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there can now be 0 or more active start positions (instead of just 0 or 1), we move to Vec
instead of Option
codon: VecDeque<u8>, | ||
found: VecDeque<Orf>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If more than one ORF is found after .next()
is called, then they are added to a queue and popped on each subsequent call of .next()
before sequence scanning resumes
// return any orfs already found | ||
if !self.state.found.is_empty() { | ||
return self.state.found.pop_front(); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are already found ORFs than haven't been reported yet, then this queue should be exhausted before moving back to scanning
// check if entering orf | ||
if self.finder.start_codons.contains(&self.state.codon) { | ||
self.state.start_pos[offset].push(index); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer can be an else if
, since you can be in an ORF and find a new starting site now. I've moved this to the top here (before checking if we are currently in an ORF) just for clarity when reading
for start_pos in &self.state.start_pos[offset] { | ||
// check if length is sufficient | ||
if index + 1 - start_pos > self.finder.min_len { | ||
// build results | ||
self.state.found.push_back(Orf { | ||
start: start_pos - 2, | ||
end: index + 1, | ||
offset: offset as i8, | ||
}); | ||
// if the first orf is too short, so are the others | ||
} else { | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a stop codon is found, loop through and queue all of the ORFs that can be constructed from the different start sites. Since these start sites are ordered from first to last (as they appear in the sequence), we can assume that if the ORF constructed from one start site is too small, then all of the subsequent start sites will yield an even smaller, also invalid ORF. To save work in these cases, we break early from the loop.
|
||
#[test] | ||
fn test_three_nested_and_offset_orfs() { | ||
let finder = basic_finder(); | ||
let sequence = b"ATGGGGATGGGGGGATGGAAAAATAAGTAG"; | ||
let expected = vec![ | ||
Orf { | ||
start: 14, | ||
end: 26, | ||
offset: 2, | ||
}, | ||
Orf { | ||
start: 0, | ||
end: 30, | ||
offset: 0, | ||
}, | ||
Orf { | ||
start: 6, | ||
end: 30, | ||
offset: 0, | ||
}, | ||
]; | ||
assert_eq!(expected, finder.find_all(sequence).collect::<Vec<Orf>>()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests that nesting is detected and doesn't break scanning in other reading frames
Once this has passed initial review I have a few more usability tweaks I'd like to make when it comes to function types and derived trait implementations :) |
1681e14
to
34399b3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much! Looks good to me!
Let's do the usability improvements you have in mind in a separate PR. |
When testing the Rust Bio ORF finder against this Rosalind problem, I discovered that the
MTPRLGLESLLE
ORF was being missed! This was because it's an ORF nested within theMGMTPRLGLESLLE
ORF in the same frame and the old algorithm did a single pass and only recorded one ORF starting position at a time.The new algorithm still performs only a single pass, but can build up a list of alternative start positions before a stop is encountered.
I've added some more comments in this thread to justify my changes and wrote a new test for this nested case. I've switched to using this fork for my syn-zeug project, but I'd love to see this merged and move back to using upstream!