Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: detect nested ORFs #501

Merged
merged 1 commit into from Sep 6, 2022
Merged

Conversation

TheLostLambda
Copy link
Contributor

When testing the Rust Bio ORF finder against this Rosalind problem, I discovered that the MTPRLGLESLLE ORF was being missed! This was because it's an ORF nested within the MGMTPRLGLESLLE ORF in the same frame and the old algorithm did a single pass and only recorded one ORF starting position at a time.

The new algorithm still performs only a single pass, but can build up a list of alternative start positions before a stop is encountered.

I've added some more comments in this thread to justify my changes and wrote a new test for this nested case. I've switched to using this fork for my syn-zeug project, but I'd love to see this merged and move back to using upstream!

start_pos: [Option<usize>; 3],
start_pos: [Vec<usize>; 3],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there can now be 0 or more active start positions (instead of just 0 or 1), we move to Vec instead of Option

codon: VecDeque<u8>,
found: VecDeque<Orf>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If more than one ORF is found after .next() is called, then they are added to a queue and popped on each subsequent call of .next() before sequence scanning resumes

Comment on lines +133 to +137
// return any orfs already found
if !self.state.found.is_empty() {
return self.state.found.pop_front();
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are already found ORFs than haven't been reported yet, then this queue should be exhausted before moving back to scanning

Comment on lines +146 to +149
// check if entering orf
if self.finder.start_codons.contains(&self.state.codon) {
self.state.start_pos[offset].push(index);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer can be an else if, since you can be in an ORF and find a new starting site now. I've moved this to the top here (before checking if we are currently in an ORF) just for clarity when reading

Comment on lines +154 to +166
for start_pos in &self.state.start_pos[offset] {
// check if length is sufficient
if index + 1 - start_pos > self.finder.min_len {
// build results
self.state.found.push_back(Orf {
start: start_pos - 2,
end: index + 1,
offset: offset as i8,
});
// if the first orf is too short, so are the others
} else {
break;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a stop codon is found, loop through and queue all of the ORFs that can be constructed from the different start sites. Since these start sites are ordered from first to last (as they appear in the sequence), we can assume that if the ORF constructed from one start site is too small, then all of the subsequent start sites will yield an even smaller, also invalid ORF. To save work in these cases, we break early from the loop.

Comment on lines +240 to +263

#[test]
fn test_three_nested_and_offset_orfs() {
let finder = basic_finder();
let sequence = b"ATGGGGATGGGGGGATGGAAAAATAAGTAG";
let expected = vec![
Orf {
start: 14,
end: 26,
offset: 2,
},
Orf {
start: 0,
end: 30,
offset: 0,
},
Orf {
start: 6,
end: 30,
offset: 0,
},
];
assert_eq!(expected, finder.find_all(sequence).collect::<Vec<Orf>>());
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests that nesting is detected and doesn't break scanning in other reading frames

@TheLostLambda
Copy link
Contributor Author

Once this has passed initial review I have a few more usability tweaks I'd like to make when it comes to function types and derived trait implementations :)

@coveralls
Copy link

coveralls commented Sep 3, 2022

Coverage Status

Coverage increased (+0.01%) to 88.405% when pulling 34399b3 on TheLostLambda:master into 8d7565b on rust-bio:master.

Copy link
Contributor

@johanneskoester johanneskoester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much! Looks good to me!

@johanneskoester
Copy link
Contributor

Let's do the usability improvements you have in mind in a separate PR.

@johanneskoester johanneskoester merged commit 19b6c36 into rust-bio:master Sep 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants