Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extension traits on Text/TextSlice for common biological sequence modifications #396

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

xyl012
Copy link
Contributor

@xyl012 xyl012 commented Nov 24, 2020

So after a little struggling, this is what I came up with so far. There's a new trait for String and Text, which returns a string or text. Although I wanted to originally make it take &str, I'm still getting a handle on 'a and wasn't sure if it would be really all that much different. (please correct me if I'm wrong!) In the case of an &str, the trait would basically create a string anyway and probably return a string, which would end up being the same as what we have here but with a .to_string() prior. There are a few other utilities I would like to create that &str might be a little more appropriate for which I'll be trying to use explicit lifetimes.

On an aside, if you guys think going by graphemes rather than characters would be beneficial I can also replace chars, but I didn't really see a use for biology.

Any input is appreciated! Cheers

@TianyiShi2001
Copy link
Member

TianyiShi2001 commented Nov 24, 2020

ご協力ありがとうございます!Thank you for your contribution!

Is there a specific reason for using String/&str instead of Vec<u8>/&[u8]? The latter is being used throughout rust-bio (aliased as Text/TextSlice) because strings used in bioinformatics rarely use non-ASCII characters, and Vec<u8>/&[u8] is faster than String/&str in Rust.

@xyl012
Copy link
Contributor Author

xyl012 commented Nov 24, 2020

Thanks for the review and input TianyiShi! String and (probably upstream) &str in this case were chosen mainly simply for ergonomics (and I'm not that good at rust). Although rust bio uses text and text slice, I would say that using string makes it adoptable to anyone using rust, even if it isn't strictly within the rust bio umbrella. As a middle ground, I have created the trait for text too.

Originally I had planned/started writing to intake/output u8 arrays and vecs but like anything in Rust, with lower level access we increase the complexity and hedge our bets that users are willing to go through learning about u8s vecs arrays and how this applies to strings. Most people I think just want a string, and there's so much space in these kinds of structures I started to get lost, but I'm studying and trying to get there. If you have the knowhow, feel free to edit!

In short, you're asking about future optimization (or maybe another trait completely?) which I just don't really know how to implement yet but as you noted, slow isn't a rust thing, and I am thinking about how to implement exactly what you said. As a minimum, I think I did a thing :)

I'm pretty new to Rust and this kind of programming so please correct or counter these points if it seems incorrect. This is my first trait! :D

@TianyiShi2001
Copy link
Member

using string makes it adoptable to anyone using rust

I think the ease-of-use for the end user depends on the interoperability between I/O tools (e.g. FASTA, FASTQ readers/writers) and algorithms (e.g. GC content, suffix array, sequence alignment). If they're all consistent on either on String or bytes, then it's easy to use (since you don't need to convert back and forth). However, as a developer who cares about performance, you'll definitely prefer bytes over Strings for representing biological sequences. It's much more easier to manipulate and has better performance (the .chars() method I saw in you code is particularly slow, because it needs to determine the next code point for each character).

Indeed, almost all algorithms in this crate use bytes (rust_bio::utils::text::Textslice), so does rust-htslib (another rust-bio project) and another independent FASTA/FASTQ parser/writer, seq-io (which is faster than rust_bio::io::{fasta, fastq}.

(for some reason the I/O tools in this crate uses Strings, and I've just submitted an issue #397 suggesting bytes should be used).

@xyl012
Copy link
Contributor Author

xyl012 commented Nov 25, 2020

I see.. seems like I need to do a little more research on possible inputs and outputs. I'm not too sure how to replace the chars function you mentioned to make it a little more performant, could you give me an example edit?

Copy link
Member

@TianyiShi2001 TianyiShi2001 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your idea of using a trait is brilliant! There are currently some utility functions in rust_bio::alphabets::{protein, dna} and rust_bio::seq_analysis, but they are free functions which are not ergonomics to use. It would be nice to have traits like Seq, ProteinSeq, DnaSeq so that users can directly call some_sequence.gc_content(), some_sequence.revcomp(), etc.

src/utils/replace.rs Outdated Show resolved Hide resolved
@TianyiShi2001
Copy link
Member

give me an example edit

Sure

@TianyiShi2001
Copy link
Member

TianyiShi2001 commented Nov 25, 2020

For example,

use rand::seq::SliceRandom;
use rand::Rng;
use std::string::String;

const STANDARD_DNA_NUCLEOTIDES: [u8; 4] = [b'A', b'C', b'G', b'T'];
const STANDARD_RNA_NUCLEOTIDES: [u8; 4] = [b'A', b'C', b'G', b'U'];

/// A trait for biological sequences. This is implemented by bytes (`Vec<u8>` and `&[u8]`)
///
/// The methods in this trait assumes that:
/// - The sequence contains only ASCII characters.
/// - Lowercase and uppercase characters are not mixed.
pub trait Seq: AsMut<[u8]> {
    // * in fact, these aren't needed. These are already provided by std. e.g. you can do `my_seq.make_ascii_uppercase()`
    // fn to_uppercase(&mut self);
    // fn to_lowercase(&mut self);
}

pub trait DnaSeq: Seq {
    /// Replace `T` with `U`
    fn to_rna(&mut self);
    /// Fill N with pseudorandom nucleotides ACTG and n with actg
    fn replace_n<R: Rng>(&mut self, rng: &mut R);
    // fn transcribe(&self) -> Vec<u8>;
}

pub trait RnaSeq: Seq {
    fn to_dna(&mut self);
}

pub trait ProteinSeq: Seq {
    fn replace_x<R: Rng>(&mut self, rng: &mut R);
}

impl<T> Seq for T where T: AsMut<[u8]> {}

impl<T> DnaSeq for T
where
    T: AsMut<[u8]>,
{
    fn to_rna(&mut self) {
        self.as_mut().iter_mut().for_each(|c| match c {
            b'T' => *c = b'U',
            b't' => *c = b'u',
            _ => {}
        })
    }
    fn replace_n<R: Rng>(&mut self, rng: &mut R) {
        self.as_mut().iter_mut().for_each(|c| match c {
            b'N' => *c = *STANDARD_DNA_NUCLEOTIDES.choose(rng).unwrap(),
            b'n' => *c = *STANDARD_DNA_NUCLEOTIDES.choose(rng).unwrap() + 32,
            _ => {}
        })
    }
}

#[cfg(test)]
mod tests {

    use super::*;

    #[test]
    fn test_seq_to_rna() {
        let mut seq = *b"ATGCT";
        seq.to_rna();
        assert_eq!(seq, *b"AUGCU");
    }
    #[test]
    fn test_seq_replace() {
        let mut rng = rand::thread_rng();
        let mut seq = *b"ATGCNNNN";
        seq.replace_n(&mut rng);
        for c in seq.iter_mut() {
            STANDARD_DNA_NUCLEOTIDES.contains(c);
        }
    }
}

Thing to notice:

  • use generics for random number generator so the user can use not only ThreadRng: https://rust-random.github.io/rand/rand/trait.Rng.html
  • use generics for sequence AsMut<[u8]> so that it works on mut Vec<u8> as well as &mut [u8]
  • from my experience, in-place operations are more appropriate in these cases. If a copy really is needed, a .clone() can be made easily.

@coveralls
Copy link

coveralls commented Nov 25, 2020

Coverage Status

Coverage decreased (-0.2%) to 86.669% when pulling 87f493a on otsukaresamadeshita:otsu_replace into 480872e on rust-bio:master.

@TianyiShi2001
Copy link
Member

I don't think replacing gaps with random nucleotides/amino acids has any biological meaning. Gaps are gaps. Gaps are created during alignment and this means deletion i.e. no nucleotide/amino acid. "Any base" and "any amino acid" are denoted by "N" and "X", respectively.

@xyl012
Copy link
Contributor Author

xyl012 commented Nov 25, 2020

I'm studying your code now, it's quite impressive! I wanted to let the user decide about gaps, caps, etc., information is information! The original reason for this page was because I downloaded some fastq files and found that I couldn't do much because there were characters that weren't actg. I just wanted to clean it so I had pure ACTG and ended up writing some utilities. I also got some corrupt seqs before so I thought might be able to do something about that also. TIL AsMut<[u8]> and +32 ! I must have missed it, but why is the code limited to unmixed lower and upper, it seems like it would work on a mixed seq?

@tedil
Copy link
Member

tedil commented Nov 25, 2020

For example,
[...]

This is exactly what I had in mind, thanks @TianyiShi2001, and sorry for not getting back to you earlier @otsukaresamadeshita!

@TianyiShi2001
Copy link
Member

why is the code limited to unmixed lower and upper, it seems like it would work on a mixed seq?

It is awkward to have mixed cases in a sequence. If someone know this happens they should unify the case first before doing any other analysis on the sequence, because most other algorithms in this crate, such as edit distance and pairwise alignment, requires uniform case.

@TianyiShi2001
Copy link
Member

I wanted to let the user decide about gaps, caps, etc., information is information! The original reason for this page was because I downloaded some fastq files and found that I couldn't do much because there were characters that weren't actg.

You didn't get my point. I meant it is OK to replace N with A/T/C/G, Y with C/T, because this is what they mean according to the IUPAC DNA nucleotide alphabet. However, gaps, specifically the - character, means missing nucleotide. Sometimes this means one nucleotide, sometime it means more than one: https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp . It is conceptually wrong to replace them with A/T/C/G.

If you want to to something about gaps to make the sequence ATCG-only, the only sensible way is to delete them, there's a function called ungap in Biopython that does exactly this.

@TianyiShi2001
Copy link
Member

As for the period (.) character, I haven't come across them before. A google search suggested that it could be a nucleotide with low quality score: https://www.biostars.org/p/62347/

Some versions of the SOLiD sequencer used to put in dots into the colorspace sequence whenever the quality was too low and was unable to call a color. Used to break all kinds of tools.

Are there more reliable references on the meaning of this character?

@tedil tedil changed the title replace traits Extension traits on Text/TextSlice for common biological sequence modifications Nov 29, 2020
@xyl012
Copy link
Contributor Author

xyl012 commented Dec 2, 2020

Thanks for the code review, I learned a lot! Code committed!

Copy link
Member

@tedil tedil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's already there looks really good :)
As mentioned in one of the review comments, only uppercase sequences are supported atm (while the doc says it also handles lowercase).

As to the mixed case case @TianyiShi2001: I'm pretty sure I have (in the past) stumbled upon files where mixed upper and lower case nucleotides were used to encode strand, so I guess there may be valid use-cases somewhere.

(@otsukaresamadeshita If you're still working on the code, you can change this Pull Request to a Draft (and once you're ready, convert from Draft to PR again) )

Comment on lines +25 to +27
//! for c in seq.iter_mut() {
//! STANDARD_DNA_NUCLEOTIDES.contains(c);
//! }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//! for c in seq.iter_mut() {
//! STANDARD_DNA_NUCLEOTIDES.contains(c);
//! }
assert!(seq.iter().all(|c| STANDARD_DNA_NUCLEOTIDES.contains(c)))

Comment on lines +105 to +107
for c in seq.iter_mut() {
STANDARD_DNA_NUCLEOTIDES.contains(c);
}
Copy link
Member

@tedil tedil Dec 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as in the doc example above, you forgot the assertion here ;)
Also I think there's no need to use iter_mut (since you're not mutating anything)

Suggested change
for c in seq.iter_mut() {
STANDARD_DNA_NUCLEOTIDES.contains(c);
}
assert!(seq.iter().all(|c| STANDARD_DNA_NUCLEOTIDES.contains(c)))

//! Traits to replace Ns with pseudorandom nucleotide bases and convert U to/from T.
//! The methods in this trait assumes that:
//! - The sequence contains only ASCII characters.
//! - Lowercase and uppercase characters are not mixed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually assumes that everything is uppercase, right? The trait implementations below only replace uppercase T/U/N with uppercase U/T/*

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants