Extension traits on Text/TextSlice for common biological sequence modifications #396

xyl012 · 2020-11-24T04:02:53Z

So after a little struggling, this is what I came up with so far. There's a new trait for String and Text, which returns a string or text. Although I wanted to originally make it take &str, I'm still getting a handle on 'a and wasn't sure if it would be really all that much different. (please correct me if I'm wrong!) In the case of an &str, the trait would basically create a string anyway and probably return a string, which would end up being the same as what we have here but with a .to_string() prior. There are a few other utilities I would like to create that &str might be a little more appropriate for which I'll be trying to use explicit lifetimes.

On an aside, if you guys think going by graphemes rather than characters would be beneficial I can also replace chars, but I didn't really see a use for biology.

Any input is appreciated! Cheers

TianyiShi2001 · 2020-11-24T21:10:05Z

ご協力ありがとうございます！Thank you for your contribution!

Is there a specific reason for using String/&str instead of Vec<u8>/&[u8]? The latter is being used throughout rust-bio (aliased as Text/TextSlice) because strings used in bioinformatics rarely use non-ASCII characters, and Vec<u8>/&[u8] is faster than String/&str in Rust.

xyl012 · 2020-11-24T21:52:54Z

Thanks for the review and input TianyiShi! String and (probably upstream) &str in this case were chosen mainly simply for ergonomics (and I'm not that good at rust). Although rust bio uses text and text slice, I would say that using string makes it adoptable to anyone using rust, even if it isn't strictly within the rust bio umbrella. As a middle ground, I have created the trait for text too.

Originally I had planned/started writing to intake/output u8 arrays and vecs but like anything in Rust, with lower level access we increase the complexity and hedge our bets that users are willing to go through learning about u8s vecs arrays and how this applies to strings. Most people I think just want a string, and there's so much space in these kinds of structures I started to get lost, but I'm studying and trying to get there. If you have the knowhow, feel free to edit!

In short, you're asking about future optimization (or maybe another trait completely?) which I just don't really know how to implement yet but as you noted, slow isn't a rust thing, and I am thinking about how to implement exactly what you said. As a minimum, I think I did a thing :)

I'm pretty new to Rust and this kind of programming so please correct or counter these points if it seems incorrect. This is my first trait! :D

TianyiShi2001 · 2020-11-25T02:05:02Z

using string makes it adoptable to anyone using rust

I think the ease-of-use for the end user depends on the interoperability between I/O tools (e.g. FASTA, FASTQ readers/writers) and algorithms (e.g. GC content, suffix array, sequence alignment). If they're all consistent on either on String or bytes, then it's easy to use (since you don't need to convert back and forth). However, as a developer who cares about performance, you'll definitely prefer bytes over Strings for representing biological sequences. It's much more easier to manipulate and has better performance (the .chars() method I saw in you code is particularly slow, because it needs to determine the next code point for each character).

Indeed, almost all algorithms in this crate use bytes (rust_bio::utils::text::Textslice), so does rust-htslib (another rust-bio project) and another independent FASTA/FASTQ parser/writer, seq-io (which is faster than rust_bio::io::{fasta, fastq}.

(for some reason the I/O tools in this crate uses Strings, and I've just submitted an issue #397 suggesting bytes should be used).

xyl012 · 2020-11-25T02:33:35Z

I see.. seems like I need to do a little more research on possible inputs and outputs. I'm not too sure how to replace the chars function you mentioned to make it a little more performant, could you give me an example edit?

TianyiShi2001

Your idea of using a trait is brilliant! There are currently some utility functions in rust_bio::alphabets::{protein, dna} and rust_bio::seq_analysis, but they are free functions which are not ergonomics to use. It would be nice to have traits like Seq, ProteinSeq, DnaSeq so that users can directly call some_sequence.gc_content(), some_sequence.revcomp(), etc.

src/utils/replace.rs

TianyiShi2001 · 2020-11-25T02:49:42Z

give me an example edit

Sure

TianyiShi2001 · 2020-11-25T04:12:56Z

For example,

use rand::seq::SliceRandom;
use rand::Rng;
use std::string::String;

const STANDARD_DNA_NUCLEOTIDES: [u8; 4] = [b'A', b'C', b'G', b'T'];
const STANDARD_RNA_NUCLEOTIDES: [u8; 4] = [b'A', b'C', b'G', b'U'];

/// A trait for biological sequences. This is implemented by bytes (`Vec<u8>` and `&[u8]`)
///
/// The methods in this trait assumes that:
/// - The sequence contains only ASCII characters.
/// - Lowercase and uppercase characters are not mixed.
pub trait Seq: AsMut<[u8]> {
    // * in fact, these aren't needed. These are already provided by std. e.g. you can do `my_seq.make_ascii_uppercase()`
    // fn to_uppercase(&mut self);
    // fn to_lowercase(&mut self);
}

pub trait DnaSeq: Seq {
    /// Replace `T` with `U`
    fn to_rna(&mut self);
    /// Fill N with pseudorandom nucleotides ACTG and n with actg
    fn replace_n<R: Rng>(&mut self, rng: &mut R);
    // fn transcribe(&self) -> Vec<u8>;
}

pub trait RnaSeq: Seq {
    fn to_dna(&mut self);
}

pub trait ProteinSeq: Seq {
    fn replace_x<R: Rng>(&mut self, rng: &mut R);
}

impl<T> Seq for T where T: AsMut<[u8]> {}

impl<T> DnaSeq for T
where
    T: AsMut<[u8]>,
{
    fn to_rna(&mut self) {
        self.as_mut().iter_mut().for_each(|c| match c {
            b'T' => *c = b'U',
            b't' => *c = b'u',
            _ => {}
        })
    }
    fn replace_n<R: Rng>(&mut self, rng: &mut R) {
        self.as_mut().iter_mut().for_each(|c| match c {
            b'N' => *c = *STANDARD_DNA_NUCLEOTIDES.choose(rng).unwrap(),
            b'n' => *c = *STANDARD_DNA_NUCLEOTIDES.choose(rng).unwrap() + 32,
            _ => {}
        })
    }
}

#[cfg(test)]
mod tests {

    use super::*;

    #[test]
    fn test_seq_to_rna() {
        let mut seq = *b"ATGCT";
        seq.to_rna();
        assert_eq!(seq, *b"AUGCU");
    }
    #[test]
    fn test_seq_replace() {
        let mut rng = rand::thread_rng();
        let mut seq = *b"ATGCNNNN";
        seq.replace_n(&mut rng);
        for c in seq.iter_mut() {
            STANDARD_DNA_NUCLEOTIDES.contains(c);
        }
    }
}

Thing to notice:

use generics for random number generator so the user can use not only ThreadRng: https://rust-random.github.io/rand/rand/trait.Rng.html
use generics for sequence AsMut<[u8]> so that it works on mut Vec<u8> as well as &mut [u8]
from my experience, in-place operations are more appropriate in these cases. If a copy really is needed, a .clone() can be made easily.

coveralls · 2020-11-25T04:16:01Z

Coverage decreased (-0.2%) to 86.669% when pulling 87f493a on otsukaresamadeshita:otsu_replace into 480872e on rust-bio:master.

TianyiShi2001 · 2020-11-25T04:25:00Z

I don't think replacing gaps with random nucleotides/amino acids has any biological meaning. Gaps are gaps. Gaps are created during alignment and this means deletion i.e. no nucleotide/amino acid. "Any base" and "any amino acid" are denoted by "N" and "X", respectively.

xyl012 · 2020-11-25T05:40:42Z

I'm studying your code now, it's quite impressive! I wanted to let the user decide about gaps, caps, etc., information is information! The original reason for this page was because I downloaded some fastq files and found that I couldn't do much because there were characters that weren't actg. I just wanted to clean it so I had pure ACTG and ended up writing some utilities. I also got some corrupt seqs before so I thought might be able to do something about that also. TIL AsMut<[u8]> and +32 ! I must have missed it, but why is the code limited to unmixed lower and upper, it seems like it would work on a mixed seq?

tedil · 2020-11-25T09:45:47Z

For example,
[...]

This is exactly what I had in mind, thanks @TianyiShi2001, and sorry for not getting back to you earlier @otsukaresamadeshita!

TianyiShi2001 · 2020-11-25T12:43:16Z

why is the code limited to unmixed lower and upper, it seems like it would work on a mixed seq?

It is awkward to have mixed cases in a sequence. If someone know this happens they should unify the case first before doing any other analysis on the sequence, because most other algorithms in this crate, such as edit distance and pairwise alignment, requires uniform case.

TianyiShi2001 · 2020-11-25T13:07:09Z

I wanted to let the user decide about gaps, caps, etc., information is information! The original reason for this page was because I downloaded some fastq files and found that I couldn't do much because there were characters that weren't actg.

You didn't get my point. I meant it is OK to replace N with A/T/C/G, Y with C/T, because this is what they mean according to the IUPAC DNA nucleotide alphabet. However, gaps, specifically the - character, means missing nucleotide. Sometimes this means one nucleotide, sometime it means more than one: https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp . It is conceptually wrong to replace them with A/T/C/G.

If you want to to something about gaps to make the sequence ATCG-only, the only sensible way is to delete them, there's a function called ungap in Biopython that does exactly this.

TianyiShi2001 · 2020-11-25T13:10:03Z

As for the period (.) character, I haven't come across them before. A google search suggested that it could be a nucleotide with low quality score: https://www.biostars.org/p/62347/

Some versions of the SOLiD sequencer used to put in dots into the colorspace sequence whenever the quality was too low and was unable to call a color. Used to break all kinds of tools.

Are there more reliable references on the meaning of this character?

…ifications

xyl012 · 2020-12-02T02:04:00Z

Thanks for the code review, I learned a lot! Code committed!

tedil

What's already there looks really good :)
As mentioned in one of the review comments, only uppercase sequences are supported atm (while the doc says it also handles lowercase).

As to the mixed case case @TianyiShi2001: I'm pretty sure I have (in the past) stumbled upon files where mixed upper and lower case nucleotides were used to encode strand, so I guess there may be valid use-cases somewhere.

(@otsukaresamadeshita If you're still working on the code, you can change this Pull Request to a Draft (and once you're ready, convert from Draft to PR again) )

tedil · 2020-12-02T11:06:44Z

src/utils/replace.rs

+//! for c in seq.iter_mut() {
+//!     STANDARD_DNA_NUCLEOTIDES.contains(c);
+//! }


Suggested change

//! for c in seq.iter_mut() {

//! STANDARD_DNA_NUCLEOTIDES.contains(c);

//! }

assert!(seq.iter().all(|c| STANDARD_DNA_NUCLEOTIDES.contains(c)))

tedil · 2020-12-02T11:10:52Z

src/utils/replace.rs

+        for c in seq.iter_mut() {
+            STANDARD_DNA_NUCLEOTIDES.contains(c);
+        }


Same as in the doc example above, you forgot the assertion here ;)
Also I think there's no need to use iter_mut (since you're not mutating anything)

Suggested change

for c in seq.iter_mut() {

STANDARD_DNA_NUCLEOTIDES.contains(c);

}

assert!(seq.iter().all(|c| STANDARD_DNA_NUCLEOTIDES.contains(c)))

tedil · 2020-12-02T11:13:05Z

src/utils/replace.rs

+//! Traits to replace Ns with pseudorandom nucleotide bases and convert U to/from T.
+//! The methods in this trait assumes that:
+//! - The sequence contains only ASCII characters.
+//! - Lowercase and uppercase characters are not mixed.


It actually assumes that everything is uppercase, right? The trait implementations below only replace uppercase T/U/N with uppercase U/T/*

replace traits

a03d8b2

xyl012 requested review from TianyiShi2001 and tedil November 24, 2020 04:02

TianyiShi2001 reviewed Nov 25, 2020

View reviewed changes

src/utils/replace.rs Outdated Show resolved Hide resolved

Merge branch 'master' into otsu_replace

20677cd

tedil changed the title ~~replace traits~~ Extension traits on Text/TextSlice for common biological sequence modifications Nov 29, 2020

Extension traits on Text/TextSlice for common biological sequence mod…

87f493a

…ifications

tedil requested changes Dec 2, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extension traits on Text/TextSlice for common biological sequence modifications #396

Extension traits on Text/TextSlice for common biological sequence modifications #396

xyl012 commented Nov 24, 2020 •

edited

TianyiShi2001 commented Nov 24, 2020 •

edited

xyl012 commented Nov 24, 2020

TianyiShi2001 commented Nov 25, 2020

xyl012 commented Nov 25, 2020

TianyiShi2001 left a comment

TianyiShi2001 commented Nov 25, 2020

TianyiShi2001 commented Nov 25, 2020 •

edited

coveralls commented Nov 25, 2020 •

edited

TianyiShi2001 commented Nov 25, 2020

xyl012 commented Nov 25, 2020 •

edited

tedil commented Nov 25, 2020 •

edited

TianyiShi2001 commented Nov 25, 2020

TianyiShi2001 commented Nov 25, 2020

TianyiShi2001 commented Nov 25, 2020

xyl012 commented Dec 2, 2020

tedil left a comment

tedil Dec 2, 2020

tedil Dec 2, 2020 •

edited

tedil Dec 2, 2020

Extension traits on Text/TextSlice for common biological sequence modifications #396

Are you sure you want to change the base?

Extension traits on Text/TextSlice for common biological sequence modifications #396

Conversation

xyl012 commented Nov 24, 2020 • edited

TianyiShi2001 commented Nov 24, 2020 • edited

xyl012 commented Nov 24, 2020

TianyiShi2001 commented Nov 25, 2020

xyl012 commented Nov 25, 2020

TianyiShi2001 left a comment

Choose a reason for hiding this comment

TianyiShi2001 commented Nov 25, 2020

TianyiShi2001 commented Nov 25, 2020 • edited

coveralls commented Nov 25, 2020 • edited

TianyiShi2001 commented Nov 25, 2020

xyl012 commented Nov 25, 2020 • edited

tedil commented Nov 25, 2020 • edited

TianyiShi2001 commented Nov 25, 2020

TianyiShi2001 commented Nov 25, 2020

TianyiShi2001 commented Nov 25, 2020

xyl012 commented Dec 2, 2020

tedil left a comment

Choose a reason for hiding this comment

tedil Dec 2, 2020

Choose a reason for hiding this comment

tedil Dec 2, 2020 • edited

Choose a reason for hiding this comment

tedil Dec 2, 2020

Choose a reason for hiding this comment

xyl012 commented Nov 24, 2020 •

edited

TianyiShi2001 commented Nov 24, 2020 •

edited

TianyiShi2001 commented Nov 25, 2020 •

edited

coveralls commented Nov 25, 2020 •

edited

xyl012 commented Nov 25, 2020 •

edited

tedil commented Nov 25, 2020 •

edited

tedil Dec 2, 2020 •

edited