Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FASTA/Q sequence record check insufficient #472

Open
ebioman opened this issue Dec 25, 2021 · 1 comment
Open

FASTA/Q sequence record check insufficient #472

ebioman opened this issue Dec 25, 2021 · 1 comment

Comments

@ebioman
Copy link

ebioman commented Dec 25, 2021

Hi
I would put this clearly into the improvement section. I found that the FASTA/Q record function .check() verifies
essentially 2 things:

  • that the header is provided and valid
  • that the sequence is ascii

I would dare to say that the sequence check is not sufficient in that aspect.
It should rather be a standard IUB/IUPAC nucleotide and amino acid code check.
Not sure if there was maybe a reason for that choice, such as speed and/or dependencies.

But essentially I can pass a sequence successfully which contains e.g. a "@" sign which is not ideal.

    /// Check validity of Fasta record.
    pub fn check(&self) -> Result<(), &str> {
        if self.id().is_empty() {
            return Err("Expecting id for Fasta record.");
        }
        if !self.seq.is_ascii() {
            return Err("Non-ascii character found in sequence.");
        }

        Ok(())
    }
@ebioman
Copy link
Author

ebioman commented Dec 28, 2021

Okay I played a bit around and would have the following suggestion - which you might not like.

extern crate regex;
use regex::bytes::Regex;
extern crate bio;

pub fn check_fq(fq_record: &bio::io::fastq::Record) -> Result<(), &'static str> {
    let alphabet = Regex::new(r"(?-u)[A-z\*\-]").expect("ERROR: could not generate personal alphabet");
    if fq_record.id().is_empty() {
        return Err("Expecting id for FastQ record.");
    }
    if !fq_record.seq().is_ascii() {
        return Err("Non-ascii character found in sequence.");
    }
    if !fq_record.qual().is_ascii() {
        return Err("Non-ascii character found in qualities.");
    }
    if fq_record.seq().len() != fq_record.qual().len() {
        return Err("Unequal length of sequence an qualities.");
    }
    for letter in fq_record.seq(){
        if !alphabet.is_match(&[*letter]) {
            return Err("Non valid character in seq found!");
        }
    }

    Ok(())
}

which works for the following ones:

let record = Record::with_attrs("id_str", Some("desc"), b"ATGCGGG", b"QQQQQQQ");
assert!(check_fq(&record).is_ok());
record = Record::with_attrs("id_str", Some("desc"), b"ATGC@GGG", b"QQQQ0QQQ");
let actual = check_fq(&record).unwrap_err();
let expected = "Non valid character in seq found!";
assert_eq!(actual, expected);

I thought by matching directly again u8 instead of going through strings one could save time.
But obviously this would add an external crate as dependency.
So just maybe something to spark some ideas....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant