You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi
I would put this clearly into the improvement section. I found that the FASTA/Q record function .check() verifies
essentially 2 things:
that the header is provided and valid
that the sequence is ascii
I would dare to say that the sequence check is not sufficient in that aspect.
It should rather be a standard IUB/IUPAC nucleotide and amino acid code check.
Not sure if there was maybe a reason for that choice, such as speed and/or dependencies.
But essentially I can pass a sequence successfully which contains e.g. a "@" sign which is not ideal.
/// Check validity of Fasta record.
pub fn check(&self) -> Result<(), &str> {
if self.id().is_empty() {
return Err("Expecting id for Fasta record.");
}
if !self.seq.is_ascii() {
return Err("Non-ascii character found in sequence.");
}
Ok(())
}
The text was updated successfully, but these errors were encountered:
Okay I played a bit around and would have the following suggestion - which you might not like.
externcrate regex;use regex::bytes::Regex;externcrate bio;pubfncheck_fq(fq_record:&bio::io::fastq::Record) -> Result<(),&'static str>{let alphabet = Regex::new(r"(?-u)[A-z\*\-]").expect("ERROR: could not generate personal alphabet");if fq_record.id().is_empty(){returnErr("Expecting id for FastQ record.");}if !fq_record.seq().is_ascii(){returnErr("Non-ascii character found in sequence.");}if !fq_record.qual().is_ascii(){returnErr("Non-ascii character found in qualities.");}if fq_record.seq().len() != fq_record.qual().len(){returnErr("Unequal length of sequence an qualities.");}for letter in fq_record.seq(){if !alphabet.is_match(&[*letter]){returnErr("Non valid character in seq found!");}}Ok(())}
which works for the following ones:
let record = Record::with_attrs("id_str",Some("desc"),b"ATGCGGG",b"QQQQQQQ");assert!(check_fq(&record).is_ok());
record = Record::with_attrs("id_str",Some("desc"),b"ATGC@GGG",b"QQQQ0QQQ");let actual = check_fq(&record).unwrap_err();let expected = "Non valid character in seq found!";assert_eq!(actual, expected);
I thought by matching directly again u8 instead of going through strings one could save time.
But obviously this would add an external crate as dependency.
So just maybe something to spark some ideas....
Hi
I would put this clearly into the improvement section. I found that the FASTA/Q record function
.check()
verifiesessentially 2 things:
I would dare to say that the sequence check is not sufficient in that aspect.
It should rather be a standard IUB/IUPAC nucleotide and amino acid code check.
Not sure if there was maybe a reason for that choice, such as speed and/or dependencies.
But essentially I can pass a sequence successfully which contains e.g. a "@" sign which is not ideal.
The text was updated successfully, but these errors were encountered: