Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: FASTA/FASTQ unification #433

Open
wants to merge 56 commits into
base: master
Choose a base branch
from

Conversation

morsecodist
Copy link
Contributor

@morsecodist morsecodist commented May 31, 2021

Description

Implements types and utilities for dealing with FASTA and FASTQ files interchangeably. Completely backwards compatible. There are a few elements to this:

Utilities

Basic utility functions for determining whether data is in the FASTA or FASTQ format are implemented. Also an emum for FASTA/FASTQ types is implemented.

Typing

Interfaces are defined for working with FASTA and FASTQ data interchangeably. These allow you to implement functions for fasta::Record and fastq::Record generically using type bounds. In this case the type is known at compile time and performance is equivalent to implementations on either fasta::Record or fastq::Record.

Example

The motivating example here is implementing functions for both FASTA and FASTQ data like this filtration based on length:

use bio::io::{fasta, fastq, fastx};
use std::io;

fn at_least_n_bases<T, E, I>(records: I, n: usize) -> impl Iterator<Item = Result<T, E>>
where
    T: fastx::Record,
    E: std::error::Error,
    I: fastx::Records<T, E>,
{
    records.filter(move |rr| match rr {
        Ok(r) => r.seq().len() > n,
        _ => true,
    })
}

let mut reader = fasta::Reader::new(io::stdin());
let mut writer = fasta::Writer::new(io::stdout());

for record in at_least_n_bases(reader.records(), 10) {
    writer.write_record(&record.unwrap());
}

Dynamically Dealing with Either Type

For a small performance cost (5-10)% you can also deal with these types interchangeably at runtime with EitherRecord and EitherRecords. Any program can be implemented using just the zero-cost approaches described above but for developers who care more for convenience than the absolute peak performance these may be of use.

Example

In this example we can parse a file of unknown FASTA/FASTQ type and print information about each record seamlessly.

use bio::io::fastx::{EitherRecords, Record};
use std::io;
use std::str;

let mut records = EitherRecords::from(io::stdin());
while let Some(Ok(record)) = records.next() {
    println!("id  : {}", record.id());
    println!("desc: {}", record.desc().unwrap_or("none"));
    println!("seq : {}", str::from_utf8(record.seq()).unwrap_or("error"));
    // Add a default quality in case we have a FASTA record
    let default_qual = vec![b'I'; record.seq().len()];
    let qual = record.qual().unwrap_or(&default_qual);
    println!("qual: {}", str::from_utf8(qual).unwrap_or("error"));
    println!("")
}

Quality Info

This PR contains extensive tests and increases overall test coverage. It also contains documented practical examples and documentation for top level functions. It adds a new benchmark to gauge the performance of it's new features:

test bench_fasta_count        ... bench:  17,344,780 ns/iter (+/- 1,956,447)     
test bench_fastx_fasta_count  ... bench:  17,261,147 ns/iter (+/- 1,747,287)
test bench_either_fasta_count ... bench:  19,138,472 ns/iter (+/- 1,926,619)

test bench_fasta_check        ... bench:  18,526,579 ns/iter (+/- 1,874,079)
test bench_fastx_fasta_check  ... bench:  18,511,587 ns/iter (+/- 1,814,413)
test bench_either_fasta_check ... bench:  19,756,205 ns/iter (+/- 869,123)

In this benchmark two tasks are performed: (check and count) for each approach: (normal fasta parsing fasta, the generic approach fastx_fasta, and the dynamic approach either_fasta). We can see there is no difference for the generic approach and a small slowdown for the dynamic.

src/io/fastx.rs Outdated Show resolved Hide resolved
src/io/fastx.rs Outdated Show resolved Hide resolved
src/io/fastx.rs Outdated Show resolved Hide resolved
@natir
Copy link
Contributor

natir commented May 31, 2021

Hi nice work,

I didn't read all stuff in detail but I made some comment on something I see.

@morsecodist
Copy link
Contributor Author

Added some more docs/examples and made some improvements. Going to add more tests and clean up the docs a touch but I think it is coming together. Any thoughts on some of the high level/value judgement pieces?

@natir
Copy link
Contributor

natir commented Jun 2, 2021

Some generic comment:

I updated the fastx error type to use io::Error which is a breaking change. In my opinion unifying the error types (fasta uses io::Error) is important and these all map decently well onto io errors. I am open to discuss this more.

I made a small check in other io module I think we need a discuss about error types and how we can made an unification. I think we need change similar as what you do but not in this PR.

I suspect the compiler will fix any performance overhead this may introduce but it still isn't the most standard thing in the world.

Maybe you can add a small benchmark to check if fastx layer have an impact on performance or not.

@coveralls
Copy link

coveralls commented Jun 8, 2021

Coverage Status

Coverage increased (+0.003%) to 88.677% when pulling 4382282 on morsecodist:fastx-three into 9e8a7ca on rust-bio:master.

@morsecodist morsecodist marked this pull request as ready for review June 9, 2021 13:25
@morsecodist morsecodist changed the title FASTA/FASTQ unification draft FASTA/FASTQ unification Jun 9, 2021
benches/fastx.rs Outdated
Comment on lines 25 to 37
let mut rng = thread_rng();
for _ in 0..FASTA_SIZE {
let id: String = thread_rng()
.sample_iter(&Alphanumeric)
.take(ID_LEN)
.map(char::from)
.collect();

let desc: String = thread_rng()
.sample_iter(&Alphanumeric)
.take(DESC_LEN)
.map(char::from)
.collect();
Copy link
Contributor

@natir natir Jun 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait you create a RNG generator but you didn't use it it to generate id and description why ?

I also recommend usage of a seedable RNG in benchmark to have more stable result between benchmark runs.

To get this you need change Cargo.toml:

rand = { version = "0.8", features = ["std_rng"] }

And change line 25 in:

let mut rng = rand::rngs::StdRng::seed_from_u64(42);

@morsecodist morsecodist requested a review from natir June 13, 2021 12:25
@ahcm
Copy link

ahcm commented Jun 16, 2021

I wonder why you do not have a FastxReader trait.
In general if you use Fastx trait, I would expect that a FastaRecord was enough.

@morsecodist
Copy link
Contributor Author

@ahcm thanks for the comments

I wonder why you do not have a FastxReader trait.

The reader requires that an empty fasta/fastq record be passed in to be read into. For the EitherRecord type this is difficult because there is no natural empty value as it needs to be either one or the other. Also from my understanding the reader is only there for performance reasons because it is less convenient to use in all ways than the iterator, with the EitherRecord you will already need to build a new struct depending on the type so this performance benefit is negated. For the traits the reader would be unusable. The trait bound on it's own only enforces an interface that is agnostic to the record type and you need to pass in a specific record to the read function but there would be no way to enforce that. In the other PR the reader for both records was implemented for the shared reader type but this brings up ambiguity in the case where you are reading into the wrong record. Do you ignore the quality? Do you error? Do you have an empty slice for the quality? In general I felt that the ambiguity and difficulty did not justify the amount of utility this would bring.

In general if you use Fastx trait, I would expect that a FastaRecord was enough.

I am confused by this comment. How would you implement a function generic over both record types without the trait? In my filtration example we are filtering based on factors that are only in the fasta but we want to output a file in the same format as the input so if we converted to fasta that information would be lost.

@ahcm
Copy link

ahcm commented Jun 16, 2021

To me Fastx means, take Fasta or Fastq and expose the common fields (FastaRecord without quality).

You could have an independent FastxReader, that skips the quality and exposes a FastaRecord.
I expect the code would not be more than what you have now.

You could also modify FastaReader and FastqReader to have one generic record.
In that case I'd leave the qualities as empty slice in the Fasta case, but that's just my taste.
Both could then have the FastxReader trait.

Or you could go with a type parameter FastxReader or similar.

@morsecodist
Copy link
Contributor Author

You could have an independent FastxReader, that skips the quality and exposes a FastaRecord.

With this approach you lose the ability to preserve the input data for the filtration use case which was what motivated me to add this in the first place and I am not sure what you gain in return.

You could also modify FastaReader and FastqReader to have one generic record.

You mean a generic subtype of fastx? This may introduce some issues with circular dependencies.

I could unify the reader by creating a struct for a fastx reader and implementing fasta and fastq read on it if you feel it would be useful. I personally never use the reader so this is an oversight on my part.

benches/fastx.rs Outdated
Comment on lines 22 to 46
let mut raw_writer = Vec::new();
{
let mut w = fasta::Writer::new(&mut raw_writer);
for _ in 0..FASTA_SIZE {
let id: String = StdRng::seed_from_u64(42)
.sample_iter(&Alphanumeric)
.take(ID_LEN)
.map(char::from)
.collect();

let desc: String = StdRng::seed_from_u64(4)
.sample_iter(&Alphanumeric)
.take(DESC_LEN)
.map(char::from)
.collect();

let seq: Vec<u8> = (0..SEQ_LEN)
.map(|_| {
let idx = StdRng::seed_from_u64(65).gen_range(0..BASES.len());
BASES[idx]
})
.collect();

w.write_record(&fasta::Record::with_attrs(&id, Some(&desc), &seq))?;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose an important simplification of this code:

    let mut rng = StdRng::seed_from_u64(42);
    let mut raw_writer = Vec::new();
    {
        let mut w = fasta::Writer::new(&mut raw_writer);
        for _ in 0..FASTA_SIZE {
            let id: String = (0..ID_LEN)
                .map(|_| char::from(rng.sample(&Alphanumeric)))
                .collect();

            let desc: String = (0..DESC_LEN)
                .map(|_| char::from(rng.sample(&Alphanumeric)))
                .collect();

            let seq: Vec<u8> = (0..SEQ_LEN)
                .map(|_| {
                    let idx = rng.gen_range(0..BASES.len());
                    BASES[idx]
                })
                .collect();

            w.write_record(&fasta::Record::with_attrs(&id, Some(&desc), &seq))?;
        }

Another advantage, of this code id, desc and sequence is different between each read, but file produce is same between run.

@natir
Copy link
Contributor

natir commented Jun 22, 2021

@ahcm remark is intresting, but I think we can merge this PR now, and add modification in future. It's already a nice change.

Except my small simplification in benchmark, I think we can merge this PR.

@morsecodist morsecodist requested a review from natir June 28, 2021 00:00
@morsecodist
Copy link
Contributor Author

@natir made your proposed change. I also trialed a new version that should have been able to get performance parity between EitherRecord and normal records and while doing it I realized that the benchmarks are unreliable and I haven't really been able to make them reliable. Even on a dedicated server and with lots of parameter tweaking I can't get consistent results run to run. Given how hard it is to replicate performance degradation it is even possible we do have performance parity but either way the difference is very small.

@ahcm
Copy link

ahcm commented Jun 30, 2021

So I looked around a little and found seq_io:
https://github.com/markschl/seq_io

Not only does it have a common interface but it is also very low overhead.
While a lot of performance will be lost the more you actually work with the sequence,
it will be useful for summary tools and also if you want to copy raw data (with possible newlines).

I also made a little library on my own in that direction:
https://github.com/ahcm/fastx

I'm still not very familiar with rust, especially on designing interfaces for libraries,
so this is also practice for me.

Maybe this gives you some ideas.
This approach also lets itself to SIMD reader techniques, I'm just trying to figure out how to use the NEON operations in sensible way.

I guess, at least with the SIMD based reader I would like you to be able to use or copy it
or at least use the ideas and parts of the code to improve performance.

Best
Andy

@tedil tedil mentioned this pull request Jul 22, 2021
@morsecodist morsecodist changed the title FASTA/FASTQ unification feat: FASTA/FASTQ unification Dec 29, 2021
@natir
Copy link
Contributor

natir commented Jan 10, 2022

Hi @morsecodist , sorry I take to much time to recheck this pull request, many thanks for your work.

From my point of view this code is functional, usable and performant.

The code is correctly written, except for some small issue reporte by clippy (version 0.1.57) but nothing too serious, the code is tested and benchmarked correctly (lacks a benchmark for fastq files). The documentation seems clear to me.

Then concerning the choices of implementation I don't really have an opinion. @ahcm seems say that the either interface is less efficient but the benchmark on my machine contradicts it. In fact on my machine the fastx code seems faster than the classic fasta parser.

parser check (ns/iter) count (ns/iter)
fasta 20,258 21,385
fastx_fasta 19,083 19,887
fastx_either 19,908 19,437

I think we can merge it without any problem. After this interface will we generate trouble in the future, required a lot of maintenance, I do not have the answer.

For me it can be merged but since it's a big change I'd like another @rust-bio/reviewers to approve it before merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants