feat: FASTA/FASTQ unification #433

morsecodist · 2021-05-31T04:07:58Z

Description

Implements types and utilities for dealing with FASTA and FASTQ files interchangeably. Completely backwards compatible. There are a few elements to this:

Utilities

Basic utility functions for determining whether data is in the FASTA or FASTQ format are implemented. Also an emum for FASTA/FASTQ types is implemented.

Typing

Interfaces are defined for working with FASTA and FASTQ data interchangeably. These allow you to implement functions for fasta::Record and fastq::Record generically using type bounds. In this case the type is known at compile time and performance is equivalent to implementations on either fasta::Record or fastq::Record.

Example

The motivating example here is implementing functions for both FASTA and FASTQ data like this filtration based on length:

use bio::io::{fasta, fastq, fastx};
use std::io;

fn at_least_n_bases<T, E, I>(records: I, n: usize) -> impl Iterator<Item = Result<T, E>>
where
    T: fastx::Record,
    E: std::error::Error,
    I: fastx::Records<T, E>,
{
    records.filter(move |rr| match rr {
        Ok(r) => r.seq().len() > n,
        _ => true,
    })
}

let mut reader = fasta::Reader::new(io::stdin());
let mut writer = fasta::Writer::new(io::stdout());

for record in at_least_n_bases(reader.records(), 10) {
    writer.write_record(&record.unwrap());
}

Dynamically Dealing with Either Type

For a small performance cost (5-10)% you can also deal with these types interchangeably at runtime with EitherRecord and EitherRecords. Any program can be implemented using just the zero-cost approaches described above but for developers who care more for convenience than the absolute peak performance these may be of use.

Example

In this example we can parse a file of unknown FASTA/FASTQ type and print information about each record seamlessly.

use bio::io::fastx::{EitherRecords, Record};
use std::io;
use std::str;

let mut records = EitherRecords::from(io::stdin());
while let Some(Ok(record)) = records.next() {
    println!("id  : {}", record.id());
    println!("desc: {}", record.desc().unwrap_or("none"));
    println!("seq : {}", str::from_utf8(record.seq()).unwrap_or("error"));
    // Add a default quality in case we have a FASTA record
    let default_qual = vec![b'I'; record.seq().len()];
    let qual = record.qual().unwrap_or(&default_qual);
    println!("qual: {}", str::from_utf8(qual).unwrap_or("error"));
    println!("")
}

Quality Info

This PR contains extensive tests and increases overall test coverage. It also contains documented practical examples and documentation for top level functions. It adds a new benchmark to gauge the performance of it's new features:

test bench_fasta_count        ... bench:  17,344,780 ns/iter (+/- 1,956,447)     
test bench_fastx_fasta_count  ... bench:  17,261,147 ns/iter (+/- 1,747,287)
test bench_either_fasta_count ... bench:  19,138,472 ns/iter (+/- 1,926,619)

test bench_fasta_check        ... bench:  18,526,579 ns/iter (+/- 1,874,079)
test bench_fastx_fasta_check  ... bench:  18,511,587 ns/iter (+/- 1,814,413)
test bench_either_fasta_check ... bench:  19,756,205 ns/iter (+/- 869,123)

In this benchmark two tasks are performed: (check and count) for each approach: (normal fasta parsing fasta, the generic approach fastx_fasta, and the dynamic approach either_fasta). We can see there is no difference for the generic approach and a small slowdown for the dynamic.

src/io/fastx.rs

natir · 2021-05-31T10:53:41Z

Hi nice work,

I didn't read all stuff in detail but I made some comment on something I see.

morsecodist · 2021-05-31T17:08:46Z

Added some more docs/examples and made some improvements. Going to add more tests and clean up the docs a touch but I think it is coming together. Any thoughts on some of the high level/value judgement pieces?

natir · 2021-06-02T09:38:09Z

Some generic comment:

I updated the fastx error type to use io::Error which is a breaking change. In my opinion unifying the error types (fasta uses io::Error) is important and these all map decently well onto io errors. I am open to discuss this more.

I made a small check in other io module I think we need a discuss about error types and how we can made an unification. I think we need change similar as what you do but not in this PR.

I suspect the compiler will fix any performance overhead this may introduce but it still isn't the most standard thing in the world.

Maybe you can add a small benchmark to check if fastx layer have an impact on performance or not.

coveralls · 2021-06-08T15:38:25Z

Coverage increased (+0.003%) to 88.677% when pulling 4382282 on morsecodist:fastx-three into 9e8a7ca on rust-bio:master.

natir · 2021-06-10T07:22:20Z

benches/fastx.rs

+    let mut rng = thread_rng();
+    for _ in 0..FASTA_SIZE {
+        let id: String = thread_rng()
+            .sample_iter(&Alphanumeric)
+            .take(ID_LEN)
+            .map(char::from)
+            .collect();
+
+        let desc: String = thread_rng()
+            .sample_iter(&Alphanumeric)
+            .take(DESC_LEN)
+            .map(char::from)
+            .collect();


Wait you create a RNG generator but you didn't use it it to generate id and description why ?

I also recommend usage of a seedable RNG in benchmark to have more stable result between benchmark runs.

To get this you need change Cargo.toml:

rand = { version = "0.8", features = ["std_rng"] }

And change line 25 in:

let mut rng = rand::rngs::StdRng::seed_from_u64(42);

…stx-three

ahcm · 2021-06-16T13:34:31Z

I wonder why you do not have a FastxReader trait.
In general if you use Fastx trait, I would expect that a FastaRecord was enough.

morsecodist · 2021-06-16T13:56:12Z

@ahcm thanks for the comments

I wonder why you do not have a FastxReader trait.

The reader requires that an empty fasta/fastq record be passed in to be read into. For the EitherRecord type this is difficult because there is no natural empty value as it needs to be either one or the other. Also from my understanding the reader is only there for performance reasons because it is less convenient to use in all ways than the iterator, with the EitherRecord you will already need to build a new struct depending on the type so this performance benefit is negated. For the traits the reader would be unusable. The trait bound on it's own only enforces an interface that is agnostic to the record type and you need to pass in a specific record to the read function but there would be no way to enforce that. In the other PR the reader for both records was implemented for the shared reader type but this brings up ambiguity in the case where you are reading into the wrong record. Do you ignore the quality? Do you error? Do you have an empty slice for the quality? In general I felt that the ambiguity and difficulty did not justify the amount of utility this would bring.

In general if you use Fastx trait, I would expect that a FastaRecord was enough.

I am confused by this comment. How would you implement a function generic over both record types without the trait? In my filtration example we are filtering based on factors that are only in the fasta but we want to output a file in the same format as the input so if we converted to fasta that information would be lost.

ahcm · 2021-06-16T14:22:13Z

To me Fastx means, take Fasta or Fastq and expose the common fields (FastaRecord without quality).

You could have an independent FastxReader, that skips the quality and exposes a FastaRecord.
I expect the code would not be more than what you have now.

You could also modify FastaReader and FastqReader to have one generic record.
In that case I'd leave the qualities as empty slice in the Fasta case, but that's just my taste.
Both could then have the FastxReader trait.

Or you could go with a type parameter FastxReader or similar.

morsecodist · 2021-06-16T14:31:28Z

You could have an independent FastxReader, that skips the quality and exposes a FastaRecord.

With this approach you lose the ability to preserve the input data for the filtration use case which was what motivated me to add this in the first place and I am not sure what you gain in return.

You could also modify FastaReader and FastqReader to have one generic record.

You mean a generic subtype of fastx? This may introduce some issues with circular dependencies.

I could unify the reader by creating a struct for a fastx reader and implementing fasta and fastq read on it if you feel it would be useful. I personally never use the reader so this is an oversight on my part.

natir · 2021-06-22T12:10:54Z

benches/fastx.rs

+    let mut raw_writer = Vec::new();
+    {
+        let mut w = fasta::Writer::new(&mut raw_writer);
+        for _ in 0..FASTA_SIZE {
+            let id: String = StdRng::seed_from_u64(42)
+                .sample_iter(&Alphanumeric)
+                .take(ID_LEN)
+                .map(char::from)
+                .collect();
+
+            let desc: String = StdRng::seed_from_u64(4)
+                .sample_iter(&Alphanumeric)
+                .take(DESC_LEN)
+                .map(char::from)
+                .collect();
+
+            let seq: Vec<u8> = (0..SEQ_LEN)
+                .map(|_| {
+                    let idx = StdRng::seed_from_u64(65).gen_range(0..BASES.len());
+                    BASES[idx]
+                })
+                .collect();
+
+            w.write_record(&fasta::Record::with_attrs(&id, Some(&desc), &seq))?;
+        }


I propose an important simplification of this code:

let mut rng = StdRng::seed_from_u64(42); let mut raw_writer = Vec::new(); { let mut w = fasta::Writer::new(&mut raw_writer); for _ in 0..FASTA_SIZE { let id: String = (0..ID_LEN) .map(|_| char::from(rng.sample(&Alphanumeric))) .collect(); let desc: String = (0..DESC_LEN) .map(|_| char::from(rng.sample(&Alphanumeric))) .collect(); let seq: Vec<u8> = (0..SEQ_LEN) .map(|_| { let idx = rng.gen_range(0..BASES.len()); BASES[idx] }) .collect(); w.write_record(&fasta::Record::with_attrs(&id, Some(&desc), &seq))?; }

Another advantage, of this code id, desc and sequence is different between each read, but file produce is same between run.

natir · 2021-06-22T12:13:46Z

@ahcm remark is intresting, but I think we can merge this PR now, and add modification in future. It's already a nice change.

Except my small simplification in benchmark, I think we can merge this PR.

morsecodist · 2021-06-28T00:01:49Z

@natir made your proposed change. I also trialed a new version that should have been able to get performance parity between EitherRecord and normal records and while doing it I realized that the benchmarks are unreliable and I haven't really been able to make them reliable. Even on a dedicated server and with lots of parameter tweaking I can't get consistent results run to run. Given how hard it is to replicate performance degradation it is even possible we do have performance parity but either way the difference is very small.

ahcm · 2021-06-30T11:59:53Z

So I looked around a little and found seq_io:
https://github.com/markschl/seq_io

Not only does it have a common interface but it is also very low overhead.
While a lot of performance will be lost the more you actually work with the sequence,
it will be useful for summary tools and also if you want to copy raw data (with possible newlines).

I also made a little library on my own in that direction:
https://github.com/ahcm/fastx

I'm still not very familiar with rust, especially on designing interfaces for libraries,
so this is also practice for me.

Maybe this gives you some ideas.
This approach also lets itself to SIMD reader techniques, I'm just trying to figure out how to use the NEON operations in sensible way.

I guess, at least with the SIMD based reader I would like you to be able to use or copy it
or at least use the ideas and parts of the code to improve performance.

Best
Andy

natir · 2022-01-10T16:52:29Z

Hi @morsecodist , sorry I take to much time to recheck this pull request, many thanks for your work.

From my point of view this code is functional, usable and performant.

The code is correctly written, except for some small issue reporte by clippy (version 0.1.57) but nothing too serious, the code is tested and benchmarked correctly (lacks a benchmark for fastq files). The documentation seems clear to me.

Then concerning the choices of implementation I don't really have an opinion. @ahcm seems say that the either interface is less efficient but the benchmark on my machine contradicts it. In fact on my machine the fastx code seems faster than the classic fasta parser.

parser	check (ns/iter)	count (ns/iter)
fasta	20,258	21,385
fastx_fasta	19,083	19,887
fastx_either	19,908	19,437

I think we can merge it without any problem. After this interface will we generate trouble in the future, required a lot of maintenance, I do not have the answer.

For me it can be merged but since it's a big change I'd like another @rust-bio/reviewers to approve it before merge.

morsecodist added 9 commits May 18, 2021 10:33

Support dealing with fasta and fastq files interchangably

3ca8ffb

format

9f7949e

WIP

d819268

remove qual

c9b0e40

good spot

03a823a

good spot

e8fdca1

good spot

25f551a

donish

8533c96

breaking one

6c8155a

natir reviewed May 31, 2021

View reviewed changes

src/io/fastx.rs Outdated Show resolved Hide resolved

natir reviewed May 31, 2021

View reviewed changes

src/io/fastx.rs Outdated Show resolved Hide resolved

natir reviewed May 31, 2021

View reviewed changes

src/io/fastx.rs Outdated Show resolved Hide resolved

morsecodist added 6 commits May 31, 2021 11:42

good spot

becb201

unify messages:

9017747

good spot

a56c0e9

more examples

a702c97

more docs

b3bfe7f

cleanup

a8e9693

morsecodist added 4 commits May 31, 2021 18:19

conversion

98cb702

rework examples

50f9111

more docs

e26bda9

more niceness

7601c51

morsecodist added 5 commits June 5, 2021 17:24

good spot

76085b4

separate error types

d1d4b06

good spot

05c7a24

swap names

e017aa2

renaming

bc52c6f

add note on perf

c39a8de

morsecodist marked this pull request as ready for review June 9, 2021 13:25

Merge branch 'master' into fastx-three

b5f0a9a

morsecodist changed the title ~~FASTA/FASTQ unification draft~~ FASTA/FASTQ unification Jun 9, 2021

natir reviewed Jun 10, 2021

View reviewed changes

morsecodist added 3 commits June 10, 2021 08:29

seeded rng

74c9b2a

Merge branch 'fastx-three' of github.com:morsecodist/rust-bio into fa…

19b4277

…stx-three

make benchmark more sensible

f6ed77e

morsecodist requested a review from natir June 13, 2021 12:25

natir reviewed Jun 22, 2021

View reviewed changes

comment address

2f2f9a4

morsecodist requested a review from natir June 28, 2021 00:00

fix bench

ee261fe

tedil mentioned this pull request Jul 22, 2021

feat: Generic record #442

Open

morsecodist added 2 commits December 28, 2021 20:25

merge

53d7bc1

resolve typing issues

9920557

morsecodist changed the title ~~FASTA/FASTQ unification~~ feat: FASTA/FASTQ unification Dec 29, 2021

morsecodist added 4 commits December 28, 2021 22:26

format

3763408

coverage

d118fd8

format

264095c

coverage

4382282

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: FASTA/FASTQ unification #433

feat: FASTA/FASTQ unification #433

morsecodist commented May 31, 2021 •

edited

natir commented May 31, 2021

morsecodist commented May 31, 2021

natir commented Jun 2, 2021

coveralls commented Jun 8, 2021 •

edited

natir Jun 10, 2021 •

edited

ahcm commented Jun 16, 2021

morsecodist commented Jun 16, 2021

ahcm commented Jun 16, 2021

morsecodist commented Jun 16, 2021

natir Jun 22, 2021

natir commented Jun 22, 2021

morsecodist commented Jun 28, 2021

ahcm commented Jun 30, 2021

natir commented Jan 10, 2022 •

edited

feat: FASTA/FASTQ unification #433

Are you sure you want to change the base?

feat: FASTA/FASTQ unification #433

Conversation

morsecodist commented May 31, 2021 • edited

Description

Utilities

Typing

Example

Dynamically Dealing with Either Type

Example

Quality Info

natir commented May 31, 2021

morsecodist commented May 31, 2021

natir commented Jun 2, 2021

coveralls commented Jun 8, 2021 • edited

natir Jun 10, 2021 • edited

Choose a reason for hiding this comment

ahcm commented Jun 16, 2021

morsecodist commented Jun 16, 2021

ahcm commented Jun 16, 2021

morsecodist commented Jun 16, 2021

natir Jun 22, 2021

Choose a reason for hiding this comment

natir commented Jun 22, 2021

morsecodist commented Jun 28, 2021

ahcm commented Jun 30, 2021

natir commented Jan 10, 2022 • edited

morsecodist commented May 31, 2021 •

edited

coveralls commented Jun 8, 2021 •

edited

natir Jun 10, 2021 •

edited

natir commented Jan 10, 2022 •

edited