New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: FASTA/FASTQ unification #433
base: master
Are you sure you want to change the base?
Conversation
Hi nice work, I didn't read all stuff in detail but I made some comment on something I see. |
Added some more docs/examples and made some improvements. Going to add more tests and clean up the docs a touch but I think it is coming together. Any thoughts on some of the high level/value judgement pieces? |
Some generic comment:
I made a small check in other
Maybe you can add a small benchmark to check if fastx layer have an impact on performance or not. |
benches/fastx.rs
Outdated
let mut rng = thread_rng(); | ||
for _ in 0..FASTA_SIZE { | ||
let id: String = thread_rng() | ||
.sample_iter(&Alphanumeric) | ||
.take(ID_LEN) | ||
.map(char::from) | ||
.collect(); | ||
|
||
let desc: String = thread_rng() | ||
.sample_iter(&Alphanumeric) | ||
.take(DESC_LEN) | ||
.map(char::from) | ||
.collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait you create a RNG generator but you didn't use it it to generate id
and description
why ?
I also recommend usage of a seedable RNG in benchmark to have more stable result between benchmark runs.
To get this you need change Cargo.toml:
rand = { version = "0.8", features = ["std_rng"] }
And change line 25 in:
let mut rng = rand::rngs::StdRng::seed_from_u64(42);
I wonder why you do not have a FastxReader trait. |
@ahcm thanks for the comments
The reader requires that an empty fasta/fastq record be passed in to be read into. For the
I am confused by this comment. How would you implement a function generic over both record types without the trait? In my filtration example we are filtering based on factors that are only in the fasta but we want to output a file in the same format as the input so if we converted to fasta that information would be lost. |
To me Fastx means, take Fasta or Fastq and expose the common fields (FastaRecord without quality). You could have an independent FastxReader, that skips the quality and exposes a FastaRecord. You could also modify FastaReader and FastqReader to have one generic record. Or you could go with a type parameter FastxReader or similar. |
With this approach you lose the ability to preserve the input data for the filtration use case which was what motivated me to add this in the first place and I am not sure what you gain in return.
You mean a generic subtype of fastx? This may introduce some issues with circular dependencies. I could unify the reader by creating a struct for a fastx reader and implementing fasta and fastq read on it if you feel it would be useful. I personally never use the reader so this is an oversight on my part. |
benches/fastx.rs
Outdated
let mut raw_writer = Vec::new(); | ||
{ | ||
let mut w = fasta::Writer::new(&mut raw_writer); | ||
for _ in 0..FASTA_SIZE { | ||
let id: String = StdRng::seed_from_u64(42) | ||
.sample_iter(&Alphanumeric) | ||
.take(ID_LEN) | ||
.map(char::from) | ||
.collect(); | ||
|
||
let desc: String = StdRng::seed_from_u64(4) | ||
.sample_iter(&Alphanumeric) | ||
.take(DESC_LEN) | ||
.map(char::from) | ||
.collect(); | ||
|
||
let seq: Vec<u8> = (0..SEQ_LEN) | ||
.map(|_| { | ||
let idx = StdRng::seed_from_u64(65).gen_range(0..BASES.len()); | ||
BASES[idx] | ||
}) | ||
.collect(); | ||
|
||
w.write_record(&fasta::Record::with_attrs(&id, Some(&desc), &seq))?; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose an important simplification of this code:
let mut rng = StdRng::seed_from_u64(42);
let mut raw_writer = Vec::new();
{
let mut w = fasta::Writer::new(&mut raw_writer);
for _ in 0..FASTA_SIZE {
let id: String = (0..ID_LEN)
.map(|_| char::from(rng.sample(&Alphanumeric)))
.collect();
let desc: String = (0..DESC_LEN)
.map(|_| char::from(rng.sample(&Alphanumeric)))
.collect();
let seq: Vec<u8> = (0..SEQ_LEN)
.map(|_| {
let idx = rng.gen_range(0..BASES.len());
BASES[idx]
})
.collect();
w.write_record(&fasta::Record::with_attrs(&id, Some(&desc), &seq))?;
}
Another advantage, of this code id, desc and sequence is different between each read, but file produce is same between run.
@ahcm remark is intresting, but I think we can merge this PR now, and add modification in future. It's already a nice change. Except my small simplification in benchmark, I think we can merge this PR. |
@natir made your proposed change. I also trialed a new version that should have been able to get performance parity between |
So I looked around a little and found seq_io: Not only does it have a common interface but it is also very low overhead. I also made a little library on my own in that direction: I'm still not very familiar with rust, especially on designing interfaces for libraries, Maybe this gives you some ideas. I guess, at least with the SIMD based reader I would like you to be able to use or copy it Best |
Hi @morsecodist , sorry I take to much time to recheck this pull request, many thanks for your work. From my point of view this code is functional, usable and performant. The code is correctly written, except for some small issue reporte by clippy (version 0.1.57) but nothing too serious, the code is tested and benchmarked correctly (lacks a benchmark for fastq files). The documentation seems clear to me. Then concerning the choices of implementation I don't really have an opinion. @ahcm seems say that the either interface is less efficient but the benchmark on my machine contradicts it. In fact on my machine the fastx code seems faster than the classic fasta parser.
I think we can merge it without any problem. After this interface will we generate trouble in the future, required a lot of maintenance, I do not have the answer. For me it can be merged but since it's a big change I'd like another @rust-bio/reviewers to approve it before merge. |
Description
Implements types and utilities for dealing with FASTA and FASTQ files interchangeably. Completely backwards compatible. There are a few elements to this:
Utilities
Basic utility functions for determining whether data is in the FASTA or FASTQ format are implemented. Also an emum for FASTA/FASTQ types is implemented.
Typing
Interfaces are defined for working with FASTA and FASTQ data interchangeably. These allow you to implement functions for
fasta::Record
andfastq::Record
generically using type bounds. In this case the type is known at compile time and performance is equivalent to implementations on eitherfasta::Record
orfastq::Record
.Example
The motivating example here is implementing functions for both FASTA and FASTQ data like this filtration based on length:
Dynamically Dealing with Either Type
For a small performance cost (5-10)% you can also deal with these types interchangeably at runtime with
EitherRecord
andEitherRecords
. Any program can be implemented using just the zero-cost approaches described above but for developers who care more for convenience than the absolute peak performance these may be of use.Example
In this example we can parse a file of unknown FASTA/FASTQ type and print information about each record seamlessly.
Quality Info
This PR contains extensive tests and increases overall test coverage. It also contains documented practical examples and documentation for top level functions. It adds a new benchmark to gauge the performance of it's new features:
In this benchmark two tasks are performed: (check and count) for each approach: (normal fasta parsing
fasta
, the generic approachfastx_fasta
, and the dynamic approacheither_fasta
). We can see there is no difference for the generic approach and a small slowdown for the dynamic.