Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: Improve bcf Record filter interface and improve docs #306

Merged
merged 10 commits into from Jul 6, 2021
Merged

feat!: Improve bcf Record filter interface and improve docs #306

merged 10 commits into from Jul 6, 2021

Conversation

mbhall88
Copy link
Member

This is possibly a controversial change. I find having to interact with Id whenever trying to deal with bcf Record filters really counterintuitive and convoluted.

For example, if I want to check whether a record has some filter q10 I have to do something like

let id = record.header().name2id(b"q10").unwrap();
if record.has_filter(id) {
// do something
}

and this gets worse if I want to push a collection of filters as I have to get their internal Id from the header for each. This seems to be exposing internal things that needn't be exposed IMO.

The changes I am suggesting here instead allow you to do things like

if record.has_filter(b"q10") {
// do something
}
// or
record.set_filters(&[b"q10", b"frs90"])?;

which seems much more ergonomic?

I've also added thorough documentation for all touched methods - plus for the format method, which had no docs.

Happhy to discuss and refine these proposed changes.

@coveralls
Copy link

coveralls commented May 14, 2021

Pull Request Test Coverage Report for Build 1003750613

  • 109 of 109 (100.0%) changed or added relevant lines in 2 files are covered.
  • 4 unchanged lines in 2 files lost coverage.
  • Overall coverage increased (+0.1%) to 93.543%

Files with Coverage Reduction New Missed Lines %
src/bcf/header.rs 2 84.18%
src/bcf/mod.rs 2 89.02%
Totals Coverage Status
Change from base Build 1003749437: 0.1%
Covered Lines: 11503
Relevant Lines: 12297

💛 - Coveralls

Copy link
Member

@dlaehnemann dlaehnemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work.

And I really like the idea of providing the filter names, not the ids. Making this library more user-friendly is definitely important!

One possible downside to these changes:
If you repeatedly apply the same filter function on lots of records in a loop, would repeatedly calling name_to_id() result in a much performance hit compared to looking the ids up once before starting the loop? Or do we expect cargo to clean this up for us during compilation? Maybe a small performance test could give an indication?

Comment on lines 481 to 483
/// record.remove_filter(b"bar", true).unwrap();
/// assert!(!record.has_filter(b"bar"));
/// assert!(record.has_filter(b"PASS"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also add another example with pass_on_empty=false?

/// #
/// # // Write uncompressed VCF to stdout with above header and get an empty record
/// # let mut vcf = Writer::from_stdout(&header, true, Format::VCF).unwrap();
/// # let mut record = vcf.empty_record();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make this line visible in the docs? So that it is clear that record is an empty_record() to start with.

Suggested change
/// # let mut record = vcf.empty_record();
/// let mut record = vcf.empty_record();

/// # Errors
/// **Attention:** the returned [`BufferBacked`] from [`integer()`](Format::integer)
/// (`read_depths`), which holds the data, has to be kept in scope as long as the data is
/// accessed. If parts of the data are accessed after the `BufferBacked` object is been
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo:

Suggested change
/// accessed. If parts of the data are accessed after the `BufferBacked` object is been
/// accessed. If parts of the data are accessed after the `BufferBacked` object has been

///
/// # Errors
/// **Attention:** the returned [`BufferBacked`] from [`integer()`](Format::integer)
/// (`read_depths`), which holds the data, has to be kept in scope as long as the data is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does has to be kept in scope mean for the user? Do I have to manually make sure, that it is in scope? Or will the lifetime of the resulting buffer do this for me and have cargo complain if this is not ensured?

assert!(!record.has_filter(b"foo"));
assert!(record.has_filter(b"bar"));
assert!(record.remove_filter(b"baz", true).is_err());
record.remove_filter(b"bar", true).unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, I'd also add the false case.

/// - `flt_ids` - The identifiers of the filter values to set.
pub fn set_filters(&mut self, flt_ids: &[Id]) {
let mut flt_ids: Vec<i32> = flt_ids.iter().map(|x| **x as i32).collect();
pub fn set_filters(&mut self, flt_ids: &[&[u8]]) -> Result<()> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename flt_ids to something like flt_names (as in name_to_id())? Although ID refers to the respective field where the filter name is given in the header, it also refers to the number assigned to that filter name internally. So maybe flt_names is less ambiguous?

@mbhall88
Copy link
Member Author

One possible downside to these changes:
If you repeatedly apply the same filter function on lots of records in a loop, would repeatedly calling name_to_id() result in a much performance hit compared to looking the ids up once before starting the loop? Or do we expect cargo to clean this up for us during compilation? Maybe a small performance test could give an indication?

This is a very good point and I feel ashamed I didn't think of this.
To that end, here is a benchmark, using Criterion, for each of the changed methods.

Benchmark of the implementation in this PR

remove_filter foo       time:   [40.751 ns 40.850 ns 40.964 ns]
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe

push_filter foo bar     time:   [77.596 ns 77.743 ns 77.959 ns]
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe

has_filter foo bar      time:   [80.197 ns 80.429 ns 80.651 ns]
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

set_filters foo bar     time:   [139.59 ns 140.16 ns 140.74 ns]
Found 32 outliers among 100 measurements (32.00%)
  17 (17.00%) low severe
  1 (1.00%) low mild
  4 (4.00%) high mild
  10 (10.00%) high severe

Benchmark with previous implementation

remove_filter foo       time:   [5.5482 ns 5.5598 ns 5.5720 ns]
                        change: [-86.433% -86.384% -86.336%] (p = 0.00 < 0.01)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe

push_filter foo bar     time:   [9.3092 ns 9.3354 ns 9.3692 ns]
                        change: [-88.120% -88.038% -87.978%] (p = 0.00 < 0.01)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

has_filter foo bar      time:   [4.5153 ns 4.5327 ns 4.5515 ns]
                        change: [-94.370% -94.345% -94.320%] (p = 0.00 < 0.01)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

set_filters foo bar     time:   [22.839 ns 22.933 ns 23.032 ns]
                        change: [-83.712% -83.642% -83.568%] (p = 0.00 < 0.01)
                        Performance has improved.

So there is definitely a hit to performance for this convenience.
For some context, if we want to quantify this in "real" terms, this difference, over 1,000,000 VCF entries - that is, calling the relevant method on each of the 1 million records - amounts to:

  • remove_filter: 0.035 seconds difference
  • push_filter: 0.068 seconds difference
  • has_filter: 0.076 seconds difference
  • set_filters: 0.118 seconds difference

A link to the benchmark code can be found at https://github.com/mbhall88/rust-htslib/blob/3be94eb3a55cbd800d055e7fd733e47d1974f95d/benches/bcf.rs

I guess this is not surprising, as we are adding an extra bit of retrieval to the code. htslib's name2id method basically just looks the name up in a hash table, so retrieval is pretty quick - hence the low drop in "real" performance. However, it is a drop nonetheless.

There are a few of ways forward from here (that I see):

  1. The results are fine and we proceed with the changes as is
  2. The results are not fine and we should just leave everything as is. Albeit with better documentation for these methods - i.e. showing how to get an id for a filter before looking it up.
  3. Change Id to a trait that can seamlessly handle i32 and &[u8] for all of these methods - that way an i32 or a &[u8] can be used with these methods
  4. We look into some mechanism for caching the name2id conversion. This is done in the C code already, but we could look at moving it "closer". This does not sound like an easy/good idea IMO though.

@dlaehnemann
Copy link
Member

Thanks for so thoroughly setting up a benchmark!

On the one hand, the time lost on 1 million records is not much, and BCFs will usually be limited to the millions of records (or maybe billions, but not more), simply by the size of most genomes. On the other hand, the relative increase in runtime seems to be to about 8x the original runtime, and that does seem like a bigger hit, if you handle lots of big BCFs.

I like your list of options and with the above considerations, I would opt for 3---including doctests that explain the performance difference, reference each other across the two different trait implementations and recommend one time conversion of name_to_id before jumping into record loops. This way, we get the best of both worlds: an easy usage, if you just want to look at a few records or do the looking up rarely, and the possibility to still keep the original performance benefit. And option 3 sounds like the Rusty way of doing this... ;)

@mbhall88
Copy link
Member Author

mbhall88 commented May 18, 2021

Ok, so I implemented a trait, FilterId to "seamlessly" handle either an Id or a byte slice.

Benchmark against original implementation

This benchmark used Ids so it is a fair comparison to the original implementation. Essentially, it shows the overhead of wrapping Id in a trait

remove_filter foo       time:   [11.376 ns 11.396 ns 11.419 ns]
                        change: [+104.18% +104.81% +105.39%] (p = 0.00 < 0.01)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

push_filter foo bar     time:   [20.211 ns 20.288 ns 20.419 ns]
                        change: [+116.50% +117.87% +120.22%] (p = 0.00 < 0.01)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) high mild
  8 (8.00%) high severe

has_filter foo bar      time:   [22.907 ns 23.491 ns 24.191 ns]
                        change: [+399.75% +404.91% +410.82%] (p = 0.00 < 0.01)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  4 (4.00%) high severe

set_filters foo bar     time:   [59.954 ns 60.110 ns 60.277 ns]
                        change: [+160.95% +161.99% +163.05%] (p = 0.00 < 0.01)
                        Performance has regressed.

So the trait has an overhead (on Id) of 2-4x increase in runtime. In "real" terms, for our 1 million records example, this would be a difference of 0.019 seconds in the worst-case (has_filter) and 0.0051 seconds in the best-case remove_filter

Benchmark with FilterId trait using bytes against using Id (wrapped in trait)

This is benchmarking the current implementation, but using bytes instead of Id, and comparing it to the run above where we used Id wrapped in the FilterId trait.

remove_filter foo       time:   [41.550 ns 41.844 ns 42.211 ns]
                        change: [+263.27% +266.37% +270.57%] (p = 0.00 < 0.01)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) high mild
  7 (7.00%) high severe

push_filter foo bar     time:   [88.090 ns 89.204 ns 90.115 ns]
                        change: [+312.45% +318.74% +324.95%] (p = 0.00 < 0.01)
                        Performance has regressed.

has_filter foo bar      time:   [97.352 ns 97.737 ns 98.123 ns]
                        change: [+366.49% +370.30% +374.07%] (p = 0.00 < 0.01)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

set_filters foo bar     time:   [70.452 ns 70.848 ns 71.286 ns]
                        change: [+17.932% +18.898% +20.262%] (p = 0.00 < 0.01)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

This performance is marginally slower (~2-9ns on average) than the original pure bytes implementation of this PR - i.e. the overhead of wrapping bytes in a trait.

If we want to go ahead with this trait implementation I will update the docs to discuss the performance.

Copy link
Member

@dlaehnemann dlaehnemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have thought that the trait implementation shouldn't cause any overhead (e.g. see here). So maybe the devil is in the detail (of the implementation) somewhere?

@@ -79,6 +79,30 @@ impl NumericUtils for i32 {
}
}

/// A trait to allow for seamless use of bytes or integer identifiers for filters
pub trait FilterId {
fn id_from_header(&self, header: &HeaderView) -> Result<Id>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As none of the function implementations can return an error, I don't think wrapping this in a Result is needed:

Suggested change
fn id_from_header(&self, header: &HeaderView) -> Result<Id>;
fn id_from_header(&self, header: &HeaderView) -> Id;

Maybe this is what is causing the overhead for Id? Because we only need the unpacking after using name_to_id(), so maybe move the unpacking in to the implementation for [u8]?

Ok(*self)
}
fn is_pass(&self) -> bool {
*self == Id(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the previous code, this was simply checking *self == 0 (*flt_id == 0) without the Id() wrapping. Is this definitely needed, or could this be contributing to overhead, as well?

@mbhall88
Copy link
Member Author

I'm off on holidays for three weeks so won't be able to dig into this for a while. Please feel free to try out some things on this PR branch if you like.

@mbhall88
Copy link
Member Author

Hmmm, I can't really find a way to remove the performance cost. I guess I'll just leave this PR here in case someone with better rust skills than I can find a way - or someone with more time to dedicate to it.

@johanneskoester
Copy link
Contributor

To me the difference sounds quite marginal. Given the improved ergonomics, I think we can just keep it as it is and merge? Is there still anything to do here before merging?

@mbhall88
Copy link
Member Author

mbhall88 commented Jul 5, 2021

To me the difference sounds quite marginal. Given the improved ergonomics, I think we can just keep it as it is and merge? Is there still anything to do here before merging?

I think it should be all fine. There are a couple of conflicts due to #309 I think, but they should only be minor. There is a minor change needed to the changelog entry as it says the filter functions take bytes instead of Id, which is not true, it is now either of those, but I'll make that change once the conflicts are merged

It's all good to go

@johanneskoester johanneskoester changed the title Improve bcf Record filter interface and improve docs feat!: Improve bcf Record filter interface and improve docs Jul 6, 2021
@johanneskoester johanneskoester merged commit f45e91d into rust-bio:master Jul 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants