feat!: Improve bcf Record filter interface and improve docs #306

mbhall88 · 2021-05-14T02:38:14Z

This is possibly a controversial change. I find having to interact with Id whenever trying to deal with bcf Record filters really counterintuitive and convoluted.

For example, if I want to check whether a record has some filter q10 I have to do something like

let id = record.header().name2id(b"q10").unwrap();
if record.has_filter(id) {
// do something
}

and this gets worse if I want to push a collection of filters as I have to get their internal Id from the header for each. This seems to be exposing internal things that needn't be exposed IMO.

The changes I am suggesting here instead allow you to do things like

if record.has_filter(b"q10") {
// do something
}
// or
record.set_filters(&[b"q10", b"frs90"])?;

which seems much more ergonomic?

I've also added thorough documentation for all touched methods - plus for the format method, which had no docs.

Happhy to discuss and refine these proposed changes.

coveralls · 2021-05-14T02:44:04Z

Pull Request Test Coverage Report for Build 1003750613

109 of 109 (100.0%) changed or added relevant lines in 2 files are covered.
4 unchanged lines in 2 files lost coverage.
Overall coverage increased (+0.1%) to 93.543%

Files with Coverage Reduction	New Missed Lines	%
src/bcf/header.rs	2	84.18%
src/bcf/mod.rs	2	89.02%

Totals
Change from base Build 1003749437:	0.1%
Covered Lines:	11503
Relevant Lines:	12297

💛 - Coveralls

dlaehnemann

Nice work.

And I really like the idea of providing the filter names, not the ids. Making this library more user-friendly is definitely important!

One possible downside to these changes:
If you repeatedly apply the same filter function on lots of records in a loop, would repeatedly calling name_to_id() result in a much performance hit compared to looking the ids up once before starting the loop? Or do we expect cargo to clean this up for us during compilation? Maybe a small performance test could give an indication?

dlaehnemann · 2021-05-14T12:21:26Z

src/bcf/record.rs

+    /// record.remove_filter(b"bar", true).unwrap();
+    /// assert!(!record.has_filter(b"bar"));
+    /// assert!(record.has_filter(b"PASS"));


Maybe also add another example with pass_on_empty=false?

dlaehnemann · 2021-05-14T12:24:52Z

src/bcf/record.rs

+    /// #
+    /// # // Write uncompressed VCF to stdout with above header and get an empty record
+    /// # let mut vcf = Writer::from_stdout(&header, true, Format::VCF).unwrap();
+    /// # let mut record = vcf.empty_record();


Maybe make this line visible in the docs? So that it is clear that record is an empty_record() to start with.

Suggested change

/// # let mut record = vcf.empty_record();

/// let mut record = vcf.empty_record();

dlaehnemann · 2021-05-14T12:25:28Z

src/bcf/record.rs

+    /// # Errors
+    /// **Attention:** the returned [`BufferBacked`] from [`integer()`](Format::integer)
+    /// (`read_depths`), which holds the data, has to be kept in scope as long as the data is
+    /// accessed. If parts of the data are accessed after the `BufferBacked` object is been


Typo:

Suggested change

/// accessed. If parts of the data are accessed after the `BufferBacked` object is been

/// accessed. If parts of the data are accessed after the `BufferBacked` object has been

dlaehnemann · 2021-05-14T12:26:38Z

src/bcf/record.rs

+    ///
+    /// # Errors
+    /// **Attention:** the returned [`BufferBacked`] from [`integer()`](Format::integer)
+    /// (`read_depths`), which holds the data, has to be kept in scope as long as the data is


What does has to be kept in scope mean for the user? Do I have to manually make sure, that it is in scope? Or will the lifetime of the resulting buffer do this for me and have cargo complain if this is not ensured?

dlaehnemann · 2021-05-14T12:27:54Z

src/bcf/record.rs

+        assert!(!record.has_filter(b"foo"));
+        assert!(record.has_filter(b"bar"));
+        assert!(record.remove_filter(b"baz", true).is_err());
+        record.remove_filter(b"bar", true).unwrap();


As above, I'd also add the false case.

dlaehnemann · 2021-05-14T12:34:39Z

src/bcf/record.rs

-    /// - `flt_ids` - The identifiers of the filter values to set.
-    pub fn set_filters(&mut self, flt_ids: &[Id]) {
-        let mut flt_ids: Vec<i32> = flt_ids.iter().map(|x| **x as i32).collect();
+    pub fn set_filters(&mut self, flt_ids: &[&[u8]]) -> Result<()> {


Maybe rename flt_ids to something like flt_names (as in name_to_id())? Although ID refers to the respective field where the filter name is given in the header, it also refers to the number assigned to that filter name internally. So maybe flt_names is less ambiguous?

mbhall88 · 2021-05-17T01:07:44Z

One possible downside to these changes:
If you repeatedly apply the same filter function on lots of records in a loop, would repeatedly calling name_to_id() result in a much performance hit compared to looking the ids up once before starting the loop? Or do we expect cargo to clean this up for us during compilation? Maybe a small performance test could give an indication?

This is a very good point and I feel ashamed I didn't think of this.
To that end, here is a benchmark, using Criterion, for each of the changed methods.

Benchmark of the implementation in this PR

remove_filter foo       time:   [40.751 ns 40.850 ns 40.964 ns]
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe

push_filter foo bar     time:   [77.596 ns 77.743 ns 77.959 ns]
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe

has_filter foo bar      time:   [80.197 ns 80.429 ns 80.651 ns]
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

set_filters foo bar     time:   [139.59 ns 140.16 ns 140.74 ns]
Found 32 outliers among 100 measurements (32.00%)
  17 (17.00%) low severe
  1 (1.00%) low mild
  4 (4.00%) high mild
  10 (10.00%) high severe

Benchmark with previous implementation

remove_filter foo       time:   [5.5482 ns 5.5598 ns 5.5720 ns]
                        change: [-86.433% -86.384% -86.336%] (p = 0.00 < 0.01)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe

push_filter foo bar     time:   [9.3092 ns 9.3354 ns 9.3692 ns]
                        change: [-88.120% -88.038% -87.978%] (p = 0.00 < 0.01)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

has_filter foo bar      time:   [4.5153 ns 4.5327 ns 4.5515 ns]
                        change: [-94.370% -94.345% -94.320%] (p = 0.00 < 0.01)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

set_filters foo bar     time:   [22.839 ns 22.933 ns 23.032 ns]
                        change: [-83.712% -83.642% -83.568%] (p = 0.00 < 0.01)
                        Performance has improved.

So there is definitely a hit to performance for this convenience.
For some context, if we want to quantify this in "real" terms, this difference, over 1,000,000 VCF entries - that is, calling the relevant method on each of the 1 million records - amounts to:

remove_filter: 0.035 seconds difference
push_filter: 0.068 seconds difference
has_filter: 0.076 seconds difference
set_filters: 0.118 seconds difference

A link to the benchmark code can be found at https://github.com/mbhall88/rust-htslib/blob/3be94eb3a55cbd800d055e7fd733e47d1974f95d/benches/bcf.rs

I guess this is not surprising, as we are adding an extra bit of retrieval to the code. htslib's name2id method basically just looks the name up in a hash table, so retrieval is pretty quick - hence the low drop in "real" performance. However, it is a drop nonetheless.

There are a few of ways forward from here (that I see):

The results are fine and we proceed with the changes as is
The results are not fine and we should just leave everything as is. Albeit with better documentation for these methods - i.e. showing how to get an id for a filter before looking it up.
Change Id to a trait that can seamlessly handle i32 and &[u8] for all of these methods - that way an i32 or a &[u8] can be used with these methods
We look into some mechanism for caching the name2id conversion. This is done in the C code already, but we could look at moving it "closer". This does not sound like an easy/good idea IMO though.

dlaehnemann · 2021-05-17T08:39:13Z

Thanks for so thoroughly setting up a benchmark!

On the one hand, the time lost on 1 million records is not much, and BCFs will usually be limited to the millions of records (or maybe billions, but not more), simply by the size of most genomes. On the other hand, the relative increase in runtime seems to be to about 8x the original runtime, and that does seem like a bigger hit, if you handle lots of big BCFs.

I like your list of options and with the above considerations, I would opt for 3---including doctests that explain the performance difference, reference each other across the two different trait implementations and recommend one time conversion of name_to_id before jumping into record loops. This way, we get the best of both worlds: an easy usage, if you just want to look at a few records or do the looking up rarely, and the possibility to still keep the original performance benefit. And option 3 sounds like the Rusty way of doing this... ;)

mbhall88 · 2021-05-18T01:25:08Z

Ok, so I implemented a trait, FilterId to "seamlessly" handle either an Id or a byte slice.

Benchmark against original implementation

This benchmark used Ids so it is a fair comparison to the original implementation. Essentially, it shows the overhead of wrapping Id in a trait

remove_filter foo       time:   [11.376 ns 11.396 ns 11.419 ns]
                        change: [+104.18% +104.81% +105.39%] (p = 0.00 < 0.01)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

push_filter foo bar     time:   [20.211 ns 20.288 ns 20.419 ns]
                        change: [+116.50% +117.87% +120.22%] (p = 0.00 < 0.01)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) high mild
  8 (8.00%) high severe

has_filter foo bar      time:   [22.907 ns 23.491 ns 24.191 ns]
                        change: [+399.75% +404.91% +410.82%] (p = 0.00 < 0.01)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  4 (4.00%) high severe

set_filters foo bar     time:   [59.954 ns 60.110 ns 60.277 ns]
                        change: [+160.95% +161.99% +163.05%] (p = 0.00 < 0.01)
                        Performance has regressed.

So the trait has an overhead (on Id) of 2-4x increase in runtime. In "real" terms, for our 1 million records example, this would be a difference of 0.019 seconds in the worst-case (has_filter) and 0.0051 seconds in the best-case remove_filter

Benchmark with `FilterId` trait using bytes against using `Id` (wrapped in trait)

This is benchmarking the current implementation, but using bytes instead of Id, and comparing it to the run above where we used Id wrapped in the FilterId trait.

remove_filter foo       time:   [41.550 ns 41.844 ns 42.211 ns]
                        change: [+263.27% +266.37% +270.57%] (p = 0.00 < 0.01)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) high mild
  7 (7.00%) high severe

push_filter foo bar     time:   [88.090 ns 89.204 ns 90.115 ns]
                        change: [+312.45% +318.74% +324.95%] (p = 0.00 < 0.01)
                        Performance has regressed.

has_filter foo bar      time:   [97.352 ns 97.737 ns 98.123 ns]
                        change: [+366.49% +370.30% +374.07%] (p = 0.00 < 0.01)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

set_filters foo bar     time:   [70.452 ns 70.848 ns 71.286 ns]
                        change: [+17.932% +18.898% +20.262%] (p = 0.00 < 0.01)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

This performance is marginally slower (~2-9ns on average) than the original pure bytes implementation of this PR - i.e. the overhead of wrapping bytes in a trait.

If we want to go ahead with this trait implementation I will update the docs to discuss the performance.

dlaehnemann

I would have thought that the trait implementation shouldn't cause any overhead (e.g. see here). So maybe the devil is in the detail (of the implementation) somewhere?

dlaehnemann · 2021-05-19T11:04:33Z

src/bcf/record.rs

@@ -79,6 +79,30 @@ impl NumericUtils for i32 {
    }
 }

+/// A trait to allow for seamless use of bytes or integer identifiers for filters
+pub trait FilterId {
+    fn id_from_header(&self, header: &HeaderView) -> Result<Id>;


As none of the function implementations can return an error, I don't think wrapping this in a Result is needed:

Suggested change

fn id_from_header(&self, header: &HeaderView) -> Result<Id>;

fn id_from_header(&self, header: &HeaderView) -> Id;

Maybe this is what is causing the overhead for Id? Because we only need the unpacking after using name_to_id(), so maybe move the unpacking in to the implementation for [u8]?

dlaehnemann · 2021-05-19T11:05:47Z

src/bcf/record.rs

+        Ok(*self)
+    }
+    fn is_pass(&self) -> bool {
+        *self == Id(0)


In the previous code, this was simply checking *self == 0 (*flt_id == 0) without the Id() wrapping. Is this definitely needed, or could this be contributing to overhead, as well?

mbhall88 · 2021-05-21T06:19:32Z

I'm off on holidays for three weeks so won't be able to dig into this for a while. Please feel free to try out some things on this PR branch if you like.

mbhall88 · 2021-06-28T03:51:46Z

Hmmm, I can't really find a way to remove the performance cost. I guess I'll just leave this PR here in case someone with better rust skills than I can find a way - or someone with more time to dedicate to it.

johanneskoester · 2021-07-05T19:10:06Z

To me the difference sounds quite marginal. Given the improved ergonomics, I think we can just keep it as it is and merge? Is there still anything to do here before merging?

mbhall88 · 2021-07-05T21:57:53Z

To me the difference sounds quite marginal. Given the improved ergonomics, I think we can just keep it as it is and merge? Is there still anything to do here before merging?

I think it should be all fine. There are a couple of conflicts due to #309 I think, but they should only be minor. There is a minor change needed to the changelog entry as it says the filter functions take bytes instead of Id, which is not true, it is now either of those, but I'll make that change once the conflicts are merged

It's all good to go

mbhall88 added 2 commits May 13, 2021 10:23

add docs for bcf Record::format

651b7e2

change signature for filter methods

be92168

mbhall88 assigned dlaehnemann and johanneskoester May 14, 2021

fmt

76b7f15

dlaehnemann requested changes May 14, 2021

View reviewed changes

mbhall88 added 3 commits May 18, 2021 09:56

add FilterId trait and switch has_filter implementation

d31242d

update push_filter impl

b747a47

update remove and set filter impl

be54c2e

dlaehnemann reviewed May 19, 2021

View reviewed changes

mbhall88 and others added 4 commits July 6, 2021 08:16

Merge branch 'master' into master

e5891d2

fix format enum spelling for vcf

417ab39

Merge branch 'master' into master

8eb08e8

Merge branch 'master' into master

1866631

johanneskoester changed the title ~~Improve bcf Record filter interface and improve docs~~ feat!: Improve bcf Record filter interface and improve docs Jul 6, 2021

johanneskoester merged commit f45e91d into rust-bio:master Jul 6, 2021

This was referenced Jul 6, 2021

chore: release 1.0.0 #314

Closed

chore: release 0.38.0 #315

Merged

chore: release rust-htslib 0.27.0 #319

Closed

pmarks mentioned this pull request Aug 24, 2022

fix: fixed fn to_hashmap so it can avoid an error when @CO tag appeared #363

Merged

This was referenced Mar 2, 2023

chore(master): release 0.41.0 #379

Closed

chore(master): release 0.41.1 #380

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat!: Improve bcf Record filter interface and improve docs #306

feat!: Improve bcf Record filter interface and improve docs #306

mbhall88 commented May 14, 2021

coveralls commented May 14, 2021 •

edited

dlaehnemann left a comment

dlaehnemann May 14, 2021

dlaehnemann May 14, 2021

dlaehnemann May 14, 2021

dlaehnemann May 14, 2021

dlaehnemann May 14, 2021

dlaehnemann May 14, 2021

mbhall88 commented May 17, 2021

dlaehnemann commented May 17, 2021

mbhall88 commented May 18, 2021 •

edited

dlaehnemann left a comment

dlaehnemann May 19, 2021

dlaehnemann May 19, 2021

mbhall88 commented May 21, 2021

mbhall88 commented Jun 28, 2021

johanneskoester commented Jul 5, 2021

mbhall88 commented Jul 5, 2021 •

edited

	/// # let mut record = vcf.empty_record();
	/// let mut record = vcf.empty_record();

	/// accessed. If parts of the data are accessed after the `BufferBacked` object is been
	/// accessed. If parts of the data are accessed after the `BufferBacked` object has been

	fn id_from_header(&self, header: &HeaderView) -> Result<Id>;
	fn id_from_header(&self, header: &HeaderView) -> Id;

feat!: Improve bcf Record filter interface and improve docs #306

feat!: Improve bcf Record filter interface and improve docs #306

Conversation

mbhall88 commented May 14, 2021

coveralls commented May 14, 2021 • edited

Pull Request Test Coverage Report for Build 1003750613

💛 - Coveralls

dlaehnemann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbhall88 commented May 17, 2021

Benchmark of the implementation in this PR

Benchmark with previous implementation

dlaehnemann commented May 17, 2021

mbhall88 commented May 18, 2021 • edited

Benchmark against original implementation

Benchmark with FilterId trait using bytes against using Id (wrapped in trait)

dlaehnemann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbhall88 commented May 21, 2021

mbhall88 commented Jun 28, 2021

johanneskoester commented Jul 5, 2021

mbhall88 commented Jul 5, 2021 • edited

coveralls commented May 14, 2021 •

edited

mbhall88 commented May 18, 2021 •

edited

Benchmark with `FilterId` trait using bytes against using `Id` (wrapped in trait)

mbhall88 commented Jul 5, 2021 •

edited