Refactor non-standard evaluation internals #74

sa-lee · 2019-11-15T00:47:51Z

Currently, this is done in a kludgy way (and I find it hard to remember what is done) and at the time I wrote didn't really know much about NSE/environments. This an attempt of expressing all of the accumulated technical debt and propose some fixes.

There are a few kludges:

The main engine behind doing NSE is the overscope_ranges() generic, which essentially wraps S4Vectors::as.env() then calls a rlang::new_data_mask() for creating a "tidy-evaluation" environment with the bottom environment being the input ranges and top the environment being input metadata. On the other end for some, when a verb is evaluated each quosure has their environment updated so all the generics from BiocGenerics, IRanges and S4Vectors are bound to the top level of the quosures environment. This is super wasteful and the wrong place to put them.

Grouping in the mask. Currently, we apply a transform where we create Lists from our groups and then evaluate our quosures with respect to the List. This works fairly well since most generics dispatch on the lists and provides major speedups for basic ops like mean etc. However, in the cases where there is not a generic, an operation will throw an error (in the case of summarise) or will be wasteful by creating multiple masks (like mutate). There should be checks in place that attempts to dispatch on the list, if that fails then convert the quosure to a function(?) then use an appropriate *apply function to call it. This seems to be way tricky to implement, so maybe there's a better approach? cc @lawremi

Lazy binding: Currently, functions like mutate can assign symbols into the mask, so computed columns can be computed on. This should be done lazily via delayedAssign.

The n() function. This is confusing for people as there's a name clash with dplyr::n() which will fail and causes certain functions like dplyr::count() etc not work. The n() function is also wasteful on grouping, we can really quickly compute the group counts directly from the indices. The n() function doesn't need to be exported but could be bound to the mask.

Proposed Fixes:

Overscope ranges should be renamed to reflect what it actually does, creates a mask for the ranges.
The generics should be bound to the top environment of the mask not each quosure
A .data pronoun should be installed in the mask, to enable programming NSE interfaces on top of plyranges
Refactor n to be bound to the mask, it should look outside it's scope to quickly compute the result rather than via environment look ups.
Better fail-safe mechanisms for grouping transformations that don't work.

Other ideas:

There's a tension between thinking of a Ranges as columns consisting of start, end, width and as a single Vector like object. Sometimes, you want both and this can make the API clunky see #47 and also the count_overlaps functions. Could there be a lazy binding to the Ranges in the mask? This could facilitate doing range-wise arithmetic all within mutate (without needing to use anchor first).

The text was updated successfully, but these errors were encountered:

sa-lee mentioned this issue Nov 15, 2019

summarize doesn't work with weighted mean #38

Closed

sa-lee pushed a commit that referenced this issue Dec 4, 2019

beginning refactor of nse semantics for #74

d0d846f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor non-standard evaluation internals #74

Refactor non-standard evaluation internals #74

sa-lee commented Nov 15, 2019

Refactor non-standard evaluation internals #74

Refactor non-standard evaluation internals #74

Comments

sa-lee commented Nov 15, 2019