Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor non-standard evaluation internals #74

Open
sa-lee opened this issue Nov 15, 2019 · 0 comments
Open

Refactor non-standard evaluation internals #74

sa-lee opened this issue Nov 15, 2019 · 0 comments

Comments

@sa-lee
Copy link
Collaborator

sa-lee commented Nov 15, 2019

Currently, this is done in a kludgy way (and I find it hard to remember what is done) and at the time I wrote didn't really know much about NSE/environments. This an attempt of expressing all of the accumulated technical debt and propose some fixes.

There are a few kludges:

The main engine behind doing NSE is the overscope_ranges() generic, which essentially wraps S4Vectors::as.env() then calls a rlang::new_data_mask() for creating a "tidy-evaluation" environment with the bottom environment being the input ranges and top the environment being input metadata. On the other end for some, when a verb is evaluated each quosure has their environment updated so all the generics from BiocGenerics, IRanges and S4Vectors are bound to the top level of the quosures environment. This is super wasteful and the wrong place to put them.

Grouping in the mask. Currently, we apply a transform where we create Lists from our groups and then evaluate our quosures with respect to the List. This works fairly well since most generics dispatch on the lists and provides major speedups for basic ops like mean etc. However, in the cases where there is not a generic, an operation will throw an error (in the case of summarise) or will be wasteful by creating multiple masks (like mutate). There should be checks in place that attempts to dispatch on the list, if that fails then convert the quosure to a function(?) then use an appropriate *apply function to call it. This seems to be way tricky to implement, so maybe there's a better approach? cc @lawremi

Lazy binding: Currently, functions like mutate can assign symbols into the mask, so computed columns can be computed on. This should be done lazily via delayedAssign.

The n() function. This is confusing for people as there's a name clash with dplyr::n() which will fail and causes certain functions like dplyr::count() etc not work. The n() function is also wasteful on grouping, we can really quickly compute the group counts directly from the indices. The n() function doesn't need to be exported but could be bound to the mask.

Proposed Fixes:

  1. Overscope ranges should be renamed to reflect what it actually does, creates a mask for the ranges.
  2. The generics should be bound to the top environment of the mask not each quosure
  3. A .data pronoun should be installed in the mask, to enable programming NSE interfaces on top of plyranges
  4. Refactor n to be bound to the mask, it should look outside it's scope to quickly compute the result rather than via environment look ups.
  5. Better fail-safe mechanisms for grouping transformations that don't work.

Other ideas:

There's a tension between thinking of a Ranges as columns consisting of start, end, width and as a single Vector like object. Sometimes, you want both and this can make the API clunky see #47 and also the count_overlaps functions. Could there be a lazy binding to the Ranges in the mask? This could facilitate doing range-wise arithmetic all within mutate (without needing to use anchor first).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant