Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fcase / case_when function for data.table #3823

Closed
MichaelChirico opened this issue Sep 5, 2019 · 29 comments · Fixed by #4021
Closed

fcase / case_when function for data.table #3823

MichaelChirico opened this issue Sep 5, 2019 · 29 comments · Fixed by #4021

Comments

@MichaelChirico
Copy link
Member

MichaelChirico commented Sep 5, 2019

Related/follow-up to #3657

case_when is a common tool in SQL for e.g. building labels based on conditions of flags, e.g. cutting age groups:

select
  case when age < 18 then '0-18'
       when age < 35 then '18-35'
       when age < 65 then '35-65'
       else '65+'
  end as age_label

Our comrades at dplyr have implemented this as case_when with an interface like

dplyr::case_when(
  age < 18 ~ '0-18',
  age < 35 ~ '18-35',
  age < 65 ~ '35-65',
  TRUE ~ '65+'
)

Using & interpeting formulas seems pretty natural for R -- the only other thing I can thing of would be like on syntax (case_when('age<18' = '0-18', ...)).

As for the back end, dplyr is doing a two-pass for loop at the R level which e.g. requires evaluating age < 65 for all rows (whereas this is unnecessary for anything with labels '0-18' or '18-35').

I guess we can do much better with a C implementation. Will be interesting to contrast a proper lazy implementation with a parallel version (since IINM doing it in parallel will require first evaluating all the "LHS" values, e.g. as Jan met recently in frollapply).

Note that normally I'm doing case_when stuff with a look-up join, but it's not straightforward to implement a case_when as a join in general -- though I think the two are isomorphic, backing out what the corresponding join should be I think is a hard problem. e.g. in the example here, the order of truth values matters since implicit in the later conditions are (x & !any(y)) for each condition x that was preceded by conditions y. This case is straightforward to cast as a join on cut(age), perhaps just using roll, but things can be much more complicated when several variables are involved in the truth conditions. So I don't think this route is likely to be fruitful.

@jangorecki jangorecki changed the title FR: fcase / case_when function for data.table fcase / case_when function for data.table Sep 6, 2019
@shrektan
Copy link
Member

shrektan commented Sep 9, 2019

I actually prefer the synatx like fcase(test1, value1, test2, value2, ..., default). It's easier to read when the tests and values are quite complicated.

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Sep 9, 2019

It definitely probably be easier to code from that too.

One way forward would be to S3 it and build a (test1, value1, ....) call to dispatch for fcase.formula (personally I find the formula to read a bit more naturally)

@shrektan
Copy link
Member

shrektan commented Sep 9, 2019

Yes, it's easier to read especially when test & value is short and inlined. But I'm not sure using the formula style will have "significant" overheads or not since it requires extra patching. If not, I'm ok with both styles...

@HughParsonage
Copy link
Member

HughParsonage commented Sep 9, 2019

Question, how does this improve on (or differ from) subassignment by reference? You can do an equivalent to a case when using the following:

DT <- data.table(age = 0:100)
DT[, age_label := "65+"]
DT[age < 65, age_label := "35-65"]
DT[age < 35, age_label := "18-35"]
DT[age < 18, age_label := "0-18"]

for which we already get auto-indexing by default.

@MichaelChirico
Copy link
Member Author

Hugh see last paragraph. I had a fuller explanation written out but it disappeared when I pressed Comment 😢

@shrektan
Copy link
Member

shrektan commented Sep 9, 2019

Your example is a special case. fcase is supposed to be more generic and not limited to be used within a datatable object.

@2005m
Copy link
Contributor

2005m commented Sep 12, 2019

I am thinking it would be more efficient if we don't have to evaluate all the "LHS" value, but I need to think how I would solve that problem. What do you think of something like that?
fcase(variable, default, test1, value1,...,testN, valeN)

@MichaelChirico
Copy link
Member Author

@2005m I don't quite follow the intuition of the function signature?

I have exploratory work on the fcase branch, R API is:

fcase = function(..., default=NA) .External(Cfcase, ..., default)

My thinking is to implement @shrektan's suggestion first as it'll be pretty straightforward (just need to figure out how to deal with DOTSXP in C 😅; note .External is necessary to accept ...).

Then later build the logic to interpret with formula which can map into the simpler implementation.

@shrektan
Copy link
Member

shrektan commented Sep 13, 2019

I think we can pass a list object to C. e.g.,

fcase = function(..., default = NA) {
  .Call(Cfcase, list(...), default = NA)
}
SEXP fcase(SEXP x, SEXP default) {
  ...
}

In addition, ?.External says the ... accepted is up to 65:

... arguments to be passed to the compiled code. Up to 65 for .Call.

@HughParsonage
Copy link
Member

HughParsonage commented Sep 13, 2019 via email

@2005m
Copy link
Contributor

2005m commented Sep 13, 2019

@MichaelChirico , I was thinking to use it like this:
fcase(age, '65+', '<18', '0-18', '<35', '18-35')

@shrektan
Copy link
Member

since it only depends on a single variable age, maybe it suits for fswitch() as a wrapper of fcase()?

@jangorecki
Copy link
Member

jangorecki commented Sep 13, 2019

the less checks on R level the better. I also prefer .Call rather than .External.

We should first agree on the API. As @shrektan said fswitch can be a wrapper to fcase, similarly fifelse, if we all agree on that we can then focus on fcase API.
Once API is established then maybe R prototype? this will be for 1.13.0 so there is plenty of time.

@jangorecki
Copy link
Member

jangorecki commented Sep 13, 2019

As of now we have

  1. standard evaluation interface
fcase(...)
fcase(..., default)
fcase(when1, value1, ..., default)
fcase(age < 18, '0-18', age < 35, '18-35', age < 65, '35-65', '65+')
  1. NSE formula interface
fcase(...)
fcase(..., default)
fcase(x, ..., default)
fcase(age < 18 ~ '0-18', age < 35 ~ '18-35', age < 65 ~ '35-65', TRUE ~ '65+')
  1. "vectorized"
fcase(when, value, default)
fcase(list(age < 18, age < 35, age < 65),
      list('0-18', '18-35', '35-65'),
      '65+')

any other suggestions?

@MichaelChirico
Copy link
Member Author

@jangorecki
Copy link
Member

Author of that post seems to asks for a lookup table. The problem we solve by update on join. Case when has somehow different goal, as explained in above comments.

@2005m
Copy link
Contributor

2005m commented Sep 27, 2019

Question: Can value1 be the same length as when1 or is it always equal to 1?
Also I think we need to decide which one of the above function interface you want before coding anything? So far I think there is different way of coding this. Some simple (incl. valuating each when) and some more complex (incl. Not evaluating each when)...

@MichaelChirico
Copy link
Member Author

in SQL it's usually the same length (value1 and when1 are both columns, e.g.), so I would expect that to work.

@TysonStanley
Copy link
Member

Maybe this is relevant. I just built a version that is just built on top of your very quick fifelse() with a few checks and structure created in R but otherwise allows fifelse() to run the show.

It can be found in the very developmental tidyfast package at https://github.com/TysonStanley/tidyfast. Would this very simple approach be useful?

@2005m
Copy link
Contributor

2005m commented Oct 25, 2019

Thank you @TysonStanley . I actually finished writting the function in C. I was hoping to do a pull request tonight. I just need to finish writting the tests. I'll have a look at your function.

@TysonStanley
Copy link
Member

@2005m That’s awesome! I’m excited to see it rolled out. Is the syntax similar to dplyr::case_when()?

And yes, take a look and let me know what you think. It’s a pretty simple approach since I could rely on fifelse().

@2005m
Copy link
Contributor

2005m commented Oct 25, 2019

Here is a sneak peek:

x = sample(1:100,3e7,replace = TRUE) # 114 Mb
data.table::setDTthreads(1L)

microbenchmark::microbenchmark(
dplyr::case_when(
    x < 10L ~ 0L,
    x < 20L ~ 10L,
    x < 30L ~ 20L,
    x < 40L ~ 30L,
    x < 50L ~ 40L,
    x < 60L ~ 50L,
    x > 60L ~ 60L
),
tidyfast::dt_case_when(
  x < 10L ~ 0L,
  x < 20L ~ 10L,
  x < 30L ~ 20L,
  x < 40L ~ 30L,
  x < 50L ~ 40L,
  x < 60L ~ 50L,
  x > 60L ~ 60L
),
data.table::fcase(
  x < 10L, 0L,
  x < 20L, 10L,
  x < 30L, 20L,
  x < 40L, 30L,
  x < 50L, 40L,
  x < 60L, 50L,
  x > 60L, 60L
),
times = 5L)
# Unit: seconds
#                    expr    min    lq  mean  median    uq    max neval
# dplyr::case_when         11.69 11.80 11.83   11.81  11.92 11.94     5
# tidyfast::dt_case_when    2.18  2.23  2.32    2.38   2.39  2.41     5
# data.table::fcase         1.87  1.91  2.02    2.02   2.05  2.26     5

The syntax is different to dplyr::case_when but we can probably change it.
Note my computer is very old. Code is compiled with gcc 4.9 flag -02.
The function will support interger64 and nanotime.
I still need to work on it,though. I think the PR won't be for tonight.

@TysonStanley
Copy link
Member

Thanks for the sneak peak. That performance is fun to see. And the syntax looks just as friendly either way. Personally I like the formulas but for most cases, it probably doesn’t matter much.

Are you planning on supporting other vector types in the future?

@jangorecki
Copy link
Member

does it evaluate every single case or only those that needs to be reached to provide answer?

@2005m
Copy link
Contributor

2005m commented Oct 26, 2019

@jangorecki , yes it evaluates all cases and that is why I am not happy with it. I am also not happy with the performance either. If it can't be improved, I think @TysonStanley 's approach if better because simpler and timings are similar.

@TysonStanley , I also prefer the formulas but it is probably subjective...Regarding other vector types, I don't know. It is up to the team.

@jangorecki
Copy link
Member

We are not in hurry. I think having lazyness should be crucial, otherwise it doesn't bring anything new (other than API) comparing to using lookup table. Don't try to resolve everything at first iteration. When you will feel ready submit PR to get feedback. Final state can take multiple iterations or a follow up PR(s).

@TysonStanley
Copy link
Member

Does fifelse evaluate lazily (ie only the cases it needs to)? I need to take a look at the code but thought I’d ask.

@jangorecki
Copy link
Member

yes it does

fifelse(TRUE, 1, stop("a"))
#Error in fifelse(TRUE, 1, stop("a")) : a

@2005m
Copy link
Contributor

2005m commented Oct 29, 2019

@jangorecki , I hope to be able to share my code this weekend. I need to write the Rd file and add more tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants