Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improvement on programmatically substituting columns in expressions #2655

Closed
arunsrinivasan opened this issue Mar 3, 2018 · 32 comments
Closed
Labels
enhancement programming parameterizing queries: get, mget, eval, env top request One of our most-requested issues
Milestone

Comments

@arunsrinivasan
Copy link
Member

arunsrinivasan commented Mar 3, 2018

require(data.table) # devel compiled on 3rd March 2018
dt <- data.table(x=1:5, y=6:10, z=11:15)
#    x  y  z
# 1: 1  6 11
# 2: 2  7 12
# 3: 3  8 13
# 4: 4  9 14
# 5: 5 10 15

Now we can subset cols using the .. notation conveniently as follows:

cols <- "z"
dt[, ..cols]
#     z
# 1: 11
# 2: 12
# 3: 13
# 4: 14
# 5: 15

Since ..var has a very special meaning, we could also allow this?

dt[, c(..cols, "x")]

Currently, it errors as follows:

# Error in eval(jsub, SDenv, parent.frame()) : object '..cols' not found

Haven't really given it a lot of thought.. Just came across it and wanted to file it as an issue so that it doesn't get lost.

@MichaelChirico
Copy link
Member

of course dt[ , c(cols, 'x'), with = FALSE] still works... raises the question again of whether or not we're trying to phase out with = FALSE altogether

@franknarf1
Copy link
Contributor

I like it. Here's a similar idea from #633 by Synergist:

dt <- data.table(x = 1:10, y = rnorm(10), z = runif(10))
cols <- c('y', 'z')
dt[, !..cols] # should return col x

@mattdowle
Copy link
Member

Yes -- absolutely. The original news item in 1.10.2 (Jan 2017)

When j is a symbol prefixed with .. it will be looked up in calling scope and its value taken to be column names or numbers.

myCols = c("colA","colB")
DT[, myCols, with=FALSE]
DT[, ..myCols] # same
When you see the .. prefix think one-level-up like the directory .. in all operating systems meaning the parent directory. In future the .. prefix could be made to work on all symbols apearing anywhere inside DT[...]. It is intended to be a convenient way to protect your code from accidentally picking up a column name. Similar to how x. and i. prefixes (analogous to SQL table aliases) can already be used to disambiguate the same column name present in both x and i. A symbol prefix rather than a ..() function will be easier for us to optimize internally and more convenient if you have many variables in calling scope that you wish to use in your expressions safely. This feature was first raised in 2012 and long wished for, #633. It is experimental.

I've bolded the relevant sentence. I wasn't sure about it at the time and I hoped for feedback, before continuing. Seems like it's a go, then. IIRC, without looking at the code yet, it's not that hard as there's already a substitution of all appearances of get() and/or perhaps eval() to look in calling scope. At one time I thought of eval() as a good wrapper to convey "eval in calling scope" and I think [...] may already do that. Now, I prefer .. prefix as it's simpler and more robust. Since .. is a prefix on a symbol, we know it must be a symbol (unlike eval() function which could potentially be passed paste(...) or something.)

My recent tweet here seems to have yielded a positive response, too.
https://twitter.com/MattDowle/status/967290562725359617

@arunsrinivasan
Copy link
Member Author

Nice! I like it too. It'd also, when implemented, take care of issues like:

dt[x > x]

which could be then done as:

dt[x > ..x]

@mattdowle mattdowle added this to the v1.10.6 milestone Mar 5, 2018
@arunsrinivasan
Copy link
Member Author

@MichaelChirico I assumed that's the case based on #2620

@mattdowle
Copy link
Member

mattdowle commented Mar 5, 2018

At one stage I think there was a suggestion to use a single . prefix to mean 'this scope for sure' and not in calling scope if it's not a column name (again, similar to single . directory). However, one character is a bit easier to miss when reading code. Easier and more robust to say .. must be used to get to calling scope and without that prefix, the symbol must be a column name. It would be a change with potential breakage but managed in the usual way e.g. options(datatable.strict.scope) FALSE to start with and then gradually changed to TRUE with notices and warnings over several years. The warnings could be quite advanced in this case saying exactly which symbols should be prefixed with .. as it would detect at runtime which ones it was finding in calling scope. I think users would like it as the move would be towards robustness and readability, and the changes can be made by user in their own time over the years.

@franknarf1
Copy link
Contributor

.. must be used to get to calling scope and without that prefix, the symbol must be a column name.

Looking at #697 (comment), I guess the exception for a single symbol in i will be kept (interpreted as a join and so looking in calling scope even without ..)?

@mattdowle
Copy link
Member

mattdowle commented Mar 5, 2018

Hm, yes. The difference there is that a single symbol in i doesn't make sense if that were to mean a column name. Other than, if that column is type logical. Personally I prefer to write and read DT[someCol==TRUE] anyway and leave it to optimization to do that efficiently.

@msummersgill
Copy link

msummersgill commented Mar 5, 2018

Looking forward to this one! Seems like this has potential to replace a boatload of eval/parse/quote/get that litters most of the code I've written since going full data.table.

One pipe dream of mine might look like the following, where both expressions would give the same output. (As I started to type this out I did start to question if this is even realistic)

set.seed(1234)
DT <- data.table(foo = rep(LETTERS[1:2],8),
                 bar = rep(letters[17:20],4),
                 month = rep(month.name[1:4],4),
                 day = seq_len(16),
                 yyy = rnorm(16,mean = 0.5,1.5))

## Expression 1: with hard coded columns
DT[yyy > 0,.(NewCol = paste(month, day),
             bar,
             yyy), by = foo]
## Expression 2: everything passed by reference
A = "yyy"
B = 0
C = "NewCol"
D = "month"
E = "day"
G = "foo"
H = c("bar","yyy")

DT[..A > ..B , .(..C = paste(..D, ..E),
                 ..H), by = ..G]
   foo      NewCol bar       yyy
1:   B  February 2   r 0.9161439
2:   B  February 6   r 1.2590838
3:   B February 14   r 0.5966882
4:   B    April 16   t 0.3345718
5:   A     March 3   s 2.1266618
6:   A   January 5   q 1.1436870
7:   A    March 15   s 1.9392411

However, even as I write this I'm starting to see some potential hang-ups, in particular with how the i part of the expression is evaluated.

In the case below, it does seem like there is potential ambiguity in whether ..B would be treated as a column name if a column xyz existed in DT. Is my interpretation anywhere close to what you're currently planning on implementing, and should I perhaps be using a single . notation in some places?

A = "foo" ## intended column name
B = "xyz" ## intended literal character string for comparison

DT[..A == ..B]

Whatever gets implemented, I'm looking forward to it!


Some relevant stack overflow questions (with lots of responses from contributors here):

@mattdowle
Copy link
Member

mattdowle commented Mar 24, 2018

Ok, now in master. Just for symbols used in j= for now.
Please test it out all.
See the NEWS item and the new tests in PR #2704.

Looking at @msummersgill's comment more closely just now, there seems to be two different ideas :

  1. Fetching the value of a variable in calling scope to use in a j expression that uses column names too; e.g.beta=10L; DT[, sum(colB)*..beta]. This works fine already (without the .. prefix on beta) so long as there is never any chance of DT having a column called beta. The .. makes that safer.

  2. Use a column name in a j= expression, where the column name to use is defined by a variable in calling scope. With the .. prefix now implemented, this could now be done reliably with get(..whichCol) where whichCol is in calling scope and will now work even if that DT ever contains a column called whichCol, too. However, using get() is a bit onerous and it's a bit harder to optimize internally (currently get invokes all columns to be placed in .SD, for example). Further, it's possible that whichCol in calling scope could contain a name that isn't actually a column name. In that case get() would itself look in calling scope and, if unlucky, could pick up an unintended value in calling scope. A syntax to for-sure get-that-column, and error otherwise, would be nice.

I had been thinking only of case 1 so far, and that's what @arunsrinivasan raised this issue about and @franknarf1 and @MichaelChirico agreed with further examples on that. That's what the now-merged PR does.

But @msummersgill is suggesting something completely different (case 2), I hadn't considered before. That's a very neat idea and I like it. Do we need a different prefix for case 2 then and do that too?

If I've understood correctly so far, which case does !!var in tidyeval do?

Since DT[, ..cols] (i.e. where j= is a single symbol) was the only case where .. worked before, I see how the view that .. means to get was formed. But it's just that ..cols happens to be used there where the value of ..cols means in that context to select that column, or more often a set of columns.

And now I see, @msummersgill already detailed these two cases at the end of his comment, but I didn't digest that before.

@mattdowle
Copy link
Member

mattdowle commented Mar 24, 2018

Looks like we can't have _ or __ prefix as that doesn't parse. We could have _ or __ postfix. But I somehow really like prefix rather than postfix.
Then I thought, ok, maybe @whichCol. This would mean get(..whichCol) but where the value of whichCol must be a column name, otherwise error. But that doesn't parse either.
So it's pointing back towards the old EVAL = parse(text=paste())) which I don't mind at all, but can get a bit ugly, even when using the EVAL wrapper. Again, as @msummersgill already alluded to.

So, taking Matt S's example, how about :

DT[ " @A > @B , .(@C = paste(@D, @E),  @H), by = @G " ]

The rule would be (a little like the first argument of fread) that if i= was a string containing one or more @ then it would switch to token replacement mode. In this mode, all other arguments (j=, by=, etc) must be missing. The internals would replace the tokens and then re-parse. It might not be that hard to implement.

More examples :

col = "x"
DT[, ..col]               # select column "x" (even if "cols" is a column name too)
DT[, c(..col, "y")]       # select columns "x" and "y"
thresh = 6.5
DT[ x > ..thresh ]        # select rows where "x" column > 6.5
DT[ "@col > y" ]          #  same as DT [ x > y ]  i.e. comparison of two columns
anyString = "..thresh"
DT[ "@col > @anyString" ] # same as DT[ x > ..thresh ]

It's now actually an advantage that @ prefix doesn't parse, because visually, it is distinguishable from other regular R code. The restriction would be you couldn't directly use S4's obj@slot when in token replacement mode. You'd have to move the "obj@slot" into a token's value. It doesn't have to be @. It could be anything. Maybe $ or $$ prefix, conveying $ for string.

@jangorecki
Copy link
Member

jangorecki commented Mar 24, 2018

I don't like idea of one string with replacements of @. User can easily use bquote to substitute variables in an expression, and that stays in more base R way. No new custom designed and maintained api.
From the programmatic point of view user will need to create one long string and pass it to DT[i=. It is not really that robust. It would be much more programmatic friendly if it would accept a list, eventually something similar to #1579 (comment). If it is not high priority why not just leave it for now.

@franknarf1
Copy link
Contributor

I agree with Jan re preferring expressions over strings. One other idea: add a function like get/mget, but that only looks among the data.table's columns ..?

DT = data.table(x = 1, y = 2)

cols = c("x", "y")
DT[, ..cols]                        # as now
DT[, g(..cols)]                     # same
DT[, g(cols)]                       # same
DT[, g(..cols), strict=TRUE]        # same
DT[, g(cols), strict=TRUE]          # error: symbol cols is column name in DT
                                    #        nor an external variable highlighted with a .. prefix

# behaving like mget but checking for col names
cols = c("x", "y")
DT[, lapply(g(..cols), max)]        # same as current behavior of DT[, lapply(mget(cols), max)]
v = "z"
DT[, {z = 4; lapply(g(..v), max)]   # error: selected column "z" not found
                                    # in contrast with DT[, {z = 4; lapply(mget(v), max)}]

# simplifying for a single col by default
vv = "x"
DT[, g(..vv)]                      # same as DT[, x] with simplification from list to vector
DT[, g(..vv, simplify = FALSE)]    # same as DT[, .(x)]

I don't know anything about feasibility, but something like this would be easy to understand and work with, I imagine.

Related to #1712 (comment) and #633

Just to clarify, here are translations of Matt's examples:

col = "x"
DT[, ..col]                     # DT[, .(x)]
DT[, c(..col, "y")]             # DT[, .(x, y)]

DT[ g(col) > y ]                # same as DT [ x > y ]
DT[ g(col) > y, strict=TRUE ]   # error: symbol col is not a column name nor highlighted with ..
DT[ g(..col) > y, strict=TRUE ] # same as DT [ x > y ]

thresh = 6.5
DT[ x > ..thresh ]              # same as DT[ x > 6.5 ]
DT[ x > thresh, strict=TRUE]    # error: symbol thresh is not a column name nor highlighted with ..
DT[ g(col) > ..thresh ]         # same as DT[ x > 6.5 ]
anyString = quote(..thresh)
my_expr = bquote(g(col) > .(anyString))
DT[ eval(my_expr) ]             # same as DT[ x > 6.5 ]
DT[ eval(my_expr), strict=TRUE] # error: symbol col is not a column name nor highlighted with ..

@msummersgill
Copy link

msummersgill commented Mar 25, 2018

The idea of having a single string in i with @ to identify names would be highly flexible, but (subjectively) it does feel a little foreign from the normal data.table syntax. A couple other minor drawbacks might be losing auto-complete of column names and auto-indenting of multi-line expressions when working with a R supporting code editor.

While those drawbacks certainly might be worth the trade-off, one more objective speed bump I see at the moment is that the use of a single symbol, @, potentially leaves the same opportunity for ambiguity between the target is an external object or a data.table column.

It seems like one of the underlying challenges here is that column names in a data.table can be used in the external environment with a completely different definition. The .. prefix leaves no ambiguity for the existing "list of column names" functionality, but it seems that to extend the concept to enable full fledged, unambiguous, functional programming, at least one additional prefix might be necessary.

@mattdowle I've tried to address your questions below by adding some usage examples illustrating each of the possible cases outlined below. By "External" I mean that the intended target of the evaluated name is explicitly not a column within the data.table, let me know if there's anything else I can try to better clarify! If you decide to continue down the path of DT["@A = ..."] I think enabling explicit differentiation between things that are supposed to evaluate as column names and things that aren't would still be high on my personal wish list, perhaps using ? to signify non-column names?

Thinking about manipulations on a single table only (joins introduce a whole new set of potential overlaps in names, not sure if that should be considered in this scope), here's a list of some of the potential ways I (and perhaps other users) might like to be able to define different potential targets explicitly to avoid ambiguity.

To me it would make intuitive sense to extend the usage of .. to a single column (case 3), and it seems like it might be possible to handle cases 4, 6, and 7 with the same prefix. For the sake of the examples below, I'll use ..ext. as a prefix, representing that the name should evaluate as a value in the external environment, and never as a column within DT.

Case Current Prefix New Prefix Used In Target Target Description
1 None None I, J, K Column Literal Column Name
2 None None I, J External Literal Value
3 .. I, J, K Column Column name defined externally
4 ..ext. I, J External Value (or vector) defined externally
5 .. .. J, K Columns List of column names defined externally
6 ..ext. J, K External Name of new column created in DT, defined externally
7 ..ext. J External List of names for multiple new columns created in DT, defined externally

Examples:

# Case --------------------------------
#   1      2
DT[foo == "A"] # Cases 1 & 2: Literal column name, and a literal value


col = "foo" # Case 3: Column Name defined externally
val = "A"   # Case 4: Value defined externally

# Case --------------------------------
#     3         4
DT[..col == ..ext.val] # DT[foo == "A"]


cols = c("foo","bar") # Case 5: List of column names defined externally
# Case --------------------------------
#       5
DT[, ..cols] # DT[, .(foo,bar)]


newcol = "nextday" # Case 6:Name of new column created in DT, defined externally 
# Case --------------------------------
#       6             1    2
DT[, ..ext.newcol := day + 1] # DT[, nextday := day + 1]


newcols = c("nextday", "nextday2") # Case 7:  List of names for multiple new columns created in DT, defined externally
# Case --------------------------------
#       7                1    2   1    2
DT[, ..ext.newcols := .(day + 1, day + 2)] # DT[, c("nextday", "nextday2") := .(day + 1, day + 2)]


## Putting it all together (in a totally nonsensical way for illustration only)
col = "foo"
col2 = "bar"
col3 = "day"
newgroup = "baz"
val = "A"
val2 = 1
newcols = c("nextday", "fooday")
printcols = c("baz","nextday","fooday")
# Case -------------------------------------------------------------------------------------------------------------
#    3         4            7                3          4                3      3                  6              3           5 
DT[..col == ..ext.val, ..ext.newcols := .(..col3 + ..ext.val2, paste0(..col, ..col3)), by = .(..ext.newgroup = ..col2)][, ..printcols]
#DT[foo == "A", c("nextday", "fooday") := .(day + 1, paste0(foo, day)), by = .(baz = bar)][,.(baz, nextday, fooday)]

From what I understand, is the prefix symbol limited to what will parse as valid object name in base R? If so, some options to cover cases 4, 6, and 7 might be ._col , .._col, ..ext.col, though none of those flow off the fingertips like .. does. While ... might work theoretically, it seems like trying to distinguish between .. and ... would lead to more migraines than anything else.

As alluded to by others, it does seem that potentially a single prefix for all of cases 3-7 could work if the column names (cases 3 and 6) were quoted when defined in the external environment, tidyeval style? Not to familiar with the framework, will do some more reading to see if some of the same design decisions would be potentially useful here. Update: did some reading on rlang and tidyeval, now I'm just confused.

The idea of having some kind of a strict = TRUE option seems like it could be potentially useful in allowing a minority of advanced users to write functions that manipulate data.table objects more defensively without overly burdening beginners with the need to understand a bunch of new syntax and prefixes.

@mattdowle
Copy link
Member

Let's take the comments from the top, in order :

[Jan] I don't like idea of one string with replacements of @. User can easily use bquote to substitute variables in an expression, and that stays in more base R way. No new custom designed and maintained api.

Can you please show an example how easy it is to use bquote with Matt S's example then?

[Jan] From the programmatic point of view user will need to create one long string and pass it to DT[i=.

In fact they won't be able to pass a string at all, because that would need a variable. A variable in i means something different. The string would have to appear within the [...]; i.e. literally DT[ " ... " ]. This is distinguishable in R because substitute(i) has type character in this case.

[Jan] It is not really that robust.

Can you provide one example of fragility please, so I can understand.

[Jan] It would be much more programmatic friendly if it would accept a list, eventually something similar to #1579 (comment).

I have done programmatic things like this in the past when I have been a user of data.table. But honestly, I never really liked it. The quote() or bquote() (which one, again?) have to be setup in advance and does the whole parameter have to be created in one of those, or can the quote() or bquote() be used within part of a j= expression? It's not friendly to my eyes at least. That macro idea didn't really strike me as simple or easy to explain. Personally, I would use what I've proposed (simple token replacement).

[Jan] If it is not high priority why not just leave it for now.

Because I've just extended .. prefix to work on all symbols in j=. If this .. is a mistake, then we should establish that right now, before releasing and locking that down. It will be harder to undo later. If we do need another prefix, then we should decide now and have a complete plan we're happy with, since we could feasibly wish we had not decided on .. after all.

[Frank] I agree with Jan re preferring expressions over strings.

Let's revisit after Jan has responded to points above.

[Frank] add a function like get/mget, but that only looks among the data.table's columns

The trouble is, how to deal with it internally. Let's say j= is run by group and there are a lot of groups. We don't want to be running g() over and over wastefully, just to grab the same column each time (which was defined once before the query started). So the internals have to work out if g() could ever return a different column depending on the group, and if not, take that out up front and optimize so it's a constant. It involves replacing the g() call with a constant. All that is fiddly and error prone. Part of the reason mget invokes all columns to be present in .SD is due to things like this. It's doable, but hard.

I like the token replacement because it's an explicit syntax that allows you to say ... ok, this query is programmatic, but up-front programmatically defined once. After that, it runs exactly like a regular query. That's clean to reason about in my view. There's a clear separation of logic: i) the first step that defines the query to run, then ii) that query runs. If we have g() and eval() and bquote() and things like that, then they run throughout the internals and make it hard. I remember writing once why I liked and recommended EVAL = parse(text=paste(...)) approach because it resulted in a query that could be optimized cleanly just as if the regular simple query was typed in the first place.

[Matt S] The idea of having a single string in i with @ to identify names would be highly flexible, but (subjectively) it does feel a little foreign from the normal data.table syntax.

True. After a few sleeps, I'm still liking it. 'Normal' in this case is using bquote and eval, etc, so the comparison should be to that since 'normal' data.table syntax doesn't have this ability.

[Matt S] A couple other minor drawbacks might be losing auto-complete of column names and auto-indenting of multi-line expressions when working with a R supporting code editor.

Those things aren't going to work anyway when writing the meta-query (because the column name used will be flexibly defined.) So maybe it's good that this is with a string literal and therefore colored differently in the editor so it stands out. Even looking at the colors here in GitHub above in my comment showing the examples, it looks quite good to my eyes already. The meta-query stands out in blue with quotes around, while the true variables in calling scope are highlighted in orange. It looks similar in my editor locally too.

I don't follow much of the rest of your comment. I need to see actual code examples. I can't translate that table to code because I'm not 100% clear what you mean by 'external', for example.

[Matt S] The idea of having some kind of a strict = TRUE option seems like it could be potentially useful in allowing a minority of advanced users to write functions that manipulate data.table objects more defensively without overly burdening beginners with the need to understand a bunch of new syntax and prefixes.

Yes. That idea is transitional, after several years if it goes well, strict=TRUE would be the case always and not optional. It is not just for advanced users, though. In fact, the opposite. It's to save accidents by non-advanced users. Beginners will need to use .. prefix to get to calling scope. I think they'd find meta-query easier to grasp (it's just string replacement after all) than anything involving new functions. They find quote/bquote/enquote all quite confusing, as I do too. The error messages can be very helpful with beginners in mind e.g. "'coef' is not a column name, but it is in calling scope. Please add the prefix '..' to coef to make it clear so it won't ever break in future if you ever add a column called coef."

@jangorecki
Copy link
Member

jangorecki commented Mar 27, 2018

Can you please show an example how easy it is to use bquote with Matt S's example then?

Not bquote actually but language solution, below.

Can you provide one example of fragility please, so I can understand.

There is no fragility in this solution but querying DT by character expression is well handled by eval-parse.

Language and string solutions, none of them really user friendly, but powerful, base R, stable, well defined and tested API.
I agree it make sense to have some faster way to write it to user convenience. Yet for production code I would still construct call rather than operate on string as it adds extra sanity check for syntax.

library(data.table)
set.seed(1234)
DT <- data.table(foo = rep(LETTERS[1:2],8),
                 bar = rep(letters[17:20],4),
                 month = rep(month.name[1:4],4),
                 day = seq_len(16),
                 yyy = rnorm(16,mean = 0.5,1.5))

## Expression 1: with hard coded columns
DT[yyy > 0,.(NewCol = paste(month, day),
             bar,
             yyy), by = foo] -> ans
## Expression 2: everything passed by reference
A = "yyy"
B = 0
C = "NewCol"
D = "month"
E = "day"
G = "foo"
H = c("bar","yyy")

#DT[..A > ..B , .(..C = paste(..D, ..E),
#                 ..H), by = ..G]
#   foo      NewCol bar       yyy

"# ---- language ----"

qc = as.call(list(
    as.name("["),
    x = quote(DT),
    i = call(">", as.name(A), B),
    j = as.call(c(list(as.name(".")), setNames(list(call("paste", as.name(D), as.name(E))), C), lapply(H, as.name))),
    by = as.name(G)
))
eval(qc) -> ans2

"# ---- string ----"

ci = paste(A, B, sep=" > ")
cj = paste0(".(", C, " = paste(", D, ", ", E, "), ", paste(H,collapse=", "),")")
cby = paste0("list(", G, ")") # this could be simplified
DT[eval(parse(text=ci)), eval(parse(text=cj)), by = eval(parse(text=cby))] -> ans3

# another
cdt = paste0("DT[", paste(c(ci, cj, cby), collapse=","), "]")
eval(parse(text=cdt)) -> ans4

identical(ans, ans2) # T
identical(ans, ans3) # T
identical(ans, ans4) # T

What actually seems more resonable to me is a helper function for string constraction by substituting @vars. Independent from [ call. Then we are limiting "new api" to minimum, just to eval-parse constructed string. So making it is corresponding feature to macro proposal which uses language objects instead of string in this example.

query = helper("@A > @B, .(@C = paste(@D, @E), @H), @G")
DT[text = query]
# or alternative api
DT[~query]

Which might be tricky because it is not obvious if @H should be unlisted to merge into single list with @C or stay as c("bar","yyy")/.(bar, yyy).
I would stay away from handling even more types of calls in i, it looks nice and simple but it adds new ifs to documentation.

Because I've just extended .. prefix to work on all symbols in j=. If this .. is a mistake, then we should establish that right now, before releasing and locking that down. It will be harder to undo later. If we do need another prefix, then we should decide now and have a complete plan we're happy with, since we could feasibly wish we had not decided on .. after all.

agree

@franknarf1
Copy link
Contributor

[Matt Dowle:] So the internals have to work out if g() could ever return a different column depending on the group, and if not, take that out up front and optimize so it's a constant. It involves replacing the g() call with a constant. All that is fiddly and error prone. Part of the reason mget invokes all columns to be present in .SD is due to things like this. It's doable, but hard.
[...]
There's a clear separation of logic: i) the first step that defines the query to run, then ii) that query runs. If we have g() and eval() and bquote() and things like that, then they run throughout the internals and make it hard.

Thanks for the response; that makes sense. I was imagining g() just being markup triggering substitution in your step (i), similar to . which subs to become list, not a proper function. And only permitting as an argument a prefixed variable pointing to a character string one level up, like g(..v), which would become x or .(x = x, y = y) or similar in j, with a verbose message noting the substitution.

One objection to the @ prefix is that it seems to be designed only to cover the single-variable use case. What if we want to pass a vector of columns that is not a subset of .SDcols? (In the case of it being a subset, we could do .SD[, ..v] at least.)

My other objection is what Jan mentioned: "Yet for production code I would still construct call rather than operate on string as it adds extra sanity check for syntax." I like R's syntax checking that comes with writing code as code, as opposed to writing code inside a string. And for production code, I wouldn't really trust any code I'd written into a string, I guess.

Another variation on the same idea: extend .SDcols (or some other argument) to support a list of column-selecting expressions to substitute in:

library(data.table)
DT = data.table(mtcars)

z = "newcol"
DT[, .(..z = sum(..x + ..y), lapply(.SD, max)), by=am, 
  .SDcols = list(disp:drat, x = "wt", y = "qsec")]
# substitutes j before evaluation to become...
# list(newcol = sum(wt + qsec), disp = max(disp), hp = max(hp), drat = max(drat))

So it looks up ..x and ..y in this other argument to DT[...] first (before really looking one level up) and substitutes in the associated columns.

@mattdowle
Copy link
Member

mattdowle commented Mar 27, 2018

Jan seems to be almost there, now agreeing about @ token substitution, albeit as a helper. (Those quote(), call() and as.call() are all hard work for me, let alone users.) The @ tokens would be checked to be variables in calling scope, type character, and length 1 (error if not). One extra "if" to the documentation is pretty good, I'd say. It would explain what a meta-query is, and it's simple. Much simpler than explaining the alternatives, for which there is established confusion and bug reports. The reason for not passing in a variable containing the meta-query is so that it is clear it is a meta-query right there. We don't want whether it is a meta-query or not to depend on a variable which is defined higher up in the code, or passed in. It should look like a meta-query.

Frank - ok I see your thinking on g() now. But if g() is to be substituted, may as well just use token substitution. The reader of such code would need to know what g() was and what it did. I think we need a more dramatic switch in look-and-feel to more clearly indicate meta-query mode.

One objection to the @ prefix is that it seems to be designed only to cover the single-variable use case.

It's designed to cover anything. Completely flexible. Straight forward token substitution to build any data.table query. Verbose mode would print the query that has been constructed. If that query doesn't parse, R will still catch that. But at run -time. Which is what is required anyway. It is a dynamic query.

What if we want to pass a vector of columns that is not a subset of .SDcols? (In the case of it being a subset, we could do .SD[, ..v] at least.)

What's the full example please and lets explore that.

My other objection is what Jan mentioned: "Yet for production code I would still construct call rather than operate on string as it adds extra sanity check for syntax." I like R's syntax checking that comes with writing code as code, as opposed to writing code inside a string. And for production code, I wouldn't really trust any code I'd written into a string, I guess.

I would trust the token substitution method for production, more than a muddle of quote, bquote, as.call, call etc functions (such as Jan showed in his comment, which I view as unsuitable for production.) Those are hard to write, hard to debug and maintain. The token substitution method is simpler and therefore actually more robust for production purposes, in my view. It is more robust to log in production for example, because the actual query that ran in production is written to the log file (which anybody can read as it won't contain any eval(), bquote(), call() etc). I do have a lot of experience with production systems. I was responsible for bring down Lehman Bros trading systems for a half day, when I was in my early 20's : - D I have plenty of experience talking with on-site devops at 3am who are looking through my log files and have paged me. I have plenty experience of dead feeds going unnoticed because of lax controls. The overriding goal for production is simplicity. The fewer cogs the better. The few dependencies the better. Simpler code always trumps more complex code. What all the calls to call(), as.call(), bquote() etc are doing is constructing a dynamic query anyway. Yes, R is checking those constructions are valid R when you write them, but it's at run-time that they are combined and used together in the dynamic query (and that's when they might break.) If a dynamic query does break in production, then I'd prefer to deal with a meta-query which tells me the query that was run, rather than trying to debug a muddle of quote, call, bquote, calls.

@franknarf1
Copy link
Contributor

[Me:] What if we want to pass a vector of columns that is not a subset of .SDcols? (In the case of it being a subset, we could do .SD[, ..v] at least.)

[Matt Dowle:] What's the full example please and lets explore that.

I am thinking of, for example, summarizing multiple groups of variables, programmatically selecting input columns and naming output columns:

library(data.table)
DT = data.table(mtcars)

min_em = c("disp", "hp")
min_em_prefix = "min"
mean_em = c("drat", "wt")
mean_em_prefix = "mean"
max_it = "qsec"
max_it_name = "mymax"
by_em = "am"

# goal query

DT[, .(
  min.disp = min(disp), min.hp = min(hp), 
  mean.drat = mean(drat), mean.wt = mean(wt), 
  mymax = max(qsec)
), by=am ]

# current syntax
# with four warnings since ..min_em and ..mean_em already up one level relative to .SD[...], i guess

DT[, c(
  {z <- lapply(.SD[, ..min_em], min); setNames(z, sprintf("%s.%s", ..min_em_prefix, names(z)))}, 
  {z <- lapply(.SD[, ..mean_em], mean); setNames(z, sprintf("%s.%s", ..mean_em_prefix, names(z)))}, 
  setNames(max(.SD[[..max_it]]), ..max_it_name)
), by=by_em]

# proposed syntax?

DT[, "c(
  ..min_em_prefix = lapply(@min_em, min), 
  ..mean_em_prefix = lapply(@mean_em, mean),
  ..max_it_name = max(@max_it)
)", by="..by_em"]

So, "@" here is doing two different types of substitutions:

  • for @min_em, giving a named list of columns
  • for @max_it giving just one column

Is that what you had in mind? I ask because I only noticed single-column examples so far and wasn't sure if it would also cover this use-case.

Re my version of "current syntax" above, I could write get/mget instead of subsetting .SD, but tend to try to avoid those. Also, I guess there is a way with current syntax to make sure GForce still kicks in in this example, but I guess it would be tedious.

Good point re simplicity trumping other concerns; thanks for explaining. I'm convinced that I could/would use this.

@mattdowle
Copy link
Member

mattdowle commented Mar 28, 2018

Great example. So, given :

min_em = c("disp", "hp")
min_em_prefix = "min"
mean_em = c("drat", "wt")
mean_em_prefix = "mean"
max_it = "qsec"
max_it_name = "mymax"
by_em = "am"

# goal query

DT[, .(
  min.disp = min(disp), min.hp = min(hp), 
  mean.drat = mean(drat), mean.wt = mean(wt), 
  mymax = max(qsec)
), by=am ]

This is a tough one! To do all of this is maybe stretching what I had in mind. But let's give it a go. First thought is to perhaps codify the requirement into one definition rather than lots of variable names. As itself a data.table, for fun.

> def = data.table( fun = c("min", "mean", "max"), 
                    prefix = c("min", "mean", "max"),
                    cols=list (c("disp","hp"), c("drat", "wt"), c("qsec") )
> def
      fun prefix    cols
   <char> <char>  <list>
1:    min    min disp,hp
2:   mean   mean drat,wt
3:    max    max    qsec

Now expand the definition. (There must be a better way of doing this step.)

> expanded = def[, paste0(prefix,".",cols[[1]],"=",fun,"(",cols[[1]],")"), by=1:3]
> expanded
       :                   V1
   <int>               <char>
1:     1   min.disp=min(disp)
2:     1       min.hp=min(hp)
3:     2 mean.drat=mean(drat)
4:     2     mean.wt=mean(wt)
5:     3   max.qsec=max(qsec)
>  J = paste(expanded$V1,collapse=", ")
> J
[1] "min.disp=min(disp), min.hp=min(hp), mean.drat=mean(drat), mean.wt=mean(wt), max.qsec=max(qsec)"

and then just use it :

> DT[ " , .(@J), by=@by_em " ]
...
Meta-query expanded to: DT[, .(min.disp=min(disp), min.hp=min(hp), mean.drat=mean(drat), mean.wt=mean(wt), max.qsec=max(qsec)), by=am]
...

Part of the difficulty here is that the requirement is not "these cols by all these funs using the fun as a prefix". Otherwise perhaps it would be easier. I realize it doesn't satisfy the mymax name requirement, either, but that could be built into the expansion helper function.

As one of the people who wrote the internal optimizations, I'm confident that the expanded meta-query in this case will be efficient and optimized and I'm happy to work with doing better on queries like this. Whereas I am not at all happy or comfortable optimizing the version with lots of lapply, .SD, setnames() etc, all running by group. Especially if get/mget are present, the user has to think about and use .SDcols appropriately to avoid .SD receiving all columns. If there is no .SD in the expanded meta-query, then that problem goes away.

When we look later at this line in isolation, perhaps written by someone else :

DT[ " , .(@J), by=@by_em " ]

Yes we need to know it is a meta query. But as long as we know that, we can see that @J is being constructed and that by= is programmatic too. As far as dynamic queries go, which are more advanced by their nature, that's not bad. What is not there is any lapply, .SD, .SDcols, quote, bquote, as.call, etc. I suspect that beginners, if they get as far as regular data.table syntax, might go from there to meta-query quite easily, since it is just literal string replacement and they (and we) can trace and debug it relatively easily.

@jangorecki
Copy link
Member

One extra "if" to the documentation is pretty good, I'd say. It would explain what a meta-query is, and it's simple. Much simpler than explaining the alternatives, for which there is established confusion and bug reports.

Definitely, alternatives I was showing are for experience R programmer, not a beginner.

One objection to the @ prefix is that it seems to be designed only to cover the single-variable use case.

It's designed to cover anything. Completely flexible.

Almost completely flexible. Completely flexible is R eval-parse or operating on language objects. Example query used above is not that obvious:

DT["@A > @B, .(@C = paste(@D, @E), @H), @G"]

Direct translation of that should map to:

DT[i = yyy > 0, j = .(NewCol = paste(month, day), list(bar, yyy)), by = foo]

and not to:

DT[i = yyy > 0, j = .(NewCol = paste(month, day), bar, yyy), by = foo]

Of course the second one is expected in this case but it requires c / unlist logic behind the scene to merge @H with @C into single list. And that might not be always desired.

I would trust the token substitution method for production, more than a muddle of quote, bquote, as.call, call etc functions (such as Jan showed in his comment, which I view as unsuitable for production.) Those are hard to write, hard to debug and maintain. The token substitution method is simpler and therefore actually more robust for production purposes, in my view. It is more robust to log in production for example, because the actual query that ran in production is written to the log file (which anybody can read as it won't contain any eval(), bquote(), call() etc).

It is not a muddle, it is powerful and obviously complex R metaprogramming feature. Yes it is hard to write, unless you use R metaprogramming regularly. Not harder to debug than string, same way you log character string you log quoted call by print or deparse.

I never had any issues because of using language objects. I use bquote in Rbitcoin for years already, no one ever reported any issues with it. In other non-CRAN published packages I use language objects and data.table extensively, many times I make ad-hoc wrapper customized for case I need, just to group metaprogramming logic in single place. Never had any issue with it.

I fully agree on making new more user friendly way for metaprogramming, but I would keep it as minimal as possible and re-use base R eval-parse or language objects as much as possible.

"Great example" using computing on the language R feature.

library(data.table)
DT = as.data.table(mtcars)

build.j = function(min_em, min_em_prefix, mean_em, mean_em_prefix, max_it, max_it_name)
  as.call(c(
    list(as.name(".")),
    lapply(setNames(min_em, paste(min_em_prefix, min_em, sep=".")), function(col) call("min", as.name(col))),
    lapply(setNames(mean_em, paste(mean_em_prefix, mean_em, sep=".")), function(col) call("mean", as.name(col))),
    setNames(list(call("max", as.name(max_it))), max_it_name)
  ))
   
qj = build.j(
  min_em = c("disp", "hp"),
  min_em_prefix = "min",
  mean_em = c("drat", "wt"),
  mean_em_prefix = "mean",
  max_it = "qsec",
  max_it_name = "mymax"
)
by_em = "am"

print(qj)
#.(min.disp = min(disp), min.hp = min(hp), mean.drat = mean(drat), mean.wt = mean(wt), mymax = max(qsec))
DT[, eval(qj), by_em] -> ans1

# as single call
qcall = as.call(list(as.name("["), quote(DT), substitute(), qj, as.name(by_em)))
print(qcall)
#DT[, .(min.disp = min(disp), min.hp = min(hp), mean.drat = mean(drat), mean.wt = mean(wt), mymax = max(qsec)), am]
eval(qcall) -> ans2

@jangorecki
Copy link
Member

jangorecki commented Mar 28, 2018

Another thing is security when doing eval-parse for user provided input. Computing on the language is much safer in this case:

library(data.table)
DT = as.data.table(mtcars)
col = "am"

# eval-parse way
eval(parse(text=paste0("DT[1L,.(",col,")]")))
#   am
#1:  1

# language way
eval(as.call(list(as.name("["), quote(DT), quote(1L), call(".", as.name(col)))))
#   am
#1:  1

# eval-parse abuse
writeLines("cat('harmful code here\n')", "somescript.R")
col = "{source(\"somescript.R\"); am}"
eval(parse(text=paste0("DT[1L,.(",col,")]")))
#harmful code here
#   V1
#1:  1

# try language abuse
eval(as.call(list(as.name("["), quote(DT), quote(1L), call(".", as.name(col)))))
#Error in eval(jsub, SDenv, parent.frame()) : 
#  object '{source("somescript.R"); am}' not found

This is not much a problem for data.table users, more for a developers who wants to utilize data.table inside their apps giving UI to third parties.

@msummersgill
Copy link

msummersgill commented Mar 28, 2018

[Matt Dowle] I don't follow much of the rest of your comment. I need to see actual code examples.I can't translate that table to code because I'm not 100% clear what you mean by 'external', for example.

I edited my previous post to include some examples and clarifications. Let me know if it's still unclear.

On the topic of ["@..."]

I think continued steps towards full non-standard evaluation within [...] (without ambiguity between column names vs objects in the parent environment) would still be extremely valuable. However, as an intermediate user who read through thedata.table.R source for the very first time yesterday, I understand I may be asking for a pony here without understanding the challenges in making that possible.

If you implemented this today, I would probably use it for some cases.

[Jan] Another thing is security when doing eval-parse for user provided input.

I do use data.table in every shiny application I write, but I'm definitely on the fence about this one and curious to hear what others have to say. Since many users just eval(parse(text = "...")) currently, would this represent a regression in "security" or just a step sideways?

Testing current development branch

[Matt Dowle] Ok, now in master. Just for symbols used in j= for now. Please test it out all.

Vector vs data.table result

This may be a reasonable design decision, but this isn't the behavior I expected when ..var referenced a single column.

DT = data.table(x=1:3, y=2:4)
var = "x"


DT[, x]     ## Returns a vector
DT[, ..var] ## Returns a one column data.table with column x

Unexpected evaluation behavior

DT = data.table(x=1:3, y=2:4)
var = "x"

DT[, paste(..var)] # ..var evaluates as column x
#    x
# 1: 1
# 2: 2
# 3: 3

DT[, paste(..var, x)] # ..var evaluates as string "x"
# [1] "x 1" "x 2" "x 3"


DT[, paste("blah",..var)] # ..var is concatenated by paste
# Error in `[.data.table`(DT, , paste("blah", ..var)) : 
#   column(s) not found: blah x

x = "foo"
var = "x"
dt[, paste("blah",get(..var))] # get(..var) "redirected" to global variable x
# Error in `[.data.table`(dt, , paste("blah", get(..var))) : 
#  column(s) not found: blah foo


dt[, .(paste("blah",..var))] # When wrapped in .() behavior is as expected
#        V1
# 1: blah x

dt[, .(paste("blah",get(..var)))] # When wrapped in .() behavior with get() is also as expected
#        V1
# 1: blah 1
# 2: blah 2
# 3: blah 3

Still need to use get()

[Matt Dowle] Frank - ok I see your thinking on g() now. But if g() is to be substituted, may as well just use token substitution. The reader of such code would need to know what g() was and what it did. I think we need a more dramatic switch in look-and-feel to more clearly indicate meta-query mode.

In a perfect world, if token substitution with a new prefix for this case could be done without putting everything into a single character string that would seem ideal to me. I don't have a solid enough understanding of the optimization challenges to understand how realistic that may or may not be though.

Some Positives

Don't want to throw the baby out with the bath-water here, there may still be some value even if .. doesn't turn into the end-all be all solution for NSE in data.table.

As is, would provide an answer to the stack overflow question get(x) does not work in R data.table when x is also a column in the data table.

Also, this example works as I would have expected it to.

DT = data.table(foo = LETTERS[1:5],
                 bar = seq_len(5),
                 baz = seq_len(5)/10,
                 zero = 0L)
x = "foo"
y = "bar"
z = "baz"
string1 = "blah"
num1 = 10

DT[, .(Paste = paste(..string1, get(..x), get(..y),sep = "-"),
       Mult = get(..y)*..num1,
       Add = get(..y) + get(..z),
       foo_literal = foo,
       foo_ref = get(..x))]

@franknarf1
Copy link
Contributor

@msummersgill One advantage of a separate symbol is that we can be sure that we're getting a column:

library(data.table)
DT = data.table(foo = LETTERS[1:5],
                 bar = seq_len(5),
                 baz = seq_len(5)/10,
                 zero = 0L)

# current behavior
x = "foo"
DT[, {foo = "yeehaw"; get(..x)}]
# [1] "yeehaw"

# whereas i guess the plan is
DT[', {foo = "yeehaw"; @x}']
# [1] "A" "B" "C" "D" "E"

Similarly, regarding security, it seems like only letting @ select columns would be safe, so...

DT1 = data.table(a = 1, b = 2)
f1 = function(x, d = DT) d[", code_vulnerable_if_x_is_bad(@x)"]
# should error for any input that isn't a column selection

# and even if the column name contains dangerous code
DT2 = data.table(`dangerous code here` = 1)
f2 = function(x, d = DT2) DT[", f(@x)"]
# should select the named column, not evaluate its name

And developers shouldn't write functions like f (char_expr, d = DT) DT[char_expr] where the user can inject arbitrary code as a string. That's my understanding of the proposal, anyways.

Now expand the definition. (There must be a better way of doing this step.)

Not trying to disagree here, but to clarify, the better way would be call and as.call ... but this expansion is to be done by data.table not the end-user, right? So, it's not like we would need to be familiar with these programming-on-the-language functions. In case Jan's code looks too bespoke for the example, there's...

library(data.table)
DT = data.table(mtcars)

# note: changing "prefix" to name to also cover mymax .. not a big extension
def = data.table(fun = c("min", "mean", "max"), 
                    name = c("min", "mean", "mymax"),
                    cols=list (c("disp","hp"), c("drat", "wt"), c("qsec") ))

def[, ncols := lengths(cols)]
longdef = def[, c(.SD[rep(.I, ncols)], .(col = unlist(cols))), .SDcols=!"cols"]
longdef[, out_name := sprintf("%s.%s", fun, col)][ncols == 1, out_name := name]

expr = with(longdef,
  as.call(c(as.name("list"), 
    setNames(Map(call, fun, lapply(col, as.name)), out_name)
  ))
)

DT[, eval(expr), by=am]

@mattdowle So your objection is to using call/as.call inside data.table to do expansions like this? If so, that seems like a separate question from what the end-user interface looks like (the @ and string-input question); and it seems like a "no" to making [.data.table more modular in the way Jan had in mind (#852)?

It's cheap talk since I've never contributed to the project, but I imagine that #852 would lower the barrier to potential contributors like me; I may be misunderstanding that issue or what you're saying here, though.

@mattdowle mattdowle modified the milestones: v1.11.0, v1.11.2 Apr 29, 2018
@jangorecki
Copy link
Member

jangorecki commented Mar 8, 2020

I changed the title because whole discussion deviated heavily from simple columns selection (.. prefix).


Another alternative could be providing an interface similar to substitute and its env argument.

DT[, .(out_col = fun(in_col)), by=by_col,
   env = .(out_col="mpg_max", fun="max", in_col="mpg", by_col="cyl")]

All elements of env lists would be converted to symbols and substituted in i, j, by, ...

Working example using base R. Except for LHS names, which is a tricky thing:

x = list()
class(x) = "cl"
"[.cl" = function(x, i, j, env) {
  if (!missing(env) && length(env)) {
    stopifnot(is.list(env), sapply(env, is.character))
    env = lapply(env, as.symbol)
    isub = eval(substitute(substitute(i, env)))
    jsub = eval(substitute(substitute(j, env)))
  } else {
    isub = substitute(i)
    jsub = substitute(j)
  }
  list(isub=isub, jsub=jsub)
}

x[var1==5L & var2%in%c("a","b"),
  .(fvar1=fun(var1, na.rm=TRUE), charhead=head(var2, 1L))]
#$isub
#var1 == 5L & var2 %in% c("a", "b")
#$jsub
#.(fvar1 = fun(var1, na.rm = TRUE), charhead = head(var2, 1L))
x[var1==5L & var2%in%c("a","b"),
  .(fvar1=fun(var1, na.rm=TRUE), charhead=head(var2, 1L)),
  env = list(var1="myIntCol", var2="myCharCol", fun="sum")]
#$isub
#myIntCol == 5L & myCharCol %in% c("a", "b")
#$jsub
#.(fvar1 = sum(myIntCol, na.rm = TRUE), charhead = head(myCharCol, 1L))

Thanks to @ggrothendieck for assisting on how to handle LHS names:

x = list()
class(x) = "cl"
"[.cl" = function(x, i, j, env) {
  if (!missing(env) && length(env)) {
    stopifnot(is.list(env), sapply(env, is.character), sapply(env, nzchar))
    env = lapply(env, as.symbol)
    isub = if (!missing(i)) eval(substitute(substitute(i, env)))
    jsub = if (!missing(j)) eval(substitute(substitute(j, env)))
    # substitute names?
    if (any(w <- names(jsub) %in% names(env))) {
      names(jsub)[w] = unlist(lapply(env[names(jsub)[w]], as.character))
    }
  } else {
    isub = substitute(i)
    jsub = substitute(j)
  }
  list(isub=isub, jsub=jsub)
}
x[, .(fvar1=list(fvar2 = val1)),
  env = list(fvar1="outer", fvar2="inner", val1="val_column")]
#$isub
#NULL
#$jsub
#.(outer = list(fvar2 = val_column))

Unfortunately we need to recursively traverse expression to look for nested substitutions of LHS. RHS is well handled by substitute already.

@jangorecki jangorecki changed the title Possible improvement to selecting cols the data.frame way improvement on programmatically substituting columns in expressions Mar 8, 2020
@jangorecki
Copy link
Member

Recursive traverse to substitute names in nested calls has been addressed in PR #4304. substitute2 function has been introduce to substitute both values, variable/function names (those are handled by R's substitute and call arguments names (handled by new internal substitute_call_arg_namesR function).

@jangorecki
Copy link
Member

jangorecki commented Jun 10, 2021

Because this issue turns out to be more generic than just ..var extension, I would advocate to close this issue as it is resolved by new env argument.

@jsinnett
Copy link

jsinnett commented Mar 23, 2023

Any chance ya'll have a timeline for when this will be released to production?
I'm unable to use the dev version (1.14.9). And as of 1.14.8, I don't have access to the env argument or substitute2().

@jangorecki
Copy link
Member

There are some issues asking for release date already and best to follow those.

@jangorecki jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023
@jangorecki
Copy link
Member

This issue is resolved by new env var #4304, already in master branch. If any of examples in this issue is not covered by new function please let us know so we re-open this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement programming parameterizing queries: get, mget, eval, env top request One of our most-requested issues
Projects
None yet
Development

No branches or pull requests

8 participants