deprecate multiple columns to be defined as scalar character #4357

jangorecki · 2020-04-05T18:20:43Z

For now I identified two cases, there are probably some more:

key argument in data.table should not accept key="A,B", which implies that setkey should not accept it as well
by argument in [.data.table should not accept by="A,B" (so keyby)

In both cases c("A","B") should be used instead.
I understand that it has been added as a nice feature that makes life (very little bit) simpler, but it breaks consistency, introduces new challenges (#4292), and it is also confusing to users (like does the column has a comma in its name?).
WDUT @mattdowle

The text was updated successfully, but these errors were encountered:

jangorecki · 2020-04-20T14:13:02Z

just found such a comment

data.table/R/data.table.R

Line 66 in b1b1832

# eg key="A,B"; a syntax only useful in key argument to data.table(), really.

MichaelChirico · 2024-04-03T20:59:32Z

Let's start with making the change to dev and see how much downstream impact it has.

tdhock · 2024-04-03T21:48:37Z

yes with revdep checker we should see impact the next day

MichaelChirico · 2024-04-04T01:59:38Z

Also need to deprecated for fread(key=):

data.table/R/fread.R

Lines 342 to 344 in 502c59e

    
           if (length(key) == 1L) { 
        
             key = strsplit(key, split = ",", fixed = TRUE)[[1L]] 
        
           }

MichaelChirico · 2024-04-04T02:03:27Z

Actually setDT() claims this is supported but it's actually not:

data.table/man/setDT.Rd

Line 16 in 502c59e

    
             \item{key}{Character vector of one or more column names which is passed to \code{\link{setkeyv}}. It may be a single comma separated string such as \code{key="x,y,z"}, or a vector of names such as \code{key=c("x","y","z")}. }

x = data.frame(a=1, b=2)
setDT(x, key="a,b")
# Error: some columns are not in the data.table: [a,b]

HughParsonage · 2024-04-05T12:12:14Z

I oppose this change as it will surely break existing code and I am not persuaded the benefits outweigh such a change. Even if ill-advised, such a feature was (and is!) documented as a supported way to specify columns; it was not merely a side-effect of some implementation. Given that there was never a single canonical way to specify by/keyby it seems a bit of an arbitrary choice to exclude this particular one.

A clearer error message would satisfy #4292 or even an automatic elimination of whitespace.

jangorecki · 2024-04-05T18:38:51Z

We could try to put it on 2.0.0 then...

MichaelChirico · 2024-04-05T18:43:21Z

@HughParsonage perhaps to clarify, the plan is not to release this change as a hard deprecation (i.e., as currently exists on master). master is in a transient state in order to smoke out potential downstream issues, which it now has:

#6049 (comment)

Actually, the downstream impact is relatively limited -- I was anticipating many more possible breakages.

Given that, I am still in favor of deprecation. Yes, there are several ways to specify by=, but this one creates a collision: by=character has complicated behavior because length(by) == 1L && grepl(",", by) behavior differs from length(by) > 1L || !grepl(",", by).

It also makes it more painful to deal with "publication tables" where columns might have , for any number of reasons, and we're forced to use another way to specify by= that we wouldn't otherwise, e.g. renaming the column or specifying it in backticks.

That leads to code inconsistency where we might have DT[, mean(cost), by="Revenue (%)"], then DT[, mean(cost), by = `Revenue ($1,000)`] in the next line (not a real example obviously but should convey the point). Or have to do col="Revenue ($1,000)" then by=col. Etc.

That said, the next steps would be to:

In data.table 1.16.0, leave the behavior as in 1.15.0, but toggled by an option, e.g. datatable.split.by.comma
In data.table 1.17.0, throw a warning for using the old behavior
In data.table 1.18.0, change the default
In data.table 1.19.0, throw an error for the old behavior
In data.table 1.20.0, remove the code for the old behavior

i.e. that behavior will remain available for at least two years.

With that in mind, do you still think the behavior is worth retaining? Would you support a scaled back version of the deprecation (e.g., it's only available in data.table() or only available in by=)?

jangorecki · 2024-04-07T13:22:36Z

Better option name datatable.by.split.comma otherwise it gives impression it is related to split method

MichaelChirico · 2024-04-13T00:03:54Z

OK, one week passed with no follow-up from Hugh, and all but one downstream has accepted a PR fixing their usage, so I will proceed with described gradual deprecation in #6049.

HughParsonage · 2024-04-13T07:41:10Z

Sorry Michael and team for my absence -- had some other things to attend to. Given my absence, I'm not seeking any consideration for this issue to be reopened.

My general point is that package code is a subset of wider data.table usage. And (I claim) the real-world usage of code is likely to be diverse and fraught than that in package code. In other words, the impact of backwards-compatible changes is likely to be more muted when looking at package examples, as package authors there are more likely to be responsive to changes and to use a more 'programmatic', 'formal' technique, whereas code in the bowels of organizations is likely to be a bit more slipshod. However, we should take care that our code does not unnecessarily damage the latter usage, not just for normal ethical reasons, but also because I think data.table's reputation as a long-term reliable package could be damaged.

One of the reasons I delayed writing the above was the good, plausible example Michael gave of column names with commas, and the perplexing nature of writing such code. My general point I think holds up here though: we should prefer supporting code that has been written over code that is about to be written, as it's easier to rewrite code that you have just written, than to rewrite code that's 10 years old that you barely understand. Avoiding the imposition of technical debt on inexperienced R users is something I feel quite strongly. However, given your scheduling of the problem, I think my objection could be resolved by ensuring that the error message at the conclusion of the cycle is sufficiently helpful to provide such users on the correct code to write.

I would recommend that the first version on this path to deprecation be the removal of this from the package documentation, rather than a package option.

MichaelChirico · 2024-04-13T15:37:45Z

Thank you for the feedback!

it's easier to rewrite code that you have just written, than to rewrite code that's 10 years old that you barely understand

in general yes, in this case there is a drop-in replacement that always works: replace with unlist(strsplit(x, ",")). we can even write that out in the error message, getting to your point about making that most helpful.

I would recommend that the first version on this path to deprecation be the removal of this from the package documentation, rather than a package option.

that seems like a bad approach to me, it's not how we've approached this problem in the past and it is extremely unlikely to catch notice in practice, users will be picking up this habit from all the invisible documentation as much as anything (StackOverflow, LLMs).

Check #6049, the next release will have no change for users in the next release -- only a bullet in the NEWS. it's hard to imagine being more gentle than that.

Lastly, I want to emphasize that kicking off a deprecation process by no means makes deprecation inevitable. We have stopped a deprecation several times in the past. But without actually doing the process, we have no way of knowing which features like this are deeply important/heavily used in user space, and thus worth maintaining despite quirky inconsistency, vs. us just spilling hot air. Putting this in the release starts the multi-year process of soliciting feedback from the large majority of users who don't follow the issue tracker.

MichaelChirico · 2024-05-20T17:50:03Z

I am more prone to agree with @HughParsonage and revert this change, after a few months' consideration. It makes some code somewhat inconsistent but is a nice convenience feature for interactive / exploratory work.

I think the logic to accommodate cases like #4292 is not that complicated to maintain inside a helper function (if (found missing columns under ',' split) && !(any columns with leading/trailing whitespace)) check(columns under '\\s*,\\s*' split)). Some inconsistency can be OK if it has a big benefit to user experience -- the major source of pain is when columns have "weird" characters in the names, which should be rare & comes with all sorts of caveats anyway. The most robust approach in such cases (use backticks), which I recommend & use myself, continues to work.

Any other thoughts here on the right way forward? Particularly from @jangorecki and those who 👍 the idea to deprecated this usage: @ColeMiller1 @renkun-ken @Henrik-P @tdhock @shrektan

jangorecki · 2024-05-21T05:23:55Z

I would still keep this change on the roadmap but for 2.0, where we can clean up all the inconsistencies that we found. Note that 2.0, if happen, won't happen anytime soon.

jangorecki added the consistency label Apr 5, 2020

jangorecki mentioned this issue Apr 5, 2020

should allow whitespaces in data.table(key=) #4293

Closed

MichaelChirico added the breaking-change issues whose solution would require breaking existing behavior label Apr 3, 2024

MichaelChirico mentioned this issue Apr 4, 2024

Deprecate key="a,b" and by="a,b" #6047

Merged

MichaelChirico closed this as completed in #6047 Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deprecate multiple columns to be defined as scalar character #4357

deprecate multiple columns to be defined as scalar character #4357

jangorecki commented Apr 5, 2020 •

edited by Henrik-P

jangorecki commented Apr 20, 2020

MichaelChirico commented Apr 3, 2024

tdhock commented Apr 3, 2024

MichaelChirico commented Apr 4, 2024

MichaelChirico commented Apr 4, 2024

HughParsonage commented Apr 5, 2024

jangorecki commented Apr 5, 2024

MichaelChirico commented Apr 5, 2024 •

edited

jangorecki commented Apr 7, 2024

MichaelChirico commented Apr 13, 2024

HughParsonage commented Apr 13, 2024

MichaelChirico commented Apr 13, 2024

MichaelChirico commented May 20, 2024

jangorecki commented May 21, 2024

deprecate multiple columns to be defined as scalar character #4357

deprecate multiple columns to be defined as scalar character #4357

Comments

jangorecki commented Apr 5, 2020 • edited by Henrik-P

jangorecki commented Apr 20, 2020

MichaelChirico commented Apr 3, 2024

tdhock commented Apr 3, 2024

MichaelChirico commented Apr 4, 2024

MichaelChirico commented Apr 4, 2024

HughParsonage commented Apr 5, 2024

jangorecki commented Apr 5, 2024

MichaelChirico commented Apr 5, 2024 • edited

jangorecki commented Apr 7, 2024

MichaelChirico commented Apr 13, 2024

HughParsonage commented Apr 13, 2024

MichaelChirico commented Apr 13, 2024

MichaelChirico commented May 20, 2024

jangorecki commented May 21, 2024

jangorecki commented Apr 5, 2020 •

edited by Henrik-P

MichaelChirico commented Apr 5, 2024 •

edited