Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deprecate multiple columns to be defined as scalar character #4357

Closed
jangorecki opened this issue Apr 5, 2020 · 14 comments · Fixed by #6047 · May be fixed by #6049
Closed

deprecate multiple columns to be defined as scalar character #4357

jangorecki opened this issue Apr 5, 2020 · 14 comments · Fixed by #6047 · May be fixed by #6049
Labels
breaking-change issues whose solution would require breaking existing behavior consistency

Comments

@jangorecki
Copy link
Member

jangorecki commented Apr 5, 2020

For now I identified two cases, there are probably some more:

  • key argument in data.table should not accept key="A,B", which implies that setkey should not accept it as well
  • by argument in [.data.table should not accept by="A,B" (so keyby)

In both cases c("A","B") should be used instead.
I understand that it has been added as a nice feature that makes life (very little bit) simpler, but it breaks consistency, introduces new challenges (#4292), and it is also confusing to users (like does the column has a comma in its name?).
WDUT @mattdowle

@jangorecki
Copy link
Member Author

just found such a comment

# eg key="A,B"; a syntax only useful in key argument to data.table(), really.

@MichaelChirico MichaelChirico added the breaking-change issues whose solution would require breaking existing behavior label Apr 3, 2024
@MichaelChirico
Copy link
Member

Let's start with making the change to dev and see how much downstream impact it has.

@tdhock
Copy link
Member

tdhock commented Apr 3, 2024

yes with revdep checker we should see impact the next day

@MichaelChirico
Copy link
Member

Also need to deprecated for fread(key=):

data.table/R/fread.R

Lines 342 to 344 in 502c59e

if (length(key) == 1L) {
key = strsplit(key, split = ",", fixed = TRUE)[[1L]]
}

@MichaelChirico
Copy link
Member

Actually setDT() claims this is supported but it's actually not:

\item{key}{Character vector of one or more column names which is passed to \code{\link{setkeyv}}. It may be a single comma separated string such as \code{key="x,y,z"}, or a vector of names such as \code{key=c("x","y","z")}. }

x = data.frame(a=1, b=2)
setDT(x, key="a,b")
# Error: some columns are not in the data.table: [a,b]

@HughParsonage
Copy link
Member

I oppose this change as it will surely break existing code and I am not persuaded the benefits outweigh such a change. Even if ill-advised, such a feature was (and is!) documented as a supported way to specify columns; it was not merely a side-effect of some implementation. Given that there was never a single canonical way to specify by/keyby it seems a bit of an arbitrary choice to exclude this particular one.

A clearer error message would satisfy #4292 or even an automatic elimination of whitespace.

@jangorecki
Copy link
Member Author

We could try to put it on 2.0.0 then...

@MichaelChirico
Copy link
Member

MichaelChirico commented Apr 5, 2024

@HughParsonage perhaps to clarify, the plan is not to release this change as a hard deprecation (i.e., as currently exists on master). master is in a transient state in order to smoke out potential downstream issues, which it now has:

#6049 (comment)

Actually, the downstream impact is relatively limited -- I was anticipating many more possible breakages.

Given that, I am still in favor of deprecation. Yes, there are several ways to specify by=, but this one creates a collision: by=character has complicated behavior because length(by) == 1L && grepl(",", by) behavior differs from length(by) > 1L || !grepl(",", by).

It also makes it more painful to deal with "publication tables" where columns might have , for any number of reasons, and we're forced to use another way to specify by= that we wouldn't otherwise, e.g. renaming the column or specifying it in backticks.

That leads to code inconsistency where we might have DT[, mean(cost), by="Revenue (%)"], then DT[, mean(cost), by = `Revenue ($1,000)`] in the next line (not a real example obviously but should convey the point). Or have to do col="Revenue ($1,000)" then by=col. Etc.

That said, the next steps would be to:

  1. In data.table 1.16.0, leave the behavior as in 1.15.0, but toggled by an option, e.g. datatable.split.by.comma
  2. In data.table 1.17.0, throw a warning for using the old behavior
  3. In data.table 1.18.0, change the default
  4. In data.table 1.19.0, throw an error for the old behavior
  5. In data.table 1.20.0, remove the code for the old behavior

i.e. that behavior will remain available for at least two years.

With that in mind, do you still think the behavior is worth retaining? Would you support a scaled back version of the deprecation (e.g., it's only available in data.table() or only available in by=)?

@jangorecki
Copy link
Member Author

Better option name datatable.by.split.comma otherwise it gives impression it is related to split method

@MichaelChirico
Copy link
Member

OK, one week passed with no follow-up from Hugh, and all but one downstream has accepted a PR fixing their usage, so I will proceed with described gradual deprecation in #6049.

@HughParsonage
Copy link
Member

Sorry Michael and team for my absence -- had some other things to attend to. Given my absence, I'm not seeking any consideration for this issue to be reopened.

My general point is that package code is a subset of wider data.table usage. And (I claim) the real-world usage of code is likely to be diverse and fraught than that in package code. In other words, the impact of backwards-compatible changes is likely to be more muted when looking at package examples, as package authors there are more likely to be responsive to changes and to use a more 'programmatic', 'formal' technique, whereas code in the bowels of organizations is likely to be a bit more slipshod. However, we should take care that our code does not unnecessarily damage the latter usage, not just for normal ethical reasons, but also because I think data.table's reputation as a long-term reliable package could be damaged.

One of the reasons I delayed writing the above was the good, plausible example Michael gave of column names with commas, and the perplexing nature of writing such code. My general point I think holds up here though: we should prefer supporting code that has been written over code that is about to be written, as it's easier to rewrite code that you have just written, than to rewrite code that's 10 years old that you barely understand. Avoiding the imposition of technical debt on inexperienced R users is something I feel quite strongly. However, given your scheduling of the problem, I think my objection could be resolved by ensuring that the error message at the conclusion of the cycle is sufficiently helpful to provide such users on the correct code to write.

I would recommend that the first version on this path to deprecation be the removal of this from the package documentation, rather than a package option.

@MichaelChirico
Copy link
Member

Thank you for the feedback!

it's easier to rewrite code that you have just written, than to rewrite code that's 10 years old that you barely understand

in general yes, in this case there is a drop-in replacement that always works: replace with unlist(strsplit(x, ",")). we can even write that out in the error message, getting to your point about making that most helpful.

I would recommend that the first version on this path to deprecation be the removal of this from the package documentation, rather than a package option.

that seems like a bad approach to me, it's not how we've approached this problem in the past and it is extremely unlikely to catch notice in practice, users will be picking up this habit from all the invisible documentation as much as anything (StackOverflow, LLMs).

Check #6049, the next release will have no change for users in the next release -- only a bullet in the NEWS. it's hard to imagine being more gentle than that.

Lastly, I want to emphasize that kicking off a deprecation process by no means makes deprecation inevitable. We have stopped a deprecation several times in the past. But without actually doing the process, we have no way of knowing which features like this are deeply important/heavily used in user space, and thus worth maintaining despite quirky inconsistency, vs. us just spilling hot air. Putting this in the release starts the multi-year process of soliciting feedback from the large majority of users who don't follow the issue tracker.

@MichaelChirico
Copy link
Member

I am more prone to agree with @HughParsonage and revert this change, after a few months' consideration. It makes some code somewhat inconsistent but is a nice convenience feature for interactive / exploratory work.

I think the logic to accommodate cases like #4292 is not that complicated to maintain inside a helper function (if (found missing columns under ',' split) && !(any columns with leading/trailing whitespace)) check(columns under '\\s*,\\s*' split)). Some inconsistency can be OK if it has a big benefit to user experience -- the major source of pain is when columns have "weird" characters in the names, which should be rare & comes with all sorts of caveats anyway. The most robust approach in such cases (use backticks), which I recommend & use myself, continues to work.

Any other thoughts here on the right way forward? Particularly from @jangorecki and those who 👍 the idea to deprecated this usage: @ColeMiller1 @renkun-ken @Henrik-P @tdhock @shrektan

@jangorecki
Copy link
Member Author

I would still keep this change on the roadmap but for 2.0, where we can clean up all the inconsistencies that we found. Note that 2.0, if happen, won't happen anytime soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change issues whose solution would require breaking existing behavior consistency
Projects
None yet
4 participants