New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deprecate multiple columns to be defined as scalar character #4357
Comments
just found such a comment Line 66 in b1b1832
|
Let's start with making the change to dev and see how much downstream impact it has. |
yes with revdep checker we should see impact the next day |
Also need to deprecated for Lines 342 to 344 in 502c59e
|
Actually Line 16 in 502c59e
x = data.frame(a=1, b=2)
setDT(x, key="a,b")
# Error: some columns are not in the data.table: [a,b] |
I oppose this change as it will surely break existing code and I am not persuaded the benefits outweigh such a change. Even if ill-advised, such a feature was (and is!) documented as a supported way to specify columns; it was not merely a side-effect of some implementation. Given that there was never a single canonical way to specify A clearer error message would satisfy #4292 or even an automatic elimination of whitespace. |
We could try to put it on 2.0.0 then... |
@HughParsonage perhaps to clarify, the plan is not to release this change as a hard deprecation (i.e., as currently exists on Actually, the downstream impact is relatively limited -- I was anticipating many more possible breakages. Given that, I am still in favor of deprecation. Yes, there are several ways to specify It also makes it more painful to deal with "publication tables" where columns might have That leads to code inconsistency where we might have That said, the next steps would be to:
i.e. that behavior will remain available for at least two years. With that in mind, do you still think the behavior is worth retaining? Would you support a scaled back version of the deprecation (e.g., it's only available in |
Better option name datatable.by.split.comma otherwise it gives impression it is related to split method |
OK, one week passed with no follow-up from Hugh, and all but one downstream has accepted a PR fixing their usage, so I will proceed with described gradual deprecation in #6049. |
Sorry Michael and team for my absence -- had some other things to attend to. Given my absence, I'm not seeking any consideration for this issue to be reopened. My general point is that package code is a subset of wider data.table usage. And (I claim) the real-world usage of code is likely to be diverse and fraught than that in package code. In other words, the impact of backwards-compatible changes is likely to be more muted when looking at package examples, as package authors there are more likely to be responsive to changes and to use a more 'programmatic', 'formal' technique, whereas code in the bowels of organizations is likely to be a bit more slipshod. However, we should take care that our code does not unnecessarily damage the latter usage, not just for normal ethical reasons, but also because I think data.table's reputation as a long-term reliable package could be damaged. One of the reasons I delayed writing the above was the good, plausible example Michael gave of column names with commas, and the perplexing nature of writing such code. My general point I think holds up here though: we should prefer supporting code that has been written over code that is about to be written, as it's easier to rewrite code that you have just written, than to rewrite code that's 10 years old that you barely understand. Avoiding the imposition of technical debt on inexperienced R users is something I feel quite strongly. However, given your scheduling of the problem, I think my objection could be resolved by ensuring that the error message at the conclusion of the cycle is sufficiently helpful to provide such users on the correct code to write. I would recommend that the first version on this path to deprecation be the removal of this from the package documentation, rather than a package option. |
Thank you for the feedback!
in general yes, in this case there is a drop-in replacement that always works: replace with unlist(strsplit(x, ",")). we can even write that out in the error message, getting to your point about making that most helpful.
that seems like a bad approach to me, it's not how we've approached this problem in the past and it is extremely unlikely to catch notice in practice, users will be picking up this habit from all the invisible documentation as much as anything (StackOverflow, LLMs). Check #6049, the next release will have no change for users in the next release -- only a bullet in the NEWS. it's hard to imagine being more gentle than that. Lastly, I want to emphasize that kicking off a deprecation process by no means makes deprecation inevitable. We have stopped a deprecation several times in the past. But without actually doing the process, we have no way of knowing which features like this are deeply important/heavily used in user space, and thus worth maintaining despite quirky inconsistency, vs. us just spilling hot air. Putting this in the release starts the multi-year process of soliciting feedback from the large majority of users who don't follow the issue tracker. |
I am more prone to agree with @HughParsonage and revert this change, after a few months' consideration. It makes some code somewhat inconsistent but is a nice convenience feature for interactive / exploratory work. I think the logic to accommodate cases like #4292 is not that complicated to maintain inside a helper function ( Any other thoughts here on the right way forward? Particularly from @jangorecki and those who 👍 the idea to deprecated this usage: @ColeMiller1 @renkun-ken @Henrik-P @tdhock @shrektan |
I would still keep this change on the roadmap but for 2.0, where we can clean up all the inconsistencies that we found. Note that 2.0, if happen, won't happen anytime soon. |
For now I identified two cases, there are probably some more:
key
argument indata.table
should not acceptkey="A,B"
, which implies thatsetkey
should not accept it as wellby
argument in[.data.table
should not acceptby="A,B"
(sokeyby
)In both cases
c("A","B")
should be used instead.I understand that it has been added as a nice feature that makes life (very little bit) simpler, but it breaks consistency, introduces new challenges (#4292), and it is also confusing to users (like does the column has a comma in its name?).
WDUT @mattdowle
The text was updated successfully, but these errors were encountered: