Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A few comments #1

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

A few comments #1

wants to merge 1 commit into from

Conversation

hadley
Copy link

@hadley hadley commented Mar 22, 2022

Thanks for writing this article — I enjoyed reading it. I had a few comments/questions specifically on the tidyverse stuff so I thought this might be a reasonable way to initiate a discussion.

@ReeceGoding
Copy link
Owner

Thanks for the reply. I'm very glad to hear that you enjoyed the article. It's great to know that my work hasn't gone to waste. Can I ask how you found this review? I wasn't planning to spread it around much until I had finished my final proofread. Seeing your pull request in my emails was the first hint that I seem to have gained 20,000 views overnight. Fun and terrifying!

I must also thank you for your hard work over the years. I'm sure that I've praised your books enough in this article, but special thanks has to go to chapter 13 of R for Data Science. That chapter did a lot to give me the confidence to apply for my current job. Thanks to that, I now live a much happier life.

Anyway, in terms of giving a general reply to what you've said, I'm surprised by how much we agree. When I first saw the pull request, I thought that I was about to get obliterated. It's a shame that, in the cases where we totally agree on a fault of the tidyverse, backwards compatibility has prevented you from fixing it. I suppose that you're damned if you do and damned if you don't. Too little backwards compatibility, and I'll complain because your API is unstable. Too much, and you're just as cursed as R's marriage to S. I respect the dilemma that you're in.

For the specifics, I'll go point-by-point wherever I have something to say:

HW: Obviously I disagree strongly with this :)

Yeah. I'd be deeply worried if you agreed. The Tidyverse certainly does a lot to improve R, but my belief is that it can never fix R. It's a small difference, but a key one. Continuing my crass (and, in present company, harsh) "polished turd" point, I believe that the final version of the Tidyverse will make R a remarkably well polished turd. The great shame is that, through no fault of yours, your outstanding work can't stop R being a turd.

HW: Not sure what you mean by the reference to rlang here? The end user of the tidyverse should be very rarely exposed to this package.

There's not much that I can say. I'm just sure that when I've messed up in the tidyverse, rlang has thrown the error.

HW: I'd love to dig into this more because I'd consider this specific deprecation a win: https://r4ds.had.co.nz/many-models.html?q=unnest#unnesting shows a single, acitonable, warning and the code continues to run as is, despite the deprecation occurring ~2.5 years ago. We are really trying to make sure deprecated code continues to work for many years.

It's great that the code still works, but that's not what I was getting at. My argument was essentially "tidy depreciates functions so often that its own book can't keep up". I've got nothing to say against the depreciation in of itself. It's more that you would expect the book to keep on top of what is and isn't depreciated.

HW: I don't think this is a specific feature of piping. I agree that purrr has too many functions, and we're likely to substantially reduce if we take another major stab at this problem space. I don't see how you'd replicate the behaivour of dplyr::select(). See tidyr for a counter example of a tidyverse package whos functions have a lot of arguments

More than once, I've come close to deleting the point that you're replying to. That's partly why I've hedged it with that "I'm willing to be proven wrong here" line. However, I just can't shake the feeling that I'm on to something, so I've always kept it in. At minimum, I think I'm right to say that we don't yet know if building around pipes is a good idea. But, I don't think that I know if the issue that I've identified is a true, useful, or key one. I lack the experience. If anyone's qualified to write a list of risks and pitfalls inherent to pipe-focused design, it's probably yourself. It'd be a remarkably good read.

I also can't think of a better way to replicate dplyr::select. I'll have to take a good look at that counter example.

HW: Hmmm, I don't see how the filter API could work with other data types because it's fundamentally about varibles. Do you have a specific example of subsetting a character vector where you wanted to use filter instead?

I don't. You shouldn't try to fix this. I'll update that point.

HW: well technically there is, it's just that no one uses it: autoplot(). I'm not entirely sure why this never caught on given the popularity of the plot() interface.

There is? Thank you for introducing me to my new best friend. I'm going to play around with that for a long time. I'll update the point once the party's over.

HW: I don't think so?

I certainly can't prove that I'm right on this, nor can I think of any way to be proven wrong. However, I think that the expressiveness that was lost by adopting the purrr anonymous function syntax had to be regained somewhere. Without doing so, you'd get much harder to read code. I wish that I had stronger evidence, or frankly any evidence at all, but that's why I can't elevate that point any higher than merely saying "I suspect".

@hadley
Copy link
Author

hadley commented Mar 23, 2022

I'm about to knock off for the day so I'll write more tomorrow, but I thought you'd like to know the reason you got so many reads because you made the front page of hackernews. There are ~250 comments there: https://news.ycombinator.com/item?id=30764505

@hadley
Copy link
Author

hadley commented Mar 31, 2022

I must also thank you for your hard work over the years. I'm sure that I've praised your books enough in this article, but special thanks has to go to chapter 13 of R for Data Science. That chapter did a lot to give me the confidence to apply for my current job. Thanks to that, I now live a much happier life.

Thanks, that means a lot :)

Anyway, in terms of giving a general reply to what you've said, I'm surprised by how much we agree. When I first saw the pull request, I thought that I was about to get obliterated. It's a shame that, in the cases where we totally agree on a fault of the tidyverse, backwards compatibility has prevented you from fixing it. I suppose that you're damned if you do and damned if you don't. Too little backwards compatibility, and I'll complain because your API is unstable. Too much, and you're just as cursed as R's marriage to S. I respect the dilemma that you're in.

Exactly :)

Yeah. I'd be deeply worried if you agreed. The Tidyverse certainly does a lot to improve R, but my belief is that it can never fix R. It's a small difference, but a key one. Continuing my crass (and, in present company, harsh) "polished turd" point, I believe that the final version of the Tidyverse will make R a remarkably well polished turd. The great shame is that, through no fault of yours, your outstanding work can't stop R being a turd.

I think the metaphor I'd prefer is that R is uncut diamond - it has a bunch of great ideas, and just needs some cutting and polishing to really make it shine.

HW: Not sure what you mean by the reference to rlang here? The end user of the tidyverse should be very rarely exposed to this package.

There's not much that I can say. I'm just sure that when I've messed up in the tidyverse, rlang has thrown the error.

Ah ok, I think this might just be a difference in what we mean by "monolith". From my perspective, the use of common dependencies doesn't make the tidyverse a monolith because you can still use many of the pieces independently.

HW: I'd love to dig into this more because I'd consider this specific deprecation a win: https://r4ds.had.co.nz/many-models.html?q=unnest#unnesting shows a single, acitonable, warning and the code continues to run as is, despite the deprecation occurring ~2.5 years ago. We are really trying to make sure deprecated code continues to work for many years.

It's great that the code still works, but that's not what I was getting at. My argument was essentially "tidy depreciates functions so often that its own book can't keep up". I've got nothing to say against the depreciation in of itself. It's more that you would expect the book to keep on top of what is and isn't depreciated.

One of the challenges of deprecations is making them obvious when you're working interactively, but not so obvious that they break automated workflows. So the design of our deprecation messages deliberate make them hard to spot automatically, as in the book CI/CD pipeline. And you'll an obvious post-hoc justification, it's actually good that R4DS has some deprecation messages in it, because it normalises using deprecated functions :)

HW: I don't think this is a specific feature of piping. I agree that purrr has too many functions, and we're likely to substantially reduce if we take another major stab at this problem space. I don't see how you'd replicate the behaivour of dplyr::select(). See tidyr for a counter example of a tidyverse package whos functions have a lot of arguments

More than once, I've come close to deleting the point that you're replying to. That's partly why I've hedged it with that "I'm willing to be proven wrong here" line. However, I just can't shake the feeling that I'm on to something, so I've always kept it in. At minimum, I think I'm right to say that we don't yet know if building around pipes is a good idea. But, I don't think that I know if the issue that I've identified is a true, useful, or key one. I lack the experience. If anyone's qualified to write a list of risks and pitfalls inherent to pipe-focused design, it's probably yourself. It'd be a remarkably good read.

I agree that it can take a long time to fully understand the implications of new syntax, and the introduction of piping into R is certainly new enough that there might be unintended consequences that we have yet to discover. But given that the pipeline paradigm is well established elsewhere (e.g. unix pipes) and that R Core (generally a very conservative group) liked it enough to add it to base R, I have few worries.

Additionally, I don't think that the introduction of piping changes that much in how you design functions - it mostly just forces you to think more about what the first argument should be (in way that I'd argue is almost always uniformly positive). The tidyverse generally favours functions that are pipeable AND small/composable, but I don't think those two ideas are fundamentally interrelated.

HW: I don't think so?

I certainly can't prove that I'm right on this, nor can I think of any way to be proven wrong. However, I think that the expressiveness that was lost by adopting the purrr anonymous function syntax had to be regained somewhere. Without doing so, you'd get much harder to read code. I wish that I had stronger evidence, or frankly any evidence at all, but that's why I can't elevate that point any higher than merely saying "I suspect".

The other perspective that generally pushed me away from base style towards ~ is that it's much easier to teach (e.g. https://github.com/cwickham/purrr-tutorial/blob/master/slides.pdf). So easy, in fact, that you can teach it before you teach functions, which gives students a powerful approach early on. While I'm happy writing code like lapply(data, lm, formula = mpg ~ wt), because I understand the precise rules for argument matching, most people find it hard to understand what's going on.

That said, I used to be a very passionate advocate for purrr, and now, for whatever reason, it just doesn't excite me that much. Maybe it's because many of the common challenges that previously required purrr (e.g. rectangling or loading a directory of csv files) don't any more because we've built out the tooling elsewhere (e.g. tidyr::unnest_wider(), readr::read_csv(dir()). That means that purrr can be taught much later in an intro data science sequence, and it become more reasonable to assume that the user is familiar with functions.

@ReeceGoding
Copy link
Owner

ReeceGoding commented Mar 31, 2022

I think the metaphor I'd prefer is that R is uncut diamond - it has a bunch of great ideas, and just needs some cutting and polishing to really make it shine.

Well done. That's the best re-framing that I've seen all year. My objection is that base R is no diamond and that you'd struggle to get it to change. But, yes, we both agree that base R contains some great ideas. As for disagreements, I think we'd both go in circles on this point.

post-hoc justification, it's actually good that R4DS has some deprecation messages in it, because it normalises using deprecated functions :)

That's so smart that you should sneak it in to the book. I'm entirely serious: "Don't fear the deprecated" is a lesson worth learning in the Tidyverse.

I agree that it can take a long time to fully understand the implications of new syntax, and the introduction of piping into R is certainly new enough that there might be unintended consequences that we have yet to discover. But given that the pipeline paradigm is well established elsewhere (e.g. unix pipes) and that R Core (generally a very conservative group) liked it enough to add it to base R, I have few worries.

Additionally, I don't think that the introduction of piping changes that much in how you design functions - it mostly just forces you to think more about what the first argument should be (in way that I'd argue is almost always uniformly positive). The tidyverse generally favours functions that are pipeable AND small/composable, but I don't think those two ideas are fundamentally interrelated.

I concede the point. You have far greater experience on this topic than I.

The other perspective that generally pushed me away from base style towards ~ is that it's much easier to teach (e.g. https://github.com/cwickham/purrr-tutorial/blob/master/slides.pdf). So easy, in fact, that you can teach it before you teach functions, which gives students a powerful approach early on. While I'm happy writing code like lapply(data, lm, formula = mpg ~ wt), because I understand the precise rules for argument matching, most people find it hard to understand what's going on.

Yeah, base R forces you to learn functional programming idioms. There's just no way around that. purrr's clever parsing sidesteps the issue. However, I think that this gets back to an earlier objection that I had to purrr's ways. Although it's pretty easy to show someone the basics of what ~ does in purrr (just as those slides do), there are an awful lot of tricks that can be done within purrr's arguments (I believe that pluck() is the function behind the magic?). In contrast, anonymous functions aren't too easy to learn, but they're almost the only lesson that you need to learn in base R's apply family. This is just like that "would I rather learn base R's handful of complex functions or purrr's truckload of simple ones?" point that I stole from Professor Matloff.

That said, I used to be a very passionate advocate for purrr, and now, for whatever reason, it just doesn't excite me that much. Maybe it's because many of the common challenges that previously required purrr (e.g. rectangling or loading a directory of csv files) don't any more because we've built out the tooling elsewhere (e.g. tidyr::unnest_wider(), readr::read_csv(dir()).

Those slides provoked a similar "wouldn't I just use some other part of the Tidyverse for this?" reaction from me. Our guts agree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants