Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add total.label argument for groupingsets, cube, rollup #5973

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

markseeto
Copy link
Contributor

Closes #5351

Added a total.label argument to the rollup.data.table, cube.data.table, and groupingsets.data.table functions.

A couple of comments:

  • It was sometimes not clear what should be an error, what should be a warning, and what should be neither. I made choices that seem reasonable to me, but there would be alternatives that are also reasonable.
  • I wondered whether all.label might be a better name than total.label, since the thing being calculated won't always be a total. But I left it as total.label because the documentation uses the terminology "(sub-)totals", and the word "total" is used in other software, so "total" seems to be widely accepted.

Copy link

codecov bot commented Mar 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.52%. Comparing base (b6d6100) to head (b6d1cf9).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5973      +/-   ##
==========================================
+ Coverage   97.51%   97.52%   +0.01%     
==========================================
  Files          80       80              
  Lines       14915    14987      +72     
==========================================
+ Hits        14544    14616      +72     
  Misses        371      371              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jangorecki
Copy link
Member

As for the arg name I would keep it as "label"

R/groupingsets.R Outdated
@@ -57,6 +57,16 @@ groupingsets.data.table = function(x, j, by, sets, .SDcols, id = FALSE, jj, ...)
stopf("Argument 'sets' must be a list of character vectors.")
if (!is.logical(id))
stopf("Argument 'id' must be a logical scalar.")
if (!(is.null(total.label) ||
(is.character(total.label) && length(total.label) == 1L) ||
(is.list(total.label) && all(vapply_1b(total.label, is.character)) &&
Copy link
Member

@jangorecki jangorecki Mar 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can rule out list because AFAIU char vector will be sufficient. Please correct me if I am wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jangorecki Thanks for reviewing. To make sure I understand you correctly, do you mean that total.label = list(Region = "National", Product = "Total") should not be allowed, and it should instead be total.label = c(Region = "National", Product = "Total")? I would have to try it to be sure, but I think that would be OK.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as long as each element in the list is scalar then list is not necessary and character vector is sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jangorecki Thanks, I'll try that.

Copy link
Member

@jangorecki jangorecki Mar 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was wrong and labels must not necessarily be of character type, is that correct? Then we need a list type so mixed types can be provided.

Would be good to use such little bit more complex example so it comes up straight away.

Copy link
Contributor Author

@markseeto markseeto Mar 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jangorecki. Currently it only supports character and factor grouping variables. For other types, it's left as NA.

In #5351 you wrote

Then we of course want total.label="Total" to be recycled to list(State = "Total", Group = "Total"). Of course only if all columns are character.

I interpreted this as meaning the label argument should only apply to variables of type character, but now I realise that maybe I interpreted incorrectly and you might have meant that total.label = "Total" should only be recycled to character variables, but non-character labels for non-character variables are still allowed in a list.

A couple of options (and there might be others):

  1. Only apply the label argument to grouping variables of type character or factor, and require label = c(Region = "National", Product = "Total") instead of label = list(Region = "National", Product = "Total").
  2. Allow the label argument to apply to grouping variables that aren't character or factor, and allow total.label to be a list so that different types can be specified for different variables.

Which do you prefer? As I mentioned above, if I had a date or integer grouping variable, I would probably still want a character total label (e.g. "Total") rather than a date or integer total label. To do this, currently the user would have to convert the variable to character or factor before using groupingsets/cube/rollup.

Maybe for non-character grouping variables, we could allow a choice of specifying a label of the same type as the variable, or a character label, and if a character label is provided then the variable will be converted to character in the output. For example, if A is character and B is integer, then label = list(A = "LabelA", B = 999L) and label(A = "LabelA", B = "LabelB") would both be allowed, and the second would result in B being character in the output. If we do that, then I think I would favour having label = "Label" apply to all variables, not just character variables.

Copy link
Member

@jangorecki jangorecki Mar 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing column type is not an option, column type needs to be as original columns used in 'by' and not in labels (so no for option B). Option 2. We cannot apply it to all variables because we must match column types, we don't just print results but return a data.table so saying it is only for output won't be useful. Our output is data.table and column types cannot depend on labels arg. It is the labels arg has to match to column types in the output 'by'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "output", I meant the data.table that is returned. Sorry for the confusion.

Thanks for clarifying that the column types have to stay the same. That makes it simpler.

Do you think the shorter form label = "Total" should be allowed, or should label always be a named list? If the shorter form is allowed, should it always have to be character, or would something like label = 999L be allowed and apply to all the integer grouping variables?

Copy link
Member

@jangorecki jangorecki Mar 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be allowed but for non-char column will result an NA, which anyway is likely to be expected. Similarly 999L can be also allowed but would result NA for non int columns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll give that a try.

@markseeto
Copy link
Contributor Author

As for the arg name I would keep it as "label"

@jangorecki Thanks for your comment. Do you mean change it from "total.label" to "label", or keep it as "total.label"?

@jangorecki
Copy link
Member

Yes, change to shorter "label" groupings sets are all about (sub-)totals so we don't need to be so explicitly in naming variable so verbosely. Possibly even better than "label" will he better to use "labels" so it is clear that multiple values are supported.

@jangorecki
Copy link
Member

Very good PR. What would be even more useful is minimal example usage in NEWS entry. Including different data types for labels.

@markseeto
Copy link
Contributor Author

Yes, change to shorter "label" groupings sets are all about (sub-)totals so we don't need to be so explicitly in naming variable so verbosely. Possibly even better than "label" will he better to use "labels" so it is clear that multiple values are supported.

@jangorecki I prefer "total.label" or "total.labels" because they make it easier to guess where the label is going to be used, but I don't feel strongly about this, so I'm happy to change it to "labels".

@markseeto
Copy link
Contributor Author

Very good PR. What would be even more useful is minimal example usage in NEWS entry. Including different data types for labels.

@jangorecki Thanks for the feedback. I can add an example to the NEWS entry.

@jangorecki
Copy link
Member

jangorecki commented Mar 4, 2024

Yes, change to shorter "label" groupings sets are all about (sub-)totals so we don't need to be so explicitly in naming variable so verbosely. Possibly even better than "label" will he better to use "labels" so it is clear that multiple values are supported.

@jangorecki I prefer "total.label" or "total.labels" because they make it easier to guess where the label is going to be used, but I don't feel strongly about this, so I'm happy to change it to "labels".

Another reason why I prefer single word argument name is it does not enforce any style, like total.label or totalLabel or total_label will do.

@markseeto
Copy link
Contributor Author

Yes, change to shorter "label" groupings sets are all about (sub-)totals so we don't need to be so explicitly in naming variable so verbosely. Possibly even better than "label" will he better to use "labels" so it is clear that multiple values are supported.

@jangorecki I prefer "total.label" or "total.labels" because they make it easier to guess where the label is going to be used, but I don't feel strongly about this, so I'm happy to change it to "labels".

Another reason why I prefer single word argument name is it does not enforce any style, like total.label or totalLabel or total_label will do.

OK, I can change it. Maybe "label" is better than "labels", since "labels" is a base function.

Add 'label' argument to the groupingsets.data.table(), cube.data.table(), and rollup.data.table() functions.
Add to the Usage, Arguments, Details, and Examples sections.
@markseeto markseeto reopened this Apr 4, 2024
@markseeto
Copy link
Contributor Author

markseeto commented Apr 4, 2024

@jangorecki I've updated the functions, tests, documentation, and news, based on our discussions. When you have time, please let me know what you think.

I wasn't sure about the correct way to update a PR. Apologies if I've done it incorrectly.

I don't understand the reason for the red cross next to the News commit. When I click on the cross, it says "R-CMD-check / windows-latest (devel) (pull_request)" was cancelled, but in the list of checks, it says that check was successful.

Thanks.

@ben-schwen

This comment was marked as resolved.

@markseeto

This comment was marked as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"total" label in cube/rollup/groupingsets
3 participants