Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adorn_cumulative function for one-way tabyls #238

Open
sfirke opened this issue Sep 19, 2018 · 6 comments
Open

adorn_cumulative function for one-way tabyls #238

sfirke opened this issue Sep 19, 2018 · 6 comments
Labels
pull-request-welcome seeking comments Users and any interested parties should please weigh in - this is in a discussion phase!

Comments

@sfirke
Copy link
Owner

sfirke commented Sep 19, 2018

Suggested by @elinw in #231.

I am thinking this function takes a data.frame and a column name and appends a column that is the cumulative sum of that target column. It also takes an argument dir specifying whether the function should sum down from the top (default) or up from the bottom.

The default case would be to run on the result of a one-way tabyl so in my version below, when the column name is not specified it first looks for valid_percent and then percent. If both are present, should it generate two cumsum columns, or should it default to valid_percent?


#' @title Add a cumulative sum column to a data.frame.
#'
#' @description A tidyverse-style function to add a cumsum column.
#'
#' @param dat data.frame to add cumulative sum to
#' @param colname specify the unquoted name of the column to sum, or leave it blank in which case the function will default first to a column called "valid_percent" and then for "percent".  These defaults support running this function on the result of calls to \code{janitior::tabyl}.  If no colname is supplied and these default columns are absent in the data.frame, the function will error.
#' @param dir direction to sum; defaults to "down" but can sum from the bottom of a data.frame with "up".  In this case the resulting column name will be "cumulative_up".
#'
#' @return a data.frame.
#' @export
#'
#' @examples
#' library(janitor)
#' mtcars %>%
#'   adorn_cumulative(mpg)
#'
#' mtcars %>%
#'   tabyl(cyl) %>%
#'   adorn_cumulative()
#'
#' # Vector with an NA
#' x <- c(0, 1, 2, 3, 3, 3, NA)
#'
#' x %>%
#'   tabyl() %>%
#'   adorn_cumulative()
#'
#' x %>%
#'   tabyl() %>%
#'   adorn_cumulative(dir = "up")

adorn_cumulative <- function(dat, colname, dir = "down"){

  if(!missing(colname)){
    colname <- rlang::enquo(colname)
  } else if("valid_percent" %in% names(dat)) {
  colname <- rlang::sym("valid_percent")
  } else if("percent" %in% names(dat)){
    colname <- rlang::sym("percent")
  } else {
    stop("\"colname\" not specified and default columns valid_percent and percent are not present in data.frame dat")
  }

  target <- dplyr::pull(dat, !! colname)

  if(dir == "up"){
    target <- rev(target)
  }
  dat$cumulative <- cumsum(ifelse(is.na(target), 0, target)) + target*0 # an na.rm version of cumsum, from https://stackoverflow.com/a/25576972
  if(dir == "up"){
    dat$cumulative <- rev(dat$cumulative)
    names(dat)[names(dat) %in% "cumulative"] <- "cumulative_up"
  }
  dat
}

^^^ this works so try out the examples and give feedback in this issue!

@sfirke sfirke added the seeking comments Users and any interested parties should please weigh in - this is in a discussion phase! label Sep 19, 2018
@sbalci
Copy link

sbalci commented Feb 13, 2019

Hello. Thank you for the package.
The cumulative function works, but when used with Totals, it adds up the 100% and ends up to 200%.
I have attached the code and result:

ekran resmi 2019-02-13 21 26 08

@sfirke
Copy link
Owner Author

sfirke commented Feb 14, 2019

I guess adorn_cumulative should check for a totals row, and if so then put something else in the last value. NA is probably easiest. Then adorn_pct_formatting would need to turn NA into something like "-", not NA%. Edit: this is more of a problem when dir = "up", then the 100 from the totals would immediate throw things off.

@sfirke sfirke added this to the v1.2 milestone Mar 15, 2019
@sfirke
Copy link
Owner Author

sfirke commented Mar 15, 2019

I think this will be a useful addition to janitor, but it might have more complexity and require more care than I can devote for the 1.2 release. I'll try, though, and am marking this 1.2.

@sfirke
Copy link
Owner Author

sfirke commented Apr 20, 2019

This is still worthy but I can't get this done right in the next 24 hours, it will have to wait for some future release. Anyone's welcome to turn the above code into a PR with tests & addressing the issues with totals and formatting identified above.

@sfirke sfirke added this to the v2.2 milestone Nov 24, 2020
@sfirke sfirke modified the milestone: v2.2 Dec 25, 2020
@mattroumaya
Copy link
Contributor

mattroumaya commented Feb 5, 2021

Hi @sfirke,

I have been trying to add to the code you provided above, and wanted to share what I have in case it's useful. I think that this will work in any case where adorn_totals() is called and then adorn_cumulative(), but am struggling to find a good way to implement if someone were to round values before adorn_cumulative().

The only thing I can think of is to warn the user if the max column value falls within a threshold of the sum of the cumulative sum.

Example:

library(dplyr)
library(janitor)
set.seed(1)

data.frame(test = "test",
           a = runif(100)) %>% 
  adorn_totals() %>% 
  mutate(a = round(a,0)) %>% 
  adorn_cumulative(a) %>% 
  tail()

It appears you have a 'totals' column. Watch for rounding discrepancies.
  test  a cumulative
  test  0         NA
  test  1         48
  test  0         NA
  test  1         49
  test  1         50
 Total 50         NA

This is obviously more of an issue with large datasets with rounded values.

Anyway, hopefully this is useful and helps to implement adorn_cumulative(). I work with many analysts who are moving from SPSS to R, and this would really help to emulate a SPSS-style FREQUENCIES table.

Edit: Possibly add an ignore_row arg?

set.seed(1)

data.frame(test = "test",
           a = runif(100)) %>% 
  adorn_totals() %>% 
  mutate(a = round(a,0)) %>% 
  arrange(desc(a)) %>% 
  adorn_cumulative(a, ignore_row = "first") %>% 
  head()

 test  a cumulative
 Total 56         NA
  test  1          1
  test  1          2
  test  1          3
  test  1          4
  test  1          5
adorn_cumulative <- function(dat, colname, dir = "down", ignore_row = NULL){
  
  if(!missing(colname)){
    colname <- rlang::enquo(colname)
  } else if("valid_percent" %in% names(dat)) {
    colname <- rlang::sym("valid_percent")
  } else if("percent" %in% names(dat)){
    colname <- rlang::sym("percent")
  } else {
    stop("\"colname\" not specified and default columns valid_percent and percent are not present in data.frame dat")
  }
  
  if (!dir %in% c("up", "down")) {
    stop("'dir' must be one of 'up', 'down'")
  }
  
  check <- rlang::quo_name(colname)
  
  if(!inherits(dat[[check]], c("numeric"))) {
    stop("column must be of class numeric.")
  }
  
  if (is.null(ignore_row)){
    target <- dplyr::pull(dat, !! colname)
  } else  if(ignore_row == "last"){
    target <- dplyr::pull(dat, !! colname)
    target[length(target)] <- NA
  }else if (ignore_row == "first"){
    target <- dplyr::pull(dat, !! colname)
    target[1] <- NA
  }
  

  sum_is_1 <- sum(target,na.rm = T) - max(target,na.rm = T) == 1  
  
  cumsum_1 <- max(cumsum(target),na.rm = T) - max(target,na.rm = T) == 1 
  
  cumsum_2 <- max(cumsum(target),na.rm = T) - max(target,na.rm = T) == 2 
  
  cumsum_n <- sum(target, na.rm = T) / max(target, na.rm = T) == 2
  
  if ((dir == "up" & sum_is_1 == T) | 
      (dir == "up" & cumsum_1 == T) | 
      (dir== "up" & cumsum_2 == T)  |
      (dir == "up" & cumsum_n == T)) {
    target <- replace(target, target == max(target,na.rm=T), NA)
    message("It appears you have a 'totals' column. Watch for rounding discrepancies.")
    
  } else if (sum_is_1 == T | cumsum_1 == T | cumsum_2 == T | cumsum_n == T) {
    target <- replace(target, target == max(target,na.rm=T), 0)
    message("It appears you have a 'totals' column. Watch for rounding discrepancies.")
    
  } else if ((sum(target,na.rm = T) - max(target,na.rm = T) > 1.9 & sum(target,na.rm = T) - max(target,na.rm = T) < 2.1 ) |
             (max(cumsum(target),na.rm = T) - max(target,na.rm = T) > 1.9 & max(cumsum(target),na.rm = T) - max(target,na.rm = T) < 2.1) |
             (max(cumsum(target),na.rm = T) - max(target,na.rm = T) > 1.9 & max(cumsum(target),na.rm = T) - max(target,na.rm = T) < 2.1) |
             (sum(target, na.rm = T) / max(target, na.rm = T) > 1.9 & sum(target, na.rm = T) / max(target, na.rm = T) < 2.1)){
    warning("It appears you have a rounded 'totals' column. Watch for rounding discrepancies.")
  }
  
  
  
  if(dir == "up"){
    target <- rev(target)
  }
  
  dat$cumulative <- cumsum(ifelse(is.na(target), 0, target)) + target*0 # an na.rm version of cumsum, from https://stackoverflow.com/a/25576972
  dat$cumulative[duplicated(dat$cumulative)] <- NA
  
  if(dir == "up"){
    dat$cumulative <- rev(dat$cumulative)
    names(dat)[names(dat) %in% "cumulative"] <- "cumulative_up"
    
    # Creates NA for repeated values
    # Assumption that only repeated cumulative value would be a total
    dat$cumulative_up[duplicated(dat$cumulative_up)] <- NA  
  }
  dat
}

@truenomad
Copy link

Hi @sfirke,

I have been trying to add to the code you provided above, and wanted to share what I have in case it's useful. I think that this will work in any case where adorn_totals() is called and then adorn_cumulative(), but am struggling to find a good way to implement if someone were to round values before adorn_cumulative().

The only thing I can think of is to warn the user if the max column value falls within a threshold of the sum of the cumulative sum.

Example:

library(dplyr)
library(janitor)
set.seed(1)

data.frame(test = "test",
           a = runif(100)) %>% 
  adorn_totals() %>% 
  mutate(a = round(a,0)) %>% 
  adorn_cumulative(a) %>% 
  tail()

It appears you have a 'totals' column. Watch for rounding discrepancies.
  test  a cumulative
  test  0         NA
  test  1         48
  test  0         NA
  test  1         49
  test  1         50
 Total 50         NA

This is obviously more of an issue with large datasets with rounded values.

Anyway, hopefully this is useful and helps to implement adorn_cumulative(). I work with many analysts who are moving from SPSS to R, and this would really help to emulate a SPSS-style FREQUENCIES table.

Edit: Possibly add an ignore_row arg?

set.seed(1)

data.frame(test = "test",
           a = runif(100)) %>% 
  adorn_totals() %>% 
  mutate(a = round(a,0)) %>% 
  arrange(desc(a)) %>% 
  adorn_cumulative(a, ignore_row = "first") %>% 
  head()

 test  a cumulative
 Total 56         NA
  test  1          1
  test  1          2
  test  1          3
  test  1          4
  test  1          5
adorn_cumulative <- function(dat, colname, dir = "down", ignore_row = NULL){
 
 if(!missing(colname)){
   colname <- rlang::enquo(colname)
 } else if("valid_percent" %in% names(dat)) {
   colname <- rlang::sym("valid_percent")
 } else if("percent" %in% names(dat)){
   colname <- rlang::sym("percent")
 } else {
   stop("\"colname\" not specified and default columns valid_percent and percent are not present in data.frame dat")
 }
 
 if (!dir %in% c("up", "down")) {
   stop("'dir' must be one of 'up', 'down'")
 }
 
 check <- rlang::quo_name(colname)
 
 if(!inherits(dat[[check]], c("numeric"))) {
   stop("column must be of class numeric.")
 }
 
 if (is.null(ignore_row)){
   target <- dplyr::pull(dat, !! colname)
 } else  if(ignore_row == "last"){
   target <- dplyr::pull(dat, !! colname)
   target[length(target)] <- NA
 }else if (ignore_row == "first"){
   target <- dplyr::pull(dat, !! colname)
   target[1] <- NA
 }
 

 sum_is_1 <- sum(target,na.rm = T) - max(target,na.rm = T) == 1  
 
 cumsum_1 <- max(cumsum(target),na.rm = T) - max(target,na.rm = T) == 1 
 
 cumsum_2 <- max(cumsum(target),na.rm = T) - max(target,na.rm = T) == 2 
 
 cumsum_n <- sum(target, na.rm = T) / max(target, na.rm = T) == 2
 
 if ((dir == "up" & sum_is_1 == T) | 
     (dir == "up" & cumsum_1 == T) | 
     (dir== "up" & cumsum_2 == T)  |
     (dir == "up" & cumsum_n == T)) {
   target <- replace(target, target == max(target,na.rm=T), NA)
   message("It appears you have a 'totals' column. Watch for rounding discrepancies.")
   
 } else if (sum_is_1 == T | cumsum_1 == T | cumsum_2 == T | cumsum_n == T) {
   target <- replace(target, target == max(target,na.rm=T), 0)
   message("It appears you have a 'totals' column. Watch for rounding discrepancies.")
   
 } else if ((sum(target,na.rm = T) - max(target,na.rm = T) > 1.9 & sum(target,na.rm = T) - max(target,na.rm = T) < 2.1 ) |
            (max(cumsum(target),na.rm = T) - max(target,na.rm = T) > 1.9 & max(cumsum(target),na.rm = T) - max(target,na.rm = T) < 2.1) |
            (max(cumsum(target),na.rm = T) - max(target,na.rm = T) > 1.9 & max(cumsum(target),na.rm = T) - max(target,na.rm = T) < 2.1) |
            (sum(target, na.rm = T) / max(target, na.rm = T) > 1.9 & sum(target, na.rm = T) / max(target, na.rm = T) < 2.1)){
   warning("It appears you have a rounded 'totals' column. Watch for rounding discrepancies.")
 }
 
 
 
 if(dir == "up"){
   target <- rev(target)
 }
 
 dat$cumulative <- cumsum(ifelse(is.na(target), 0, target)) + target*0 # an na.rm version of cumsum, from https://stackoverflow.com/a/25576972
 dat$cumulative[duplicated(dat$cumulative)] <- NA
 
 if(dir == "up"){
   dat$cumulative <- rev(dat$cumulative)
   names(dat)[names(dat) %in% "cumulative"] <- "cumulative_up"
   
   # Creates NA for repeated values
   # Assumption that only repeated cumulative value would be a total
   dat$cumulative_up[duplicated(dat$cumulative_up)] <- NA  
 }
 dat
}

Any suggestions on how one could amend the above code for cumulative percentage?

@sfirke sfirke removed this from the v2.2 milestone Jan 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pull-request-welcome seeking comments Users and any interested parties should please weigh in - this is in a discussion phase!
Projects
None yet
Development

No branches or pull requests

4 participants