Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why aren't openxlsx2 formulas evaluated? #863

Open
JanMarvin opened this issue Nov 29, 2023 · 0 comments
Open

Why aren't openxlsx2 formulas evaluated? #863

JanMarvin opened this issue Nov 29, 2023 · 0 comments
Labels
documentation ✍️ Improvements or additions to documentation wontfix 🚫 This will not be worked on

Comments

@JanMarvin
Copy link
Owner

JanMarvin commented Nov 29, 2023

I have written about this issue already in various places of the openxlsx issue tracker and probably in other places and this issue is just to raise awareness and not something to solve or something I'm going to work on.

The reason why formulas are not evaluated is, because there is nothing to evaluate them.

  • "But Excel shows me number if I open a file with formulas?" "Yes."
  • "And for some formulas wb_to_df() also shows values?" "Yes again."

Why is this the case?

The XML file structure of a formula written by us looks like this:

<c ...>
   <f>SUM(A1:B1)</f>
</c>

Here we have a cell with a simple formula: SUM(A1:B1). If A1 is 1 and B1 is 1 the formula will evaluate 2. In your spreadsheet software you can see the value 2. If you save the file to disk and load it again, the XML structure of the cell has changed. Now it looks like this:

<c ...>
   <f>SUM(A1:B1)</f>
   <v>2</v>
</c>

It contains a value field now and this value is 2. Now if we print from openxlsx2 we also get a value 2. But only because this value is now written into the cell. It is somehow cached into the cell. And this might actually become an issue. Now image if you replace the cell A1 with a value of -1. Now your cell still prints 2 with wb_to_df(), but is not evaluated, because it should actually show a value of 0.

But why is the value not updated?

That is because, something - like your spreadsheet software - needs some way to evalute the formula in the cell with actual data in the spreadsheet. And this is somewhat simple with basic formulas A1 + A2 and SUM(A1:A2), but it will get tricky if you ever try to implement something like nested VLOOKUP() / CHOOSE() functions, functions where R and and your spreadsheet software produce different results and last but not least, anybody has to program all/many/most/at least a few of the functions and their quirks into something we can evaluate in R. And I do not want to do this and I do not want to be the one who does this and if you want me to be the one, lets talk about my salary ..., but good news!

You could be the one!

It is not really hard to begin with. First, most likely you do not want EVERY function (this could take a while, because there are a lot), but instead focus on only a few. Second, begin with something easy and not some overly complex 300 lines of nested INDEX() functions and maybe begin with something like this. The idea is get the sheet, reference and operators from the formula and evaluate everything in some R formula just like base R functions.

Draft code

Here I use spreadsheet formulas using + and SUM() and my code matches them with Rs + and sum(). Similar I could use MIN()/MAX() and other functions that have identical names. For AVERAGE() I might need something like average <- function(...) mean(...) and other more complex functions might require actual custom written functions:

library(openxlsx2)

dat <- data.frame(
  num1 = 1,
  num2 = 2,
  fml1 = "A2 + B2",
  fml2 = "SUM(A2, B2)",
  fml3 = "'Sheet 1'!A2 + B2",
  fml4 = "FST + B2"
)
class(dat$fml1) <- "formula"
class(dat$fml2) <- "formula"
class(dat$fml3) <- "formula"
class(dat$fml4) <- "formula"

wb <- wb_workbook()$add_worksheet()$add_data(x = dat)
wb$add_named_region(dims = "A2", name = "FST")
# wb$open()

dat <- data.frame(
  x = 0:10,
  y = -5:5
)
dat_fml <- data.frame(
  x = "SUM(A2:A12)",
  y = "SUM(B2:B12)"
)
class(dat_fml$x) <- "formula"
class(dat_fml$y) <- "formula"

wb$add_worksheet()$add_data(x = dat)$add_data(x = dat_fml, dims = "A14")


wb_eval_excel_fml <- function(wb, sheet, dims) {
  # example function that works with a tiny subset of formulas
  # if you use this function in production you are braver or
  # desparater than you look.
  
  wb <- wb$clone()
  
  openxlsx2:::assert_workbook(wb)
  sheetid <- wb$validate_sheet(sheet)
  
  fmls <- as.character(wb$to_df(dims = dims, sheet = sheetid, show_formula = TRUE, col_names = FALSE))
  
  message("Input formula is: ", fmls)
  tkns <- tidyxl::xlex(fmls)
  
  sel <- tkns$type == "ref"
  vars <- tkns$token[sel]
  rnd <- openxlsx2:::random_string(n = length(vars), pattern = "[a-z]")
  
  sheets <- wb$get_sheet_names()[sheetid]
  sheets <- rep(sheets, length(vars))
  
  if (any(sel2 <- tkns$type == "sheet")) {
    shts <- tkns$token[sel2]
    sel3 <- which(tkns$type == "ref") %in% ( which(tkns$type == "sheet") + 1)
    sheets[sel3] <- stringi::stri_extract_first_regex(shts, "([^']+)")
    
    # remove this from our formula
    tkns <- as.data.frame(tkns)
    tkns <- tkns[!sel2, ]
    
    sel <- tkns$type == "ref"
  }
  
  fml_env <- new.env()
  for (i in seq_along(vars)) {
    num <- wb_to_df(wb, dims = vars[i], sheet = sheets[i], col_names = FALSE)
    assign(rnd[i], num, fml_env)
    tkns$token[sel][i] <- rnd[i]
    message("Var ", vars[i], " is: ", num)
  }
  
  lwr_fmls <- parse(text = tolower(paste0(tkns$token, collapse = "")))
  res <- as.numeric(eval(lwr_fmls, envir = fml_env))
  message("Result: ", res)
  
  sel <- wb$worksheets[[sheetid]]$sheet_data$cc$r == dims
  wb$worksheets[[sheetid]]$sheet_data$cc$v[sel] <- res
 
  wb
}

# wb$open()

wb_to_df(wb)
#>   num1 num2 fml1 fml2 fml3 fml4
#> 2    1    2   NA   NA   NA   NA

wb <- wb_eval_excel_fml(wb, sheet = 1, dims = "C2")
#> Input formula is: A2 + B2
#> Var A2 is: 1
#> Var B2 is: 2
#> Result: 3

wb <- wb_eval_excel_fml(wb, sheet = 1, dims = "D2")
#> Input formula is: SUM(A2, B2)
#> Var A2 is: 1
#> Var B2 is: 2
#> Result: 3
wb <- wb_eval_excel_fml(wb, sheet = 1, dims = "E2")
#> Input formula is: 'Sheet 1'!A2 + B2
#> Var A2 is: 1
#> Var B2 is: 2
#> Result: 3

# # one of many cases that does not work
# wb <- wb_eval_excel_fml(wb, sheet = 1, dims = "F2")

wb <- wb_eval_excel_fml(wb, sheet = 2, dims = "A15")
#> Input formula is: SUM(A2:A12)
#> Var A2:A12 is: c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
#> Result: 55
wb <- wb_eval_excel_fml(wb, sheet = 2, dims = "B15")
#> Input formula is: SUM(B2:B12)
#> Var B2:B12 is: c(-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5)
#> Result: 0

wb_to_df(wb, show_formula = TRUE)
#>   num1 num2    fml1        fml2              fml3     fml4
#> 2    1    2 A2 + B2 SUM(A2, B2) 'Sheet 1'!A2 + B2 FST + B2
wb_to_df(wb)
#>   num1 num2 fml1 fml2 fml3 fml4
#> 2    1    2    3    3    3   NA


wb_to_df(wb, sheet = 2, show_formula = TRUE) |> tail()
#>              x           y
#> 10           8           3
#> 11           9           4
#> 12          10           5
#> 13        <NA>        <NA>
#> 14           x           y
#> 15 SUM(A2:A12) SUM(B2:B12)
wb_to_df(wb, sheet = 2) |> tail()
#>       x    y
#> 10    8    3
#> 11    9    4
#> 12   10    5
#> 13 <NA> <NA>
#> 14    x    y
#> 15   55    0
@JanMarvin JanMarvin added documentation ✍️ Improvements or additions to documentation wontfix 🚫 This will not be worked on labels Nov 29, 2023
@JanMarvin JanMarvin pinned this issue Jan 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation ✍️ Improvements or additions to documentation wontfix 🚫 This will not be worked on
Projects
None yet
Development

No branches or pull requests

1 participant