Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can pdftools distinguish between radio and checkbox entries on a fillable form? #129

Open
ibecav opened this issue Mar 29, 2024 · 0 comments

Comments

@ibecav
Copy link

ibecav commented Mar 29, 2024

The package has worked extremely well on processing "traditional" non fillable forms -- thank you.

In my first attempts at using it with "fillable forms" I can't seem to find a way to distinguish between radio buttons or checkboxes that are selected and those that are not. I'm not sure if I'm missing some nuance, making a complete mistake, or whether the functions don't support it?

An example "blank" original form is at An example "blank" original form is at. For the reprex below I am focusing on a small segment of the form on page 1 that I have included as screenshots in the original state add after filling out and saving a few entries.

example_filled_form_segment
original_blank_segment

I would like to know if there is a way to distinguish the fact that "Long Term Care" is selected in the filled out form versus not selected in the original?

Thank you in advance. Below is what I hope is a reprex that will help, since I could not find an easy safe place to "post" the example filled out form I used dput to put the resulting data in the reprex obviously users can grab the original and dave changes to their local filesystem if desired.

suppressPackageStartupMessages(library(dplyr))
library(arsenal)
## Not sure if poppler version matters?
library(pdftools)
#> Using poppler version 23.04.0
## Download and save the original form as original.pdf
download.file("https://www.cdc.gov/infectioncontrol/pdf/icar/IPC-demo-LTC-508.pdf", 
              "original.pdf")
## Let's use just the first page for the reprex
## Using pdf_data() for the convenience of having a tibble
## Same problem if I use pdf_text
original_pageone <- pdf_data("original.pdf")[[1]]

original_pageone_segment <-
  original_pageone %>% 
  filter(y >= 229, y <= 290)

# no obvious errors but difficult to see the the radio button
# "text" in RStudio console

# original_pageone_segment %>% print(n = Inf)

# Fill in the form with some data.  It works and I can see
# traditional text such as "1234" and "5678" I entered on the form
# filled_pageone <- pdf_data("example_filled_form.pdf")[[1]]

# use dput to capture the resulting tibble for the reprex 
# filled_pageone %>% 
#   filter(y >= 229, y <= 290) %>% dput()

filled_pageone_segment <-
  structure(list(width = c(28L, 18L, 41L, 13L, 53L, 19L, 16L, 49L, 
                           8L, 13L, 18L, 8L, 32L, 7L, 22L, 17L, 31L, 3L, 26L, 25L, 31L, 
                           7L, 40L, 17L, 7L, 90L, 17L, 7L, 22L, 32L, 8L, 48L, 17L, 17L, 
                           28L, 8L, 8L, 48L, 17L), 
                 height = c(11L, 11L, 11L, 11L, 11L, 11L, 
                            11L, 11L, 11L, 11L, 11L, 11L, 11L, 9L, 11L, 11L, 11L, 11L, 11L, 
                            11L, 11L, 9L, 11L, 11L, 9L, 11L, 11L, 9L, 11L, 11L, 11L, 11L, 
                            7L, 11L, 11L, 11L, 11L, 11L, 7L), 
                 x = c(31L, 61L, 81L, 125L, 
                       140L, 195L, 217L, 31L, 82L, 92L, 108L, 128L, 138L, 37L, 49L, 
                       73L, 92L, 126L, 131L, 159L, 186L, 37L, 49L, 91L, 37L, 49L, 142L, 
                       37L, 49L, 73L, 395L, 406L, 459L, 275L, 294L, 325L, 335L, 346L, 
                       399L), 
                 y = c(229L, 229L, 229L, 229L, 229L, 229L, 229L, 240L, 
                       240L, 240L, 240L, 240L, 240L, 255L, 254L, 254L, 254L, 254L, 254L, 
                       254L, 254L, 267L, 266L, 266L, 278L, 278L, 278L, 290L, 290L, 290L, 
                       229L, 229L, 230L, 248L, 248L, 248L, 249L, 249L, 250L), 
                 space = c(TRUE, 
                           TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, 
                           TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, 
                           TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, 
                           TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE), 
                 text = c("Facility", 
                          "type", "(Complete", "the", "demographic", "form", "that", "corresponds", 
                          "to", "the", "type", "of", "facility):", "", "Acute", "Care", 
                          "Hospital", "/", "Critical", "Access", "Hospital", "", "Long-term", 
                          "Care", "", "Outpatient/Ambulatory", "Care", "", "Other", 
                          "(specify):", "(if", "applicable):", "1234", "CMS", "Facility", 
                          "ID", "(if", "applicable):", "5678")), 
            class = c("tbl_df", "tbl", 
                      "data.frame"), 
            row.names = c(NA, -39L))

## Use arsensal to compare tibbles in detail
summary(comparedf(original_pageone_segment, filled_pageone_segment, by = c("x", "y")))
#> 
#> 
#> Table: Summary of data.frames
#> 
#> version   arg                         ncol   nrow
#> --------  -------------------------  -----  -----
#> x         original_pageone_segment       6     37
#> y         filled_pageone_segment         6     39
#> 
#> 
#> 
#> Table: Summary of overall comparison
#> 
#> statistic                                                      value
#> ------------------------------------------------------------  ------
#> Number of by-variables                                             2
#> Number of non-by variables in common                               4
#> Number of variables compared                                       4
#> Number of variables in x but not y                                 0
#> Number of variables in y but not x                                 0
#> Number of variables compared with some values unequal              1
#> Number of variables compared with all values equal                 3
#> Number of observations in common                                  37
#> Number of observations in x but not y                              0
#> Number of observations in y but not x                              2
#> Number of observations with some compared variables unequal        2
#> Number of observations with all compared variables equal          35
#> Number of values unequal                                           2
#> 
#> 
#> 
#> Table: Variables not shared
#> 
#>                          
#>  ------------------------
#>  No variables not shared 
#>  ------------------------
#> 
#> 
#> 
#> Table: Other variables not compared
#> 
#>                                  
#>  --------------------------------
#>  No other variables not compared 
#>  --------------------------------
#> 
#> 
#> 
#> Table: Observations not shared
#> 
#> version      x     y   observation
#> --------  ----  ----  ------------
#> y          399   250            39
#> y          459   230            33
#> 
#> 
#> 
#> Table: Differences detected by variable
#> 
#> var.x    var.y      n   NAs
#> -------  -------  ---  ----
#> width    width      0     0
#> height   height     0     0
#> space    space      2     0
#> text     text       0     0
#> 
#> 
#> 
#> Table: Differences detected
#> 
#> var.x   var.y      x     y  values.x   values.y    row.x   row.y
#> ------  ------  ----  ----  ---------  ---------  ------  ------
#> space   space    346   249  FALSE      TRUE           37      38
#> space   space    406   229  FALSE      TRUE           32      32
#> 
#> 
#> 
#> Table: Non-identical attributes
#> 
#>                              
#>  ----------------------------
#>  No non-identical attributes 
#>  ----------------------------

sessionInfo()
#> R version 4.3.2 (2023-10-31)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Sonoma 14.2.1
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: America/New_York
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] pdftools_3.4.0 arsenal_3.6.3  dplyr_1.1.4   
#> 
#> loaded via a namespace (and not attached):
#>  [1] vctrs_0.6.5       cli_3.6.2         knitr_1.45        rlang_1.1.3      
#>  [5] xfun_0.42         purrr_1.0.2       styler_1.10.2     generics_0.1.3   
#>  [9] glue_1.7.0        askpass_1.2.0     qpdf_1.3.2        htmltools_0.5.7  
#> [13] fansi_1.0.6       rmarkdown_2.25    R.cache_0.16.0    tibble_3.2.1     
#> [17] evaluate_0.23     fastmap_1.1.1     yaml_2.3.8        lifecycle_1.0.4  
#> [21] compiler_4.3.2    fs_1.6.3          Rcpp_1.0.12       pkgconfig_2.0.3  
#> [25] rstudioapi_0.15.0 R.oo_1.26.0       R.utils_2.12.3    digest_0.6.34    
#> [29] R6_2.5.1          tidyselect_1.2.0  utf8_1.2.4        reprex_2.1.0     
#> [33] pillar_1.9.0      magrittr_2.0.3    R.methodsS3_1.8.2 tools_4.3.2      
#> [37] withr_3.0.0

Created on 2024-03-29 with reprex v2.1.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant