Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion to note changes to pdf_text() processing in poppler version 20.12.1 #92

Open
hrbrmstr opened this issue Apr 16, 2021 · 2 comments

Comments

@hrbrmstr
Copy link

I've been doing some back-and-forth testing between R 4.0.x and R 4.1.0 on macOS (both chipsets) of pretty much every pacakge I use and so far most things work perfectly.

The 4.1.0 CRAN macOS binary for {pdftools} is Using poppler version 20.12.1 whereas the 4.0.x CRAN macOS binary for {pdftools} is Using poppler version 0.73.0. Both are versioned pdftools_2.3.1.

R 4.0.4 `sessionInfo()`
R version 4.0.4 (2021-02-15)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] stringi_1.4.6 pdftools_2.3.1

loaded via a namespace (and not attached):
[1] compiler_4.0.4 magrittr_1.5 ellipsis_0.3.1 tools_4.0.4
[5] pillar_1.4.6 tibble_3.0.3 crayon_1.3.4 Rcpp_1.0.5
[9] vctrs_0.3.4 qpdf_1.1 lifecycle_0.2.0 pkgconfig_2.0.3
[13] rlang_0.4.7 askpass_1.1

R 4.1.0 `sessionInfo()`
R Under development (unstable) (2021-03-29 r80130)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] stringi_1.5.3 pdftools_2.3.1

loaded via a namespace (and not attached):
[1] compiler_4.1.0 tools_4.1.0 Rcpp_1.0.6 qpdf_1.1 askpass_1.1

Example code run in both sessions:

tf <- tempfile(fileext = ".pdf")
download.file("https://rud.is/dl/unit42-ransomware-threat-report-2021.pdf", tf)

library(stringi)
library(pdftools)

l <- pdf_text(tf)

stri_split_lines(l[[7]])[[1]]

# see output in the two details blocks below

unlnk(tf)
R 4.0.4 example output
stri_split_lines(l[[7]])[[1]]
 [1] "       100+   20–40     10–20     1–10"
 [2] "                 Number of victim organizations with data published on leak sites by country"
 [3] "United States            151    Belgium             4    Chile               1     Pakistan               1"
 [4] "Canada                   39     Sweden              4    Colombia            1     Peru                   1"
 [5] "Germany                  26     South Africa        3    Croatia             1     Poland                 1"
 [6] "United Kingdom           17     Spain               3    Greece              1     Portugal               1"
 [7] "France                   16     Japan               2    Hong Kong           1     Saudi Arabia           1"
 [8] "India                    11     Mexico              2    Jamaica             1     Singapore              1"
 [9] "Australia                7      New Zealand         2    Kenya               1     Sri Lanka              1"
[10] "Brazil                   5      South Korea         2    Luxembourg          1     Taiwan                 1"
[11] "Israel                   5      Switzerland         2    Malaysia            1     Thailand               1"
[12] "Italy                    5      Austria             1    Norway              1     United Arab Emirates   1"
[13] "                      Figure 3: Numbers of victim organizations with data"
[14] "                    published on leak sites by country, Jan. 2020 – Jan. 2021"
[15] "                    Pa l o A l to N et wo r ks | U n i t 4 2 | R a n s o mwa re T h re at R e p o r t, 2 02 1 7"
[16] ""
R 4.1.0 example output
stri_split_lines(l[[7]])[[1]]
 [1] "         100+   20–40       10–20     1–10"
 [2] ""
 [3] ""
 [4] ""
 [5] "                  Number of victim organizations with data published on leak sites by country"
 [6] ""
 [7] "United States               151     Belgium            4    Chile              1    Pakistan                 1"
 [8] ""
 [9] "Canada                      39      Sweden             4    Colombia           1    Peru                     1"
[10] ""
[11] "Germany                     26      South Africa       3    Croatia            1    Poland                   1"
[12] ""
[13] "United Kingdom              17      Spain              3    Greece             1    Portugal                 1"
[14] ""
[15] "France                      16      Japan              2    Hong Kong          1    Saudi Arabia             1"
[16] ""
[17] "India                       11      Mexico             2    Jamaica            1    Singapore                1"
[18] ""
[19] "Australia                   7       New Zealand        2    Kenya              1    Sri Lanka                1"
[20] ""
[21] "Brazil                      5       South Korea        2    Luxembourg         1    Taiwan                   1"
[22] ""
[23] "Israel                      5       Switzerland        2    Malaysia           1    Thailand                 1"
[24] ""
[25] "Italy                       5       Austria            1    Norway             1    United Arab Emirates     1"
[26] ""
[27] ""
[28] ""
[29] "                          Figure 3: Numbers of victim organizations with data"
[30] "                        published on leak sites by country, Jan. 2020 – Jan. 2021"
[31] ""
[32] ""
[33] ""
[34] ""
[35] "                        Pa l o A l to N et wo r ks | U n i t 4 2 | R a n s o mwa re T h re at R e p o r t, 2 02 1   7"
[36] ""
[37] ""

This is very likely a behavior change in the underlying poppler library but is definitely going to break at least some automation folks might have setup, so I'm posting the issue as more of a "heads up" and "may want to note this when 4.1.0 is live". I didn't see anything specific to this "additional newlines" directly in any of the popper changelog.

One thing you'll note if you run the example code is the generation of (IIRC) 19 PDF error: Invalid Font Weight messages, but I don't think that's causing this issue.

@jeroen
Copy link
Member

jeroen commented May 3, 2021

Hmm I am also seeing output changes on windows with recent versions of poppler. This is very annoying :/

@jeroen
Copy link
Member

jeroen commented May 4, 2021

I have bisected the issue and reported upstream: https://gitlab.freedesktop.org/poppler/poppler/-/issues/1076

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants