Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse number cannot recognize number #1507

Open
kjayhan opened this issue Aug 13, 2023 · 1 comment
Open

Parse number cannot recognize number #1507

kjayhan opened this issue Aug 13, 2023 · 1 comment

Comments

@kjayhan
Copy link

kjayhan commented Aug 13, 2023

I extracted some data from a Chinese pdf file.

The numbers in the columns are extracted as follows (for example): -122, 29458, 9.

I copy pasted the outputs of some cells. However, these characters are not the same as -122, 29458, 9, respectively.

Hence parse.number() produces NA in all of these cases.

Any suggestions regarding what I should do?

This is the pdf file in question: http://images.mofcom.gov.cn/fec/202211/20221118091910924.pdf

I extracted the data from page 49 (53rd page of the pdf file), using the following code:

library(tidyverse)
library(pdftools)

file <- tempfile()

url <- paste0("http://images.mofcom.gov.cn/fec/202211/20221118091910924.pdf") 

download.file(url, file, headers = c("User-Agent" = "My Custom User Agent"))



pdf_data <- pdf_text(file)

replace_spaces_and_commas <- function(x) {
  str_replace_all(x, "[ ,]", "")
}


pdf <- pdf_data[53:71]

tab_pdf <- str_split(pdf, "\n")

for (i in 1:19) {
  assign(paste0("tab_pdf_", i), tab_pdf[[i]])
}

the_names <- c("country", "year_2013", "year_2014", "year_2015", "year_2016", "year_2017", "year_2018", "year_2019", "year_2020", "year_2021")

view(tab_pdf_1)

pdf_clean1 <- tab_pdf_1[14:60] %>%
  str_trim %>%
  str_replace_all(",", "") %>%
  str_split("\\s{2,}", simplify = TRUE) %>%
  data.frame(stringsAsFactors = FALSE) %>%
  setNames(the_names) %>% mutate_all(.funs = replace_spaces_and_commas) %>% filter(country != "") 

I tried both, e.g., as.numeric(pdf_clean1$year_2013) and parse_number(pdf_clean$year_2013)

Both produced NAs, because the outcome for all of "9" == "9" "-122" == "-122" "29458" == "29458" are "FALSE".

sessionInfo()

R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.4.1

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0

attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base

other attached packages:
[1] countrycode_1.5.0 magrittr_2.0.3 pdftools_3.3.3
[4] lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0
[7] dplyr_1.1.2 purrr_1.0.1 readr_2.1.4
[10] tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.2
[13] tidyverse_2.0.0

loaded via a namespace (and not attached):
[1] gtable_0.3.3 compiler_4.3.1 qpdf_1.3.2
[4] tidyselect_1.2.0 Rcpp_1.0.11 scales_1.2.1
[7] R6_2.5.1 generics_0.1.3 knitr_1.42
[10] munsell_0.5.0 pillar_1.9.0 tzdb_0.4.0
[13] rlang_1.1.1 utf8_1.2.3 stringi_1.7.12
[16] xfun_0.39 timechange_0.2.0 cli_3.6.1
[19] withr_2.5.0 grid_4.3.1 rstudioapi_0.15.0
[22] hms_1.1.3 askpass_1.1 lifecycle_1.0.3
[25] vctrs_0.6.3 glue_1.6.2 fansi_1.0.4
[28] colorspace_2.1-0 tools_4.3.1 pkgconfig_2.0.3

@kjayhan
Copy link
Author

kjayhan commented Aug 14, 2023

Found a solution, just in case someone else has the same problem with the help of a stackoverflow user and ChatGPT:

convert_fullwidth_to_numeric <- function(input_str) {
  utf8_codes <- utf8ToInt(input_str)
  
  # Handle fullwidth minus sign (-) separately
  utf8_codes <- ifelse(utf8_codes == 65293, 45, utf8_codes)
  
  converted_utf8_codes <- ifelse(utf8_codes >= 65296 & utf8_codes <= 65305, utf8_codes - 65248, utf8_codes)
  converted_chars <- intToUtf8(converted_utf8_codes)
  converted_numeric <- as.numeric(converted_chars)
  return(converted_numeric)
}

# Apply the function to specified columns (columns 2 to 10)
columns_to_transform <- 2:10  # Adjust column indices as needed

for (col in columns_to_transform) {
  for (row in 1:nrow(pdf_clean1)) {
    pdf_clean1[row, col] <- convert_fullwidth_to_numeric(pdf_clean1[row, col])
  }
}

https://stackoverflow.com/questions/76895064/number-as-character-cannot-be-converted-to-numeric-in-r

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant