Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance geojson_json(sp) - transformation loop sp -> JSON -> R -> JSON #85

Closed
javrucebo opened this issue Jun 3, 2016 · 23 comments
Closed
Milestone

Comments

@javrucebo
Copy link

I am wondering why in geojson_json(sp) we have following transformations:

  • Transform sp object to geojson on disk through geojson_write
  • Transform to R object through jsonlite::fromJSON
  • Convert back to JSON through to_json

I do not see a difference in the resulting JSON's from step 1 and 3 apart from number of digits, but that might be different for more complex SpatialObjects? For any map I used from GADM I did not observe any difference using directly the JSON from geojson_write

The current behaviour adds quite some timing overhead, see some benchmarking I described in ateucher/rmapshaper#28 (comment) (though the geojson transformations are not isolated there)

@sckott
Copy link
Collaborator

sckott commented Jun 3, 2016

There are two issues:

  • to convert to geojson we use rgdal which right now has to write to disk, then we read back in, so that introduces some slowness - we are working on fixing this though, see Avoid round trip to disk in geojson_rw? #77
  • right, for geojson_json we don't need to to fromJSON, then toJSON, we should probably just read the lines in, and paste together, possibly validate that its proper json too

sckott added a commit that referenced this issue Jun 3, 2016
printing of json is now more cumbersome, but could rework which print statement is used to minify on print
@sckott
Copy link
Collaborator

sckott commented Jun 3, 2016

can try again geojson_json after installing devtools::install_github("ropensci/geojsonio@json-io") - sped up the fxn

#77 will still take some time though

@javrucebo

@sckott
Copy link
Collaborator

sckott commented Jun 3, 2016

also, @ateucher - might want to try new geojson_json on that branch

@ateucher
Copy link
Member

ateucher commented Jun 3, 2016

Damn you work fast @sckott. I'll have a look for sure!

@javrucebo
Copy link
Author

@sckott: Great, while I was still benchmarking, you provide a ready-to-use fix.

Here high-level benchmarking of your version for a relatively small geometry:

input <- raster::getData('GADM', country='FRA', level=1)
devtools::install_github("ropensci/geojsonio")
system.time(geojsonio::geojson_json(input))
#   user  system elapsed 
# 34.161   0.034  34.263 
devtools::install_github("ropensci/geojsonio@json-io")
system.time(geojsonio::geojson_json(input))
#   user  system elapsed 
#  4.643   0.204   4.876 

Back to current version of geojsonio: To illustrate the time spent, I have benchmarked the individual main steps, which shows that a BIG amount of total time is spent in fromJSON/toJSON

geojson_benchmark <- function(input) {
  dir <- tempfile()
  file <- tempfile()
  t1 <- system.time(rgdal::writeOGR(obj=input, dsn=dir, layer="",driver="GeoJSON"))
  t2a1 <- system.time(file.copy(dir, file, overwrite = TRUE))
  t2a2 <- system.time(js <- jsonlite::fromJSON(file,simplifyDataFrame = FALSE, simplifyMatrix = FALSE))
  t2a3 <- system.time(gj <- jsonlite::toJSON(js, digits=22, auto_unbox=TRUE))
  t2 <- system.time(gj <- readChar(file, file.info(dir)$size))
  list(writeOGR=t1[["elapsed"]], 
       file.copy=t2a1[["elapsed"]], 
       fromJSON=t2a2[["elapsed"]], 
       toJSON=t2a3[["elapsed"]], 
       readChar=t2[["elapsed"]])
}
geo_bench <- function(country,level=1,rep=5) {
  input <- raster::getData('GADM', country=country, level=level)
  replicate(rep,geojson_benchmark(input))
}

> geo_bench('BEL', 1)
#                     [,1]  [,2]  [,3]  [,4]  [,5] 
# writeOGR            0.093 0.108 0.103 0.111 0.106
# file.copy           0.001 0.001 0.001 0.001 0.001
# fromJSON            0.137 0.13  0.126 0.127 0.129
# toJSON              0.941 0.943 0.947 0.936 0.943
# readChar            0.026 0.014 0.027 0.014 0.027
> geo_bench('SWE', 1)
#                     [,1]   [,2]   [,3]   [,4]   [,5]  
# writeOGR            4.473  4.356  4.352  3.981  4.204 
# file.copy           0.037  0.036  0.038  0.034  0.035 
# fromJSON            7.924  8.314  7.861  7.742  7.86  
# toJSON              46.848 48.026 47.043 48.143 46.744
# readChar            1.322  1.249  1.175  1.3    1.28  
> geo_bench('CAN', 3)
#                     [,1]    [,2]    [,3]    [,4]    [,5]   
# writeOGR            66.758  78.971  80.513  84.47   99.44  
# file.copy           0.562   1.032   1.045   1.676   1.671  
# fromJSON            125.556 128.876 134.611 137.035 135.313
# toJSON              832.98  826.846 838.868 832.268 826.4  
# readChar            28.233  17.654  32.044  22.903  23.725 

Finally benchmarking your readbylines function which uses readLines against using readChar, which suggest the latter being a bit faster. Both produce the same output except the additional trailing '\n' for the readLines version.

input <- raster::getData('GADM', country='CAN', level=1)
dsn <- tempfile()
rgdal::writeOGR(obj=input, dsn=dsn, layer="",driver="GeoJSON")

readbyline <- function(x, minify = TRUE, ...) {
  paste0(readLines(x, ...), collapse = "\n")
}

readbychar <- function(x) {
  readChar(x, file.info(x)$size)
}

microbenchmark::microbenchmark(byline=readbyline(dsn),
                               bychar=readbychar(dsn),
                               times=3)
# Unit: seconds
#    expr       min        lq      mean    median        uq      max neval cld
#  byline 13.640235 13.750737 13.807696 13.861238 13.891426 13.92161     3   b
#  bychar  7.474303  7.523513  7.638348  7.572722  7.720371  7.86802     3  a 

@sckott
Copy link
Collaborator

sckott commented Jun 4, 2016

Thanks for the detailed benchmarks @javrucebo - I'll try implementing readChar instead of readLines

@ateucher
Copy link
Member

ateucher commented Jun 6, 2016

I see a similar ~10x speedup on a large SpatialPolygonsDataFrame (19MB, 48 features) in json-io branch. ~12 seconds vs ~130 seconds w/ geojson_json. Small difference in readLines vs readChar, but I also tested with readr::read_file and the difference was significant. Is it worth adding another dependency @sckott ?

input <- raster::getData('GADM', country='CAN', level=1)
dsn <- tempfile()
rgdal::writeOGR(obj=input, dsn=dsn, layer="",driver="GeoJSON")

readbyline <- function(x, minify = TRUE, ...) {
  paste0(readLines(x, ...), collapse = "\n")
}

readbychar <- function(x) {
  readChar(x, file.info(x)$size)
}

microbenchmark::microbenchmark(byline=readbyline(dsn),
                               bychar=readbychar(dsn),
                               readr::read_file(dsn),
                               times=3)
## Unit: seconds
##                   expr       min        lq     mean    median        uq
##                 byline 16.549878 16.658638 16.99846 16.767398 17.222752
##                 bychar 11.228138 11.406698 12.11129 11.585258 12.552862
##  readr::read_file(dsn)  1.529959  2.178388  2.83577  2.826818  3.488675
##        max neval cld
##  17.678107     3   c
##  13.520465     3  b 
##   4.150532     3 a

@ateucher
Copy link
Member

ateucher commented Jun 6, 2016

And tested in context:

library(geojsonio)
library(raster)
library(microbenchmark)

geojson_rw_test <- function(input, fun){
  tmp <- tempfile(fileext = ".geojson")
  suppressMessages(geojson_write(input, file = tmp))
  fun(tmp)
}

readbyline <- function(x) {
  paste0(readLines(x), collapse = "\n")
}

readbychar <- function(x) {
    readChar(x, file.info(x)$size)
}

input <- raster::getData('GADM', country='CAN', level=1)

microbenchmark(geojson_rw_test(input, fun = readbyline), 
               geojson_rw_test(input, fun = readbychar), 
               geojson_rw_test(input, fun = readr::read_file), 
               times = 3)
## Unit: seconds
##                                            expr      min       lq     mean
##        geojson_rw_test(input, fun = readbyline) 38.76973 39.98385 42.90223
##        geojson_rw_test(input, fun = readbychar) 34.11568 35.22848 36.35212
##  geojson_rw_test(input, fun = readr::read_file) 25.69207 26.11113 26.53101
##    median       uq      max neval cld
##  41.19797 44.96848 48.73900     3   b
##  36.34128 37.47035 38.59941     3   b
##  26.53019 26.95048 27.37077     3  a

@sckott
Copy link
Collaborator

sckott commented Jun 6, 2016

@ateucher thanks for the benchmarks.

Is it worth adding another dependency

possibly. we already have some heavy deps, but if it is that much faster, perhaps we should think about it

sckott added a commit that referenced this issue Jun 6, 2016
@sckott
Copy link
Collaborator

sckott commented Jun 6, 2016

@javrucebo @ateucher json-io branch now uses readr::read_file instead of readLines or readChar

let me know if you have any further thoughts

@ateucher
Copy link
Member

ateucher commented Jun 6, 2016

A thought - though possibly too convoluted. Could include readr in Suggests:, and have a function:

read_geojson_file <- function(x) {
  if (requireNamespace("readr") {
    return(readr::read_file(x))
  } else {
    return(readChar(x, nchars = file.info(x)$size)
  }
}

Then just include a note in the documentation that this function will be faster if you have readr installed... No extra overhead for the user unless they need it.

@sckott
Copy link
Collaborator

sckott commented Jun 6, 2016

@javrucebo does andy's suggestion above sound good to you?

@javrucebo
Copy link
Author

Andy's suggestion will certainly work, nevertheless I would favour to have a hard dependency. Many (if not most) R users I know personally would not read the "Suggests" section and then manually install an additional package for an - for them unknown - performance gain. But having dependencies automatically installed is something they know, and the readr package is a much lighter package than some of the others the geojsonio package already depends on.

But this is personal taste combined with knowing a not so experienced user base. Personally as I have the readr package installed I will benefit from both options ;-)

@ateucher
Copy link
Member

ateucher commented Jun 6, 2016

I agree with @javrucebo - it's probably worth having readr as a dependency... I just thought I'd throw it out there, but i't probably unnecessarily complicated.

@javrucebo
Copy link
Author

javrucebo commented Jun 6, 2016

When you use readr::read_file you should change

readr::read_file(x)

to

readr::read_file(x, locale=locale())

otherwise, read_file will use for reading anything which is set in option readr.default_locale.

As your geojson files will end up to be UTF-8 (or ASCII) in the case a user sets above option to an incompatible encoding you end up with messed up characters:

input <- raster::getData('GADM', country='AND', level=0)

# desired behaviour
js <- geojsonio::geojson_json(input)
stringi::stri_match_first_regex(js, '"NAME_ARABIC": "[^"]+"')
#      [,1]                          
# [1,] "\"NAME_ARABIC\": \"أندورا \""

# non desired behaviour
options("readr.default_locale"=readr::locale(encoding="ISO-8859-2"))
js <- geojsonio::geojson_json(input)
stringi::stri_match_first_regex(js, '"NAME_ARABIC": "[^"]+"')
#     [,1]                                          
# [1,] "\"NAME_ARABIC\": \"ŘŁŮ\u0086ŘŻŮ\u0088عا \""

As your gejson file will be written without using any locale set in R you also do not want your user to influence the encoding of what is written. I did not test though if the write_ogr always will output in UTF-8 if your system locale is not UTF-8 ....

@ateucher
Copy link
Member

ateucher commented Jun 6, 2016

What about including an encoding argument in geojson_json, with the default being "UTF-8"?

geojson_json(input, lat = NULL, lon = NULL, group = NULL,
  geometry = "point", type = "FeatureCollection", encoding = "UTF-8" ...)

and then internatlly use:

readr::read_file(x, locale = locale(encoding = encoding))

Will there every be a need for a user to change it, or is that silly?

@sckott
Copy link
Collaborator

sckott commented Jun 7, 2016

Good point about encoding. I imagine we do want to allow the user to set encoding.

@javrucebo
Copy link
Author

No, we do not want the user to choose encoding of a JSON file, UTF is required (and we might not want to go down the route of allowing UTF-16/UTF-32 which might create more headaches if you play it through)

https://tools.ietf.org/html/rfc7159#section-8.1

But encodings can create headaches already now. jsonlite::toJSON seems to do the job right:

x <- "fa\xE7ile"
Encoding(x) <- "latin1"
x
# [1] "façile"
Encoding(x)
# [1] "latin1"
js <- jsonlite::toJSON(x)
js
#  ["façile"] 
Encoding(js)
# [1] "UTF-8"

But if we let rgdal write a geojson file with latin1 encoded strings, it will create a latin1 encoded geojson file.

input <- SpatialPointsDataFrame(cbind(50,10), data=data.frame(ID=x, stringsAsFactors = FALSE))
dsn <- tempfile()
rgdal::writeOGR(obj=input, dsn=dsn, layer="",driver="GeoJSON")
system(paste("file",dsn))
# /tmp/RtmpPDcdFQ/file488e399430d1: ISO-8859 text

And with geojsonio

# version 0.1.8 as on CRAN
geojsonio::geojson_json(input)
# Error in feed_push_parser(readBin(con, raw(), n), reset = TRUE) : 
#   lexical error: invalid bytes in UTF8 string.
#           1, "properties": { "ID": "fa�ile" }, "geometry": { "type": "
#                      (right here) ------^


# version @json-io
js <- geojsonio::geojson_json(input)
js
# {
# "type": "FeatureCollection",
#                                                                                 
# "features": [
# { "type": "Feature", "id": 1, "properties": { "ID": "fa�ile" }, "geometry": { "type": "Point", "coordinates": [ 50.0, 10.0 ] } }
# ]
# }
Encoding(js)
# "UTF-8"  
# which is encoded in UTF-8 but was reading a wrong character

This goes beyond this thread though.

@javrucebo
Copy link
Author

One can give an encoding argument to writeOGR if you know how your sp object is encoded:

rgdal::writeOGR(obj=input, dsn=dsn, layer="",driver="GeoJSON", encoding="latin1")
system(paste("file",dsn))
# /tmp/Rtmp2h1fhE/file4a2b11df724b: UTF-8 Unicode text

But this does not really help, as you can have mixed encodings (in which case writeOGR fails)

In a nutshell, I would not bother with this in geojsonio. Assume writeOGR does everything correct and writes valid UTF-8 and then make sure to read it in with locale(), which is set to UTF-8.
For above pathological cases I might consider to open a bug report with rgdal

@sckott
Copy link
Collaborator

sckott commented Jun 7, 2016

okay, so readr::read_file(x, locale=locale()) is the plan

sckott added a commit that referenced this issue Jun 7, 2016
@sckott
Copy link
Collaborator

sckott commented Jun 7, 2016

okay, pushed the change, closing this, reopen if you think needs more work

@sckott sckott closed this as completed Jun 7, 2016
@sckott sckott added this to the v0.3 milestone Jun 7, 2016
@ateucher
Copy link
Member

ateucher commented Jun 9, 2016

Hey @javrucebo - just wanted to say thanks for catching this! It's made a really big difference when working with large objects 👍

@sckott sckott modified the milestones: v0.2, v0.3 Jun 17, 2016
@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Sep 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants