Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid round trip to disk in geojson_rw? #77

Open
ateucher opened this issue Feb 16, 2016 · 47 comments
Open

Avoid round trip to disk in geojson_rw? #77

ateucher opened this issue Feb 16, 2016 · 47 comments

Comments

@ateucher
Copy link
Member

There may be a way to avoid the round trip to disk in geojson_rw for converting Spatial objects to geojson, but I haven't quite been able to make it work.

The key: you can write to stdout with rgdal::writeOGR:

library(rgdal)
cities <- readOGR(system.file("vectors", package = "rgdal")[1], "cities")
writeOGR(cities[1:10,], "/vsistdout/", "cities", driver="GeoJSON")

But I can't figure out a way to capture the output. I've tried various combinations of sink(), textConnection(), and capture.output() but it's eluded me so far...

@sckott
Copy link
Collaborator

sckott commented Feb 16, 2016

Good thinking. I'll have look too. Would be nice to avoid that extra time.

@ateucher
Copy link
Member Author

Yeah, it's a killer on really big spatial objects.

@sckott
Copy link
Collaborator

sckott commented Feb 16, 2016

tried a bit, still not found a way to do it. asking around

@sckott
Copy link
Collaborator

sckott commented Feb 17, 2016

@hrbrmstr any thoughts on this one? how to collect the output of writeOGR() into a character string in the R session, instead of having to collect it from disk

@hrbrmstr
Copy link

ooh. i think so. i'm pretty flat-out tomorrow (it's one of my 2x/wk commute to boston from Maine day) but lemme see how awake I am (just got done teaching my first college 2.5hr #rstats class and am fading fast :-) But I have an idea (and I've also started a side-write of spatula, a hadley-esque revamp of rgdal+regos). So I think I may have an interim solution now-ish and a replacement in a cpl month.

@sckott
Copy link
Collaborator

sckott commented Feb 18, 2016

thanks

hugh, interesting, curious how https://github.com/thk686/rgdal2 fits in the mix

@hrbrmstr
Copy link

oh wow. i had no idea he was doing that! I shall Use the Fork and see what
he's done.

On Thu, Feb 18, 2016 at 2:43 AM, Scott Chamberlain <notifications@github.com

wrote:

thanks

hugh, interesting, curious how https://github.com/thk686/rgdal2 fits in
the mix


Reply to this email directly or view it on GitHub
#77 (comment).

@hrbrmstr
Copy link

um…what I'll post in a cpl hrs is, um…er…ugly. I might need assistance in making it cross platform and for getting other ideas. Perhaps even bringing Dirk into the convo.

@sckott
Copy link
Collaborator

sckott commented Feb 18, 2016

spatula is a good name - and i would love love love a better rgdal

@hrbrmstr
Copy link

So, save the following block to captured_write_ogr:

#include <Rcpp.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

using namespace Rcpp;

//' An \code{writeOGR} shim to capture stdout
//'
//' @param obj the \code{Spatial} object to use
//' @param obj_size pls pass in \code{object.size(obj)}
//' @param layer spatial layer to use
//' @param writeOGR pls pass in \code{writeOGR} (no quotes)
//' @return character vector of GeoJSON if all goes well
//' @examples \dontrun{
//' capturedWriteOGR(cities[1:10,],
//'                  object.size(cities[1:10,]),
//'                  "cities",
//'                  writeOGR))
//' }
// [[Rcpp::export]]
CharacterVector capturedWriteOGR(SEXP obj,
                                 int obj_size,
                                 SEXP layer,
                                 Function writeOGR) {

  // we don't know how big the output is going to be.
  // this is the main problem with this approach.
  // there is no error checking yet
  // i have no idea if this works on Windows
  // this was a hack on a commuter train ;-)

  // this could cause us to run out of memory
  // we need checking for sizes, mem avail and errors

  // allocate a buffer to hold the output geojson
  // and initialize it so we're sure to have a null term'd string
  int MAX_LEN = obj_size * 10 ;
  char buffer[MAX_LEN+1];
  memset(buffer, 0, sizeof (char) * (MAX_LEN+1));

  // now we're going to mess with pipes. the way the VSI stuff
  // mangles stdout in gdal means that we can't use nice C++
  // redirects and have to resort to file descriptor hacking

  int out_pipe[2];
  int saved_stdout;

  saved_stdout = dup(STDOUT_FILENO);

  // ok, there's some error checking if we couldn't even
  // get the hack started

  if (pipe(out_pipe) != 0 ) { return(NA_STRING); }

  dup2(out_pipe[1], STDOUT_FILENO);
  close(out_pipe[1]);

  // we've setup the capture, so let writeOGR do it's thing
  // we are calling R from C since the rgdal folks have not
  // exposed anything we can use AFAI can tell

  writeOGR(obj, "/vsistdout/", layer, "GeoJSON");

  // ok, we let it do it's thing, now make sure we've
  // cleaned up after it

  fflush(stdout);

  // now, get that mess into something we can use

  read(out_pipe[0], buffer, MAX_LEN);

  // restore order to the universe

  dup2(saved_stdout, STDOUT_FILENO);

  // say a little prayer

  return(wrap(buffer));

}

and, then run:

Rcpp::sourceCpp("captured_write_ogr.cpp")

and then run:

capturedWriteOGR(cities[1:10,], object.size(cities[1:10,]), "cities", writeOGR)

and you should get:

[1] "{\n\"type\": \"FeatureCollection\",\n\"crs\": { \"type\": \"name\", \"properties\": { \"name\": \"urn:ogc:def:crs:OGC:1.3:CRS84\" } },\n\"features\": [\n{ \"type\": \"Feature\", \"id\": 1, \"properties\": { \"NAME\": \"Murmansk\", \"COUNTRY\": \"Russia\", \"POPULATION\": 468000, \"CAPITAL\": \"N\" }, \"geometry\": { \"type\": \"Point\", \"coordinates\": [ 33.086040496826172, 68.963546752929688 ] } },\n{ \"type\": \"Feature\", \"id\": 2, \"properties\": { \"NAME\": \"Arkhangelsk\", \"COUNTRY\": \"Russia\", \"POPULATION\": 416000, \"CAPITAL\": \"N\" }, \"geometry\": { \"type\": \"Point\", \"coordinates\": [ 40.646160125732422, 64.520668029785156 ] } },\n{ \"type\": \"Feature\", \"id\": 3, \"properties\": { \"NAME\": \"Saint Petersburg\", \"COUNTRY\": \"Russia\", \"POPULATION\": 5825000, \"CAPITAL\": \"N\" }, \"geometry\": { \"type\": \"Point\", \"coordinates\": [ 30.453327178955078, 59.951889038085938 ] } },\n{ \"type\": \"Feature\", \"id\": 4, \"properties\": { \"NAME\": \"Magadan\", \"COUNTRY\": \"Russia\", \"POPULATION\": 152000, \"CAPITAL\": \"N\" }, \"geometry\": { \"type\": \"Point\", \"coordinates\": [ 150.780014038085938, 59.570999145507812 ] } },\n{ \"type\": \"Feature\", \"id\": 5, \"properties\": { \"NAME\": \"Perm'\", \"COUNTRY\": \"Russia\", \"POPULATION\": 1160000, \"CAPITAL\": \"N\" }, \"geometry\": { \"type\": \"Point\", \"coordinates\": [ 56.232463836669922, 58.000236511230469 ] } },\n{ \"type\": \"Feature\", \"id\": 6, \"properties\": { \"NAME\": \"Yekaterinburg\", \"COUNTRY\": \"Russia\", \"POPULATION\": 1620000, \"CAPITAL\": \"N\" }, \"geometry\": { \"type\": \"Point\", \"coordinates\": [ 60.610130310058594, 56.846542358398438 ] } },\n{ \"type\": \"Feature\", \"id\": 7, \"properties\": { \"NAME\": \"Nizhniy Novgorod\", \"COUNTRY\": \"Russia\", \"POPULATION\": 2025000, \"CAPITAL\": \"N\" }, \"geometry\": { \"type\": \"Point\", \"coordinates\": [ 43.940670013427734, 56.289676666259766 ] } },\n{ \"type\": \"Feature\", \"id\": 8, \"properties\": { \"NAME\": \"Glasgow\", \"COUNTRY\": \"UK\", \"POPULATION\": 1800000, \"CAPITAL\": \"N\" }, \"geometry\": { \"type\": \"Point\", \"coordinates\": [ -4.269947528839111, 55.862808227539062 ] } },\n{ \"type\": \"Feature\", \"id\": 9, \"properties\": { \"NAME\": \"Kazan'\", \"COUNTRY\": \"Russia\", \"POPULATION\": 1140000, \"CAPITAL\": \"N\" }, \"geometry\": { \"type\": \"Point\", \"coordinates\": [ 49.145465850830078, 55.733005523681641 ] } },\n{ \"type\": \"Feature\", \"id\": 10, \"properties\": { \"NAME\": \"Chelyabinsk\", \"COUNTRY\": \"Russia\", \"POPULATION\": 1325000, \"CAPITAL\": \"N\" }, \"geometry\": { \"type\": \"Point\", \"coordinates\": [ 61.392612457275391, 55.145000457763672 ] } }\n]\n}\n"

I have many caveats in those comments and if @eddelbuettel cld take a peek, let me know how daft I am and offer some suggestions it'd be a Really Good Thing.

As I say (I think) in the comments, traditional C++ stdout wrangling won't work because of the cruddy way the "VSI*" stuff works in gdal/ogr (not rgdal). If there are better ways to dynamically capture stdout at this level (cross-platform) I'd love to learn (this is a working 'hack' but may not work cross platform and may not be ideal in practice).

@hrbrmstr
Copy link

microbenchmark(js1=capturedWriteOGR(cities[1:10,], object.size(cities[1:10,]), 
               "cities", writeOGR))

## Unit: milliseconds
##  expr      min       lq     mean   median       uq      max neval
##   js1 1.862113 1.927723 2.118487 2.052322 2.112831 5.625623   100

@hrbrmstr
Copy link

I'm really not convinced int MAX_LEN = obj_size * 10 is truly necessary. In the example I provided,

object.size(cities[1:10,])
## 51768 bytes

and

object.size(capturedWriteOGR(cities[1:10,], object.size(cities[1:10,]), "cities", writeOGR))
## 2464 bytes

so it's substantially less. I suspect that'll be the case for many, if not all, Spatial objects, but we'd need to run tests to make sure. If it is consistently smaller the the object.size() of a Spatial object, then we can just get that object.size() (from within C, I was just lazy on the train) and make that the max buffer size.

@sckott
Copy link
Collaborator

sckott commented Feb 18, 2016

works great! thanks so much. indeed, cross-platform is important.

In your object.size() egs above, they are the same calls, did you intend to use different code with one of them?

@hrbrmstr
Copy link

ugh. yeah (i'll fix)

On Thu, Feb 18, 2016 at 10:22 AM, Scott Chamberlain <
notifications@github.com> wrote:

works great! thanks so much. indeed, cross-platform is important.

In your object.size() egs above, they are the same calls, did you intend
to use different code with one of them?


Reply to this email directly or view it on GitHub
#77 (comment).

@ateucher
Copy link
Member Author

Thanks so much for your help on this @hrbrmstr! I am on Windows here at work, I've given it a try. When I try to Rcpp::sourceCpp("captured_write_ogr.cpp"), I get the following warnings/error:

g++ -m64 -I"C:/PROGRA~1/R/R-32~1.3/include" -DNDEBUG     -I"C:/R/win_library/Rcpp/include" -I"C:/_dev"  -I"d:/RCompile/r-compiling/local/local323/include"     -O2 -Wall  -mtune=core2 -c captured_write_ogr.cpp -o captured_write_ogr.o
captured_write_ogr.cpp: In function 'Rcpp::CharacterVector capturedWriteOGR(SEXP, int, SEXP, Rcpp::Function)':
captured_write_ogr.cpp:54:20: error: 'pipe' was not declared in this scope
make: *** [captured_write_ogr.o] Error 1
Warning message:
running command 'make -f "C:/PROGRA~1/R/R-32~1.3/etc/x64/Makeconf" -f "C:/PROGRA~1/R/R-32~1.3/share/make/winshlib.mk" SHLIB_LDFLAGS='$(SHLIB_CXXLDFLAGS)' SHLIB_LD='$(SHLIB_CXXLD)' SHLIB="sourceCpp_1.dll" WIN=64 TCLBIN=64 OBJECTS="captured_write_ogr.o"' had status 2 
Error in Rcpp::sourceCpp("captured_write_ogr.cpp") : 
  Error 1 occurred building shared library.

It clearly doesn't recognize the pipe function...

@ateucher
Copy link
Member Author

As for the the /vsistdout/, the gdal geojson driver also supports writing to /dev/stdout (http://www.gdal.org/drv_geojson.html). Would that make things easier?

@hrbrmstr
Copy link

thx, @ateucher. I've got a Windows VM back at the ranch (Tue & Thu are my commute days to Boston) I can debug it later (though I do import <unistd.h> so I'm not sure what's going on…I do have some Windows porting knowledge, I just try not to let on that I do :-)

(oh, wait…i think i need to deal with using win32 and making a define to map pipe to MS __pipe. shld be a quick fix later. I'll fork the repo and make the changes in pkg so it's easier to build/demo.)

I tried it with /dev/stdout and it still does evil things under the covers that prevent being able to work with it nicely, sadly.

@ateucher
Copy link
Member Author

Thanks @hrbrmstr! I'm afraid my C++ knowledge is zero, but I'll see if I can track this down on my side too.

@sckott
Copy link
Collaborator

sckott commented Feb 18, 2016

@jeroenooms thoughts on the windows problem above #77 (comment)

@sckott
Copy link
Collaborator

sckott commented Feb 18, 2016

started incorporating on this branch https://github.com/ropensci/geojsonio/tree/geojsonrw-fix just to see how it all works - haven't incorporated into the geojsonio functions yet

there is a problem on check though:

* checking compiled code ... NOTE
File ‘geojsonio/libs/geojsonio.so’:
  Found ‘___stdoutp’, possibly from ‘stdout’ (C)
    Object: ‘captured_write_ogr.o’

Compiled code should not call entry points which might terminate R nor
write to stdout/stderr instead of to the console, nor the system RNG.

See ‘Writing portable packages’ in the ‘Writing R Extensions’ manual.

Some discussion on it here http://stackoverflow.com/questions/28004717/r-check-doesnt-like-stdcout-c

@ateucher
Copy link
Member Author

@sckott I sent you a PR on that branch with a Windows hack. I'm sure @hrbrmstr will have a better solution, but I wanted to give it a try :)

@hrbrmstr
Copy link

those are perfect Windows augmentations, @ateucher. I'm not worried abt that error msg. I can give verbiage for a CRAN submission with that there. I've not exhausted all other possibilities yet, tho. Since we're assuming gdal is on systems anyway, I cld just write something that taps directly into the C/C++ gdal lib and try to overcome this a different way.

I say "just" a bit too lightly there :-)

TIL: git clone git@github.com:… on Amtrak commuter train wifi was, perhaps, not the best idea I've had today.

@sckott
Copy link
Collaborator

sckott commented Feb 19, 2016

thanks @ateucher

using capturedWriteOGR() wasn't working too well unless I was in load_all() context. made a tiny exported fxn to use it

devtools::install_github("ropensci/geojsonio@geojsonrw-fix")
library("geojsonio")
library("rgdal")
cities <- readOGR(system.file("vectors", package = "rgdal")[1], "cities")
cwgr(cities[1:10,], "cities")

@sckott
Copy link
Collaborator

sckott commented Apr 7, 2016

@ateucher @hrbrmstr added a layer_options param to the capturedWriteOGR function - as we want to allow users to pass options to the geojson driver.

using it now in geojson_write - works fine with inputs of small size - however, seems to hang and/or crash with inputs of any non-small size - not sure what could be the problem

@hrbrmstr
Copy link

hrbrmstr commented Apr 7, 2016

oh. wow. i need to get back to this. :-)

On Thu, Apr 7, 2016 at 7:05 PM, Scott Chamberlain notifications@github.com
wrote:

@ateucher https://github.com/ateucher @hrbrmstr
https://github.com/hrbrmstr added a layer_options param to the
capturedWriteOGR function - as we want to allow users to pass options to
the geojson driver.

using it now in geojson_write - works fine with inputs of small size -
however, seems to hang and/or crash with inputs of any non-small size - not
sure what could be the problem


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#77 (comment)

@sckott
Copy link
Collaborator

sckott commented May 3, 2016

@hrbrmstr any chance you can take a look at this again? i can alternatively pull in Jeroen

@hrbrmstr
Copy link

hrbrmstr commented May 3, 2016

aye. back to 100% health and virtually caught up at work.

On Tue, May 3, 2016 at 12:48 PM, Scott Chamberlain <notifications@github.com

wrote:

@hrbrmstr https://github.com/hrbrmstr any chance you can take a look at
this again? i can alternatively pull in Jeroen


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#77 (comment)

@javrucebo
Copy link

With the main bottleneck of geojson_json addressed in #85, I was also benchmarking the Disk-IO overhead of rgdal::writeOGR discussed in the issue here.

Although I do appreciate every effort in speeding up code, I do not really see the big bottleneck with the round-trip to disk on my system. In following benchmarks I tried to avoid the influence of disk-caches by flushing and also write/read large amounts of random data in between steps, but still it can not be out-ruled that disk-cache has positive impact (which in real life is good).

input <- raster::getData('GADM', country='FRA', level=1)
system.time(rgdal::writeOGR(input, "/tmp/todisk", "", driver="GeoJSON"))
#   user  system elapsed 
#  1.612   0.099   1.715 
system.time(rgdal::writeOGR(input, "/tmp/ramdisk/toram", "", driver="GeoJSON"))
#   user  system elapsed 
#  1.349   0.067   1.422 
system.time(rgdal::writeOGR(input, "/vsistdout/", "", driver="GeoJSON"))
#   user  system elapsed 
#  1.357   0.072   1.430

input <- raster::getData('GADM', country='CAN', level=3)
system.time(rgdal::writeOGR(input, "/tmp/todisk", "", driver="GeoJSON"))
#   user  system elapsed 
# 53.917   1.628  56.806 
system.time(rgdal::writeOGR(input, "/tmp/ramdisk/toram", "", driver="GeoJSON"))
#   user  system elapsed 
# 54.672   0.892  55.656 
system.time(rgdal::writeOGR(input, "/vsistdout/", "", driver="GeoJSON"))
#   user  system elapsed 
# 54.266   0.923  55.282
  • There is virtual no difference between writing to ramdisk or to stdout.
  • There is a small difference between disk and ramdisk, I do not consider this are relevant given the overall running time.
  • We can also see that user and elapsed time are pretty similar, so there is not much waiting involved, most of it is processor time

For reading it does not make any difference either whether the data is already in RAM or if it is read from disk.
I did a couple of runs in random order and with disk flushing/large random writes/reads in between.

fdsk <- "/tmp/todisk"
fram <- "/tmp/ramdisk/toram"
system.time(readChar(fdsk, file.info(fdsk)$size))
#   user  system elapsed 
# 13.900   0.663  14.568 
system.time(readChar(fram, file.info(fram)$size))
#   user  system elapsed 
# 13.973   0.645  14.631 

So the main speed-up which potentially can be saved is the overhead of needing to read-in the data after converting. I doubt it will be the full 14 seconds above though as it includes the converting of the captured output.

@sckott
Copy link
Collaborator

sckott commented Jun 4, 2016

Thanks for the detailed notes here @javrucebo - We'll definitely try avoiding writing to disk with writeOGR, and if no speed up, then not really worth including

@mdsumner
Copy link

mdsumner commented Jan 14, 2017

Still a pressing issue? We can easily leverage sf to do this, I might need help with the structure-composing on the other side, but the sf-decomposing is no problem, and having a formalized pathway from sf to geojson-structures is something that would be really useful.

@sckott
Copy link
Collaborator

sckott commented Jan 15, 2017

Definitely want to avoid round tripping if possible. We've definitely sped this up a lot - if we can make it faster I'm all for it.

Seems like @ateucher has done sf to geojson already, yes? So I think we need sp classes to geojson - We have some of the sp classes covered in the geojson package - that Jeroen added - but we're not importing that functionality here yet, see https://github.com/ropensci/geojson/blob/master/R/as.geojson.R#L18-L58 and https://github.com/ropensci/geojson/blob/master/R/sp_to_geojson.R

@ateucher
Copy link
Member Author

I haven't looked closely enough at the sp -> geojson in the geojson package yet, but I'm not convinced it deals properly with polygon holes in SpatialPolygons[DataFrames]s. But sf does Spatial -> sf without sp in st_as_sfc, so I think we can borrow the logic from there to go sf -> geojson. Code is here

@sckott
Copy link
Collaborator

sckott commented Jan 16, 2017

@ateucher sounds good to use sf to do it

@ateucher
Copy link
Member Author

@sckott @mdsumner do you mean to convert Spatial* objects to geojson by converting Spatail* -> sf[c] -> geojson? Wouldn't that be inefficient? Also, then we'd want to Import sf as well as sp, which up until this point we'd been trying to avoid.

@sckott
Copy link
Collaborator

sckott commented Jan 16, 2017

@ateucher ah, so by

borrow the logic

you just meant borrow some code, not use sf?

@mdsumner
Copy link

Yes, it'd be good to use the sf approach, and possibly you could remove the dep on sp completely. but sf bundles with GDAL, GEOS and PROJ.4 so it's not exactly small. But fwiw it's very efficient to convert to sf, and it would allow you to remove those converters from your package if you relied on sf.

I don't have enough understanding of geojsonio design to know what's best (I see import(sp) and rgdal is done holus!), but I do think you might be able away with copying the sf code to decompose sp and avoid the imports completely. (though the coercion stuff will take a bit of work, I would start by modifying the sp and rgdal imports to be granular, and then work on chopping them out. I'm not sure how much of the Rings stuff is handled by sf, but the to-DataFrame parts could be simpler, I think - and would be better in a separate package. That's sort of what spbabel was for, so you could give it anything Spatial* and then spbabel::sp() the resulting table/s and it would could hide a lot of stuff, but it's not clear that's going to help now.

@mdsumner
Copy link

I made a start on the imports here, but to finish it might be more complicated than I'd like :)
https://github.com/mdsumner/geojsonio/tree/sp-imports

@sckott
Copy link
Collaborator

sckott commented Jan 16, 2017

rgdal is done holus

what does this mean?

I think we'd still want to keep sp in Suggests (the only load for users that do sp stuff) at least for users that want to convert sp class objects to geojson

@ateucher
Copy link
Member Author

In my opinion, the goal would be to move sp, rgdal, rgeos, and sf to Suggests, and only use them for changing CRSs, and probably going from geojson to Spatial or sf classes.

I think we should be able to go from sf to geojson (already done) and Spatial* to geojson largely without using those packages.

@sckott:

you just meant borrow some code, not use sf?

Yup.

@sckott
Copy link
Collaborator

sckott commented Jan 16, 2017

Agree with you @ateucher - ideally all those pkgs in Suggests

@mdsumner
Copy link

@sckott I just mean the entire package is imported "holus bolus", rather than just the functions needed. I'd like to make those imports granular so it's clear what's used and what isn't, which is a stepping stone to Suggests.

@sckott
Copy link
Collaborator

sckott commented Jan 17, 2017

ah, makes sense @mdsumner

@sckott
Copy link
Collaborator

sckott commented Nov 28, 2019

@ateucher i've replaced rgdal with sf - now that we're using sf::st_write know of any way to avoid the disk write with that fxn?

@ateucher
Copy link
Member Author

I don't think so @sckott, pretty similar behaviour to rgdal in that sense...

@mdsumner
Copy link

geojsonsf?

@ateucher
Copy link
Member Author

I had a similar thought @mdsumner. I think I'll probably use it in rmapshaper too

@sckott
Copy link
Collaborator

sckott commented Nov 30, 2019

good idea @mdsumner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants