Add function to recursively export all Socrata data to local directory #126

tomschenkjr · 2017-03-05T01:13:06Z

The ls.socrata() function supports the listing of all data on a data portal while read.socrata(). Thus, the two can be combined under export.socrata() to download all of the files a neatly compress and place them in a single directory.

This sort of functionality can be used by people wishing to archive data portals or help in the migration of one platform to another one. The function should be focused on saving the data to a local or cloud-based storage (e.g., S3) and should avoid loading all of the data into memory.

Solving #124 will allow an easier integration between ls.socrata() and read.socrata(), so RSocrata v1.7.2-7 or above will be required.

Output compress CSVs into a directory named after the domain itself (e.g., data.cityofchicago.org/)
Timestamp each individual file (files should be individually timestamped since an entire portal export may take awhile)
Ensure that GeoJSON files are exported, compressed (I don't think Add support for geojson downloads #43 needs to be completed first)
Ensure Shapefiles, KML/KMZ, and other "attach"-based files are also exported, compressed
Allow users to supply a token in export.socrata() to avoid throttling
Allow users to choose between CSV or JSON outputs. (CSVs are data.frame friendly, but JSON downloads faster from Socrata)
Allow users to choose between compression or uncompressed output
Support output to cloud services such as S3
Write unit test

~~An initial alpha is on this gist.~~ This is now on the issue126 branch. Feedback is encouraged as I've done limited testing at this point.

The text was updated successfully, but these errors were encountered:

joshmwolff · 2017-03-07T19:54:23Z

Really glad to see this enhancement request. I've used this function combination before and it works well, so I'm glad to see that you're thinking of combining them. An interesting extension of this enhancement --and one perhaps not vulnerable to throttling -- would be to enable RSocrata users to export a list of column names associated with each dataset (rather than the datasets themselves). That would allow users to investigate which datasets share which fields.

tomschenkjr · 2017-03-07T20:16:09Z

Thanks, @joshmwolff. It's an interesting idea and can see that being generally useful. But, as I think about it, would that be most useful to retain all of the columns in memory instead of writing them to disk?

joshmwolff · 2017-03-07T21:22:30Z

@tomschenkjr: think you're right: it would be inefficient to use the export.socrata() function to get a list of all column names for all datasets within a domain if that means downloading all datasets into memory instead of writing them to disk. That said, I'm not sure I'd necessarily want to download all datasets to disk if I were only interested in keeping column names and ultimately combining those names into a single data table. I see, however, that Socrata's Discovery API will in fact return dataset column names for datasets in a particular domain. Can RSocrata hit the Discovery API? That might enable the package to capture the column names without having to also download the column data itself.

tomschenkjr · 2017-03-07T21:34:52Z

@joshmwolff - right now we're not using the discovery API, but something we're planning for an upcoming release.

In either case, I think read.socrata() could be modified to only extract the column names and not the data. Whether that's through the same SoDA API calls or through the Discovery/Catalog API is probably more a technical/optimization question.

Would you mind opening a new issue on this? I think it's a worthwhile conversation to track as a separate feature than this one.

tomschenkjr · 2017-03-15T22:05:58Z

Outlining some thoughts on unit tests for this function:

Count the number of files written to the directory and compare it to the count of data sets listed in ls.socrata().
Compare the file names written to disk to the names expected to be written (may need to be careful of certain OSes having filename length limitations).
Ensure the directory created aligns to the domain name being exported.

joshmwolff · 2017-03-27T17:24:59Z

FWIW, I've been playing with Socrata's "Discovery API" and it works well as a means to for creating a small dataframe of dataset names and other metadata. The following worked for me:

library(jsonlite)
API <- 'http://api.us.socrata.com/api/catalog/v1?only=datasets&domains=data.cambridgema.gov'
RawMetadata <- fromJSON(txt=API)
Metadata <- RawMetadata[['results']]
Metadata <- Metadata$resource

This avoids having to write any files to disk, as you're just storing a single small dataframe in memory (in Cambridge's case, 88 rows by 17 columns).

tomschenkjr · 2017-03-28T02:43:58Z

Looks interesting and definitely interested in incorporating the Discovery API (i.e., #114).

Would this fit in line with #128?

joshmwolff · 2017-03-30T19:23:53Z

Whoops. You're right: my comment belongs in #128 rather than here. Feel free to ignore.

tomschenkjr · 2017-05-05T22:35:09Z

I've pushed the branch to the repo: issue126.

/cc @nicklucius @geneorama

nicklucius · 2017-05-06T03:36:10Z

I've pushed my first stab at downloading non-tabular data files. Here's how it works: if the first download URL available is not a CSV, the httr::GET() contents are written straight to a file and read.socrata() is not called. The original filename and proper file extension is appended to the filename.

For geographical data with multiple download choices, it looks like KML is first so that is what is being saved.

Still to do:

compress the file - I had issues with this

tomschenkjr · 2017-05-06T17:38:52Z

@nicklucius - that's great. I'll check-off export for Shapefile/KML/KMZ for now. Looks like this will also work for geojson, but will play with that for a bit to see how it works.

The compression is surprisingly tricky.

tomschenkjr · 2017-05-06T18:32:05Z

Scratch the geojson remark, that's certainly taken care of. I'll check that off, too.

tomschenkjr · 2017-05-06T18:55:52Z

Getting an error when testing with the bulk files. Appears to be an error with the file name structure. Looking into it.

> export.socrata("https://data.cityofchicago.org/")
 Hide Traceback
 
 Rerun with Debug
 Error in file(con, "wb") : cannot open the connection 
3. file(con, "wb") 
2. writeBin(response$content, filename) 
1. export.socrata("https://data.cityofchicago.org/") 
In addition: Warning message:
In file(con, "wb") :
  cannot open file 'data.cityofchicago.org/qixn-wjxu_2017-05-06_133501."Street Sweeping - 2017 -
Map.kml"': Invalid argument

tomschenkjr · 2017-05-06T22:04:12Z

Ok - I've fixed the above error and also added the app_token parameter.

Later today, can move this over to dev branch to be part of the nightly build.

tomschenkjr · 2017-05-06T22:05:28Z

As for unit testing, one option is to setup a valid, but fake data.json file that is just a subset of data and placed at a file location we can control. Those files can point to actual data from the Chicago data portal.

This is a bit of work so don't necessarily like it, but is one option.

tomschenkjr · 2017-05-07T12:31:21Z

It's now on dev branch.

chrismetcalf · 2017-05-08T18:48:28Z

Hey guys, been on paternity leave (👶🍼💩) so I haven't had my head in GitHub enough, but @tomschenkjr was kind enough to alert me via email to what is going on.

Big question that'll help this be performant on Socrata portals - for the actual data dumping, are you using the export links (/api/views/$id/rows.$format) or the SODA2 APIs (/resource/$id.$format)?

The former will allow you to download the dataset export in one big file, and takes better advantage of caching where available. It should be faster for you.

nicklucius · 2017-05-10T14:41:58Z

@chrismetcalf - Congrats! I can understand how GitHub might not be the first thing on your mind right now.

The export.socrata() function uses the export links that look like /api/views/$id/rows.$format to download data. Is this the faster method? If so, I wonder if that is something to keep in mind for read.socrata() development.

chrismetcalf · 2017-05-12T16:33:34Z

Yes, @nicklucius, it looks like you're all good! The only other recommendation I would make would be to watch for eTag headers if you've got local caching, but if you're using the export links you're already taking advantage of our bulk export and server-side caching.

geneorama · 2017-05-15T19:48:52Z

Documenting two ideas from the analytics meeting today:

We need to save off the ls.socrata results so that we have the information about the data sets in the future.
I think it would be good to separate out the two functions into their own functions for easier unit testing

James-SR · 2018-03-21T09:18:13Z

Not sure if this error message should be a concern?

export.socrata("https://opendata.cheshireeast.gov.uk/")
Warning messages:
1: In no_deniro(result[[columnName]]) : NAs introduced by coercion
2: In no_deniro(result[[columnName]]) : NAs introduced by coercion
3: In no_deniro(result[[columnName]]) : NAs introduced by coercion

It appears to have extracted some files, but I'm not sure it is the complete set of public files.

tomschenkjr · 2018-03-21T17:29:02Z

@James-SR - thanks. It's still a beta feature so good to see some use cases that's producing warnings. When we develop it more, we will look at this warnings and handle them more elegantly.

The easiest way to check is if the number of exported documents is the same as the number of entries at https://opendata.cheshireeast.gov.uk/data.json

tomschenkjr · 2018-10-28T02:59:29Z

@James-SR - The latest build of export.socrata has been pushed to the issue126 branch. I tested Cheshire East portal and did not receive the error. This could be because we fixed it or the offending data set or element was changed on the portal itself.

Feel free to test this again and hope it is still useful.

tomschenkjr · 2018-10-30T16:28:28Z

There is quirk with some data listed in Socrata that I wanted to document. Socrata supports several HTML-based "non-data" formats in the data.json file--which we use to list all available data through ls.socrata(). First, Socrata has "Stories" which let users create HTML sites (example). In my recent commit, I've opted to simply skip these files.

Second, Socrata also supports "external data". Sometimes, these external data are links to HTML webpages while other times they link to actual data. For instance, this dataset is linked to a web page not hosted by Socrata.

Our function simply does not handle this well because HTML websites do not have content disposition. I've made a change to ignore links that do not have a content disposition to ignore these kind of sites. Sometimes external data is actually data (e.g., CSV), so those should still be downloaded. Other content, e.g., HTML, will be skipped.

The downside of these approaches is it will ignore some information and could cause confusion. The former scenario is easy to determine because the data.json does not list a distributionUrl for Stories. However, the latter case is more complicated because external data always displays a distributionUrl, even if it's to HTML-based sites.

This also makes more difficult to write unit tests. For instance, this approach means the data.json file is not a one-for-one correspondence to data actually downloaded. There isn't an indicator in data.json for what is ultimately downloaded.

I will think of ways to resolve these issues.

tomschenkjr · 2020-01-05T18:10:35Z

I've resolved the issue by having the function follow these rules:

Download if it is a CSV (mediaType is text/csv)
Ignore if there is no downloadUrl (usually because there simply isn't data)
Ignore if it's an HTML file (mediaType is text/html)
Otherwise, download everything through a normal GET command.

* Ignores HTML files (e.g., Socrata Pages) * Ignores on occassions there isn't any data * Will download (uncompressed) PDFs, Word, Excel, PowerPoint, plain text attachments.

tomschenkjr added the enhancement label Mar 5, 2017

tomschenkjr self-assigned this Mar 5, 2017

tomschenkjr modified the milestone: 1.8.0 Mar 15, 2017

nicklucius added a commit that referenced this issue May 6, 2017

download non-tabular datasets with export.socrata #126

afd8d64

tomschenkjr mentioned this issue Dec 4, 2017

Fix package reference in export.socrata #140

Closed

tomschenkjr mentioned this issue Jan 5, 2020

Adding export.socrata function #187

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add function to recursively export all Socrata data to local directory #126

Add function to recursively export all Socrata data to local directory #126

tomschenkjr commented Mar 5, 2017 •

edited

joshmwolff commented Mar 7, 2017

tomschenkjr commented Mar 7, 2017

joshmwolff commented Mar 7, 2017

tomschenkjr commented Mar 7, 2017

tomschenkjr commented Mar 15, 2017

joshmwolff commented Mar 27, 2017

tomschenkjr commented Mar 28, 2017

joshmwolff commented Mar 30, 2017

tomschenkjr commented May 5, 2017

nicklucius commented May 6, 2017

tomschenkjr commented May 6, 2017

tomschenkjr commented May 6, 2017

tomschenkjr commented May 6, 2017

tomschenkjr commented May 6, 2017

tomschenkjr commented May 6, 2017

tomschenkjr commented May 7, 2017

chrismetcalf commented May 8, 2017

nicklucius commented May 10, 2017

chrismetcalf commented May 12, 2017

geneorama commented May 15, 2017

James-SR commented Mar 21, 2018 •

edited

tomschenkjr commented Mar 21, 2018

tomschenkjr commented Oct 28, 2018

tomschenkjr commented Oct 30, 2018

tomschenkjr commented Jan 5, 2020

Add function to recursively export all Socrata data to local directory #126

Add function to recursively export all Socrata data to local directory #126

Comments

tomschenkjr commented Mar 5, 2017 • edited

joshmwolff commented Mar 7, 2017

tomschenkjr commented Mar 7, 2017

joshmwolff commented Mar 7, 2017

tomschenkjr commented Mar 7, 2017

tomschenkjr commented Mar 15, 2017

joshmwolff commented Mar 27, 2017

tomschenkjr commented Mar 28, 2017

joshmwolff commented Mar 30, 2017

tomschenkjr commented May 5, 2017

nicklucius commented May 6, 2017

tomschenkjr commented May 6, 2017

tomschenkjr commented May 6, 2017

tomschenkjr commented May 6, 2017

tomschenkjr commented May 6, 2017

tomschenkjr commented May 6, 2017

tomschenkjr commented May 7, 2017

chrismetcalf commented May 8, 2017

nicklucius commented May 10, 2017

chrismetcalf commented May 12, 2017

geneorama commented May 15, 2017

James-SR commented Mar 21, 2018 • edited

tomschenkjr commented Mar 21, 2018

tomschenkjr commented Oct 28, 2018

tomschenkjr commented Oct 30, 2018

tomschenkjr commented Jan 5, 2020

tomschenkjr commented Mar 5, 2017 •

edited

James-SR commented Mar 21, 2018 •

edited