Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Data Management API #186

Open
ryan-hall opened this issue Dec 5, 2019 · 6 comments
Open

Support for Data Management API #186

ryan-hall opened this issue Dec 5, 2019 · 6 comments

Comments

@ryan-hall
Copy link

Enhancement:
The Socrata Data Management API (https://socratapublishing.docs.apiary.io/#) enables Socrata publishing features like dataset drafts and on-platform data transformations. It would be ideal if RSocrata offered data publishers a convenient method for using the Data Management API codepath when sending data to Socrata, in addition to the current SODA endpoints.

@tomschenkjr
Copy link
Contributor

@ryan-hall thanks for the suggestion. Of course, it would be interesting to support as much of the Socrata API as possible. I've finally had a chance to review the documentation so now getting to a few questions to help understand current capabilities and what the Data Management API can provide.

First, I want to make sure we understand the new capabilities with this data set.

Overall flow

It seems there are two main verbs that are reused across all actions: creating revisions and applying revisions. For some functions, there are other steps in between, but these are universal. Depending on the action, creating revisions and applying revisions may need to have different arguments

Questions

  1. Are the revision and applying revisions truly universal? Does it make sense to have a pair of universal functions to handle these?

Creating a new data set

It appears the new API supports the creation of a few data set through the API (🙌). It appears creating a new data set requires several steps: creating a new revision, create a new source (i.e.

If I understand this correctly, this gives us a few options. While there are multiple steps to create a file, we could create several functions to make the process easier:

  • Create a revision
  • Create a new source
  • Upload a source
  • (Optional) Adding new column to a source file before uploading.
  • Apply revision
  • A function that performs all of these actions (e.g., creates a revision, source, and uploads in a single function). However, this would exclude adding a new column before uploading since that does not make sense.

Questions

  1. Can users later modify the public and private status of the data set? For instance, create a private data set and then use the API to change it to public?
  2. Are all of the returned responses exclusively done through JSON?
  3. After creating the data set, can I use typical methods to publish the data (e.g., write.socrata())
  4. For creating a new source, it appears the fields are: field_name, display_name, position, and transform
    1. What is position and potential inputs?
    2. How do you think we might handle the transformation functions? Use raw string inputs as part of an input field?
  5. What is the creating data sets from an external link doing? Not following that example use-case.

Updating metadata

Awesome there is support for updating metadata for both data sets and columns. It appears these are the valid metadata fields that can be updated for the entire data set:

  • tags
  • privateMetadata
  • name
  • licenseId
  • license.name
  • description
  • category
  • attributionLink
  • attribution

For column metadata, it appears you can update the following metadata fields:

  • display_name
  • field_name (same as API name)
  • description

Questions

  • Are there limitations on data types, length limitations, or anything else as inputs for metadata?
  • What's the difference between licenseID and license.name?
  • We've only supported gathering metadata from the data.json file (ls.socrata()). Related to this, is there another way to supporting reading metadata that is officially supported? Is it the views API?

Updating Column Data Types

This seems straight forward.

Legacy support for write.socrata()

The current function is write.socrata() that uses traditional POST and PUT calls.

  • Do you foresee this functionality being supported for the foreseeable future?
  • Can you intermix the use of the Data Management API with write.socrata calls? That is, can I use write.socrata() after initially creating a data set through the Data Management API? Namely, I'm wondering if we can keep write.socrata() to handle dataset updates.
  • Is the Data Management API only compatible with data sets beyond a specific version (e.g., 2.1)?

@matt-sandgren
Copy link

@tomschenkjr Hi Tom, I work for Fulton County Government in GA, I'd be very interested in contributing to this. We've been using the write.socrata() function to update all of our datasets, but we have an increasing number of them that use some of the Data Management transforms. I've been playing around with the Data Management API recently, and I have a working function that addresses this much:

A function that performs all of these actions (e.g., creates a revision, source, and uploads in a single function). However, this would exclude adding a new column before uploading since that does not make sense.

I have very limited git/github experience, so I'm not sure what the best way to share that code would be, if you're interested in taking a look.

  • RE your question on position, that's a numeric argument that determines the column order. From their docs "If you're adding a new column, position is a required field that determines the column order".

  • For handling transformation functions, using raw string input is the easiest way I can think of, probably passed to a function in the form of a named list.

  • I believe you can use write.socrata() after initially creating a dataset through the Data Management API, as long as that dataset doesn't use any data transformations, but I'm only 80% sure of that so we'd want to test.

@tomschenkjr
Copy link
Contributor

@matt-sandgren - thanks! That's terrific and happy that you'd be interested in helping. Certainly understand that you don't have a lot of GitHub experience, but perhaps we can do some preliminary sharing of the code to fit it within the RSocrata package.

Not sure if you've done this before, but you can share code in gist. You can simply copy/paste code without needing to use any git commands. If you create a "Public Gist", we can then take a look at the code. Just copy and paste the link once it's created.

@matt-sandgren
Copy link

Here's what I've got so far.

  • I've broken one small function out, but it can likely be broken up further.
  • Any dataset can be updated, not just ones that use data transforms on Socrata.

@ryan-hall
Copy link
Author

I've been exploring separate functions for each step in this fork. Thanks for sharing what you've been using @matt-sandgren. I've focused on the flow for updating a dataset from a local csv.

To some of your original questions, @tomschenkjr:

Overall Flow

  1. Yes, they are fairly universal. You can always open a new revision on a dataset. You can open multiple revisions on a single dataset. And applying a revision is the same no matter what you've done within the revision. The only nuance to applying is permissions/approvals (see below 1)

Creating a new data set or updating an existing dataset

  1. There's a separate Permissions API for changing the sharing scope of an asset. In a data update, DSMAPI will simply use the current permissions scope. It gets a little tricky when applying a revision runs up against an Approvals queue, controlled by an Approvals API. So in the state where
    • Dataset is public, but dataset updates go through the approval queue, then applying the revision will "fail" because the dataset has entered the approvals queue and needs an approvals decision before the update takes effect in the published copy.
    • Supporting the Permissions API seems like another enhancement to consider
    • Supporting the Approvals API within these DSMAPI write functions needs further thought
  2. DSMAPI returns responses only in JSON, yes.
  3. write.socrata() will not do anything against a revision/draft, but is still valid with a published dataset. The real crux of the API choice (SODA vs DSMAPI) is that if you rely on some on-Socrata data transforms (like creating a new geocoded Point column), write.socrata() will not apply those transforms (the Point column would be empty for any new rows sent with write.socrata())
  4. For a new column, some fields like position are required. But this is only if you are adding a new column within the revision, which may be infrequent.
    • I'm curious if there's a use case for regular schema changes, like frequent new columns or altered transforms, or bulk schema changes where this occurs.
  5. You can pass DSMAPI a URL of a parseable file instead of a local file. This would be prudent to support in any "create" functions.

Updating Metadata

  1. Some fields may have limitations, need to look into this further. Standard API field name rules apply. Column descriptions have no length limitation as far as I know, but they are plain text only, no html support.
  2. I believe License ID refers to baked in license options on Socrata, would need to check.
  3. The Metadata API allows for reading and writing dataset-level metadata. See the issue145 branch on my fork for a basic read metadata function. The Metadata API does not read or write column-level metadata however.

Legacy support for write.socrata()

  1. Yes, the SODA endoints are still supported and writing directly to the dataset is still supported. This will continue into the foreseeable future.
  2. You can, but only if your source matches the schema of the Socrata destination exactly. And, if any transforms at all are used (a point column made from lat long, a year value transformed into a DateTime, a case() statement used to map a set of values to new values, some numerical or percentage calculation, etc) they will not be applied with write.socrata(). @matt-sandgren, you're right on about the "as long as that dataset doesn't use any data transformations".
  3. Definitely would want to test this more to see where there are edge cases.

Note on validating an upload to a revision

When data is uploaded to a revision, Socrata checks every column for data type errors as well as running all transforms against the new data. This can be quick, or can take a hot minute. If you try and apply the revision before the validation step has finished, the apply step will fail.

You can check finished_at and failed_at in the response from the upload to source step to ensure that:

  1. the upload has finished processing and
  2. no columns failed to process (transform/data types)

@mlamias
Copy link

mlamias commented Dec 9, 2020

Enhancement:
The Socrata Data Management API (https://socratapublishing.docs.apiary.io/#) enables Socrata publishing features like dataset drafts and on-platform data transformations. It would be ideal if RSocrata offered data publishers a convenient method for using the Data Management API codepath when sending data to Socrata, in addition to the current SODA endpoints.

Yes! I would really love to see the developer API Key supported for UPSERT and REPLACE. Any plans to include this soon?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants