Data catalog integration #214

danielfdsilva · 2018-04-09T16:37:59Z

From the proposal:

Develop an integration with the WB Data Catalog that makes it easier to discover and use input data prepared by others.

Users upload data to the Data Catalog and tag it with a specific tag (this will be well documented);
When setting up a project, the users can indicate whether they want to pull data from the Data Catalog;
When they select the Data Catalog as the source, the platform will fetch data and prepare it for use by RAM.

The development of this feature depends on the ability to programmatically access and retrieve data from the Data Catalog. Most importantly:

List datasets by tag and collection (eg. RAM-ready, road-network, population)
Retrieve datasets.

danielfdsilva · 2018-04-10T16:06:35Z

The flow for the data catalog integration will be:

Search for the appropriate datasets using the search-service/search_api/datasets endpoint and filtering by RAM specific tags (Ex: https://datacatalog.worldbank.org/search-service/search_api/datasets?filter[field_tags]=1230).
Select the first resource in field_resources.
Use the api/3/action/resource_show to get information on the resource and download link (Ex: https://datacatalog.worldbank.org/api/3/action/resource_show?id=102055).

Important notes:

The tags used in the datasets must be one of ram-poi, ram-origins, ram-road-network, ram-profile, ram-admin depending on the data type of the resource.
We're only going to consider the first item of the resources field.
The resource is expected to be in the format needed by RAM.

There are a couple of things we need to move this forward, namely:

Add the tags ram-poi, ram-origins, ram-road-network, ram-profile, ram-admin to the system and provide us with their ids.
Upload sample data to use during development.

danielfdsilva · 2018-04-12T08:45:09Z

Server implementation for data catalog

The server will have an endpoint to return the options for a given file type (poi, origins, road-network, profile, admin) - /projects/setup-options`.
When then endpoint is queried the system will check the internal cache for data. If the data is recent enough it is returned, otherwise it will be fetched from the world bank data catalog and stored for subsequent requests enduring a speedy application.
When trying to submit the project/scenario creation form the system will check if the selected option is available. If it isn't, an error is returned and the cache is cleared ensuring that the next request will have fresh data.

Cache table schema:

field	type
id	integer
type	string
dataset_id	string
dataset_name	string
resource_id	string
resource_url	string

olafveerman · 2018-06-05T14:45:10Z

ID's of the tags:

tag	id
`ram-rn`	1412
`ram-admin`	1413

Still have to figure out the tags for the other tags (pop and POI)

The following endpoint can be used to filter by type:

https://datacatalog.worldbank.org/search-service/search_api/datasets?limit=10&filter[field_tags]=1412

cc @danielfdsilva

olafveerman · 2018-07-26T14:01:01Z

This feature is ready to be tested. For this we'd need the following:

the tag id's of ram-poi, ram-origins
data that is tested in RAM uploaded to the Data Catalog. This will allow us to run the final tests before merging in the PR.

@qli1205 Can you help us with this?

olafveerman · 2018-09-14T13:58:07Z

Todo:

test if we can import a single POI dataset with multiple resources. Clara will upload a dataset for us to test with
Clara will also provide links to a POI and origin dataset so we can determine the id of ram-poi and ram-origins

@qli1205

olafveerman · 2018-09-20T18:50:19Z

To test the POI dataset, we can use the following: https://datacatalog.worldbank.org/dataset/china-ghuizou-poi

danielfdsilva · 2018-09-21T17:22:00Z

We were able to test the data catalog workflow with the new data you uploaded. The POI and Admin Boundaries work well, but we need input from your side on the Road Network and Origins.

Please see our notes below.

POI

Thanks for uploading the new POI data. We implemented a way to consume a single resource with multiple datasets.
Both single and multi-resource datasets will work fine, but the latter is preferred as it makes the import process faster for the user. Example of the 2 different options:

{
  "143771": {
    "title": "China - Guizhou: Heatlth",
    "nid": "143771",
    "field_resources": {
      "und": [
        { "target_id": "143772" }
      ]
    }
  },
  "144180": {
    "title": "China - Ghuizou: POI",
    "nid": "144180",
    "field_resources": {
      "und": [
        { "target_id": "144181" },
        { "target_id": "144182" },
        { "target_id": "144183" }
      ]
    }
  }
}

Admin boundaries

Correct format and working perfectly.

Road-network

The entry we found in the catalog is in GeoJSON format. The road-network must be osm-xml as specified in the help.

Population data

There were no entries for population data in the catalog. Please upload it and provide us with a link.

The population data must be a GeoJSON file where each feature has at least a population estimate property with an integer value. To make the import process as easy as possible, the resource metadata should include a tag that specifies which attributes are related to population. As we're not very familiar with CKAN, we can work with you to figure out how to best incorporate this data.

@qli1205

danielfdsilva mentioned this issue Apr 17, 2018

Add support for wbcatalog source #216

Merged

danielfdsilva mentioned this issue Sep 24, 2018

WB Catalog fixes #234

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data catalog integration #214

Data catalog integration #214

danielfdsilva commented Apr 9, 2018

danielfdsilva commented Apr 10, 2018 •

edited

danielfdsilva commented Apr 12, 2018 •

edited

olafveerman commented Jun 5, 2018

olafveerman commented Jul 26, 2018

olafveerman commented Sep 14, 2018

olafveerman commented Sep 20, 2018

danielfdsilva commented Sep 21, 2018

Data catalog integration #214

Data catalog integration #214

Comments

danielfdsilva commented Apr 9, 2018

danielfdsilva commented Apr 10, 2018 • edited

danielfdsilva commented Apr 12, 2018 • edited

olafveerman commented Jun 5, 2018

olafveerman commented Jul 26, 2018

olafveerman commented Sep 14, 2018

olafveerman commented Sep 20, 2018

danielfdsilva commented Sep 21, 2018

POI

Admin boundaries

Road-network

Population data

danielfdsilva commented Apr 10, 2018 •

edited

danielfdsilva commented Apr 12, 2018 •

edited