Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data catalog integration #214

Open
danielfdsilva opened this issue Apr 9, 2018 · 7 comments
Open

Data catalog integration #214

danielfdsilva opened this issue Apr 9, 2018 · 7 comments

Comments

@danielfdsilva
Copy link
Collaborator

From the proposal:


Develop an integration with the WB Data Catalog that makes it easier to discover and use input data prepared by others.

  1. Users upload data to the Data Catalog and tag it with a specific tag (this will be well documented);
  2. When setting up a project, the users can indicate whether they want to pull data from the Data Catalog;
  3. When they select the Data Catalog as the source, the platform will fetch data and prepare it for use by RAM.

The development of this feature depends on the ability to programmatically access and retrieve data from the Data Catalog. Most importantly:

  • List datasets by tag and collection (eg. RAM-ready, road-network, population)
  • Retrieve datasets.
@danielfdsilva
Copy link
Collaborator Author

danielfdsilva commented Apr 10, 2018

The flow for the data catalog integration will be:

Important notes:

  • The tags used in the datasets must be one of ram-poi, ram-origins, ram-road-network, ram-profile, ram-admin depending on the data type of the resource.
  • We're only going to consider the first item of the resources field.
  • The resource is expected to be in the format needed by RAM.

There are a couple of things we need to move this forward, namely:

  • Add the tags ram-poi, ram-origins, ram-road-network, ram-profile, ram-admin to the system and provide us with their ids.
  • Upload sample data to use during development.

@danielfdsilva
Copy link
Collaborator Author

danielfdsilva commented Apr 12, 2018

Server implementation for data catalog

  • The server will have an endpoint to return the options for a given file type (poi, origins, road-network, profile, admin) - /projects/setup-options`.
  • When then endpoint is queried the system will check the internal cache for data. If the data is recent enough it is returned, otherwise it will be fetched from the world bank data catalog and stored for subsequent requests enduring a speedy application.
  • When trying to submit the project/scenario creation form the system will check if the selected option is available. If it isn't, an error is returned and the cache is cleared ensuring that the next request will have fresh data.

Cache table schema:

field type
id integer
type string
dataset_id string
dataset_name string
resource_id string
resource_url string

@olafveerman
Copy link
Contributor

ID's of the tags:

tag id
ram-rn 1412
ram-admin 1413

Still have to figure out the tags for the other tags (pop and POI)

The following endpoint can be used to filter by type:

https://datacatalog.worldbank.org/search-service/search_api/datasets?limit=10&filter[field_tags]=1412

cc @danielfdsilva

@olafveerman
Copy link
Contributor

This feature is ready to be tested. For this we'd need the following:

  • the tag id's of ram-poi, ram-origins
  • data that is tested in RAM uploaded to the Data Catalog. This will allow us to run the final tests before merging in the PR.

@qli1205 Can you help us with this?

@olafveerman
Copy link
Contributor

Todo:

  • test if we can import a single POI dataset with multiple resources. Clara will upload a dataset for us to test with
  • Clara will also provide links to a POI and origin dataset so we can determine the id of ram-poi and ram-origins

@qli1205

@olafveerman
Copy link
Contributor

To test the POI dataset, we can use the following: https://datacatalog.worldbank.org/dataset/china-ghuizou-poi

@danielfdsilva
Copy link
Collaborator Author

We were able to test the data catalog workflow with the new data you uploaded. The POI and Admin Boundaries work well, but we need input from your side on the Road Network and Origins.

Please see our notes below.

POI

Thanks for uploading the new POI data. We implemented a way to consume a single resource with multiple datasets.
Both single and multi-resource datasets will work fine, but the latter is preferred as it makes the import process faster for the user. Example of the 2 different options:

{
  "143771": {
    "title": "China - Guizhou: Heatlth",
    "nid": "143771",
    "field_resources": {
      "und": [
        { "target_id": "143772" }
      ]
    }
  },
  "144180": {
    "title": "China - Ghuizou: POI",
    "nid": "144180",
    "field_resources": {
      "und": [
        { "target_id": "144181" },
        { "target_id": "144182" },
        { "target_id": "144183" }
      ]
    }
  }
}

Admin boundaries

Correct format and working perfectly.

Road-network

The entry we found in the catalog is in GeoJSON format. The road-network must be osm-xml as specified in the help.

Population data

There were no entries for population data in the catalog. Please upload it and provide us with a link.

The population data must be a GeoJSON file where each feature has at least a population estimate property with an integer value. To make the import process as easy as possible, the resource metadata should include a tag that specifies which attributes are related to population. As we're not very familiar with CKAN, we can work with you to figure out how to best incorporate this data.

@qli1205

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants