Skip to content

Set of examples to generate different types of metadata with R packages

Notifications You must be signed in to change notification settings

juldebar/R_Metadata

Repository files navigation

Main releases of the codes are availale on Zenodo with this DOI:

DOI

Implementation of FAIR data management plans with R programming language and usual data sources in scientific context

This repository provides 3 examples of workflows to generate metadata (compliant with OGC standards) from different data sources by using R scripts. Metadata can be pushed directly from R to a CSW server (eg geonetwork) and data managed in a Postgres database can be also published in Geoserver (WMS/WFS) from R.

Each sub-folder contains an example of worfklow dedicated to a specific kind of data source:

  • Flat files: CSV/Google Spreadsheet which is used to edit metadata for different datasets, the data structure of the spreadsheet relies on DCMI main metadata elements (one column per type of element), This workflow can either work with a simple CSV file (local access) or with the same file stored in a collaborative environment to facilitate online edition by multiple users without versionning issues (we give an example in a google spreadsheet),
  • SQL / Relationnal database / RDBMS: this workflow uses a "metadata" table (with a similar structure as the spreadsheet used in the previous workflow) and is only implemented with Postgres and Postgis RDBMS (other RDBMS could be easily added, eg MySQL).
  • NetCDF files / OPeNDAP accesible on Thredds sever: this worklow can either extract metadata from NetCDF files stored locally or remotely accessible through OPeNDAP protocol (eg frome a Thredds server).

These R codes can be executed online

All codes can be executed online in RStudio server provided by D4science infrastructure. If you want to try, please ask for a login (and briefly explain why).

Pre-requisites

Make sure that following pre-requisites are ok:

if (!require("pacman")) install.packages("pacman")
pacman::p_load(uuid,raster,ncdf4,gsheet,XML,devtools,RPostgreSQL,jsonlite,googleVis,rgeos,rgdal,sf)
install_github("RFigisGeo", "openfigis")

If rgdal is not available for your version of R install it from source or update your R version.

Installation of R packages on Linux might require the installation of following OS underlying packages (tested on Debian / Ubuntu):

(sudo) apt-get install netcdf-bin libcurl4-openssl-dev  libssl-dev r-cran-ncdf4 libxml2-dev libgdal-dev gdal-bin libgeos-dev udunits-bin libudunits2-dev

Step 1: Execute the default workflow (spreadsheet use case)

Once you have set up the execution environment (see list of OS and R packages in the section above), as a first start, it is recommended to execute the worklow using a google spreadsheet as a (meta)data source since it is the easiest worklow to start with. This will help you to understand how to deal with the json configuration file as well as to understand the logics of all workflows.

Just change few lines in 2 files

Once done with pre-requisites (see previous section):

  • change the working directory in the main script for the workflow to fit the actual (local) path of this github repository on your PC,
  • edit the content of the json configuration file template (there is one specific json file per workflow / type of data source) to specify how to connect the components of your spatial data infrastructure and the URLs of the spreadsheets (storing metadata and related contacts).
    • if you want to use the BlueBridge / D4science infrastructure components (eg RStudio server, geoserver / geonetwork) you have to set the token of your personal account : you need to register first,
    • at this stage, it is recommanded to let the default URLs of the google spreadsheets (you will update them with yours once you checked that the workflow can be executed as it is set by default),
    • set the credentials of your Geonetwork or CSW server (see here )
    • rename this file as following :" workflow_configuration_Dublin_Core_gsheet.json "
  • Execute the main script of the workflow, read the logs and check that Geonetwork is accessible from R.

If it works properly, you should see all datasets described in the spreadsheet containing dublin core metadata elements displayed as metadata sheets published in the geonetwork / CSW server.

Usual Errors

  • Your token is not set if you use Geonetwork / Geoserver in the BlueBridge infrastructure
  • You are using emails in the ** metadata spreadsheet ** which are not declared in the ** contacts spreadsheet **
  • You didn't comply with syntactic rules
    • contacts: see related wiki section
    • provenance: see related wiki section

Once there, you can start tuning the workflow to plug other data sources and using other contacts.

Step 2 : Tune the workflow to fit your needs

Once you have been able to execute the workflow with the provided templates and your SDI, you can customize the workflow to fit your specific needs.

Whatever the data source to be plugged, the most important step remain (see details in previous section) :

  • the modification of the main script:
  • the edition of the content of json configuration files templates (one specific json file per workflow / type of data source) to indicate how to connect the components of your spatial data infrastructure and the URLs of the google spreadsheets you created,

Plug your data sources (spreadsheets, Postgres database, Thredds server) and your applications

When it works, you can try to execute the same worflow with your spreadsheets and other workflows with additional data sources (Postgres and Thredds / NetCDF files).

  • you have created your own google spreadsheets to describe:
  • For Postgres workflow, you have to specify how to use the additional applications:
    • set the credentials of your Postgres server,
    • set the credentials of your Geoserver which will be used to make datasets available with WMS / WMFS access protocols.
    • Execute the main script of the workflow and read logs to check that third applications (eg Postgres, Geonetwork, Geoserver) are accessible from R.

Postgres data source use case

In this case, it is required:

  • to specify the credentials to access the database in the configuration file workflow_configuration_Postgres_template.json: cf these lines,
  • to prepare the list of SQL queries with which your datasets can be physically extracted from the Postgres database (and stored as CSV files)
  • to specify a user who can create tables :
    • the metadata table which describes the list of datasets for which we will create metadata (OGC 19115 in geonetwork) and access protocols (OGC WMS/WFS from geoserver)
    • one view per dataset where columns are renamed as following:
      • the name of date colum "AS date"
      • the name of geometry colum "AS geom"

(Des)activatation of the different steps

The different steps of the workflow can be (des)activated independantly according to the values "actions" listed" in the json configuration file:

  "actions": {
    "create_metadata_table": false,
    "create_sql_view_for_each_dataset": true,
    "data_wms_wfs": true,
    "data_csv": false,
    "metadata_iso_19115": false,
    "metadata_iso_19110": false,
    "write_metadata_EML": false,
    "main": "write_Dublin_Core_metadata"
  }

NetCDF / NCML (OPeNDAP / Thredds server) use case

Main scripts for metadata creation and publication

The most important scripts for metadata creation are the following

ForTheBadge powered-by-electricity