Skip to content

The data pipeline to turn cubes from GENESIS into a graphql api

License

Notifications You must be signed in to change notification settings

datenguide/genesapi-cli

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

74 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

genesapi-cli

A command-line interface to download, process and index german public statistic data from GENESIS-Instances like www.regionalstatistik.de into a JSON format that can then be loaded via Logstash into an Elasticsearch index.

This package has a very modular approach so it can be used for different purposes when dealing with german public statistics data from GENESIS.

For example, it is used within (and developed for) the datengui.de tech stack to provide a GraphQL-API that feeds the website.

install

genesapi-cli requires python 3.

We recommend organizing your python stuff via pip and virtual environments.

FIXME: genesapi-cli relies on a fork of regenesis, that needs to be installed manually beforehand, any pull requests to hook this properly into our setup.py are welcomed ;)

pip install -e git+git@github.com:datenguide/regenesis.git#egg=regenesis

pip install -e git+git@github.com:datenguide/genesapi-cli.git#egg=genesapi

This will install the genesapi command-line interface and all the requirements, like pandas (see their install documentation if you run into trouble during installation).

Note: Although the very very great regenesis package is a bit outdated, genesapi is only importing some parts of it, so don't worry if you don't get regenesis running by yourself locally. But if you get it running, its creator @pudo + the datenguide-team would be very happy about pull requests :-)

After installing, you should be able to type this into your command line:

genesapi -h

usage

Wait: If you ended up here because you just want to set up a small local instance of the genesapi flask app that exposes the GraphQL-api, follow the steps described there.

tl;dr

A complete process to download the complete GENESIS data and upload it into an Elasticsearch index is listed below – continue reading the Full Documentation if you need to understand whats going on exactly.

Before really executing this, read at least the notes below this list.

mkdir ./data/
CATALOG=catalog.yml genesapi fetch ./data/
genesapi build_regions ./data/ > regions.json
genesapi build_schema ./data/ > schema.json
genesapi build_es_template ./schema.json > template.json
curl -H 'Content-Type: application/json' -XPOST http://localhost:9200/_template/genesapi -d@template.json
genesapi jsonify ./data/ | logstash -f logstash.conf
genesapi status --host localhost:9200 --index genesapi > status.csv

Its roughly about to download 1.2G from the GENESIS soap api. The Elasticsearch index well be around 8G at the end. The soap api is very slow and the complete download (currently 2278 single csv files aka cubes) takes several hours, as well as the indexing into Elasticsearch.

Plus, most of the scripts make use of python3's built-in multiprocessing, so most of the time all the cores of the machine that is running these commands are in heavy use.

It is as well highly recommended that you make yourself a bit familiar with Elasticsearch and Logstash and how to tweak it for your machine(s).

Example data

For playing around with this package without downloading from the GENESIS api, there is some downloaded data in the example/ folder of this repo.

Full Documentation

This package is split in several small tasks, that can all be invoked via

genesapi <task> arg1 --kwarg1 foo

The tasks are all a step within the general "data pipeline" that downloads raw csv data cubes and transform them into a json-serializable format.

Tasks:

  1. fetch
  2. build_schema
  3. build_regions
  4. build_markdown
  5. build_es_template
  6. jsonify
  7. status

For transforming csv data cubes to json facts, only fetch and jsonify are necessary.

To provide some context data & schema definitions to load the jsonified data into Elasticsearch, the other tasks are needed in between (see below)

Logging

genesapi prints a lot to stdout, for instance the jsonified facts so that they can easily piped to logstash, so logging happens to stderr

You can adjust the logging level (default: INFO) to any valid python logging level, for example:

genesapi --logging DEBUG <task> <args>

fetch

Download csv data (aka cubes) from a GENESIS instance like www.regionalstatistik.de and store them somewhere in the local filesystem.

Cubes are stored as Revisions, if a cube is updated the old one will still be present in the file system. The fetch command as well stores some meta information about the downloading process in the given directory.

Create a catalog in YAML format, see example/catalog.yml for details (basically, just put in your credentials) and use this path as environment variable CATALOG

usage: genesapi fetch [-h] [--new] [--prefix PREFIX] storage

positional arguments:
  storage          Directory where to store cube data

optional arguments:
  -h, --help       show this help message and exit
  --new            Initialize Storage if it doesn't exist and start
                   downloading
  --prefix PREFIX  Prefix of cube names to restrict downloading, e.g. "111"

Example:

CATALOG=catalog.yml genesapi fetch ./data/cubes/
prefix

You can filter for a prefix of the cube names with the --prefix option.

To retrieve only cubes for the statistic id "11111":

CATALOG=catalog.yml genesapi fetch ./data/cubes/ --prefix 11111

jsonify

Transform downloaded cubes (csv files) into facts (json lines)

A fact is a unique value of a specific topic at a specific location at a specific point in time (or timespan)

  • value: a number, either int or float
  • topic: a broader topic like "work statistics" described with a combination of measures and their dimensions, e.g. "Gender: Female, Age: 15 to 20 years"
  • location: Germany itself or a state, district or municipality in Germany.
  • time: either a year or a specific date.

The number of flats (WOHNY1) that have 5 rooms ( WHGGR1 = WHGRME05) in the german municipality Baddeckenstedt (id: 03158402) in the state of "Niedersachsen" in 2016 (year) was 1120 (value).

In the current implementation, this described fact looks like this in json:

{
    "year" : "2016",
    "WOHNY1" : {
        "error" : "0",
        "quality" : "e",
        "locked" : "",
        "value" : 1120
    },
    "STAG" : {
        "until" : "2016-12-31T23:59:59",
        "value" : "31.12.2016",
        "from" : "2016-12-31T00:00:00"
    },
    "fact_id" : "394ce1e5e76fdb9599c46ecbb3db6c8f8ae09c33",
    "id" : "03158402",
    "cube" : "31231GJ006",
    "GEMEIN" : "03158402",
    "WHGGR1" : "WHGRME05",
    "lau" : 1
}

See this fact query at api.genesapi.org

You can either store the facts as json files on disk or directly pipe them to logstash

Note: Storing a whole json-serialized GENESIS dump to disk requires a lot of time and space. The option to store the facts as json files is more for debugging purposes or to share serialized subsets of the data accross devices or people. We recommend directly piping to logstash if you want to feed a complete Elasticsearch index (which takes a lot of time and space, too...).

usage: genesapi jsonify [-h] [--output OUTPUT] [--pretty] directory

positional arguments:
  storage          Directory with raw cubes downloaded via the `fetch` command

optional arguments:
  -h, --help       show this help message and exit
  --output OUTPUT  Output directory. If none, print each record per line to
                   stdout
  --pretty         Print pretty indented json (for debugging purposes)

How to use this command to feed an Elasticsearch index

Download logstash and install it somehow, use the logstash config in this repo.

genesapi jsonify cubes | logstash -f logstash.conf

See here a more detailed description how to set up an Elasticsearch cluster for genesapi

build_regions

Create a id => region mapping for all the regions in json format.

{
    "08425": {
        "id": "08425", // AGS for the region
        "name": "Alb-Donau-Kreis", // Nicely formated name of the region
        "type": "Landkreis", // Type of region (e.g. Kreisfreie Stadt, Regierungsbezirk)
        "level": 3, // NUTS level (1-3), LAU (4)
        "duration": {
            "from": "2012-01-01", // ISO dates for earliest available statistical measure
            "until": "2019-12-31"  // ISO dates for latest available statistical measure
        }
    },
}
usage: genesapi build_regions [-h] storage

positional arguments:
  storage     Directory with raw cubes downloaded via the `fetch` command

optional arguments:
  -h, --help  show this help message and exit

Example:

genesapi build_regions ./data/cubes/ > names.json

build_schema

The schema is needed for the flask app and for the tasks build_es_template and build_markdown.

This commands grabs the raw cubes and extracts the measures ("Merkmal") structure out of it into a json format printed to stdout.

usage: genesapi build_schema [-h] directory

positional arguments:
  directory             Directory with raw cubes downloaded via the `fetch`
                        command

optional arguments:
  -h, --help            show this help message and exit

Example:

genesapi build_schema ./data/cubes/ > schema.json

build_es_template

Create a template mapping for Elasticsearch, based on the schema from build_schema

usage: genesapi build_es_template [-h] [--index INDEX] [--shards SHARDS]
                                  [--replicas REPLICAS]
                                  schema

positional arguments:
  schema               JSON file from `build_schema` output

optional arguments:
  -h, --help           show this help message and exit
  --index INDEX        Name of elasticsearch index
  --shards SHARDS      Number of shards for elasticsearch index
  --replicas REPLICAS  Number of replicas for elasticsearch index

Example:

genesapi build_es_template ./data/schema.json > template.json

Apply this template (index name genesapi, could be anything):

curl -H 'Content-Type: application/json' -XPOST http://localhost:9200/_template/genesapi -d@template.json

See here a more detailed description how to set up an Elasticsearch cluster for genesapi

build_markdown

Export each measure (from the schema to a markdown with frontmatter that could be used to generate a documentation page powered by jekyll or gatsby.

usage: genesapi build_markdown [-h] schema output

positional arguments:
  schema      JSON file from `build_schema` output
  output      Output directory.

optional arguments:
  -h, --help  show this help message and exit

Example:

genesapi build_markdown ./data/schema.json ../path-to-my-jekyll/_posts/

status

Obtain metadata for cubes in the storage like last downloaded, last exported, number of facts...

Optionally retrieve the number of facts for each cube from elasticsearch to compare.

usage: genesapi status [-h] [--host HOST] [--index INDEX] storage

positional arguments:
  storage        Directory to storage

optional arguments:
  -h, --help     show this help message and exit
  --host HOST    Elastic host:port to obtain stats from
  --index INDEX  Elastic index

Example:

genesapi status regionalstatistik --host localhost:9200 --index genesapi > status.csv

Storage

the store manages cubes data on disk, download from webservices and export to json facts

it can be created and updated with the fetch command (see above)

it allows partial updates (when cubes changes)

every information is stored in the filesystem so there is no need for an extra database to keep track of the status of the cubes

a Storage has a base directory with this layout:

./
    webservice_url                  -   plain text file containing the webservice url used
    last_updated                    -   plain text file containing date in isoformat
    last_exported                   -   plain text file containing date in isoformat
    logs/                           -   folder for keeping logfiles
    11111BJ001/                     -   directory for cube name "11111BJ001"
        last_updated                -   plain text file containing date in isoformat
        last_exported               -   plain text file containing date in isoformat
        current/                    -   symbolic link to the latest revision directory
        2019-08-07T08:40:20/        -   revision directory for given date (isoformat)
            downloaded              -   plain text file containing date in isoformat
            exported                -   plain text file containing date in isoformat
            meta.yml                -   original metadata from webservice in yaml format
            data.csv                -   original csv data for this cube
        2017-06-07T08:40:20/        -   an older revision...
            ...
    11111BJ002/                     -   another cube...
        ...

About

The data pipeline to turn cubes from GENESIS into a graphql api

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages