Skip to content

medialab/ricardo_data

Repository files navigation

The RICardo dataset

The RICardo dataset compiles trade statistics sources (primary, secondary and recent estimations) of international trade bilateral flows of the 19th century.

We created a web application to visually explore this dataset. This application is not only a final product but a research tool which helped us in curating this dataset by providing data quality feedbacks and support research works.

This dataset is meant to evolve. You can follow our work in the RIcardo hypothèses.org blog.

To learn more about this dataset

Dedinger, Béatrice, et Paul Girard. 2017. « Exploring trade globalization in the long run: The RICardo project ». Historical Methods: A Journal of Quantitative and Interdisciplinary History 50 (1): 30‑48. doi:10.1080/01615440.2016.1220269.
the paper at Historical Methods
our preprint version: 01-May-2016

get the data

To download the data you can :

  • use the published dataset by downloading the DOI below or in the release section;
  • to get the last data version, clone this repository and use a database script to combine he data (see dedicated section)

How to cite

DOI/10.5281/zeonod.1119592
Béatrice Dedinger, & Paul Girard. (2017). RICardo dataset 2017.12 (Version 2017.12) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.1119592

license

The RICardo dataset is made available under the Open Database License: http://opendatacommons.org/licenses/odbl/1.0/. Any rights in individual contents of the database are licensed under the Database Contents License: http://opendatacommons.org/licenses/dbcl/1.0/

Compile the dataset

If you want to get the latest data version through our deduplication algorithm you can use our database_scripts.

First prepare your python environment:

$ pyenv virtualenv 3.8 ricardo_data
$ pyenv activate ricardo_data
$ pip install -r requirements.txt

Only the pip install is mandatory but using pyenv and virtualenv is more than recommended.

Aggregate the many data/flows/source(s).csv into one data/flows.csv

$ cd database_scripts
$ python flows.py aggregate

deduplicate trade flows (primary sources, general/special...)

$ python flows.py deduplicate

This script outputs a sqlite database and RICardo_trade_flows_deduplicated.csv. Those deduplicated data are the one used into our data exploration website.

Repository structure

./data

Data are provided in csv format (utf-8, comma separated):

  • flows/: the trade flows transcribed from sources, one CSV file by source. See ./Database_scripts to learn how to combine the flow data
  • sources.csv: volumes of statistics, books or research papers used to compile the flows table
  • RICentities.csv: RICentites are the unified nomencalture of trade reporting and partner names
  • RICentities_group.csv: Some RICentities are of type 'group'. This table show which entities are part of RICentities groups
  • entity_names.csv: This table documents how the partner and reporting original names in sources have been translated in a unified nomemclature
  • exchange_rates.csv: exchange rates used to convert trade flows to pound sterling
  • currencies.scv: currencies translation table
  • expimp_spegen.csv:export/import and special/general translation table

The precise format (list of type of fields) of those csv files is described in the datapackage.json file. Learn more about data packages on the frictionless data website.

./database_scripts

This folder contains some python and bash scripts used to:

and more to be documented soon...

deprecated python2 script yet to be ported

  • RICardo_sqlite_creation.py: compile data csv files in a sqlite database (see RICardo_schema.sql)
  • update_csv_from_sqlite.py: update the data folder from the RICardo sqlite database. This script is used to update the data folder after having edited data in batch through sql queries. Some examples of such scripts can be found in the update_data_scripts folder.
  • test folder: a series of python scripts which applies some automatic tests to the RICardo_viz.sqlite database. It outputs various data quality reports in the out_data folder

./update_data_scripts

This folder is used to document the data update sessions made: original files, data update sql queries, notes... Note that not all modifications were listed in this folder. To keep track of exhaustive changes made to data, use the historic feature of git.

supported by

This work has been supported by l’Agence National de la Recherche under the reference RICARDO ANR-06-BLAN-0332 and by Sciences Po Scientific Advisory Board.
Sciences Po, médialab     Sciences Po, Centre d'Histoire     funded by l'Agence Nationale de la recherche