The RICardo dataset

The RICardo dataset compiles trade statistics sources (primary, secondary and recent estimations) of international trade bilateral flows of the 19th century.

We created a web application to visually explore this dataset. This application is not only a final product but a research tool which helped us in curating this dataset by providing data quality feedbacks and support research works.

This dataset is meant to evolve. You can follow our work in the RIcardo hypothèses.org blog.

To learn more about this dataset

Dedinger, Béatrice, et Paul Girard. 2017. « Exploring trade globalization in the long run: The RICardo project ». Historical Methods: A Journal of Quantitative and Interdisciplinary History 50 (1): 30‑48. doi:10.1080/01615440.2016.1220269.
the paper at Historical Methods
our preprint version: 01-May-2016

get the data

To download the data you can :

use the published dataset by downloading the DOI below or in the release section;
to get the last data version, clone this repository and use a database script to combine he data (see dedicated section)

How to cite

Béatrice Dedinger, & Paul Girard. (2017). RICardo dataset 2017.12 (Version 2017.12) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.1119592

license

The RICardo dataset is made available under the Open Database License: http://opendatacommons.org/licenses/odbl/1.0/. Any rights in individual contents of the database are licensed under the Database Contents License: http://opendatacommons.org/licenses/dbcl/1.0/

Compile the dataset

If you want to get the latest data version through our deduplication algorithm you can use our database_scripts.

First prepare your python environment:

$ pyenv virtualenv 3.8 ricardo_data
$ pyenv activate ricardo_data
$ pip install -r requirements.txt

Only the pip install is mandatory but using pyenv and virtualenv is more than recommended.

Aggregate the many data/flows/source(s).csv into one data/flows.csv

$ cd database_scripts
$ python flows.py aggregate

deduplicate trade flows (primary sources, general/special...)

$ python flows.py deduplicate

This script outputs a sqlite database and RICardo_trade_flows_deduplicated.csv. Those deduplicated data are the one used into our data exploration website.

Repository structure

./data

Data are provided in csv format (utf-8, comma separated):

flows/: the trade flows transcribed from sources, one CSV file by source. See ./Database_scripts to learn how to combine the flow data
sources.csv: volumes of statistics, books or research papers used to compile the flows table
RICentities.csv: RICentites are the unified nomencalture of trade reporting and partner names
RICentities_group.csv: Some RICentities are of type 'group'. This table show which entities are part of RICentities groups
entity_names.csv: This table documents how the partner and reporting original names in sources have been translated in a unified nomemclature
exchange_rates.csv: exchange rates used to convert trade flows to pound sterling
currencies.scv: currencies translation table
expimp_spegen.csv:export/import and special/general translation table

The precise format (list of type of fields) of those csv files is described in the datapackage.json file. Learn more about data packages on the frictionless data website.

./database_scripts

This folder contains some python and bash scripts used to:

deduplicate_flows.py: prepare and filter flows data and combine them into a sqlite database ready to serve the RICardo online exploration tool. This scripts also create the few csv exports including in the tool.
deploy_data.sh: copy RICardo data in the RICardo web application folder pointed in the config.py configuration file.

and more to be documented soon...

deprecated python2 script yet to be ported

RICardo_sqlite_creation.py: compile data csv files in a sqlite database (see RICardo_schema.sql)
update_csv_from_sqlite.py: update the data folder from the RICardo sqlite database. This script is used to update the data folder after having edited data in batch through sql queries. Some examples of such scripts can be found in the update_data_scripts folder.
test folder: a series of python scripts which applies some automatic tests to the RICardo_viz.sqlite database. It outputs various data quality reports in the out_data folder

./update_data_scripts

This folder is used to document the data update sessions made: original files, data update sql queries, notes... Note that not all modifications were listed in this folder. To keep track of exhaustive changes made to data, use the historic feature of git.

supported by

This work has been supported by l’Agence National de la Recherche under the reference RICARDO ANR-06-BLAN-0332 and by Sciences Po Scientific Advisory Board.

Name		Name	Last commit message	Last commit date
Latest commit History 831 Commits
data		data
database_scripts		database_scripts
docs		docs
importApp		importApp
sqlite_data		sqlite_data
update_data_scripts		update_data_scripts
.gitignore		.gitignore
20170601_séance_toflit18.md		20170601_séance_toflit18.md
README.md		README.md
datapackage.json		datapackage.json
validation.json		validation.json
validation.txt		validation.txt
validation.yml		validation.yml

medialab/ricardo_data

Folders and files

Latest commit

History

Repository files navigation

The RICardo dataset

To learn more about this dataset

get the data

How to cite

license

Compile the dataset

Repository structure

./data

./database_scripts

deprecated python2 script yet to be ported

./update_data_scripts

supported by

About

Topics

Resources

Stars

Watchers

Forks

Languages