Skip to content

Latest commit

 

History

History

Content Database

Setup the Content Database

You will find in the Script SQL folder various file that help build the content database. You can go to the Database/Interactive SQL tab. Virtuoso interactive SQL

1. Setup the structure

If it is your first instantiation, please use the global script cdb_global_v2.sql

If you are updating an existing database the needed scripts can be find in each specific folder.

2. Static data

Some tables have to be filled in order for the project to work, such as:

  • Named entities
  • Modality

3. Statistics Explained Data

Like before, if it is your first instantiation of the database, please use the global script : global script cdb_global_v2.sql If it is an update, the scripts needed can be find in the Statistics Explained folder. Launch the scripts in the following order :

Once the structure is set you can launch the following files to fill the modality’s tables

Once the database is set you can start launching the various spiders.

4. Eurostat glossary

Regarding the structure, if you used the cdb_global_v2.sql file you can go to the data insertion part, if not you can go to the Estat13k folder, and launch the following scripts :

In order to gather the glossary instead of scrapping the data we used the bulkdownload option and created SQL queries from it.

First the modality queries (estat13k_modalities_data.sql) have to be launched.

Then the estat13k_glossary_data.sql. In order to do it use the following Jupyter Notebook : cdb_insert.ipynb

Finally, you can add the last queries : estat13k_stat_and_measurement_unit_data.sql

5. CodeList and datasets

Regarding the structure, if you used the cdb_global_v2.sql file you can go to the data insertion part, if not you can go to the CodeList and datasets folder, and launch the following script :

As previously, we did not scrape the following data, we first downloaded the raw data and created SQL queries in order to fill the database.

The first step is to launch : estat_codelist_label_data.sql and then using cdb_insert.ipynb launch each file: estat_dictionnary_code_batchX.sql, X=1,...,5.

At these stage, the codelists and code are all in the content database, however we found that we have to add some code to the time dictionnary in order for our work on the datasets to work. You'll find the elements to add in the estat_dictionnary_code_data_time_addition.sql file

Then you can add some datatsets. Launch first the estat_dataset_label_data.sql file and then the estat_dataset_code_data.sql in order to create the links between datasets and codelists. If the last file is to heavy , the cdb_insert.ipynb file can be used.

6. Taxonomy, Terminology, Topic Model

In each folder with the same name in the Script SQL folder you can find the structure of the needed tables.

7. The CDB schema

Please see CDB schema. File CDB_tables.docx briefly describes the main tables in the Content Database, i.e. those that were actually used in the Use Cases. Other tables which were left unused are not included in this description.