Skip to content

eurostat/NLP4Stat

Repository files navigation

NLP4Stat

Project organisation

  • Software Environment: contains instructions how to install and connect to the Virtuoso server.

  • Content Database: contains instructions how to setup, scrape and load the data in the content database. Also contains sub-folders with enrichment codes:

  • Knowledge Database - see dedicated section.

  • Knowledge Database latest documentation: the latest documentation of the knowledge database after the re-organization in a Word file, as of July 2022. It includes the description of the ontologies used, the alignment with external Linked Open Data (LOD) vocabularies and the ontology files structure.

  • Content and Knowledge Database documentation: the documentation in a Word file, as of February 2022. This describes the version of the knowledge database before the re-organization. It is only valid for the content database.

  • Use Case A:

    • Use Case A Widgets Demo : for demonstration of ipywidgets only, as part of deliverable D3.1. This is superseded by the next codes which are part of deliverable D3.2.

    • Use Case A Query builder: Query builder, with inputs from the database (SE articles and SE Glossary articles). The latest version (June 2022) is a Google Colab notebook demonstrating the reading of all data from the Knowledge Database, with SPARQL queries. The performance improvement is significant.

    • Use Case A Faceted search: Faceted search, with inputs from the database (SE articles). Among others, the code assigns the majority of the SE articles to (possibly more than one) themes, sub-themes and categories. Revised January 2022 to read all inputs from the database. The more recent version (May 2022 - in Google Colab) uses the Knowledge Database to return, for each SE article in the results, all related resources (and not only the related SE articles as in previous versions).

    • Use Case A Graphical exploration. Two applications for graphical exploration, one in R Shiny and another in MS Power BI. See separate description in this link. The description includes links to short documentations for the two applications. Revised January 2022.

  • Use Case B:

    • Use Case B Query builder : Query builder using content from the SE Glossary articles, the SE articles and OECD's Glossary of Statistical Terms. Revised January 2022 to eliminate external files. The latest version (June 2022) is a Google Colab notebook demonstrating the reading of all data from the Knowledge Database, with SPARQL queries. The performance improvement is significant.
    • Use Case B Faceted search. Using Eurostat themes and sub-themes to search articles from the OECD's Glossary of Statistical Terms. Among others, the code uses a correspondence between a) Eurostat's themes and sub-themes and b) OECD's Glossary themes. Revised January 2022 to eliminate external files. The latest version (May 2022 - in Google Colab notebook) uses the Knowledge Database to return OECD themes, both for the displayed articles and for their related ones.
    • Use Case B Graphical exploration - Power BI. An MS Power BI application. See separate documentation in this folder. Revised January 2022.
    • Use Case B SE OECD Common NPs. This code finds common noun-phrases in Statistics Explained articles and OECD's Glossary articles. The objective was to create a common vocabulary for the labelling of both sources. This code reads from the database a manual filtering of the noun phrases found in the SE articles, keeping the most "useful" ones. The common vocabulary is being used in the Power BI application in Use Case B. The folder contains also the output file (SE_vs_OECD_Glossary_Noun_Phrases.xlsx). Revised January 2022.
    • Use Case B Topic Modelling: This is a demonstration code showing the application of topic modelling results based on Eurostat's content to OECD Glossary articles. See also note in this folder. Revised January 2022.
    • Use Case Β Scraping OECD: Code for the scraping of OECD's Glossary articles and the writing of the scraped content in the Content Database. Revised January 2022.
  • Use Case C:

    • Use Case C Word embeddings. Word vectors trained on SE articles and SE Glossary articles. Application for the identification of Eurostat datasets. The processing of SE and SE Glossary articles is for the first run only, to save the vectors model. After this, it suffices to load the model. This is also saved in plain text format for inspection (see file SE_GL_wordvectors.txt in folder). Revised (February 2022), no more requires the external file table_of_contents.xml. There is a Jupyter notebook version and a Google Colab one. The file name of the latter starts with GC_.

    • Use Case C Topic modelling and Word embeddings. Combination of topic modelling and word embeddings for the identification of statistical datasets. Revised (February 2022), no more requires the external file table_of_contents.xml. One can either re-create the LDA model or load the saved one from the previous code, from file lda_model.pl contained in compressed file lda_model.rar. A copy is included in the Use Case C/Data folder. The latest version is in a Google Colab notebook, adjusted (June 2022) to enrich the user's query with related terms and synonyms from an external ontology, ConceptNet, using two methods for access to such terms.

      • Main features:
      • Carry-out topic modelling with a large enough corpus (Statistics Explained articles and Statistics Explained Glossary articles) and a large number of topics (1000) and extract significant (lemmatized) keywords. The objectives are two:
        • to cover the whole corpus and thus the "correlated" datasets at a high granularity,
        • avoid using common ("dominating") words in the matches with the user's query.
      • Enhance these keywords with their closest terms from the word embeddings created exclusively from Eurostat's content. The total large number of keywords can then differentiate the datasets.
      • Match the (similarly enhanced and enriched from ConceptNet) sentence(s) entered in the query, with datasets, based on the number of keywords found in the datasets (simple or full descriptions).
      • Put first priority to the matches with words in the enhanced topic modelling dictionary and second to the matches with any other words, to avoid "dominating" terms.
    • Use Case C BERT model. A BERT model, based on the SentenceTransformers Python framework which uses the BERT algorithm to compute sentence embeddings. The strategy employed is to use a pre-trained BERT model, fine-tune it to the available corpus, and then use the “retrieve & re-rank pipeline” approach for the ranking of the matches, as is suggested for complex semantic search scenarios. This is a Google Colab notebook and requires GPU for proper running. It also requires setting-up a Google drive to store the model and retrieve it in re-runs, avoiding the long computation time it requires. Revised (February 2022), no more requires the external file table_of_contents.xml. Adjusted (June 2022) to replace all SQL queries by SPARQL queries to the Knowledge Database and also to add the entries from Eurostat's Concepts and Definitions Database as inputs to the fine-tuning of the model.

  • Use Case D:

    • Use Case D using BERT. A demonstration code showing the logic of the databot. The notebook is in Google Colab notebook Use_Case_D_Using_BERT_v1.ipynb and requires a CUDA-enabled GPU. It can be used in bot conversations for the identification of either datasets or SE Glossary articles. The component for the identification of similar datasets or SE Glossary articles is based on the same S-BERT model used in Use Case C. The most time-consuming part is fine-tuning. This can be skipped if already run once.
    • Use Case D DeepPavlov.This is a code with the same logic, but implemented with the DeepPavlov framework and also with some changes and improvements. Instructions are included at the top of the notebook, Use_Case_D_DeepPavlov_v2_rev_June2022.ipynb. The requirements (CUDA-enabled GPU) and the targets of the conversation with the databot (either datasets or SE Glossary articles) are the same. This code was adjusted (June 2022) to read all data with SPARQL queries from the Knowledge Database and also to include the terms and definitions from Eurostat's Concepts and Definitions Database as additional inputs in the fine-tuning stage.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages