Skip to content

Madjakul/HALvesting

Repository files navigation

HALvesting

Harvests open scientific papers from HAL.


HALvesting is a Python project designed to crawl data from the HAL (Hyper Articles en Ligne) repository. It provides functionalities to fetch data from HAL and to process it for further analysis.

The latest dump can be found on HuggingFace.

Features

Requirements

You will need Python > 3.8 and an internet connection.

Installation

  1. Clone the repository
git clone https://github.com/Madjakul/HALvesting.git
  1. Navigate to the project directory
cd HALvesting
  1. Install the required dependencies:
pip install -r requirements.txt

Usage

It's easier to modify the files scripts/fetch_data.sh, scripts/merge_data.sh, scripts/enrich_data.sh and scripts/filter_data.sh at need and launch them. However one can launch directly the Python scripts with the correct arguments as we will see below.

Fetching Data

This scripts passes a query to HAL's API and fetch all the open papers before storing them in a JSON file. The fetched data is only comprise of papers' metadatas.

[
    {
        "halid": "02975689",
        "lang": "ar",
        "domain": [ "math.math-ho", "shs.edu" ],
        "timestamp": "2024/03/04 16:36:13",
        "year": "2020",
        "url": "https://hal.science/hal-02975689/file/DAD-TGP_Encyclopedia_Arabic.pdf"
    },
    ...
]
>>> python3 fetch_data.py -h
usage: fetch_data.py [-h] [--query [QUERY]] [--from_date FROM_DATE] [--from_hour FROM_HOUR] [--to_date TO_DATE] [--to_hour TO_HOUR] --pdf PDF --response_dir RESPONSE_DIR [--pdf_dir [PDF_DIR]] [--num_chunks [NUM_CHUNKS]]

Arguments used to fetch data.

options:
  -h, --help            show this help message and exit
  --query [QUERY]       Query used to request APIs.
  --from_date FROM_DATE
                        Minimum submition date of documents.
  --from_hour FROM_HOUR
                        Minimum submition hour of documents.
  --to_date TO_DATE     Maximum submition date of documents.
  --to_hour TO_HOUR     Maximum submition hour of documents.
  --pdf PDF             Set to `true` if you want to download the PDFs.
  --response_dir RESPONSE_DIR
                        Target directory used to store fetched data.
  --pdf_dir [PDF_DIR]   Target directory used to store the PDFs.
  --num_chunks [NUM_CHUNKS]
                        Number of semaphores for the PDF downloader.

Post-process Data

This script merges the fetched metadatas with the generated text from GROBID and harvesting. The output are compressed JSON files sorted by language.

>>> python3 merge_data.py
usage: merge_data.py [-h] --js_dir_path JS_DIR_PATH --txts_dir_path TXTS_DIR_PATH --output_dir_path OUTPUT_DIR_PATH --version VERSION

Arguments used to fetch data.

options:
  -h, --help            show this help message and exit
  --js_dir_path JS_DIR_PATH
                        Folder containing fetched data.
  --txts_dir_path TXTS_DIR_PATH
                        Folder containing the txt files.
  --output_dir_path OUTPUT_DIR_PATH
                        Final folder containing the processed data for HuggingFace.
  --version VERSION     Version of the dump starting at '1.0'.

Enrich Data

This script adds new keys to the merged data. The values added pushed to HuggingFace for convenience. For more information: https://huggingface.co/datasets/Madjakul/HALvest-R

>>> python3 enrich_data.py -h
usage: enrich_data.py [-h] [--dataset_checkpoint DATASET_CHECKPOINT] [--cache_dir_path [CACHE_DIR_PATH]] [--dataset_config_path DATASET_CONFIG_PATH] [--download_models DOWNLOAD_MODELS] [--kenlm_dir_path KENLM_DIR_PATH] [--num_proc NUM_PROC]
                      [--batch_size BATCH_SIZE] [--output_dir_path OUTPUT_DIR_PATH] [--tokenizer_checkpoint [TOKENIZER_CHECKPOINT]] [--use_fast [USE_FAST]] [--load_from_cache_file [LOAD_FROM_CACHE_FILE]] --version VERSION

Download Sentencepiece and KenLM models for supported languages.

options:
  -h, --help            show this help message and exit
  --dataset_checkpoint DATASET_CHECKPOINT
                        Name of the HuggingFace dataset to be processed.
  --cache_dir_path [CACHE_DIR_PATH]
                        Path to the HuggingFace cache directory.
  --dataset_config_path DATASET_CONFIG_PATH
                        Path to the txt file containing the dataset configs to process.
  --download_models DOWNLOAD_MODELS
                        Set to `true` if you want to download the KenLM models.
  --kenlm_dir_path KENLM_DIR_PATH
                        Path to the directory containing the sentencepiece and kenlm models.
  --num_proc NUM_PROC   Number of processes to use for processing the dataset.
  --batch_size BATCH_SIZE
                        Number of documents loaded per proc.
  --output_dir_path OUTPUT_DIR_PATH
                        Path to the directory where the processed dataset will be saved.
  --tokenizer_checkpoint [TOKENIZER_CHECKPOINT]
                        Name of the HuggingFace tokenizer model to be used.
  --use_fast [USE_FAST]
                        Set to `true` if you want to use the Ruste-based tokenizer from HF.
  --load_from_cache_file [LOAD_FROM_CACHE_FILE]
                        Set to `true` if you if some of the enriching functions have been altered.
  --version VERSION     Version of the dump starting at '1.0'.

Filter Data

This script filters out raw data out of the gibberish documents: https://huggingface.co/datasets/Madjakul/HALvest

>>> python3 filter_data.py -h
usage: filter_data.py [-h] [--dataset_checkpoint DATASET_CHECKPOINT] [--cache_dir_path [CACHE_DIR_PATH]] [--dataset_config_path DATASET_CONFIG_PATH] [--num_proc NUM_PROC] [--batch_size BATCH_SIZE] [--output_dir_path OUTPUT_DIR_PATH]
                      [--load_from_cache_file [LOAD_FROM_CACHE_FILE]] --version VERSION

Argument used to filter the dataset.

options:
  -h, --help            show this help message and exit
  --dataset_checkpoint DATASET_CHECKPOINT
                        Name of the HuggingFace dataset to be processed.
  --cache_dir_path [CACHE_DIR_PATH]
                        Path to the HuggingFace cache directory.
  --dataset_config_path DATASET_CONFIG_PATH
                        Path to the txt file containing the dataset configs to process.
  --num_proc NUM_PROC   Number of processes to use for processing the dataset.
  --batch_size BATCH_SIZE
                        Number of documents loaded per proc.
  --output_dir_path OUTPUT_DIR_PATH
                        Path to the directory where the processed dataset will be saved.
  --load_from_cache_file [LOAD_FROM_CACHE_FILE]
                        Set to `true` if you if some of the enriching functions have been altered.
  --version VERSION     Version of the dump starting at '1.0'.

Citation

To cite HALvesting/HALvest:

@software{almanach_halvest_2024,
  author = {Kulumba, Francis and Antoun, Wissam and Vimont, Guillaume and Romary, Laurent},
  title = {HALvest: Open Scientific Papers Harvested from HAL.},
  month = April,
  year = 2024,
  company = Almanach,
  url = {https://github.com/Madjakul/HALvesting}
}

Acknowledgement

The code and the dataset have been built upon the following work:

GROBID: A machine learning software for extracting information from scholarly documents
https://github.com/kermitt2/grobid

harvesting: Collection of parsers for scientific data

License

This project is licensed under the Apache License 2.0.