Skip to content

stav121/warc-parser

Repository files navigation

warc-parser

Test warc-parser codecov Commit License

WARC/0.18 File metadata parser and indexer.

Requirements

  • Python 3.6 (or greater)
  • psycopg2 2.8.5
  • pyyaml 5.1
  • validators 0.18.2
  • beautifulsoup4 4.9.3

About

This project is developed for the purpose of extraction and indexing metadata from WARC/0.18 file format.

The input WARC/0.18 file is processed and the metadata is saved in a Postgres Database table.

Usage

First step is to configure the database connection in the config.yaml environment.

Inside the database create the table specified in the docker/init.sql script.

Alternatively you can use the docker-compose file located in the docker/ folder to spawn a database.

Run: cd docker/ && docker-compose up -d

Example usage of the script:

python3 warcparser.py -f input/15.warc.gz -c config.yaml -n=corpus-name

Author

Stavros Grigoriou

About

🗃️ WARC/0.18 File parser

Resources

License

Stars

Watchers

Forks

Languages