Skip to content
This repository has been archived by the owner on May 16, 2023. It is now read-only.

sp1thas/book-depository-dataset

Repository files navigation

rip

Bookdepository has been discontinued, as a result, this project is now piece of history. Don't try to scrape anything, won't work.


Book Depository Dataset

pre-commit.ci status testing python-version kaggle-dataset Code style: black

The source code of Book Depository Dataset. Here you will find the implementation for data extraction (scrapy spider), parsing and EDA.

Dataset is also available here as kaggle dataset

Project Structure

  • crawler: scrapy crawler for data extraction

  • parser: python script for data transformation and dataset creation

  • eda: Exploratory Data Analysis on dataset

Step to reproduce

  1. Run scrapy crawler in order to retrieve data from bookdepository.com
  2. Run parser in order to create the dataset

Crawler

scrapy-version

This scrapy project is used to extract the majority of books from bookdepository.com. If you want to extract the data on your own, please keep settings file as is.

Usage

Use crawler as a common scrapy project:

poetry run scrapy crawl bdepobooks -o data/raw/textual/books.jsonlines

Scraping process will take more than a week. (scraping rate: ~50 items/minute). After crawling, data/raw/textual/books.jsonlines will contain all the raw data of books. Downloaded images can be found under the data/raw/media/full folder.

Parser

This submodule is about parsing and manipulating the raw data in order to create the dataset in a tabular format (csv).

Usage

Use the parser directly from command line, just provide the .jsonlines file with raw data and the output directory.

python parse_dataset.py -h
optional arguments:
  -h, --help            show this help message and exit
  -i INP, --input-file INP
                        Input file path
  -o OUT, --output-folder OUT
                        Output folder path

Working example

poetry run python src/parser/parse_dataset.py \
                  --input-jsonb data/raw/textual/books.jsonlines \
                  --input-images data/raw/media/full \
                  --output-folder data/parsed

This script will create a collection of .csv and .zip files in data/parsed/ folder.

Citation

 @misc{simakis_2020,
	title={Book Depository Dataset},
	url={https://www.kaggle.com/ds/467291},
	DOI={10.34740/kaggle/ds/467291},
	publisher={Kaggle},
	author={Simakis, Panagiotis},
	year={2020}
}

Sponsor

A shout-out for the sponsors of this project:

Disclaimer

All books are hosted by bookdepository.com. The use of dataset is fair use for academic purposes.