Scraping BooksToScrape

About the project

OpenClassrooms Python Developer Project #2: Use Python Basics for Market Analysis

Tested on Windows 10, Python 3.9.5.

Objectives

Scraping of books.toscrape.com with BeautifulSoup4 and Requests, export data to .csv files and download cover images to the "exports" folder.

Implementation of the ETL process:

Extract relevant and specific data from the source website;
Transform, filter and clean data;
Load data into searchable and retrievable files.

Post-course optimisation

This project has been optimised after the end of the OpenClassrooms course. To view the previously delivered version, please check this commit.

Improvements made to this project include:

Using OOP for the main scraper
Optimising loops for faster execution time
Parsing of command line arguments for options:
- Json export option
- Ignore images option
- One-file export option
Progress bars (tqdm)

Setup

Clone the repository

git clone https://github.com/hmignon/P2_mignon_helene.git

Create the virtual environment

cd P2_mignon_helene
python -m venv env
Activate the environment source env/bin/activate (macOS and Linux) or env\Scripts\activate (Windows)

Install required packages

pip install -r requirements.txt

Usage

To scrape the entirety of books.toscrape.com to .csv files, use the command python main.py.

Options

Use python main.py --help to view all options.

--categories: Scrape one or several categories. This argument takes category names and/or full urls. For example, the 2 following commands would yield the same results:

main.py --categories travel
main.py --categories http://books.toscrape.com/catalogue/category/books/travel_2/index.html

To scrape a selection of categories, add selected names and/or urls separated by one space.

Note: selecting the same category several times (e.g. python main.py --categories travel travel) will only export data once.

main.py --categories classics thriller
main.py --categories http://books.toscrape.com/catalogue/category/books/classics_6/index.html thriller

-c or --csv: Export data to .csv files.
-j or --json: Export data to .json files.

Note: -j and -c can be used concurrently to export to both formats during the same scraping process.

--one-file : Export all data to a single .csv/.json file.
--ignore-covers: Skip cover images downloads.

Using .csv files

If you wish to open the exported .csv files in any spreadsheet software (Microsoft Excel, LibreOffice/OpenOffice Calc, Google Sheets...), please make sure to select the following options:

UTF-8 encoding
comma , as separator
double quote " as string delimiter

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
img		img
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
scraper.py		scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

img

img

.gitignore

.gitignore

README.md

README.md

main.py

main.py

requirements.txt

requirements.txt

scraper.py

scraper.py

Repository files navigation

Scraping BooksToScrape

About the project

Objectives

Post-course optimisation

Setup

Clone the repository

Create the virtual environment

Install required packages

Usage

Options

Using .csv files

About

Releases

Packages

Languages

hmignon/P2_BooksToScrape

Folders and files

Latest commit

History

Repository files navigation

Scraping BooksToScrape

About the project

Objectives

Post-course optimisation

Setup

Clone the repository

Create the virtual environment

Install required packages

Usage

Options

Using .csv files

About

Topics

Resources

Stars

Watchers

Forks

Languages