Skip to content

Scraping BooksToScrape (P2 OC D-A Python) : Utiliser les bases de Python pour l'analyse de marché

Notifications You must be signed in to change notification settings

hmignon/P2_BooksToScrape

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

logo

Scraping BooksToScrape

python-badge Beautiful Soup Requests

About the project

OpenClassrooms Python Developer Project #2: Use Python Basics for Market Analysis

Tested on Windows 10, Python 3.9.5.

Objectives

Scraping of books.toscrape.com with BeautifulSoup4 and Requests, export data to .csv files and download cover images to the "exports" folder.

Implementation of the ETL process:

  • Extract relevant and specific data from the source website;
  • Transform, filter and clean data;
  • Load data into searchable and retrievable files.

Post-course optimisation

This project has been optimised after the end of the OpenClassrooms course. To view the previously delivered version, please check this commit.

Improvements made to this project include:

  • Using OOP for the main scraper
  • Optimising loops for faster execution time
  • Parsing of command line arguments for options:
    • Json export option
    • Ignore images option
    • One-file export option
  • Progress bars (tqdm)

Setup

Clone the repository

  • git clone https://github.com/hmignon/P2_mignon_helene.git

Create the virtual environment

  • cd P2_mignon_helene
  • python -m venv env
  • Activate the environment source env/bin/activate (macOS and Linux) or env\Scripts\activate (Windows)

Install required packages

  • pip install -r requirements.txt

Usage

To scrape the entirety of books.toscrape.com to .csv files, use the command python main.py.

Options

Use python main.py --help to view all options.

  • --categories: Scrape one or several categories. This argument takes category names and/or full urls. For example, the 2 following commands would yield the same results:
main.py --categories travel
main.py --categories http://books.toscrape.com/catalogue/category/books/travel_2/index.html

To scrape a selection of categories, add selected names and/or urls separated by one space.

Note: selecting the same category several times (e.g. python main.py --categories travel travel) will only export data once.

main.py --categories classics thriller
main.py --categories http://books.toscrape.com/catalogue/category/books/classics_6/index.html thriller
  • -c or --csv: Export data to .csv files.
  • -j or --json: Export data to .json files.

Note: -j and -c can be used concurrently to export to both formats during the same scraping process.

  • --one-file : Export all data to a single .csv/.json file.
  • --ignore-covers: Skip cover images downloads.

Using .csv files

If you wish to open the exported .csv files in any spreadsheet software (Microsoft Excel, LibreOffice/OpenOffice Calc, Google Sheets...), please make sure to select the following options:

  • UTF-8 encoding
  • comma , as separator
  • double quote " as string delimiter

About

Scraping BooksToScrape (P2 OC D-A Python) : Utiliser les bases de Python pour l'analyse de marché

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages