Skip to content

danielnachumdev/scraper_ex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scraper Exercise

This code provides an example of a Command Line Interface (CLI) tool for web scraping with input validation. The program uses Python and allows users to provide input arguments through the command line to control the web scraping process. It employs the argparse module to parse and validate the input arguments.

Implementation Notes

  • Iterative algorithm - uses less system resources over recursion and the upper bound limit for memory usage is significantly higher
  • Online algorithm - ability to support dynamically growing or shrinking queue of jobs to perform as we don't know the whole working set in advance
  • multithreading with worker pool execution - Using multithreading to speed up the execution as there are significant slowdown during the downloading of the HTML which is equivalent here to external IO
  • using synchronization mechanisms for multithreading (no busy-waiting) - As we are dealing with multithreading we need to synchronize non-atomic operations above our data structures and communication between the threads
  • Wrapper classes to create abstraction for end user - Easy use, Abstraction, Facade
  • additional extra arguments inside the code for more control over the scalability - Allowing additional fine-tuning for specific requirements
  • dynamic resource usage scaling based on usage - Because the algorithm is an Online algorithm we want the system usage to scale based on the available workload up to an upper bound
  • CLI usage - Nice CLI usage and integration
  • import vs run module as main - This code can be used both with an import and both as running as the main module straight from the command line with no imports required

Prerequisites

  • Python 3.10 or later installed on your system.

Getting Started

  1. Clone the repository or copy the code to your local machine.
  2. (Recommended) create a virtual environment (for example conda or local)
python -m venv venv
  1. Ensure you have the required Python dependencies by running the following command:
pip install -r ./requirements/publish.txt

Usage

from scraper import Scraper, ScraperWorker, LinkExtractor
NUM_THREADS=4
s = Scraper(
    NUM_THREADS,
    ScraperWorker,
    dict(extractor_class=LinkExtractor)
)
s.scrape(
    base_url,
    extract_amount,
    max_depth,
    unique
)

or

python .\scraper\ https://www.ynetnews.com/ 5 2 true 

Arguments:

  • base_url (string): The URL of the website to be scraped.

  • extract_amount (positive integer): The number of items to be extracted during the scraping process. It must be a positive integer.

  • max_depth (positive integer): The maximum depth of pages to be scraped. It must be a positive integer.

  • unique (boolean): A boolean value indicating whether the scraper should only extract unique items. Accepted values are "True" or "False".

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published