Scraper Exercise

This code provides an example of a Command Line Interface (CLI) tool for web scraping with input validation. The program uses Python and allows users to provide input arguments through the command line to control the web scraping process. It employs the argparse module to parse and validate the input arguments.

Implementation Notes

Iterative algorithm - uses less system resources over recursion and the upper bound limit for memory usage is significantly higher
Online algorithm - ability to support dynamically growing or shrinking queue of jobs to perform as we don't know the whole working set in advance
multithreading with worker pool execution - Using multithreading to speed up the execution as there are significant slowdown during the downloading of the HTML which is equivalent here to external IO
using synchronization mechanisms for multithreading (no busy-waiting) - As we are dealing with multithreading we need to synchronize non-atomic operations above our data structures and communication between the threads
Wrapper classes to create abstraction for end user - Easy use, Abstraction, Facade
additional extra arguments inside the code for more control over the scalability - Allowing additional fine-tuning for specific requirements
dynamic resource usage scaling based on usage - Because the algorithm is an Online algorithm we want the system usage to scale based on the available workload up to an upper bound
CLI usage - Nice CLI usage and integration
import vs run module as main - This code can be used both with an import and both as running as the main module straight from the command line with no imports required

Prerequisites

Python 3.10 or later installed on your system.

Getting Started

Clone the repository or copy the code to your local machine.
(Recommended) create a virtual environment (for example conda or local)

python -m venv venv

Ensure you have the required Python dependencies by running the following command:

pip install -r ./requirements/publish.txt

Usage

from scraper import Scraper, ScraperWorker, LinkExtractor
NUM_THREADS=4
s = Scraper(
    NUM_THREADS,
    ScraperWorker,
    dict(extractor_class=LinkExtractor)
)
s.scrape(
    base_url,
    extract_amount,
    max_depth,
    unique
)

or

python .\scraper\ https://www.ynetnews.com/ 5 2 true

Arguments:

base_url (string): The URL of the website to be scraped.
extract_amount (positive integer): The number of items to be extracted during the scraping process. It must be a positive integer.
max_depth (positive integer): The maximum depth of pages to be scraped. It must be a positive integer.
unique (boolean): A boolean value indicating whether the scraper should only extract unique items. Accepted values are "True" or "False".

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.vscode		.vscode
requirements		requirements
scraper		scraper
.gitignore		.gitignore
.pylintrc		.pylintrc
Makefile		Makefile
README.md		README.md
main.py		main.py
mypy.ini		mypy.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.vscode

.vscode

requirements

requirements

scraper

scraper

.gitignore

.gitignore

.pylintrc

.pylintrc

Makefile

Makefile

README.md

README.md

main.py

main.py

mypy.ini

mypy.ini

Repository files navigation

Scraper Exercise

Implementation Notes

Prerequisites

Getting Started

Usage

Arguments:

About

Releases

Packages

Languages

danielnachumdev/scraper_ex

Folders and files

Latest commit

History

Repository files navigation

Scraper Exercise

Implementation Notes

Prerequisites

Getting Started

Usage

Arguments:

About

Resources

Stars

Watchers

Forks

Languages