Skip to content

britisharmy/python-utilities

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

python-utilities

dfs.py

Concurrency Constructs:

The provided Python script leverages several concurrency constructs to efficiently crawl a website and create a sitemap:

  1. Threading: The script uses Python's built-in threading module to create and manage threads. Two essential concurrency constructs are employed:

    • Locks: Locks are used for synchronization in multithreaded environments. In this script, threading.Lock() is utilized to protect shared data structures. Specifically, visited_links_lock and sitemap_links_lock are used to ensure thread-safe access to the visited_links and sitemap_links sets, respectively.

    • Condition Variables: A condition variable (sitemap_ready_condition) is employed for synchronization and signaling between threads. It ensures that the sitemap is written only when all links have been processed. The sitemap_ready_condition.wait() method is used to pause the main thread until the condition is met, and sitemap_ready_condition.notify() is used to signal when the sitemap is ready to be written.

    • ThreadPoolExecutor: The concurrent.futures.ThreadPoolExecutor is used for managing a pool of threads. It allows the script to asynchronously crawl multiple links in parallel. The crawl_website function submits tasks to the thread pool using executor.submit.

Depth-First Search (DFS) Algorithm:

The crawling logic in this script is based on the Depth-First Search (DFS) algorithm, which is a fundamental graph traversal algorithm. In the context of web crawling, the DFS algorithm is adapted as follows:

  • Starting Point: The DFS starts at the base_url, which serves as the root of the traversal.

  • Visited Links Set: To prevent revisiting the same URL, a visited_links set is maintained. When a link is visited, it is added to this set.

  • Recursion: The core of the DFS algorithm lies in the recursive nature of the crawl_website function. For each valid link found on a page, a new thread is spawned to crawl that link. This process continues until the specified depth (max_depth) is reached or there are no more links to explore.

  • Unique Links Set: To ensure that the sitemap contains only unique links, a unique_links set is maintained. This set stores all unique links encountered during the crawling process.

  • Sitemap Creation: The sitemap is constructed in XML format (sitemap.xml) and is only written once all links have been processed. The XML tree is built with the unique links, and the sitemap file is written using a single thread to avoid concurrency issues.

Overall, this script combines concurrency constructs with the DFS algorithm to efficiently crawl a website, collect links, and create a sitemap, making it a useful tool for various web-related tasks such as SEO analysis and data extraction.

About

Useful python utilities

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages