Skip to content

RHDZMOTA/crawler-poc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Crawler - Proof of Concept

  • Current version: 1.1.0

Crawler POC using Scala and Akka Streams.

The crawling requests (URLs) are provided by a Google PubSub subscription. The crawler downloads and parse the HTML to extract the next level urls. Each new URL is publish into the same PubSub topic to repeat the process. The raw HTML contents are dumped into Google Storage and the crawl request information is stored in Cassandra.

What is a Web Crawler

A web crawler is a computer program that systematically browses the web. This particular example is used for web-scrapping (i.e. downloads the web's content). In essence, our crawler is listening into a Google PubSub topic for a crawl-request. Whenever a crawl request is received, the crawling process starts. This process is designed to be recursive until a certain depth level.

Crawling process overview:

a -> b -> c -> 	|-> d.1 -> e
                |-> d.2
                |-> d.3

Where:

  a   : get crawl-request from Google Cloud PubSub.
  b   : validate request and cache the info.
  c   : download content.
  d.1 : extract new url from content.
  d.2 : save crawl-request into Cassandra.
  d.3 : save content to Google Cloud Storage.
  e   : publish url as new crawl-requests into Google Cloud PubSub.  

The raw contents from the web are storage in Google Cloud Storage. From there we can use a processing engine (e.g. Apache Spark) to analyze the contents or an indexing engine (e.g. Solr) to make que content easy to query.

Tech Stack

To guarantee scalability and reliability, this web-crawler is coded with Scala using the Akka toolkit (i.e., actors and streams). Our implementation aims to follow the reactive manifesto.

This is a list of the current (relevant) tech stack:

Usage

Follow the next steps to configure and run the app:

  1. Download this repo: git clone <remote>
  2. Create the configuration file and add credentials: cp src/main/resources/application.conf.example src/main/resources/application.conf
  3. Configure Cassandra database.
  4. Run the app: sbt run

Configure Cassandra

We use a single Cassandra table to register the URLs that have been crawled so far.

If you don't have installed Cassandra, you can follow this installation guide for Ubuntu (or other Debian-based OS).

To configure Cassandra to this use-case follow these instructions:

  • Start the Cassandra service:
sudo systemctl start cassandra.service
  • Open the Cassandra shell with cqlsh and create the keyspace:
CREATE KEYSPACE crawler WITH REPLICATION = {'class':'SimpleStrategy', 'replication_factor':1};
  • Create the table:
CREATE TABLE crawler.url (
  id uuid,
  uri text,
  depth int,
  max_depth int,
  from_url uuid,
  crawl_request_id uuid,
  timestamp timestamp,
  PRIMARY KEY (id, from_url, crawl_request_id)
);

Contributions

Feel free to add issues or create PRs. Contact the authors for further information.

License

To be defined.