Web Crawler - Proof of Concept

Current version: 1.1.0

Crawler POC using Scala and Akka Streams.

The crawling requests (URLs) are provided by a Google PubSub subscription. The crawler downloads and parse the HTML to extract the next level urls. Each new URL is publish into the same PubSub topic to repeat the process. The raw HTML contents are dumped into Google Storage and the crawl request information is stored in Cassandra.

What is a Web Crawler

A web crawler is a computer program that systematically browses the web. This particular example is used for web-scrapping (i.e. downloads the web's content). In essence, our crawler is listening into a Google PubSub topic for a crawl-request. Whenever a crawl request is received, the crawling process starts. This process is designed to be recursive until a certain depth level.

Crawling process overview:

a -> b -> c -> 	|-> d.1 -> e
                |-> d.2
                |-> d.3

Where:

  a   : get crawl-request from Google Cloud PubSub.
  b   : validate request and cache the info.
  c   : download content.
  d.1 : extract new url from content.
  d.2 : save crawl-request into Cassandra.
  d.3 : save content to Google Cloud Storage.
  e   : publish url as new crawl-requests into Google Cloud PubSub.

The raw contents from the web are storage in Google Cloud Storage. From there we can use a processing engine (e.g. Apache Spark) to analyze the contents or an indexing engine (e.g. Solr) to make que content easy to query.

Tech Stack

To guarantee scalability and reliability, this web-crawler is coded with Scala using the Akka toolkit (i.e., actors and streams). Our implementation aims to follow the reactive manifesto.

This is a list of the current (relevant) tech stack:

Usage

Follow the next steps to configure and run the app:

Download this repo: git clone <remote>
Create the configuration file and add credentials: cp src/main/resources/application.conf.example src/main/resources/application.conf
- You will need to create a service account for Google Cloud PubSub and Google Cloud Storage.
Configure Cassandra database.
Run the app: sbt run

Configure Cassandra

We use a single Cassandra table to register the URLs that have been crawled so far.

If you don't have installed Cassandra, you can follow this installation guide for Ubuntu (or other Debian-based OS).

To configure Cassandra to this use-case follow these instructions:

Start the Cassandra service:

sudo systemctl start cassandra.service

Open the Cassandra shell with cqlsh and create the keyspace:

CREATE KEYSPACE crawler WITH REPLICATION = {'class':'SimpleStrategy', 'replication_factor':1};

Create the table:

CREATE TABLE crawler.url (
  id uuid,
  uri text,
  depth int,
  max_depth int,
  from_url uuid,
  crawl_request_id uuid,
  timestamp timestamp,
  PRIMARY KEY (id, from_url, crawl_request_id)
);

Contributions

Feel free to add issues or create PRs. Contact the authors for further information.

Rodrigo Hernández Mota

License

To be defined.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
project		project
src/main		src/main
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

project

project

src/main

src/main

.gitignore

.gitignore

README.md

README.md

build.sbt

build.sbt

Repository files navigation

Web Crawler - Proof of Concept

What is a Web Crawler

Tech Stack

Usage

Configure Cassandra

Contributions

License

About

Releases

Packages

Languages

RHDZMOTA/crawler-poc

Folders and files

Latest commit

History

Repository files navigation

Web Crawler - Proof of Concept

What is a Web Crawler

Tech Stack

Usage

Configure Cassandra

Contributions

License

About

Topics

Resources

Stars

Watchers

Forks

Languages