- Current version: 1.1.0
Crawler POC using Scala and Akka Streams.
The crawling requests (URLs) are provided by a Google PubSub subscription. The crawler downloads and parse the HTML to extract the next level urls. Each new URL is publish into the same PubSub topic to repeat the process. The raw HTML contents are dumped into Google Storage and the crawl request information is stored in Cassandra.
A web crawler is a computer program that systematically browses the web. This particular example is used for web-scrapping (i.e. downloads the web's content). In essence, our crawler is listening into a Google PubSub topic for a crawl-request. Whenever a crawl request is received, the crawling process starts. This process is designed to be recursive until a certain depth level.
Crawling process overview:
a -> b -> c -> |-> d.1 -> e
|-> d.2
|-> d.3
Where:
a : get crawl-request from Google Cloud PubSub.
b : validate request and cache the info.
c : download content.
d.1 : extract new url from content.
d.2 : save crawl-request into Cassandra.
d.3 : save content to Google Cloud Storage.
e : publish url as new crawl-requests into Google Cloud PubSub.
The raw contents from the web are storage in Google Cloud Storage. From there we can use a processing engine (e.g. Apache Spark) to analyze the contents or an indexing engine (e.g. Solr) to make que content easy to query.
To guarantee scalability and reliability, this web-crawler is coded with Scala using the Akka toolkit (i.e., actors and streams). Our implementation aims to follow the reactive manifesto.
This is a list of the current (relevant) tech stack:
- Scala
- SBT
- Akka (actors + streams)
- Cassandra
- Google Cloud PubSub
- Google Cloud Storage
Follow the next steps to configure and run the app:
- Download this repo:
git clone <remote>
- Create the configuration file and add credentials:
cp src/main/resources/application.conf.example src/main/resources/application.conf
- You will need to create a service account for Google Cloud PubSub and Google Cloud Storage.
- Configure Cassandra database.
- Run the app:
sbt run
We use a single Cassandra table to register the URLs that have been crawled so far.
If you don't have installed Cassandra, you can follow this installation guide for Ubuntu (or other Debian-based OS).
To configure Cassandra to this use-case follow these instructions:
- Start the Cassandra service:
sudo systemctl start cassandra.service
- Open the Cassandra shell with
cqlsh
and create the keyspace:
CREATE KEYSPACE crawler WITH REPLICATION = {'class':'SimpleStrategy', 'replication_factor':1};
- Create the table:
CREATE TABLE crawler.url (
id uuid,
uri text,
depth int,
max_depth int,
from_url uuid,
crawl_request_id uuid,
timestamp timestamp,
PRIMARY KEY (id, from_url, crawl_request_id)
);
Feel free to add issues or create PRs. Contact the authors for further information.
To be defined.