Skip to content

emilyzhang/crawlr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

crawlr

crawlr recursively crawls URLs.

Usage

Build & run

Make sure that the docker daemon is running and your local 8000 and 5432 ports aren't allocated by any local services before running the following command, otherwise the services will fail to start. This following command will spin up a postgres container, a container running the crawlr API (serving on localhost:8000/), and a worker container for the crawler.

DSN="postgresql://user:test@db:5432/crawlr" MAX_WORKERS=30 docker-compose -f ./deployments/docker-compose.yml up --build

Environment variable meanings:

  • DSN: connection string for the API server and crawler to connect to the database
  • MAX_WORKERS: specifies the number of workers that the crawler can spin up at a time

To destroy and recreate the database/clean up:

docker container prune && docker volume prune

To run all tests:

go test -v ./...

API

POST /crawl

Request

{
  "url": "mlyzhng.com",
  "levels": 2
}
  • url string: Represents the URL to crawl.
  • levels int: Represents the number of levels of recursion.

Response

{
  "crawl_request_id": 1,
  "url": "mlyzhng.com",
  "levels": 2
}

crawl_request_url int: Represents the ID of the created CrawlRequest.

  • url string: Represents the URL to crawl.
  • levels int: Represents the number of levels of recursion.

Example

To create a CrawlRequest starting from google.com with 2 levels of recursion:

curl localhost:8000/crawl --data '{"url": "mlyzhng.com", "levels": 2}' | jq

GET /status/:id

Response

{
  "crawl_request_id": 4,
  "url": "http://mlyzhng.com",
  "completed": 19,
  "in_progress": 0,
  "failed": 1,
  "total": 20
}
  • crawl_request_url int: Represents the ID of the created CrawlRequest.
  • url string: Represents the URL to crawl.
  • completed int: Represents the number of tasks completed.
  • in_progress int: Represents the number of tasks in progress.
  • failed int: Represents the number of tasks failed.
  • total int: Represents the number of tasks attempted in total.

Example

To check the status of the CrawlRequest with id 1:

curl localhost:8000/status/1 | jq

GET /results/:id

Response

{
  "avatars2.githubusercontent.com": 1,
  "dashboard.lob.com": 5,
  "developer.github.com": 1,
  "education.github.com": 2,
  "enterprise.github.com": 1,
  "github.blog": 1,
  "github.com": 12,
  "github.community": 1,
  "githubstatus.com": 1,
  "go.lob.com": 2,
  "help.github.com": 7,
  "i.imgur.com": 11,
  "images-wixmp-ed30a86b8c4ca887773594c2.wixmp.com": 31,
  "lab.github.com": 1,
  "library.lob.com": 4,
  "lob.com": 74,
  "opensource.guide": 1,
  "status.lob.com": 1,
  "support.lob.com": 3,
  "training.github.com": 1,
  "twitter.com": 1,
  "www.facebook.com": 1,
  "www.github.com": 79,
  "www.goodreads.com": 8,
  "www.instagram.com": 2,
  "www.linkedin.com": 3
}

Returns a JSON object containing counts of each unique host name found while crawling (excluding counts of the original hostname in the supplied URL when the CrawlRequest was created).

Example

To check the results of the CrawlRequest with id 1:

curl localhost:8000/results/1 | jq

Design Decisions

My design is based on the idea that the internet can be represented as a graph, where pages are nodes and a link on a page represents a directed edge from one page to another (for example, if https://google.com had a link to https://maps.google.com in its html body, I would create a directed edge from the page node https://google.com to the page node https://maps.google.com).

I cut down on the links I crawled by ignoring any urls with non http/https protocols, as well as stripping all urls of their fragments. However, I considered urls with different protocols to be different pages, even if they otherwise had the same host name, path, and query. (Ex: https://google.com was treated as a different page than http://google.com).

Based on the requirements given, I optimized for scalability at the expense of adding a little additional complexity into the system. There are 3 overall parts to this system: the API server, the crawler, and the database. The API server and crawler each live in their own container and are stateless, so they can be scaled up or down as needed.

API Server

When a request to the /crawl endpoint is made, the API server creates a CrawlRequest and inserts the relevant information into the crawl_requests table. The server also creates a task representing the first page node to be crawled (containing the id of the associated CrawlRequest, the page URL to be crawled, the current level of recursion (0), and the status of the task NOT_STARTED) and stores that in the task table in the database.

When a request to the /status/:id endpoint is made, the API server retrieves all tasks associated with the CrawlRequest id up to the second to last level and counts how many tasks are COMPLETED,FAILED, or IN_PROGRESS and returns these metrics.

When a request to the /results/:id endpoint is made, the API server retrieves all tasks associated with the CrawlRequest id and counts the unique hosts seen. This works because the tasks represent all the pages that were crawled through the graph for a single CrawlRequest.

Crawler

The crawler is responsible for unfolding the graph that the API server relies on. The crawler continually retrieves more incomplete tasks from the tasks table. For every task, the crawler will check if a page node has already been created for the page URL in the page_nodes table.

If no such page node exists, the worker will create a page and insert it into the page_nodes table and make a GET request to the URL. It will then create new nodes for all the new hosts found, as well as edges from the current page node to the newly created page nodes. Otherwise, if a page has already been created for the URL, the task will retrieve all page nodes with edges where the source node is the current page node.

Then the crawler determines based on the current level and the total number of expected levels for the CrawlRequest whether to create more tasks. If more tasks should be created, the crawler uses the retrieved URLs of pages (from found edges) or the retrieved URLs from the GET request, and inserts new tasks into the database, incrementing the level of recursion by 1. Once all new tasks (if any) are inserted into the database, the worker marks the current task as COMPLETED.

Deployment to Production

Since everything is containerized, it should be relatively simple to deploy this to whatever container orchestration system is preferred (Kubernetes, ECS, etc.). We could use a CI/CD tool like CircleCI or Github Actions to run unit and integration tests every time a new feature branch is merged into master, and to do automatic deploys to a staging or testing environment where an engineer could do additional QA or testing if necessary. Finally, the engineer can manually deploy changes to a production environment after the new changes have been tested to their satisfaction.

We also might want to use a hosted solution for the database instead of maintaining our own, such as AWS RDS or Google Cloud SQL. (I also barely set up the database in this solution, so I'd change the Dockerfile to pass in the database username and password as build arguments instead of just specifying it in the file itself for security reasons).

In order to scale this, we'd need to put a load balancer in front of the API server to handle requests (and probably need to set up a domain for our API server, as well as provisioning TLS/SSL certs if we want our API to support HTTPS).

Future Work

For simplicity's sake, I built this system with the invariant that once a page has been crawled, it'll never be updated again. This means that even if a page has changed since the last time we crawled it, the system doesn't make any updates to the nodes in the graph if we've already crawled that page. This increases the speed of individual crawl requests, but decreases the accuracy of returned results. Future work could include checking Last-Modified, Expires, Cache-Content, or Etag headers for a page to determine whether we should update the page node and its edges (or keeping a timestamp for when we last crawled a page node, and always recrawling a page node after a certain amount of time has passed). This would make the results returned more accurate.

I don't have graceful server shutdowns or worker retries implemented, so if a worker dies in the middle of completing a task, the CrawlRequest related to the task will never be completed. I could create an additional service or function to find tasks that have been IN_PROGRESS for more than a specified amount of time, and restart that task by setting the task status back to NOT_STARTED.

I could currently don't cache or save computed results for a CrawlRequest anywhere, so in the future I could add a column to the crawl_requests table to store this information to make results return faster.

Other interesting things I'd like to add: distributed request tracing throughout the API and tasks, implementing better logging for the crawler/API (more informative JSON error messages), adding metrics for long it takes to crawl an average page, etc. I'd also like to add more thorough test coverage. Currently I'm only testing http response body html parsing and URL filtering functionality since I thought it would take up too much time to figure out how to mock the database client for unit testing functions that involve database access.

In the future, to make this a more polite web crawler, it'd be neat to add ways so the web crawler could rate-limit itself or to follow the robots.txt/metadata to only crawl content that the site owner wants to be crawled.

Thoughts on Scalability

The api and crawler worker should both be easily horizontally scalable, as they are stateless and rely on the database to store state. They can be scaled up by increasing the number of replicas in whichever config file is needed for a chosen container orchestration platform (for example, in Kubernetes, we would make a Deployment object for the api and one for the crawler and specify the number of replicas we want.)

The number of workers in each crawler container is also scalable, as we can scale up the number of workers per crawler via a command line flag. I'm slightly worried about how to make the database more scalable - I don't have as much experience there, but I'd love to learn more about it.

Learnings

My original design involved doing all of the crawling for a single CrawlRequest on a single worker within a crawler instance, which involved working with a lot of goroutines, wait groups, and mutexes. The results/status would then be stored in the database when the CrawlRequest was complete. I redesigned the system to distribute tasks among crawlers instead because some pages have hundreds and thousands of links while others only have 1 or 2 links. This made it hard to reason about the work that each crawler would do in my first design - even if we scaled up the number of workers we had, some crawlers might end up with many magnitudes more work than other crawlers, based on the URL that they're responsible for crawling.

When implementing the distributed task system, I ran into some issues working with concurrent database reads/writes (had deadlock situations when attempting to update edges for a page node, which involved also creating new page nodes). I simplified my database queries which seemed to help (I ran some manual crawl request tests to check for deadlocks or other errors and didn't see any), but I'd love to learn more about how to improve my system/algorithm design in a way that I can provably avoid concurrency issues.

I also realized I had a bug in my cycle detection/algorithm for not visiting previously visited backlinks, where I wasn't counting hosts correctly because I tried to visit each URL only once in a crawl request.

About

crawlr recursively crawls urls

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published