Overview

Castroom is a podcast search engine. It was primarily made to learn how to make a distributed web crawler using Kubernetes. It is capable of gathering hundreds of thousands of podcasts within a few hours, and can easily be scaled up even more with one simple command.

Project Structure

Discovery

Master

coordinates all the crawler jobs
maintains a local cache (using LevelDB) to prevent the same URL from being crawled multiple times
receives data from the crawler nodes and pushes to the queue
the crawler nodes send all data to this node after crawling a website
send the data to ElasticSearch on completion
managed by Google Kubernetes Engine

Crawler

crawls iTunes podcast pages and sends batched data to the master node for caching
goes through a proxy to bypass certain restrictions
managed by Google Kubernetes Engine

API

provides endpoints for querying Elasticsearch and retrieving podcast Feed information
hosted on Heroku

Web

frontend for the search engine
managed by Firebase Hosting

Technologies Used

Docker
Google Kubernetes Engine
Amazon Simple Queue Service
Amazon Elasticsearch Service
Heroku
Firebase Hosting
React
Node.js
LevelDB
Datadog

Name		Name	Last commit message	Last commit date
Latest commit History 145 Commits
api		api
discovery		discovery
web		web
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

api

api

discovery

discovery

web

web

README.md

README.md

Repository files navigation

Overview

Project Structure

Discovery

Master

Crawler

API

Web

Technologies Used

Screenshots

About

Releases

Packages

Languages

castroom/castroom

Folders and files

Latest commit

History

Repository files navigation

Overview

Project Structure

Discovery

Master

Crawler

API

Web

Technologies Used

Screenshots

About

Topics

Resources

Stars

Watchers

Forks

Languages