Skip to content

castroom/castroom

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

githubheader

Overview

Castroom is a podcast search engine. It was primarily made to learn how to make a distributed web crawler using Kubernetes. It is capable of gathering hundreds of thousands of podcasts within a few hours, and can easily be scaled up even more with one simple command.

Project Structure

Discovery

Master

  • coordinates all the crawler jobs
  • maintains a local cache (using LevelDB) to prevent the same URL from being crawled multiple times
  • receives data from the crawler nodes and pushes to the queue
  • the crawler nodes send all data to this node after crawling a website
  • send the data to ElasticSearch on completion
  • managed by Google Kubernetes Engine

Crawler

  • crawls iTunes podcast pages and sends batched data to the master node for caching
  • goes through a proxy to bypass certain restrictions
  • managed by Google Kubernetes Engine

API

  • provides endpoints for querying Elasticsearch and retrieving podcast Feed information
  • hosted on Heroku

Web

  • frontend for the search engine
  • managed by Firebase Hosting

project-structure

Technologies Used

  • Docker
  • Google Kubernetes Engine
  • Amazon Simple Queue Service
  • Amazon Elasticsearch Service
  • Heroku
  • Firebase Hosting
  • React
  • Node.js
  • LevelDB
  • Datadog

Screenshots

Search

Search Results





GIF of Castroom