Skip to content

bfemiano/song_plays_workshop_tutorial

Repository files navigation

song_plays_workshop_tutorial

This is an advanced workshop for people comfortable programming. We'll be writing code in both Python and Scala. You don't have to be an expert in either of those languages, but some familiarity, at least with Python, is recommended.

Students will write 2 components.

  1. A series of workflow management classes in Python using the Luigi framework.
  2. A Scala Spark job to join the datasets together and perform some basic group analysis.

The workflow will chain together the processes of downloading multiple data sources for a given day (2019-02-08) from S3, sending those sources as input into the Spark program and verifying expected output files are produced.

It should take students with assistance about 2 - 3 hours including VM setup time to write from scratch.

Environment setup:

  1. VM Setup directions (Recommended)
  2. For directions to setup the compile/run dependencies to run locally instead of the VM see Local Setup.

Workshop directions

  1. First let's code the Luigi Tasks.
  2. Then let's write the Scala Spark Job.
  3. Now let's run the Spark artifict you built with Luigi and Put it all together.

Fake data generation (Students can skip this)

For directions on how to run the fake data generator see Fake Data Generator.

About

Song Plays Workshop Tutorial

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published