Materials for the the Big Data Workshop for DIScNet: 3-5 April 2023.
This will be a 3 day hands-on introduction to the technologies and ideas that are usied in building data applications at scale. The content will be delivered virtually via Zoom for the duration of the course
There are different components to the course including:
- Lectures
- Guided lab exercises
- Supported workshop sessions
During the course we will use Docker to support testing technologies on a local laptop/desktop; some elements of the course will be completed in the cloud using the DataBricks community edition.
For a lot of the course a recent install of Anaconda will be required and an environment with Python >= 3.9 and Jupyter Lab >3.0 available.
The course will be delivered in a dynamic and interactive way. There is a lot of core material but there are more technologies than there is time to explore and so the students will have the opportunity to suggest what elements are focussed on in some of the later sessions. Some of the topics we will be exploring include:
- Python, Jupyter Lab & Pandas
- Apache Spark
- SQL
- NoSQL
- Docker and containerisation
- Data streaming
- Interactive dashboarding
- Workflow orchestration
The course is very focussed around "doing" and "playing" with the tools. To that end there are lots of practical lab componets. These exist in the practical-labs
folder. Exercises are in numerical order that should be in sync with the lecture order. Some of the code is available with gaps for you to complete and in those cases a second copy of the code with the solution is also included.
To support the dynamic nature of the course some of the content will be live coded and will be uploaded to the repository after the session.
Other elements will collect some of the materials developed by Paul Freemantle (2016-2017) and Julie Weeds (2019-2020) previous instructors for this course.
Their original course materials are available here - we will use some of these materials as appropriate.