Data ingestion with dlt

In this hands-on workshop, we’ll learn how to build data ingestion pipelines.

We’ll cover the following steps:

Extracting data from APIs, or files.
Normalizing and loading data
Incremental loading

By the end of this workshop, you’ll be able to write data pipelines like a senior data engineer: Quickly, concisely, scalable, and self-maintaining.

Video: https://www.youtube.com/live/oLXhBM7nf2Q

Navigation

Workshop content
Workshop notebook
Homework starter notebook

Resources

Website and community: Visit our docs, discuss on our slack (Link at top of docs).
Course colab: Notebook.
dlthub community Slack.

Teacher

Welcome to the data talks club data engineering zoomcamp, the data ingestion workshop.

My name is Adrian, and I work in the data field since 2012
- I built many data warehouses some lakes, and a few data teams
- 10 years into my career I started working on dlt “data load tool”, which is an open source library to enable data engineers to build faster and better.
- I started working on dlt because data engineering is one of the few areas of software engineering where we do not have developer tools to do our work.
- Building better pipelines would require more code re-use - we cannot all just build perfect pipelines from scratch every time.
- And so dlt was born, a library that automates the tedious part of data ingestion: Loading, schema management, data type detection, scalability, self healing, scalable extraction… you get the idea - essentially a data engineer’s “one stop shop” for best practice data pipelining.
- Due to its simplicity of use, dlt enables laymen to
  - Build pipelines 5-10x faster than without it
  - Build self healing, self maintaining pipelines with all the best practices of data engineers. Automating schema changes removes the bulk of maintenance efforts.
  - Govern your pipelines with schema evolution alerts and data contracts.
  - and generally develop pipelines like a senior, commercial data engineer.

Course

You can find the course file here The course has 3 parts

Extraction Section: In this section we will learn about scalable extraction
Normalisation Section: In this section we will learn to prepare data for loading
Loading Section): Here we will learn about incremental loading modes

Homework

The linked colab notebook offers a few exercises to practice what you learned today.

Question 1: What is the sum of the outputs of the generator for limit = 5?

A: 10.23433234744176
B: 7.892332347441762
C: 8.382332347441762
D: 9.123332347441762

Question 2: What is the 13th number yielded by the generator?

A: 4.236551275463989
B: 3.605551275463989
C: 2.345551275463989
D: 5.678551275463989

Question 3: Append the 2 generators. After correctly appending the data, calculate the sum of all ages of people.

A: 353
B: 365
C: 378
D: 390

Question 4: Merge the 2 generators using the ID column. Calculate the sum of ages of all the people loaded as described above.

A: 215
B: 266
C: 241
D: 258

Submit the solution here: https://courses.datatalks.club/de-zoomcamp-2024/homework/workshop1

Next steps

As you are learning the various concepts of data engineering, consider creating a portfolio project that will further your own knowledge.

By demonstrating the ability to deliver end to end, you will have an easier time finding your first role. This will help regardless of whether your hiring manager reviews your project, largely because you will have a better understanding and will be able to talk the talk.

Here are some example projects that others did with dlt:

Serverless dlt-dbt on cloud functions: Article
Bird finder: Part 1, Part 2
Event ingestion on GCP: Article and repo
Event ingestion on AWS: Article and repo
Or see one of the many demos created by our working students: Hacker news, GA4 events, an E-Commerce, google sheets, Motherduck, MongoDB + Holistics, Deepnote, Prefect, PowerBI vs GoodData vs Metabase, Dagster, Ingesting events via gcp webhooks, SAP to snowflake replication, Read emails and send sumamry to slack with AI and Kestra, Mode +dlt capabilities, dbt on cloud functions
If you want to use dlt in your project, check this list of public APIs

If you create a personal project, consider submitting it to our blog - we will be happy to showcase it. Just drop us a line in the dlt slack.

And don't forget, if you like dlt

Give us a GitHub Star!
Join our Slack community

Notes

Add your notes here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dlt.md

dlt.md

Data ingestion with dlt

Navigation

Resources

Teacher

Course

Homework

Question 1: What is the sum of the outputs of the generator for limit = 5?

Question 2: What is the 13th number yielded by the generator?

Question 3: Append the 2 generators. After correctly appending the data, calculate the sum of all ages of people.

Question 4: Merge the 2 generators using the ID column. Calculate the sum of ages of all the people loaded as described above.

Next steps

Notes

Files

dlt.md

Latest commit

History

dlt.md

File metadata and controls

Data ingestion with dlt

Navigation

Resources

Teacher

Course

Homework

Question 1: What is the sum of the outputs of the generator for limit = 5?

Question 2: What is the 13th number yielded by the generator?

Question 3: Append the 2 generators. After correctly appending the data, calculate the sum of all ages of people.

Question 4: Merge the 2 generators using the ID column. Calculate the sum of ages of all the people loaded as described above.

Next steps

Notes