Skip to content

Latest commit

 

History

History
133 lines (96 loc) · 6.65 KB

File metadata and controls

133 lines (96 loc) · 6.65 KB

Data ingestion with dlt

​In this hands-on workshop, we’ll learn how to build data ingestion pipelines.

​We’ll cover the following steps:

  • ​Extracting data from APIs, or files.
  • ​Normalizing and loading data
  • ​Incremental loading

​By the end of this workshop, you’ll be able to write data pipelines like a senior data engineer: Quickly, concisely, scalable, and self-maintaining.

Video: https://www.youtube.com/live/oLXhBM7nf2Q


Navigation

Resources


Teacher

Welcome to the data talks club data engineering zoomcamp, the data ingestion workshop.

  • My name is Adrian, and I work in the data field since 2012
    • I built many data warehouses some lakes, and a few data teams
    • 10 years into my career I started working on dlt “data load tool”, which is an open source library to enable data engineers to build faster and better.
    • I started working on dlt because data engineering is one of the few areas of software engineering where we do not have developer tools to do our work.
    • Building better pipelines would require more code re-use - we cannot all just build perfect pipelines from scratch every time.
    • And so dlt was born, a library that automates the tedious part of data ingestion: Loading, schema management, data type detection, scalability, self healing, scalable extraction… you get the idea - essentially a data engineer’s “one stop shop” for best practice data pipelining.
    • Due to its simplicity of use, dlt enables laymen to
      • Build pipelines 5-10x faster than without it
      • Build self healing, self maintaining pipelines with all the best practices of data engineers. Automating schema changes removes the bulk of maintenance efforts.
      • Govern your pipelines with schema evolution alerts and data contracts.
      • and generally develop pipelines like a senior, commercial data engineer.

Course

You can find the course file here The course has 3 parts


Homework

The linked colab notebook offers a few exercises to practice what you learned today.

Question 1: What is the sum of the outputs of the generator for limit = 5?

  • A: 10.23433234744176
  • B: 7.892332347441762
  • C: 8.382332347441762
  • D: 9.123332347441762

Question 2: What is the 13th number yielded by the generator?

  • A: 4.236551275463989
  • B: 3.605551275463989
  • C: 2.345551275463989
  • D: 5.678551275463989

Question 3: Append the 2 generators. After correctly appending the data, calculate the sum of all ages of people.

  • A: 353
  • B: 365
  • C: 378
  • D: 390

Question 4: Merge the 2 generators using the ID column. Calculate the sum of ages of all the people loaded as described above.

  • A: 215
  • B: 266
  • C: 241
  • D: 258

Submit the solution here: https://courses.datatalks.club/de-zoomcamp-2024/homework/workshop1


Next steps

As you are learning the various concepts of data engineering, consider creating a portfolio project that will further your own knowledge.

By demonstrating the ability to deliver end to end, you will have an easier time finding your first role. This will help regardless of whether your hiring manager reviews your project, largely because you will have a better understanding and will be able to talk the talk.

Here are some example projects that others did with dlt:

If you create a personal project, consider submitting it to our blog - we will be happy to showcase it. Just drop us a line in the dlt slack.

And don't forget, if you like dlt

Notes

  • Add your notes here