Skip to content

This repository exemplifies a simple ELT process using delta to perform upsert and remove data files that aren't in the latest state of the transaction log for the table.

anneglienke/101_upsert-delta

Repository files navigation

Project logo

"Delta Lake is an open source project that enables building a Lakehouse architecture on top of data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing on top of existing data lakes, such as S3, ADLS, GCS, and HDFS."

License


Introduction to Delta Upsert

This repository exemplifies a simple ELT process using delta to perform upsert and remove data files that aren't in the latest state of the transaction log for the table.

📝 Table of Contents

  • 1.raw-zone-ingestion - first ingestion to raw-zone
  • 2.raw-zone-incremental - incremental ingestion (append) to raw-zone
  • 3.staging-zone-ingestion - snapshot of the latest state of the table and creation of staging-zone (delta)
  • 4.staging-zone-incremental - incremental snapshot ingestion (delta)
  • Check scripts (check_raw-zone.py, check_staging-zone.py) - scripts to read and monitor tables being created
  • CSV files (titanic.csv, titanic2.csv, titanic3.csv) - simulate changes in tables being ingested
  • Directories (raw-zone, staging-zone) - store the data

About

This repository exemplifies a simple ELT process using delta to perform upsert and remove data files that aren't in the latest state of the transaction log for the table.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages