Skip to content

Apache Hudi examples designed to be run on AWS Elastic Map Reduce (EMR) via. EMR Studio or EMR Notebooks

Notifications You must be signed in to change notification settings

ev2900/EMR_Studio_Hudi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

Apache Hudi Examples

map-user map-user

Apache Hudi examples designed to be run on AWS Elastic Map Reduce (EMR) via. EMR Studio and/or EMR Notebook(s).

Reference background on key concepts. If you are new to working with Hudi it is worth reading about Hudi's timeline, file management, index, table types, query types, copy on write, merge on read.

If you are not familiar with the core Hudi concepts or are new to Hudi I highly recommend you watch AWS re:Invent 2019: Insert, upsert, and delete data in Amazon S3 using Amazon.

Enviorment Set Up

The samples in this repository are designed to run on EMR via. EMR Notebooks or EMR Studio. To set up your enviorment follow the AWS documentation for EMR Notebooks or EMR Studio.

You can upload the .ipynb files in this repository directly to the Jupyter enviorments provides by EMR Notebooks / Studio

Copy on Write

The notebooks in copy_on_write is the best place to start. It covers working with data via. Hudi specific to copy on write tables. The notebook(s) covers

  • Writing data to S3
  • Reading data from S3
  • Upserting data
  • Incremental querying
  • Point in Time querying
  • Deleting Data

Both a Python and Scala notebooks are available.

Merge on Read

The notebook in merge_on_read is the best next step once you understand the copy_on_write notebook(s). The merge_on_read notebook covers

  • Writing data to S3
  • Upserting data
  • Snapshot queries
  • Read optimized queries
  • Compaction

Both a Python and Scala notebooks are available.

Future Imporvement to this Repo

  • Hudi SQL example(s)
  • Hudi time travel example(s)

About

Apache Hudi examples designed to be run on AWS Elastic Map Reduce (EMR) via. EMR Studio or EMR Notebooks

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published