Skip to content

joestein/getting-started-apachespark-apachekafka-apacheiceberg

Repository files navigation

getting-started-apachespark-apachekafka-apacheiceberg

Getting Started Apache Spark, Apache Kafka, Apache Iceberg

This code is from the post here.

It then included an integration with the Nessie Catalog written up in this post

I then updated it for minio posted here

Up and Running

First make sure you have Python3, Java, Docker and Spark installed.

The Spark download. Then to install Spark, extract it in a directory and set that directory as an environment variable called SPARK_HOME and then but the bin folder of SPARK_HOME in your PATH. Let me know if I need to add docs for Python, Java and Docker.

Next you will need some terminals open.

In the first terminal lets start install python libs and startup Kafka, Schama Repo, Nessie and Minio.

pip3 install -r requirements.txt
docker-compose up

Now in another terminal lets send a message to Kafka.

./send.sh

In another terminal lets create our Iceberg table managed by the Nessie catalog and store data on Minio.

./create_table.sh --table_name=your_table_name

In the same terminal lets start up a Spark job to read from Kafka, understand the Avro schema and save that to the Iceberg table we created.

./kafka-to-iceberg.sh --table_name=your_table_name

Now lets see whats in the table we created. Go to another terminal and run pyspark to run interactive queries.

./mypy.sh

Now go to the queries.py file and copy all of the imports and configuration. After that you should be able to run your query and see the record you sent to Kafka in the Iceberg table.

spark.sql("select * from nessie.your_table_name").show()

You can send again and run the select and see it show up.

Thanx =8^) Joe Stein
https://www.twitter.com/charmalloc
https://www.linkedin.com/in/charmalloc

About

Getting Started Apache Spark, Apache Kafka, Apache Iceberg

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published