Skip to content

unexist/showcase-hadoop-cdc-quarkus

Repository files navigation

Showcase for Hadoop with CDC on Quarkus

This project holds a showcase for Hadoop with CDC on Quarkus.

Make targets

Following make targets exist in the subfolder podman:

Container and co

  • pd-machine-create - Create a suitable Qemu machine

  • pd-pod-create - Create the pod with port mappings

  • pd-pod-rm - Remove pod

  • pd-pod-recreate - Remove and recreate the pod

  • pd-build - Build all images

  • pd-init - Create machine, pod and build all images

  • pd-start - Start all container

Following make targets are available here:

Tools

  • todo - Create todo entry via curl

  • list - List todo entries via curl

  • kat-listen - Listen for Kafka messages

  • kat-send - Send Kafka message

  • psql - Use psql CLI to connect to Postgres

Hive

  • beeline - Start beeline CLI and connect to Hive

  • beeline-hive-select - Select data from hive_todos

  • beeline-debezium-select - Select data from debezium_todos

  • beeline-spark-select - Select data from spark_messages and spark_todos

Spark

  • spark-shell - Start Spark shell and connect to Spark

  • spark-beeline - Start Spark Beeline and connect to Spark

  • data-init - Init all Hive data

  • copy - Copy the Scala jar into the Hadoop container

Browser

  • open-namenode - Open the namenode in a browser

  • open-datanode - Open the datanode in a browser

  • open-spark-master - Open the Spark master in a browser

  • open-spark-slave - Open the Spark slave in a browser

  • open-spark-shell - Open the Spark shell in a browser

  • open-resourcemanager - Open the Resoucemanager in a browser

  • open-debezium - Open the Debezium in a browser

  • open-app - Open the Quarkus Dev tools in a browser

How to use

Initial setup

  1. Create podman machine: make -C podman pd-machine-create

  2. Start podman machine: make -C podman pd-machine-start

  3. Create pod: make -C podman pd-pod-create

  4. Build all containers: make -C podman pd-build

  5. Start all containers: make -C podman pd-start

Start everything on host

  1. Init Hive tables: make init

  2. Compile scala jar: make scala

  3. SSH into Hadoop pod: make ssh

Run everything on ssh

  1. Copy scala jar to Hadoop: make copy

  2. Init Spark tables: make init

  3. Run scala jar: make run

  4. Create todo via curl: make todo

Problems

Dockerfile

Apparently, the datanodes use components like C-libraries which I couldn’t get NOT to dump core dump with Alpine/Musl.

Podman

Starting with Podman 4.4.1 they dropped the default privileges for chroot, which led to following problems on connection:

ssh: Connection closed by 127.0.0.1 port 22
sshd: chroot("/run/sshd"): Operation not permitted [preauth]

Scala

java.lang.NoSuchMethodError: 'scala.collection.immutable.ArraySeq scala.runtime.ScalaRunTime$.wrapRefArray(java.lang.Object[])'
Caused by: java.lang.ClassNotFoundException: scala.$less$colon$less

Make sure the Scala version of the jars/dependencies match the Scala version of Spark.

This can be easily checked with:

mvn dependency:tree

Hive

The jdbc connection string for either anonymous or hduser for Hive is following:

jdbc:hive2://localhost:10000/default

And adding the external Debezium table works:

add jar /home/hduser/hive/lib/iceberg-hive-runtime-1.1.0.jar;
create external table debezium stored by 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' location 'hdfs://localhost:9000/warehouse/debeziumevents/debeziumcdc_showcase_public_todos' TBLPROPERTIES ('iceberg.catalog'='location_based_table')"

Hadoop

Hadoop includes a Jakarta enabled version of Jetty, but still in v3.3.6 many of the servlets implement interfaces of javax.servlets.* and this doesn’t work in a Jakarta project.

java.lang.RuntimeException: java.lang.NoSuchMethodError: 'void org.eclipse.jetty.servlet.ServletHolder.<init>(javax.servlet.Servlet)'

Hadoop