Showcase for Hadoop with CDC on Quarkus

This project holds a showcase for Hadoop with CDC on Quarkus.

Make targets

Following make targets exist in the subfolder podman:

Container and co

pd-machine-create - Create a suitable Qemu machine
pd-pod-create - Create the pod with port mappings
pd-pod-rm - Remove pod
pd-pod-recreate - Remove and recreate the pod
pd-build - Build all images
pd-init - Create machine, pod and build all images
pd-start - Start all container

Following make targets are available here:

Tools

todo - Create todo entry via curl
list - List todo entries via curl
kat-listen - Listen for Kafka messages
kat-send - Send Kafka message
psql - Use psql CLI to connect to Postgres

Hive

beeline - Start beeline CLI and connect to Hive
beeline-hive-select - Select data from hive_todos
beeline-debezium-select - Select data from debezium_todos
beeline-spark-select - Select data from spark_messages and spark_todos

Spark

spark-shell - Start Spark shell and connect to Spark
spark-beeline - Start Spark Beeline and connect to Spark
data-init - Init all Hive data
copy - Copy the Scala jar into the Hadoop container

Browser

open-namenode - Open the namenode in a browser
open-datanode - Open the datanode in a browser
open-spark-master - Open the Spark master in a browser
open-spark-slave - Open the Spark slave in a browser
open-spark-shell - Open the Spark shell in a browser
open-resourcemanager - Open the Resoucemanager in a browser
open-debezium - Open the Debezium in a browser
open-app - Open the Quarkus Dev tools in a browser

How to use

Initial setup

Create podman machine: make -C podman pd-machine-create
Start podman machine: make -C podman pd-machine-start
Create pod: make -C podman pd-pod-create
Build all containers: make -C podman pd-build
Start all containers: make -C podman pd-start

Start everything on host

Init Hive tables: make init
Compile scala jar: make scala
SSH into Hadoop pod: make ssh

Run everything on ssh

Copy scala jar to Hadoop: make copy
Init Spark tables: make init
Run scala jar: make run
Create todo via curl: make todo

Problems

Dockerfile

Apparently, the datanodes use components like C-libraries which I couldn’t get NOT to dump core dump with Alpine/Musl.

Podman

Starting with Podman 4.4.1 they dropped the default privileges for chroot, which led to following problems on connection:

ssh: Connection closed by 127.0.0.1 port 22
sshd: chroot("/run/sshd"): Operation not permitted [preauth]

Scala

java.lang.NoSuchMethodError: 'scala.collection.immutable.ArraySeq scala.runtime.ScalaRunTime$.wrapRefArray(java.lang.Object[])'

Caused by: java.lang.ClassNotFoundException: scala.$less$colon$less

Make sure the Scala version of the jars/dependencies match the Scala version of Spark.

This can be easily checked with:

mvn dependency:tree

Hive

The jdbc connection string for either anonymous or hduser for Hive is following:

jdbc:hive2://localhost:10000/default

And adding the external Debezium table works:

add jar /home/hduser/hive/lib/iceberg-hive-runtime-1.1.0.jar;
create external table debezium stored by 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' location 'hdfs://localhost:9000/warehouse/debeziumevents/debeziumcdc_showcase_public_todos' TBLPROPERTIES ('iceberg.catalog'='location_based_table')"

Spark

Spark executors use submitted values for JAVA_HOME:

https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_Set_Java_Home_Howto.md

Hadoop

Hadoop includes a Jakarta enabled version of Jetty, but still in v3.3.6 many of the servlets implement interfaces of javax.servlets.* and this doesn’t work in a Jakarta project.

java.lang.RuntimeException: java.lang.NoSuchMethodError: 'void org.eclipse.jetty.servlet.ServletHolder.<init>(javax.servlet.Servlet)'

Name		Name	Last commit message	Last commit date
Latest commit History 334 Commits
podman		podman
todo-mapreduce		todo-mapreduce
todo-service-debezium		todo-service-debezium
todo-service-hadoop		todo-service-hadoop
todo-spark-sink		todo-spark-sink
.gitignore		.gitignore
.hgignore		.hgignore
LICENSE		LICENSE
Makefile		Makefile
README.adoc		README.adoc
pom.xml		pom.xml

License

unexist/showcase-hadoop-cdc-quarkus

Folders and files

Latest commit

History

Repository files navigation

Showcase for Hadoop with CDC on Quarkus

Make targets

Container and co

Tools

Hive

Spark

Browser

How to use

Initial setup

Start everything on host

Run everything on ssh

Problems

Dockerfile

Podman

Scala

Hive

Spark

Hadoop

Links

Hadoop

MapReduce

Config defaults

Debezium

Iceberg

Spark

Scala

About

Topics

Resources

License

Stars

Watchers

Forks

Languages