Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A quickstart demo to showcase Hudi functionalities using docker along with support for integration-tests #455

Merged
merged 1 commit into from Oct 2, 2018

Conversation

bvaradar
Copy link
Contributor

Comes with foundations for adding docker integration tests. Docker images built with Hadoop 2.8.4 Hive 2.3.3 and Spark 2.3.1

Demo using docker containers with documentation

@bvaradar
Copy link
Contributor Author

@vinothchandar @n3nash : Rebased this branch against master. Tests passes locally. Please use this instead of vinothchandar#4

@bvaradar
Copy link
Contributor Author

@vinothchandar : The build is failing here


[ERROR] Failed to execute goal on project hoodie-hadoop-base-docker: Could not resolve dependencies for project com.uber.hoodie:hoodie-hadoop-base-docker:pom:0.4.4-SNAPSHOT: Could not transfer artifact com.uber.hoodie:hoodie-hadoop-docker:jar:0.4.4-SNAPSHOT from/to Maven repository (https://central.maven.org/maven2/): Host name 'central.maven.org' does not match the certificate subject provided by the peer (CN=repo1.maven.org, O="Sonatype, Inc", L=Fulton, ST=MD, C=US) -> [Help 1]
[ERROR]

This is a brand new module. If I run "mvn clean package", it passes but "mvn clean install", it fails. Have you noticed this before when new modules were added to hudi ?


Stock Tracker data will be used to showcase both different Hudi Views and the effects of Compaction.

Take a look at the director `docker/demo/data`. There are 2 batches of stock data - each at 1 minute granularity.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

directory*

Copy link
Member

@sungjuly sungjuly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a huge help to understand the Hoodie, thanks!

docs/quickstart.md Outdated Show resolved Hide resolved
docs/quickstart.md Show resolved Hide resolved
# Schedule a compaction. This will use Spark Launcher to schedule compaction
hoodie:stock_ticks->compaction schedule
....
Compaction instance : 20180910234509
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems compaction schedule is trying to kick off YARN instead of spark cluster. Here's a log when testing

2018-09-15 03:59:11 INFO  TimelineClientImpl:297 - Timeline service address: http://historyserver:8188/ws/v1/timeline/
2018-09-15 03:59:11 INFO  RMProxy:98 - Connecting to ResourceManager at resourcemanager:8032
2018-09-15 03:59:11 INFO  AHSProxy:42 - Connecting to Application History server at historyserver/172.22.0.7:10200
2018-09-15 04:14:10 ERROR SparkContext:91 - Error initializing SparkContext.
java.net.UnknownHostException: Invalid host name: local host is: (unknown); destination host is: "resourcemanager":8032; java.net.UnknownHostException; For more details see:  http://wiki.apache.org/hadoop/UnknownHost
    at sun.reflect.GeneratedConstructorAccessor20.newInstance(Unknown Source)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sungjuly . I have rebased and this issue should be fixed now.

@@ -14,11 +14,11 @@ Check out code and pull it into Intellij as a normal maven project.

Normally build the maven project, from command line
```
$ mvn clean install -DskipTests
$ mvn clean install -DskipTests -skipITs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-skipITs should be -DskipITs

@bvaradar bvaradar changed the title Build and deploy Hoodie Docker containers for integration-tests along with demo [WIP] Build and deploy Hoodie Docker containers for integration-tests along with demo Sep 21, 2018
@bvaradar
Copy link
Contributor Author

@sungjuly @n3nash : Thanks a lot for providing the feedback. I have revamped this PR and made few improvements. Here is the list:

  1. Reimplement docker files to reduce image sizes. I am seeing 3-4x improvements in docker sizes (from 4 - 5 GB to 0.8-1.5GB).
  2. I have published all the docker images to docker-hub (temporarily using my docker-hub account) so that there is no need to build the docker images as part of running the script.
  3. Made both docker images and compose follow convention so that a single image can satisfy both use-cases
    (a) Standalone - Docker images that are published to Docker-hub have inbuilt hudi jars so that they are self-contained.
    (b) Mounting Convention - The images allow for overriding those hudi jars with locally built jars allowing for the same setup to be used for running integ-testing against code-changes locally (look at the compose script and hoodie-integ/pom.xml)
  4. Updates quickstart docs to reflect the changes above

I have marked this PR as WIP since I need to showcase incremental view as part of demo.

@n3nash @sungjuly : Can you please follow the quickstart steps and let me know your experience.
Our aim is to have an easy-to-use and hassle-free hudi quickstart. I would appreciate your feedback on that.


```
cd docker
./setup_demo.sh
Copy link
Member

@sungjuly sungjuly Sep 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've encountered this. It seems it's related docker version. Here's my env to figure out problems.

  • docker-engine: 18.06.1-ce (there's no way to downgrade 17.12 since the official site only supports the latest version for Mac)
  • docker-compose: 1.22.0
➜  docker git:(docker) ✗ ./setup_demo.sh
Removing network compose_default
Creating network "compose_default" with the default driver
Creating kafkabroker ...
Creating zookeeper                 ... error
Creating hive-metastore-postgresql ...
Creating namenode                  ...
Creating kafkabroker               ... error
ERROR: for zookeeper  Cannot create container for service zookeeper: Conflict. The container name "/zookeeper" is already in use by container "8fb174b9eb8c40bdc20e0ed1b042a793fe49ec3dcea1f353c250b482c7c80019". You have to remove (or rename) that container to be able to reuse that name.

ERROR: for kafkabroker  Cannot create container for service kafka: Conflict. The container name "/kafkabroker" is alreCreating hive-metastore-postgresql ... error
ame) that container to be able to reuse that name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sungjuly : ok, can you check "docker container ls" and see if there are stopped/running containers with the same name. If you find them, can you remove the containers using "docker rm".

This is an one-time issue for those who tried this PR before. I have moved the compose script to a different file but the container names are same. Removing the docker containers should hopefully fix the issue.

For a different reason, I reverted my docker (mac) installation to "factory defaults" and that also fixed the problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docker env was cleaned up all things before I tested. Here's output for docker container ls

➜  docker git:(docker) ✗ docker container ls
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It worked after resetting to factory defaults on docker! thank you! @bvaradar

image

@bvaradar
Copy link
Contributor Author

@sungjuly: I am also online in hoodie slack channel. Ping me there if you need quicker reply.

Thanks,
Balaji.V

@sungjuly
Copy link
Member

@bvaradar would you please share the link for hoodie slack channel? I don't see any information, thank you!

@sungjuly
Copy link
Member

nvmd, I found this - #143 (comment)

Copy link
Member

@sungjuly sungjuly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huge thank you @bvaradar. It's super helpful to understand hoodie more!



# Execute the compaction
hoodie:stock_ticks->compaction run --compactionInstant 20180910234509 --parallelism 2 --sparkMemory 1G --schemaFilePath /var/demo/config/schema.avsc --retry 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if you can add more description compactionInstant value should be updated based on previous results of compactions show all.


```
cd docker
./setup_demo.sh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It worked after resetting to factory defaults on docker! thank you! @bvaradar

image

@bvaradar bvaradar changed the title [WIP] Build and deploy Hoodie Docker containers for integration-tests along with demo A quickstart demo to showcase Hudi functionalities using dockers along with support for integration-tests Sep 24, 2018
@bvaradar bvaradar changed the title A quickstart demo to showcase Hudi functionalities using dockers along with support for integration-tests A quickstart demo to showcase Hudi functionalities using docker along with support for integration-tests Sep 24, 2018
Copy link
Contributor Author

@bvaradar bvaradar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinothchandar @n3nash @sungjuly : Updated PR with incremental view demo in quickstart. Ready for review.

@bvaradar
Copy link
Contributor Author

Tests were failing because of log-size limit issue. Fixed as part of PR : #465

services:

namenode:
image: varadarb/hudi-hadoop_2.8.4-namenode:latest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct ? Should the username be here ? @bvaradar

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@n3nash :
I am not aware of any headless docker hub account . I have created a ticket to provide more context and assigned to @vinothchandar

#469

Once the ticket is resolved, we can replace the image locations in Dockerfile, pom.xml and in compose scripts.

- /tmp/hadoop_data:/hadoop/dfs/data

historyserver:
image: varadarb/hudi-hadoop_2.8.4-history:latest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

exit
```

#### Step 5 (b): Run Spark-SQl Queries
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sungjuly @vinothchandar @n3nash : added spark-sql example here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I've tested with new scenarios. It worked properly! Separately creating MOR/COW tables is a good idea to explain internal works. Thank you for your work as a user!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 It is so much better to test stuff now.. :) Thank you for doing this, even as co-creator of the project :) ha ha.

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works really well. Took me 20 mins to get containers up, from a decent internet connection in India.

2018-09-24 22:20:00 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2018-09-24 22:20:00 INFO SparkContext:54 - Successfully stopped SparkContext
# Run the following spark-submit command to execute the delta-streamer and ingest to stock_ticks_mor dataset in HDFS
spark-submit --class com.uber.hoodie.utilities.deltastreamer.HoodieDeltaStreamer /var/hoodie/ws/hoodie-utilities/target/hoodie-utilities-0.4.4-SNAPSHOT.jar --source-class com.uber.hoodie.utilities.sources.JsonKafkaSource --source-ordering-field ts --target-base-path /user/hive/warehouse/stock_ticks_mor --target-table stock_ticks_mor --props /var/demo/config/kafka-source.properties
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is now hoodie-utilities-0.4.4-SNAPSHOT.jar with the new release. same comment on other 3 occurrances as well.. Can we name the jar differently inside the container without the version? that way it will continue to work? (I understand it takes away knowing what version is being tested.) thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also DeltaStreamer now takes a --storage-type COPY_ON_WRITE/MERGE_ON_READ argument which is required.. we can use this and get rid of steps to create the dataset manually via CLI? DeltaStreamer will create the dataset if the basePath does not exist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinothchandar : I have made changes to treat hoodie-utilities jar in the same was as other bundle jars. The utilities jar will now be part of docker image with version removed. It is also available using the alias $HUDI_UTILITIES_BUNDLE. I have updated quickstart and uploaded newer docker images with "latest" tag so you should see the changes when you setup demo again. The 2nd step of explicitly initializing Hudi datasets using Hudi CLI is also removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bvaradar docker/compose#3574 don't think compose up will pull in latest images.. I am going remove all containers and retry.. We probably need to have a better support in the script for optionally also doing a compose pull before?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That worked @bvaradar .. One more issue I found was between recreating containers, the path on host machine /tmp/hadoop* needs to be blown away.. Can we add a rm -rf before the mkdir -p in the setup script.. other than that, verified that building and deltastreamer work..

Lets run similar queries against M-O-R dataset. Lets look at both
ReadOptimized and Realtime views supported by M-O-R dataset

# Run agains ReadOptimized View. Notice that the latest timestamp is 10:29
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: against

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

1 row selected (6.326 seconds)


# Run agains Realtime View. Notice that the latest timestamp is again 10:29
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: against

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

exit
```

#### Step 5 (b): Run Spark-SQl Queries
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 It is so much better to test stuff now.. :) Thank you for doing this, even as co-creator of the project :) ha ha.

running in spark-sql

```
$SPARK_INSTALL/bin/spark-shell --jars $HUDI_SPARK_BUNDLE --master local[2] --driver-class-path $HADOOP_CONF_DIR --conf spark.sql.hive.convertMetastoreParquet=false --deploy-mode client --driver-memory 1G --executor-memory 3G --num-executors 1 --packages com.databricks:spark-avro_2.11:4.0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add docker exec -it adhoc-1 /bin/bash ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.


#### Step 7(b): Run Spark SQL Queries

Running the same queries in Spark-SQl:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: SQL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a pass. other changes seem ok.. Let me know once you have fixed the minor issues I left in comments..

hoodie.deltastreamer.schemaprovider.source.schema.file=/var/demo/config/schema.avsc
hoodie.deltastreamer.schemaprovider.target.schema.file=/var/demo/config/schema.avsc
# Kafka Source
#hoodie.deltastreamer.source.kafka.topic=uber_trips
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove this line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

<packaging>maven-plugin</packaging>
<name>docker-maven-plugin</name>
<description>A maven plugin for docker</description>
<url>https://github.com/spotify/docker-maven-plugin</url>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the plugin not published anywhere we can pull down from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. This file was accidentally added

@bvaradar
Copy link
Contributor Author

bvaradar commented Oct 1, 2018

@vinothchandar : Thanks a lot for the review comments. Incorporated them and updated the PR.

@vinothchandar
Copy link
Member

@bvaradar reported two issues with respect to reiniting containers.. otherwise its good for merging. I will go ahead and do that

@@ -0,0 +1,49 @@
[![Gitter chat](https://badges.gitter.im/gitterHQ/gitter.png)](https://gitter.im/big-data-europe/Lobby)

# docker-hive
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bvaradar is this file meant to be checked in?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed this. I have removed it now.

@@ -0,0 +1,15 @@
# Create host mount directory and copy
mkdir -p /tmp/hadoop_name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a protective rm -rf here? otherwise NN will start in safe mode since it does not recognize the data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once the docker containers are stopped, the files are deleted using "rm -rf ". I did not want to do the deletion before docker compose down so that docker containers can be shutdown more cleanly. Will chat f2f to explain more.

@vinothchandar
Copy link
Member

@bvaradar actually lets do a quick sync before I merge this

@bvaradar
Copy link
Contributor Author

bvaradar commented Oct 1, 2018

@vinothchandar : Changed the setup_demo script to first do docker-compose pull before docker-compose up in order to pull the latest version of docker images. Addressed other comments.

@bvaradar bvaradar force-pushed the docker branch 2 times, most recently from 0aec4bd to 3f3f97b Compare October 1, 2018 20:20
…er integration tests. Docker images built with Hadoop 2.8.4 Hive 2.3.3 and Spark 2.3.1 and published to docker-hub

Look at quickstart document for how to setup docker and run demo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants