Add a docker-compose environment for local/integration testing #58

dmateusp · 2020-03-07T21:11:38Z

Hi there,

Relevant:

How to connect to a local spark install #31: started the conversation for having a local environment
(WIP) Adds Dockerfile with containerized spark #55 @aaronsteers shared the image he uses

@aaronsteers shared his original local environment, I diverged quite a bit from it so I decided to open a fresh PR

What this PR adds:

a local docker-compose environment that starts Thrift (built within the repo), and a Postgres image (for the hive metastore)
mounts volumes, so users can see data being created under ./.spark-warehouse and ./.hive-metastore
starts a Spark UI at localhost:4040 where users can see queries being executed

I hope this helps with integration testing and makes it easier for people to get started with dbt-spark or develop the plug-in

cc @Fokko

jtcohen6

This is so cool. Thank you much for the hard work here, @dmateusp! I took it for a spin locally and was amazed by the ease of setup. I think this is going to enable local integration testing and containerized CI in a way that accelerates the pace of contribution.

Given that this is a fork off of @aaronsteers' work in #55, is everyone on board with preferring this one? I prefer the addition of a Postgres dependency to a MySQL one. I'm open to hearing if there's significant functionality supported in the other approach and omitted here.

If it's okay with you, I want to wait on merging this until @beckjake has a chance to give it a once-over. (He's on vacation this week.)

README.md

aaronsteers · 2020-03-09T22:24:30Z

This is so cool. Thank you much for the hard work here, @dmateusp! I took it for a spin locally and was amazed by the ease of setup. I think this is going to enable local integration testing and containerized CI in a way that accelerates the pace of contribution.

I agree! Fantastic work - and thank you @dmateusp for your effort in getting this revamped and cleaned up.

Given that this is a fork off of @aaronsteers' work in #55, is everyone on board with preferring this one? I prefer the addition of a Postgres dependency to a MySQL one. I'm open to hearing if there's significant functionality supported in the other approach and omitted here.

@jtcohen6 - YES - for my part, at least, I do agree - this is far cleaner than the initial which I posted on #55; I am happy to close or deprioritize #55 in favor of this approach. I may still iterate on some version of this for my own needs in a standalone image, but I can use the core Dockerfile as the source image for further downstream work. And meanwhile, the core image here will be leaner and easier to maintain.

README.md

aaronsteers · 2020-03-09T22:52:10Z

docker/thrift/Dockerfile

+ARG HADOOP_VERSION=2.7.7
+ARG HADOOP_MINOR_VERSION=2.7


Rather than have to declare two args and having to keep both in sync, what about calculating HADOOP_MINOR_VERSION from HADOOP_VERSION?

I'm not an expert at bash substitution by any means, but I believe this does the trick:

# Get 2-part minor version string (e.g. `2.7.7` -> `2.7`) ENV HADOOP_MINOR_VERSION=${HADOOP_VERSION%.*}

Step 4/26 : ARG HADOOP_MINOR_VERSION=${HADOOP_VERSION%.*} ERROR: Service 'thrift' failed to build: failed to process "${HADOOP_VERSION%.*}": missing ':' in substitution

I don't think this is supported by Docker, curious if/how you made it work

aaronsteers · 2020-03-09T22:55:48Z

docker/thrift/Dockerfile

+ARG HADOOP_MINOR_VERSION=2.7
+ARG HADOOP_HOME=/usr/local/hdp
+ARG SPARK_NAME=spark-${SPARK_VERSION}-bin-hadoop${HADOOP_MINOR_VERSION}
+ARG SPARK_HOME=/usr/local/spark


Just curious - Do we need SPARK_HOME as an arg or would a simple ENV do the trick?
(I'm not sure what the use case for having this as an ARG would be.)

Same question also for SPARK_NAME since we already have args for spark and hadoop version strings.

Thanks for the feedback here, I removed some of those ARGs, also to clean up the usage of ARG and ENV I re-use the base image

This seems to be a trick in Docker to propagate the environment but I'm happy with how it simplified the ARG/ENV usage in the Dockerfile

moby/moby#37345 (comment)

aaronsteers

I've left a few comments/questions/suggestions inline with the code, but I see no blockers or glaring issues. I would still wait on approval from someone else on the core team, but for my part, I would be happy to see this move forward.

Also, general disclaimer: since I am working on another project now, I haven't had time to do any real testing. I will have to lean on others in terms of testing-based feedback.

Swap Spark instructions and Hadoop instructions Re-use base image to share ENV Remove some ARGs

Co-Authored-By: Aaron Steers <18150651+aaronsteers@users.noreply.github.com>

NielsZeilemaker · 2020-03-12T15:18:15Z

Hi @aaronsteers I've updated our spark dockerhub image to be able to run a thrift server.
We actively maintain these images, and also test them whenever a new version of spark comes out.
Maybe you could switch out your dockerfile, and switch to ours?

https://github.com/godatadriven-dockerhub/spark

I've included a thrift server example docker-compose file in the root of the repo:
https://github.com/godatadriven-dockerhub/spark/blob/master/docker-compose-thrift.yml

Fokko

LGTM, thanks for picking this up @dmateusp 👍

README.md

Fokko · 2020-03-12T20:52:51Z

docker-compose.yml

+services:
+
+  thrift:
+    build: docker/thrift


I would be interested in using @NielsZeilemaker his suggestion and using a pre-existing image instead of building one from scratch.

We're also open for PR's :-)

oh of course, didn't know there was something out there

Fokko · 2020-03-12T20:54:00Z

docker-compose.yml

+    depends_on:
+      - hive-metastore
+    volumes:
+      - ./.spark-warehouse/:/usr/local/spark/spark-warehouse


Just curious, when would you use this? The data will be mounted onto the root fs, while all the metadata is inside the docker images.

I mount another volume here: https://github.com/fishtown-analytics/dbt-spark/pull/58/files#diff-4e5e90c6228fd48698d074241c2ba760R20

So, you have both the metadata and data persisted locally

I just think it's nicer to know you can run docker-compose down in case something is wrong but still keep you metadata/data somewhere

Co-Authored-By: Fokko Driesprong <fokko@driesprong.frl>

dmateusp · 2020-03-14T18:46:10Z

@Fokko @NielsZeilemaker thanks for sharing the godatadriven

However would you look into: godatadriven-dockerhub/spark#1 ?

My docker-compose sometimes crashes (typically on the first run when database files aren't initialized), I solved it in this repository by adding the retry mechanism in the entrypoint

After that's solved, I don't see any objections to removing the Docker image I created here in favor of the godatadriven hosted image

Fokko · 2020-03-14T19:45:41Z

@NielsZeilemaker committed a fix: godatadriven-dockerhub/spark@990e234

…to aaronsteers-docker

dmateusp · 2020-03-15T11:20:10Z

I cleaned up this PR to reuse godatadriven's image

Fokko

Awesome work @dmateusp! LGTM

Fokko · 2020-03-15T11:46:55Z

docker-compose.yml

+version: "3.7"
+services:
+
+  thrift:


Maybe thrift isn't the best name. I'd rather call it Spark, or Spark2. It would be great if we can test DBT against Spark 2 and 3 in the future :-)

I changed the names to be more specific

jtcohen6 · 2020-03-16T18:25:58Z

Really exciting stuff in here, folks!

@beckjake Could you give this a once-over when you get a chance?

Fokko · 2020-03-16T18:40:54Z

Would be great if we could run this as a CI, without having to wait for the manual integration tests :)

dmateusp · 2020-03-16T18:51:59Z

I can look into #61 in a separate PR, hopefully it helps with 0.15.0!

jtcohen6 · 2020-03-16T19:51:11Z

Right on.

Additionally, there are a lot of conversations happening on our end currently around how we can write and implement better Spark integration tests. The goal is to find some happy medium between dbt-core integration tests and the proof-of-concept dbt-integration-tests repo.

beckjake

This looks great to me! I have some tiny suggested docs changes, but with those tweaks to my profiles.yml I was able to get this up and running.

README.md

Co-Authored-By: Jacob Beck <beckjake@users.noreply.github.com>

beckjake

Looks great, I love this! It was so easy to get spark up and running locally.

aaronsteers and others added 21 commits March 7, 2020 20:59

adds docker image

080fead

fix jinja breakage

57216fb

add venv

b1a7f4b

added env vars

6782746

remove docker-in-docker

5573025

updates per feedback

21f3795

trim down pip installs

1c80725

Re-organizing Dockerfile

f21af63

Simplify bootstrap, change to same image as other dbt repos

790c9e3

Add more to .dockerignore

2305b7f

Multi-stage build

9d8a0e8

Rename spark/docker to docker/

8efa9f5

Add docker-compose

d393a62

Remove DBT from image, separate into thrift directory

7d8e468

Add consistent spark-warehouse

3444cb8

Add README instructions to run locally

d82093a

Add how to reset environment

4f73ca8

Remove unused conf file

b4d37ed

Add hadoop user for integration tests

c3fdf37

Remove unrelated requirements.txt change

d0496f1

Remove accidently committed Pipfiles

45fd004

jtcohen6 mentioned this pull request Mar 9, 2020

Add integration tests #3

Closed

jtcohen6 reviewed Mar 9, 2020

View reviewed changes

README.md Outdated Show resolved Hide resolved

aaronsteers reviewed Mar 9, 2020

View reviewed changes

README.md Outdated Show resolved Hide resolved

aaronsteers mentioned this pull request Mar 9, 2020

(WIP) Adds Dockerfile with containerized spark #55

Closed

aaronsteers reviewed Mar 9, 2020

View reviewed changes

aaronsteers approved these changes Mar 9, 2020

View reviewed changes

Cleanup ARG and ENV usage

9351b95

Swap Spark instructions and Hadoop instructions Re-use base image to share ENV Remove some ARGs

Daniel Mateus Pires and others added 2 commits March 10, 2020 22:31

Change default schema to 'analytics' to match README

5bf414e

Add reference to SQL/JDBC endpoint

fda7e2a

Co-Authored-By: Aaron Steers <18150651+aaronsteers@users.noreply.github.com>

jtcohen6 mentioned this pull request Mar 12, 2020

Improvements to CI, release workflow #61

Closed

Fokko approved these changes Mar 12, 2020

View reviewed changes

Specifications in README

b29c826

Co-Authored-By: Fokko Driesprong <fokko@driesprong.frl>

Daniel Mateus Pires added 2 commits March 15, 2020 11:17

Remove some logic to use godatadriven's docker image

74df393

Merge branch 'aaronsteers-docker' of github.com:dmateusp/dbt-spark in…

a55d058

…to aaronsteers-docker

Fokko approved these changes Mar 15, 2020

View reviewed changes

Disambiguate names in docker-compose

3bb6bc9

beckjake reviewed Mar 16, 2020

View reviewed changes

README.md Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

dmateusp and others added 2 commits March 17, 2020 12:58

Update wrong port

e4228cc

Co-Authored-By: Jacob Beck <beckjake@users.noreply.github.com>

Use dbt user instead of hadoop

f5c2e93

Co-Authored-By: Jacob Beck <beckjake@users.noreply.github.com>

beckjake approved these changes Mar 17, 2020

View reviewed changes

jtcohen6 merged commit 55b236c into dbt-labs:master Mar 17, 2020

jtcohen6 mentioned this pull request Mar 17, 2020

How to connect to a local spark install #31

Closed

dmateusp mentioned this pull request Mar 17, 2020

Integration tests without cluster #62

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a docker-compose environment for local/integration testing #58

Add a docker-compose environment for local/integration testing #58

dmateusp commented Mar 7, 2020

jtcohen6 left a comment

aaronsteers commented Mar 9, 2020 •

edited

aaronsteers Mar 9, 2020

dmateusp Mar 10, 2020

aaronsteers Mar 9, 2020 •

edited

dmateusp Mar 10, 2020

aaronsteers left a comment

NielsZeilemaker commented Mar 12, 2020

Fokko left a comment

Fokko Mar 12, 2020

Fokko Mar 12, 2020

dmateusp Mar 14, 2020

Fokko Mar 12, 2020

dmateusp Mar 14, 2020

dmateusp commented Mar 14, 2020

Fokko commented Mar 14, 2020

dmateusp commented Mar 15, 2020

Fokko left a comment

Fokko Mar 15, 2020

dmateusp Mar 15, 2020

jtcohen6 commented Mar 16, 2020

Fokko commented Mar 16, 2020

dmateusp commented Mar 16, 2020

jtcohen6 commented Mar 16, 2020

beckjake left a comment

beckjake left a comment

Add a docker-compose environment for local/integration testing #58

Add a docker-compose environment for local/integration testing #58

Conversation

dmateusp commented Mar 7, 2020

jtcohen6 left a comment

Choose a reason for hiding this comment

aaronsteers commented Mar 9, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaronsteers Mar 9, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaronsteers left a comment

Choose a reason for hiding this comment

NielsZeilemaker commented Mar 12, 2020

Fokko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmateusp commented Mar 14, 2020

Fokko commented Mar 14, 2020

dmateusp commented Mar 15, 2020

Fokko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtcohen6 commented Mar 16, 2020

Fokko commented Mar 16, 2020

dmateusp commented Mar 16, 2020

jtcohen6 commented Mar 16, 2020

beckjake left a comment

Choose a reason for hiding this comment

beckjake left a comment

Choose a reason for hiding this comment

aaronsteers commented Mar 9, 2020 •

edited

aaronsteers Mar 9, 2020 •

edited