Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a docker-compose environment for local/integration testing #58

Merged
merged 30 commits into from Mar 17, 2020

Conversation

dmateusp
Copy link
Contributor

@dmateusp dmateusp commented Mar 7, 2020

Hi there,

Relevant:

@aaronsteers shared his original local environment, I diverged quite a bit from it so I decided to open a fresh PR

What this PR adds:

  • a local docker-compose environment that starts Thrift (built within the repo), and a Postgres image (for the hive metastore)
  • mounts volumes, so users can see data being created under ./.spark-warehouse and ./.hive-metastore
  • starts a Spark UI at localhost:4040 where users can see queries being executed

I hope this helps with integration testing and makes it easier for people to get started with dbt-spark or develop the plug-in

cc @Fokko

@jtcohen6 jtcohen6 mentioned this pull request Mar 9, 2020
Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so cool. Thank you much for the hard work here, @dmateusp! I took it for a spin locally and was amazed by the ease of setup. I think this is going to enable local integration testing and containerized CI in a way that accelerates the pace of contribution.

Given that this is a fork off of @aaronsteers' work in #55, is everyone on board with preferring this one? I prefer the addition of a Postgres dependency to a MySQL one. I'm open to hearing if there's significant functionality supported in the other approach and omitted here.

If it's okay with you, I want to wait on merging this until @beckjake has a chance to give it a once-over. (He's on vacation this week.)

README.md Outdated Show resolved Hide resolved
@aaronsteers
Copy link
Contributor

aaronsteers commented Mar 9, 2020

This is so cool. Thank you much for the hard work here, @dmateusp! I took it for a spin locally and was amazed by the ease of setup. I think this is going to enable local integration testing and containerized CI in a way that accelerates the pace of contribution.

I agree! Fantastic work - and thank you @dmateusp for your effort in getting this revamped and cleaned up.

Given that this is a fork off of @aaronsteers' work in #55, is everyone on board with preferring this one? I prefer the addition of a Postgres dependency to a MySQL one. I'm open to hearing if there's significant functionality supported in the other approach and omitted here.

@jtcohen6 - YES - for my part, at least, I do agree - this is far cleaner than the initial which I posted on #55; I am happy to close or deprioritize #55 in favor of this approach. I may still iterate on some version of this for my own needs in a standalone image, but I can use the core Dockerfile as the source image for further downstream work. And meanwhile, the core image here will be leaner and easier to maintain.

README.md Outdated Show resolved Hide resolved
Comment on lines 3 to 4
ARG HADOOP_VERSION=2.7.7
ARG HADOOP_MINOR_VERSION=2.7
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than have to declare two args and having to keep both in sync, what about calculating HADOOP_MINOR_VERSION from HADOOP_VERSION?

I'm not an expert at bash substitution by any means, but I believe this does the trick:

# Get 2-part minor version string (e.g. `2.7.7` -> `2.7`)
ENV HADOOP_MINOR_VERSION=${HADOOP_VERSION%.*}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Step 4/26 : ARG HADOOP_MINOR_VERSION=${HADOOP_VERSION%.*}
ERROR: Service 'thrift' failed to build: failed to process "${HADOOP_VERSION%.*}": missing ':' in substitution

I don't think this is supported by Docker, curious if/how you made it work

ARG HADOOP_MINOR_VERSION=2.7
ARG HADOOP_HOME=/usr/local/hdp
ARG SPARK_NAME=spark-${SPARK_VERSION}-bin-hadoop${HADOOP_MINOR_VERSION}
ARG SPARK_HOME=/usr/local/spark
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious - Do we need SPARK_HOME as an arg or would a simple ENV do the trick?
(I'm not sure what the use case for having this as an ARG would be.)

Same question also for SPARK_NAME since we already have args for spark and hadoop version strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback here, I removed some of those ARGs, also to clean up the usage of ARG and ENV I re-use the base image

This seems to be a trick in Docker to propagate the environment but I'm happy with how it simplified the ARG/ENV usage in the Dockerfile

moby/moby#37345 (comment)

Copy link
Contributor

@aaronsteers aaronsteers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a few comments/questions/suggestions inline with the code, but I see no blockers or glaring issues. I would still wait on approval from someone else on the core team, but for my part, I would be happy to see this move forward.

Also, general disclaimer: since I am working on another project now, I haven't had time to do any real testing. I will have to lean on others in terms of testing-based feedback.

Swap Spark instructions and Hadoop instructions
Re-use base image to share ENV
Remove some ARGs
Daniel Mateus Pires and others added 2 commits March 10, 2020 22:31
Co-Authored-By: Aaron Steers <18150651+aaronsteers@users.noreply.github.com>
@NielsZeilemaker
Copy link
Collaborator

Hi @aaronsteers I've updated our spark dockerhub image to be able to run a thrift server.
We actively maintain these images, and also test them whenever a new version of spark comes out.
Maybe you could switch out your dockerfile, and switch to ours?

https://github.com/godatadriven-dockerhub/spark

I've included a thrift server example docker-compose file in the root of the repo:
https://github.com/godatadriven-dockerhub/spark/blob/master/docker-compose-thrift.yml

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for picking this up @dmateusp 👍

README.md Outdated Show resolved Hide resolved
services:

thrift:
build: docker/thrift
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be interested in using @NielsZeilemaker his suggestion and using a pre-existing image instead of building one from scratch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're also open for PR's :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh of course, didn't know there was something out there

depends_on:
- hive-metastore
volumes:
- ./.spark-warehouse/:/usr/local/spark/spark-warehouse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, when would you use this? The data will be mounted onto the root fs, while all the metadata is inside the docker images.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mount another volume here: https://github.com/fishtown-analytics/dbt-spark/pull/58/files#diff-4e5e90c6228fd48698d074241c2ba760R20

So, you have both the metadata and data persisted locally

I just think it's nicer to know you can run docker-compose down in case something is wrong but still keep you metadata/data somewhere

Co-Authored-By: Fokko Driesprong <fokko@driesprong.frl>
@dmateusp
Copy link
Contributor Author

@Fokko @NielsZeilemaker thanks for sharing the godatadriven

However would you look into: godatadriven-dockerhub/spark#1 ?

My docker-compose sometimes crashes (typically on the first run when database files aren't initialized), I solved it in this repository by adding the retry mechanism in the entrypoint

After that's solved, I don't see any objections to removing the Docker image I created here in favor of the godatadriven hosted image

@Fokko
Copy link
Contributor

Fokko commented Mar 14, 2020

@dmateusp
Copy link
Contributor Author

I cleaned up this PR to reuse godatadriven's image

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work @dmateusp! LGTM

version: "3.7"
services:

thrift:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe thrift isn't the best name. I'd rather call it Spark, or Spark2. It would be great if we can test DBT against Spark 2 and 3 in the future :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the names to be more specific

@jtcohen6
Copy link
Contributor

Really exciting stuff in here, folks!

@beckjake Could you give this a once-over when you get a chance?

@Fokko
Copy link
Contributor

Fokko commented Mar 16, 2020

Would be great if we could run this as a CI, without having to wait for the manual integration tests :)

@dmateusp
Copy link
Contributor Author

I can look into #61 in a separate PR, hopefully it helps with 0.15.0!

@jtcohen6
Copy link
Contributor

Right on.

Additionally, there are a lot of conversations happening on our end currently around how we can write and implement better Spark integration tests. The goal is to find some happy medium between dbt-core integration tests and the proof-of-concept dbt-integration-tests repo.

Copy link
Contributor

@beckjake beckjake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me! I have some tiny suggested docs changes, but with those tweaks to my profiles.yml I was able to get this up and running.

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
dmateusp and others added 2 commits March 17, 2020 12:58
Co-Authored-By: Jacob Beck <beckjake@users.noreply.github.com>
Co-Authored-By: Jacob Beck <beckjake@users.noreply.github.com>
Copy link
Contributor

@beckjake beckjake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, I love this! It was so easy to get spark up and running locally.

@jtcohen6 jtcohen6 merged commit 55b236c into dbt-labs:master Mar 17, 2020
@dmateusp dmateusp mentioned this pull request Mar 17, 2020
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants