Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add terraform to setup Dataflow on GCP #2

Open
batpad opened this issue Sep 7, 2022 · 7 comments
Open

Add terraform to setup Dataflow on GCP #2

batpad opened this issue Sep 7, 2022 · 7 comments

Comments

@batpad
Copy link

batpad commented Sep 7, 2022

We should add a gcp directory in the terraform folder to provision Dataflow on GCP, so the setup of GCP bakeries can be managed within this repository.

From chat with @yuvipanda - "it just needs to provision the service account for dataflow" .

@cisaacstern
Copy link
Member

This repo is awesome! Thanks for getting this started. I also really like the name.

Re: GCP terraform , each time the orchestrator FastAPI backend is released, the release script runs the terraform in pangeo-forge/dataflow-status-monitoring@github-app-hook to setup (or check, if it already exists) infrastructure required for sending job completion notifications back to ourselves when Dataflow jobs either succeed or fail.

That dataflow-status-monitoring code is mounted in orchestrator as a submodule, and called from here. Note that orchestrator imports dataflow-status-monitoring into a few different terraform environments (here), so that releasing development instances of the app doesn't inadvertently break the production infrastructure.

@echarles
Copy link

Now that the runner is using Flink (pangeo-forge/pangeo-forge-runner#21), is any external Beam cluster (Dataflow on GPC) still needed?

I am still trying to understand the architecture reading https://flink.apache.org/ecosystem/2020/02/22/apache-beam-how-beam-runs-on-top-of-flink.html and https://beam.apache.org/documentation/runners/flink and I wonder if Beam is still in the picture or if Flink is enough to handle the jobs?

@echarles
Copy link

Well, I guess Dataflow is still needed, I am still trying to find where Flink is configured to use it.

Another question: any appetite to run Beam on Kuberternes and get rid of Dataflow like described in https://python.plainenglish.io/apache-beam-flink-cluster-kubernetes-python-a1965f37b7cb

@cisaacstern
Copy link
Member

Hi @echarles, thanks for chiming in here. This repo is a placeholder that we have not done much work on. Currently I can say we are interesting in supporting Flink in addition to Dataflow, but not as a replacement for it. Some basic Flink configuration can be found in these tests but we do not currently run any Flink in production. All of our production workloads are currently on Dataflow. If you're interested in participating in the conversation, we'd welcome you to join our recurring Pangeo Forge coordination call, which is listed on this calendar and also discussed here for any on-the-fly schedule adjustments.

@echarles
Copy link

Thx @cisaacstern I will join next Monday 2nd Jan meeting.

@cisaacstern
Copy link
Member

cisaacstern commented Dec 30, 2022

Great, @echarles! Looking forward to it.

@echarles
Copy link

echarles commented Jan 2, 2023

Thx for the warm welcome at today meeting. I understand things evolve ATM with the introduction of the new GCP Cloud Runner. I guess my goal is to run on K8S the services and not depend on GCP. Is it already possible/documented? If not, what is missing to make this happen?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants