Add terraform to setup Dataflow on GCP #2

batpad · 2022-09-07T09:34:16Z

We should add a gcp directory in the terraform folder to provision Dataflow on GCP, so the setup of GCP bakeries can be managed within this repository.

From chat with @yuvipanda - "it just needs to provision the service account for dataflow" .

The text was updated successfully, but these errors were encountered:

cisaacstern · 2022-09-07T17:36:09Z

This repo is awesome! Thanks for getting this started. I also really like the name.

Re: GCP terraform , each time the orchestrator FastAPI backend is released, the release script runs the terraform in pangeo-forge/dataflow-status-monitoring@github-app-hook to setup (or check, if it already exists) infrastructure required for sending job completion notifications back to ourselves when Dataflow jobs either succeed or fail.

That dataflow-status-monitoring code is mounted in orchestrator as a submodule, and called from here. Note that orchestrator imports dataflow-status-monitoring into a few different terraform environments (here), so that releasing development instances of the app doesn't inadvertently break the production infrastructure.

echarles · 2022-12-29T14:10:21Z

Now that the runner is using Flink (pangeo-forge/pangeo-forge-runner#21), is any external Beam cluster (Dataflow on GPC) still needed?

I am still trying to understand the architecture reading https://flink.apache.org/ecosystem/2020/02/22/apache-beam-how-beam-runs-on-top-of-flink.html and https://beam.apache.org/documentation/runners/flink and I wonder if Beam is still in the picture or if Flink is enough to handle the jobs?

echarles · 2022-12-29T14:19:03Z

Well, I guess Dataflow is still needed, I am still trying to find where Flink is configured to use it.

Another question: any appetite to run Beam on Kuberternes and get rid of Dataflow like described in https://python.plainenglish.io/apache-beam-flink-cluster-kubernetes-python-a1965f37b7cb

cisaacstern · 2022-12-29T17:21:16Z

Hi @echarles, thanks for chiming in here. This repo is a placeholder that we have not done much work on. Currently I can say we are interesting in supporting Flink in addition to Dataflow, but not as a replacement for it. Some basic Flink configuration can be found in these tests but we do not currently run any Flink in production. All of our production workloads are currently on Dataflow. If you're interested in participating in the conversation, we'd welcome you to join our recurring Pangeo Forge coordination call, which is listed on this calendar and also discussed here for any on-the-fly schedule adjustments.

echarles · 2022-12-30T06:41:06Z

Thx @cisaacstern I will join next Monday 2nd Jan meeting.

cisaacstern · 2022-12-30T19:06:39Z

Great, @echarles! Looking forward to it.

echarles · 2023-01-02T20:37:47Z

Thx for the warm welcome at today meeting. I understand things evolve ATM with the introduction of the new GCP Cloud Runner. I guess my goal is to run on K8S the services and not depend on GCP. Is it already possible/documented? If not, what is missing to make this happen?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add terraform to setup Dataflow on GCP #2

Add terraform to setup Dataflow on GCP #2

batpad commented Sep 7, 2022

cisaacstern commented Sep 7, 2022

echarles commented Dec 29, 2022

echarles commented Dec 29, 2022

cisaacstern commented Dec 29, 2022

echarles commented Dec 30, 2022

cisaacstern commented Dec 30, 2022 •

edited

echarles commented Jan 2, 2023

Add terraform to setup Dataflow on GCP #2

Add terraform to setup Dataflow on GCP #2

Comments

batpad commented Sep 7, 2022

cisaacstern commented Sep 7, 2022

echarles commented Dec 29, 2022

echarles commented Dec 29, 2022

cisaacstern commented Dec 29, 2022

echarles commented Dec 30, 2022

cisaacstern commented Dec 30, 2022 • edited

echarles commented Jan 2, 2023

cisaacstern commented Dec 30, 2022 •

edited