targets integration #155

MarkEdmondson1234 · 2021-12-10T08:05:13Z

Getting some feedback here ropensci/targets#720

GCP already available via:

library(future)
library(targets)
library(googleComputeEngineR)

vms <- gce_vm_cluster()
plan <- plan(cluster, workers = as.cluster(vms))
tar_resources_future(plan = plan)
...

But I think there is an opportunity to move this more into a serverless direction, as the cloud build steps seem to seamlessly map to tar_targets() if a way of communicating between the steps can be done.

As an example an equivalent googleCloudRunner to targets minimal example would be:

library(googleCloudRunner)

bs <- c(
    cr_buildstep_gcloud("gsutil", 
                        id = "raw_data_file",
                        args = c("gsutil",
                                 "cp",
                                 "gs://your-bucket/data/raw_data.csv",
                                 "/workspace/data/raw_data.csv")),
    # normally would not use readRDS()/saveRDS() in multiple steps but for sake of example
    cr_buildstep_r("read_csv('/workspace/data/raw_data.csv', col_types = cols()) %>% saveRDS('raw_data')",
                   id = "raw_data",
                   name = "verse"),
    cr_buildstep_r("readRDS('raw_data') %>% filter(!is.na(Ozone)) %>% saveRDS('data')",
                   id = "data",
                   name = "verse"),
    cr_buildstep_r("create_plot(readRDS('data')) %>% saveRDS('hist')",
                   id = "hist", 
                   waitFor = "data", # so it runs concurrently to 'fit'
                   name = "verse"),
    cr_buildstep_r("biglm(Ozone ~ Wind + Temp, readRDS('data'))",
                   waitFor = "data", # so it runs concurrently to 'hist'
                   id = "fit",
                   name = "gcr.io/mydocker/biglm")                          
    
)
bs |> cr_build_yaml()

Normally I would put all the r steps in one buildstep sourced from a file but have added readRDS() %>% blah() %>% saveRDS() to illustrate functionality that I think targets could take care of.

Makes this yaml object that I think maps to targets closely:

==cloudRunnerYaml==
steps:
- name: gcr.io/google.com/cloudsdktool/cloud-sdk:alpine
  entrypoint: gsutil
  args:
  - gsutil
  - cp
  - gs://your-bucket/data/raw_data.csv
  - /workspace/data/raw_data.csv
  id: raw_data_file
- name: rocker/verse
  args:
  - Rscript
  - -e
  - read_csv('/workspace/data/raw_data.csv', col_types = cols()) %>% saveRDS('raw_data')
  id: raw_data
- name: rocker/verse
  args:
  - Rscript
  - -e
  - readRDS('raw_data') %>% filter(!is.na(Ozone)) %>% saveRDS('data')
  id: data
- name: rocker/verse
  args:
  - Rscript
  - -e
  - create_plot(readRDS('data')) %>% saveRDS('hist')
  id: hist
  waitFor:
  - data
- name: gcr.io/mydocker/biglm
  args:
  - Rscript
  - -e
  - biglm(Ozone ~ Wind + Temp, readRDS('data'))
  id: fit
  waitFor:
  - data

(more build args here)

Do the build on GCP via the_build |> cr_build()

And/or each buildstep could be its own dedicated cr_build() and the build's artefacts are uploaded/downloaded after its run.

This holds several advantages:

Each step can be executed in its own environment
Each step can use differing amount of resources (e.g. a 32 core build step vs a 1 core)
Start-up and tear down is handled automatically
Multiple languages could be used within a task step
Up to 24hrs compute time per step
Default 30 steps concurrent usage, quotas up to 100. Unlimited build queue.

I see that as a tool that is better than Airflow for visualising DAGs, taking care of state management on whether each node needs to be run but with a lot of scale to build each step in a cloud environment.

The text was updated successfully, but these errors were encountered:

MarkEdmondson1234 · 2021-12-14T21:26:26Z

The function cr_build_targets() helps set up some boilerplate code to download targets meta data from the specified GCS bucket, run the pipeline and uplaod the artifacts back to the same bucket. Need some tests to see if it is respecting the right targets skips etc.

cr_build_targets(path=tempfile())

# adding custom environment args and secrets to the build
cr_build_targets(
  task_image = "gcr.io/my-project/my-targets-pipeline",
  options = list(env = c("ENV1=1234",
                         "ENV_USER=Dave")),
  availableSecrets = cr_build_yaml_secrets("MY_PW","my-pw"),
  task_args = list(secretEnv = "MY_PW"))

Resulting in build:

==cloudRunnerYaml==
steps:
- name: gcr.io/google.com/cloudsdktool/cloud-sdk:alpine
  entrypoint: bash
  args:
  - -c
  - gsutil -m cp -r ${_TARGET_BUCKET}/* /workspace/_targets || exit 0
  id: get previous _targets metadata
- name: ubuntu
  args:
  - bash
  - -c
  - ls -lR
  id: debug file list
- name: gcr.io/my-project/my-targets-pipeline
  args:
  - Rscript
  - -e
  - targets::tar_make()
  id: target pipeline
  secretEnv:
  - MY_PW
timeout: 3600s
options:
  env:
  - ENV1=1234
  - ENV_USER=Dave
substitutions:
  _TARGET_BUCKET: gs://mark-edmondson-public-files/googleCloudRunner/_targets
availableSecrets:
  secretManager:
  - versionName: projects/mark-edmondson-gde/secrets/my-pw/versions/latest
    env: MY_PW
artifacts:
  objects:
    location: gs://mark-edmondson-public-files/googleCloudRunner/_targets/meta
    paths:
    - /workspace/_targets/meta/**

MarkEdmondson1234 · 2021-12-16T20:50:07Z

Tests are working now which confirm a targets build can reuse previous builds artifacts, and also rerun if the source are updates https://github.com/MarkEdmondson1234/googleCloudRunner/pull/159/files

MarkEdmondson1234 · 2021-12-18T16:51:55Z

Need two modes(?) - one where all target files are the upcoming gcs integration which will download artifacts as needed, one where the data is loaded from other sources (file etc) kept in a normal GCS bucket

MarkEdmondson1234 · 2021-12-19T22:46:53Z

Added cr_buildstep_targets() to prep for sending up individual build steps. cr_buildstep_targets_setup() downloads the meta folder, cr_buildstep_targets_teardown() uploads the targets changed files to the bucket.

MarkEdmondson1234 mentioned this issue Dec 14, 2021

Google Cloud integrations ropensci/targets#720

Closed

MarkEdmondson1234 added a commit that referenced this issue Dec 15, 2021

start some tests for #155

e96af7a

MarkEdmondson1234 added a commit that referenced this issue Dec 19, 2021

add cr_buildstep_targets #155

c123179

MarkEdmondson1234 added a commit that referenced this issue Jan 3, 2022

helper for immediate execution of target builds #155

56311c0

MarkEdmondson1234 added the enhancement New feature or request label Jan 3, 2022

MarkEdmondson1234 added a commit that referenced this issue Jan 5, 2022

better target artifact downloads #155

3707a3f

MarkEdmondson1234 mentioned this issue Jan 10, 2022

Integrations with other packages - request for help #164

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

targets integration #155

targets integration #155

MarkEdmondson1234 commented Dec 10, 2021

MarkEdmondson1234 commented Dec 14, 2021 •

edited

MarkEdmondson1234 commented Dec 16, 2021

MarkEdmondson1234 commented Dec 18, 2021

MarkEdmondson1234 commented Dec 19, 2021 •

edited

targets integration #155

targets integration #155

Comments

MarkEdmondson1234 commented Dec 10, 2021

MarkEdmondson1234 commented Dec 14, 2021 • edited

MarkEdmondson1234 commented Dec 16, 2021

MarkEdmondson1234 commented Dec 18, 2021

MarkEdmondson1234 commented Dec 19, 2021 • edited

MarkEdmondson1234 commented Dec 14, 2021 •

edited

MarkEdmondson1234 commented Dec 19, 2021 •

edited