Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

targets integration #155

Open
MarkEdmondson1234 opened this issue Dec 10, 2021 · 4 comments
Open

targets integration #155

MarkEdmondson1234 opened this issue Dec 10, 2021 · 4 comments
Labels
enhancement New feature or request

Comments

@MarkEdmondson1234
Copy link
Owner

Getting some feedback here ropensci/targets#720

GCP already available via:

library(future)
library(targets)
library(googleComputeEngineR)

vms <- gce_vm_cluster()
plan <- plan(cluster, workers = as.cluster(vms))
tar_resources_future(plan = plan)
...

But I think there is an opportunity to move this more into a serverless direction, as the cloud build steps seem to seamlessly map to tar_targets() if a way of communicating between the steps can be done.

As an example an equivalent googleCloudRunner to targets minimal example would be:

library(googleCloudRunner)

bs <- c(
    cr_buildstep_gcloud("gsutil", 
                        id = "raw_data_file",
                        args = c("gsutil",
                                 "cp",
                                 "gs://your-bucket/data/raw_data.csv",
                                 "/workspace/data/raw_data.csv")),
    # normally would not use readRDS()/saveRDS() in multiple steps but for sake of example
    cr_buildstep_r("read_csv('/workspace/data/raw_data.csv', col_types = cols()) %>% saveRDS('raw_data')",
                   id = "raw_data",
                   name = "verse"),
    cr_buildstep_r("readRDS('raw_data') %>% filter(!is.na(Ozone)) %>% saveRDS('data')",
                   id = "data",
                   name = "verse"),
    cr_buildstep_r("create_plot(readRDS('data')) %>% saveRDS('hist')",
                   id = "hist", 
                   waitFor = "data", # so it runs concurrently to 'fit'
                   name = "verse"),
    cr_buildstep_r("biglm(Ozone ~ Wind + Temp, readRDS('data'))",
                   waitFor = "data", # so it runs concurrently to 'hist'
                   id = "fit",
                   name = "gcr.io/mydocker/biglm")                          
    
)
bs |> cr_build_yaml() 

Normally I would put all the r steps in one buildstep sourced from a file but have added readRDS() %>% blah() %>% saveRDS() to illustrate functionality that I think targets could take care of.

Makes this yaml object that I think maps to targets closely:

==cloudRunnerYaml==
steps:
- name: gcr.io/google.com/cloudsdktool/cloud-sdk:alpine
  entrypoint: gsutil
  args:
  - gsutil
  - cp
  - gs://your-bucket/data/raw_data.csv
  - /workspace/data/raw_data.csv
  id: raw_data_file
- name: rocker/verse
  args:
  - Rscript
  - -e
  - read_csv('/workspace/data/raw_data.csv', col_types = cols()) %>% saveRDS('raw_data')
  id: raw_data
- name: rocker/verse
  args:
  - Rscript
  - -e
  - readRDS('raw_data') %>% filter(!is.na(Ozone)) %>% saveRDS('data')
  id: data
- name: rocker/verse
  args:
  - Rscript
  - -e
  - create_plot(readRDS('data')) %>% saveRDS('hist')
  id: hist
  waitFor:
  - data
- name: gcr.io/mydocker/biglm
  args:
  - Rscript
  - -e
  - biglm(Ozone ~ Wind + Temp, readRDS('data'))
  id: fit
  waitFor:
  - data

(more build args here)

Do the build on GCP via the_build |> cr_build()

And/or each buildstep could be its own dedicated cr_build() and the build's artefacts are uploaded/downloaded after its run.

This holds several advantages:

  • Each step can be executed in its own environment
  • Each step can use differing amount of resources (e.g. a 32 core build step vs a 1 core)
  • Start-up and tear down is handled automatically
  • Multiple languages could be used within a task step
  • Up to 24hrs compute time per step
  • Default 30 steps concurrent usage, quotas up to 100. Unlimited build queue.

I see that as a tool that is better than Airflow for visualising DAGs, taking care of state management on whether each node needs to be run but with a lot of scale to build each step in a cloud environment.

@MarkEdmondson1234
Copy link
Owner Author

MarkEdmondson1234 commented Dec 14, 2021

The function cr_build_targets() helps set up some boilerplate code to download targets meta data from the specified GCS bucket, run the pipeline and uplaod the artifacts back to the same bucket. Need some tests to see if it is respecting the right targets skips etc.

cr_build_targets(path=tempfile())

# adding custom environment args and secrets to the build
cr_build_targets(
  task_image = "gcr.io/my-project/my-targets-pipeline",
  options = list(env = c("ENV1=1234",
                         "ENV_USER=Dave")),
  availableSecrets = cr_build_yaml_secrets("MY_PW","my-pw"),
  task_args = list(secretEnv = "MY_PW"))

Resulting in build:

==cloudRunnerYaml==
steps:
- name: gcr.io/google.com/cloudsdktool/cloud-sdk:alpine
  entrypoint: bash
  args:
  - -c
  - gsutil -m cp -r ${_TARGET_BUCKET}/* /workspace/_targets || exit 0
  id: get previous _targets metadata
- name: ubuntu
  args:
  - bash
  - -c
  - ls -lR
  id: debug file list
- name: gcr.io/my-project/my-targets-pipeline
  args:
  - Rscript
  - -e
  - targets::tar_make()
  id: target pipeline
  secretEnv:
  - MY_PW
timeout: 3600s
options:
  env:
  - ENV1=1234
  - ENV_USER=Dave
substitutions:
  _TARGET_BUCKET: gs://mark-edmondson-public-files/googleCloudRunner/_targets
availableSecrets:
  secretManager:
  - versionName: projects/mark-edmondson-gde/secrets/my-pw/versions/latest
    env: MY_PW
artifacts:
  objects:
    location: gs://mark-edmondson-public-files/googleCloudRunner/_targets/meta
    paths:
    - /workspace/_targets/meta/**

@MarkEdmondson1234
Copy link
Owner Author

Tests are working now which confirm a targets build can reuse previous builds artifacts, and also rerun if the source are updates https://github.com/MarkEdmondson1234/googleCloudRunner/pull/159/files

@MarkEdmondson1234
Copy link
Owner Author

Need two modes(?) - one where all target files are the upcoming gcs integration which will download artifacts as needed, one where the data is loaded from other sources (file etc) kept in a normal GCS bucket

MarkEdmondson1234 added a commit that referenced this issue Dec 19, 2021
@MarkEdmondson1234
Copy link
Owner Author

MarkEdmondson1234 commented Dec 19, 2021

Added cr_buildstep_targets() to prep for sending up individual build steps. cr_buildstep_targets_setup() downloads the meta folder, cr_buildstep_targets_teardown() uploads the targets changed files to the bucket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant