Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parallel option to gcs_copy_object() #130

Open
tkoncz opened this issue Oct 5, 2020 · 3 comments
Open

Add parallel option to gcs_copy_object() #130

tkoncz opened this issue Oct 5, 2020 · 3 comments

Comments

@tkoncz
Copy link

tkoncz commented Oct 5, 2020

Currently (as I understand) gcs_copy_object() only allows a sequential copy, by looping over a list of objects and calling the copy on them separately.

However, the gsutil CLI allows a multi-threaded copy with the -m parameter:
https://cloud.google.com/storage/docs/gsutil/commands/cp#description

Do you think this could be implemented in the R package?

If so, let me know if you need help with this :)

Thanks,
Tamas

@MarkEdmondson1234
Copy link
Collaborator

MarkEdmondson1234 commented Oct 6, 2020

Thanks for raising the issue, I was thinking of doing this via future but that uses up CPU cores and since have found curl::curl_fetch_multi() which is much better but means using curl instead of httr under the hood which is a bit more complicated. I have an implementation for Cloud Run URLs that could be used as a basis though:

https://github.com/MarkEdmondson1234/googleCloudRunner/blob/b0db74e914b81737c57556216104f782f6313d4b/R/jwt-requests.R#L124-L154

cr_jwt_async <- function(urls, token, ...){


  failure <- function(str){
    cat(paste("Failed request:", str), file = stderr())
  }


  results <- list()
  success <- function(x){
    if(x$status_code == 200){
      results <<- append(results, list(rawToChar(x$content)))
    } else {
      myMessage(x$status_code, "failure for request", x$url, level = 3)
    }


  }
  pool <- new_pool()


  lapply(urls, function(x){
    myMessage("Calling asynch: ", x, level = 3)
    h <- new_handle(url = x, ...)
    h <- cr_jwt_with_curl(h = h, token = token)
    curl_fetch_multi(x,
                     done = success, fail = failure,
                     handle = h, pool = pool)
  })


  multi_run(pool = pool)


  results


}

If you want to take a look at it that would be very welcome :)

If done then it should be done in such as manner as all GCS function operations benefit, and possibly even pulling it up to googleAuthR so all libraries have access to it.

@tkoncz
Copy link
Author

tkoncz commented Oct 12, 2020

Thanks Mark! I'll take a look when I have the chance, and will let you know how it goes :)

@nturaga
Copy link

nturaga commented Dec 16, 2022

Has this ever come to fruition? The multithreading option -m would be phenomenal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants