Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gcs_get_object(object, parseObject = TRUE) returns raw object from .csv and will not parse into a data frame #184

Open
sarahhirsch opened this issue Aug 30, 2023 · 10 comments

Comments

@sarahhirsch
Copy link

I have just finished authenticating and connecting to my project's bucket, where I have several somewhat large .csv files, from a virtual machine (Debian) on the same project. I am trying to load these files into my R project environment as data frames, but I only seem to be able to get a raw object when using gcs_get_object(object, parseObject = TRUE).

Both gcs_parse_download(object, encoding = "UTF-8") directly on the bucket object and content(object) on the output of gcs_get_object(object, parseObject = TRUE), throw
Error in content(object) : is.response(x) is not TRUE.
(which is expected, at least for the latter case).

I have also tried gcs_parse_download(object, encoding = "ANSI") with the same results.

Do you know what might be happening here? Have I misunderstood one of these functions? How can I get my data into a data frame?

@MarkEdmondson1234
Copy link
Collaborator

MarkEdmondson1234 commented Aug 31, 2023

The csv parse function can be configured and since csv is not a real standard it's possible your file (which you say is big) may have some issues. Have you tested it on a smaller file first yet, perhaps also do an upload of a data.frame then download to object to check. Please also report your sessionInfo().

@sarahhirsch
Copy link
Author

sarahhirsch commented Sep 1, 2023

Here's my sessionInfo:

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Debian GNU/Linux 10 (buster)

Matrix products: default
BLAS/LAPACK: /opt/conda/lib/libopenblasp-r0.3.12.so

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C           LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.14.8         httr_1.4.6                googleCloudStorageR_0.7.0

loaded via a namespace (and not attached):
 [1] rstudioapi_0.13   magrittr_2.0.1    R6_2.5.0          rlang_1.1.1       fastmap_1.1.0     fansi_0.4.2       tools_4.0.3       utf8_1.2.1       
 [9] cli_3.6.1         googleAuthR_2.0.1 askpass_1.1       ellipsis_0.3.1    openssl_2.1.0     yaml_2.2.1        assertthat_0.2.1  digest_0.6.27    
[17] tibble_3.1.0      gargle_1.5.2      lifecycle_1.0.0   crayon_1.4.1      zip_2.3.0         vctrs_0.3.6       fs_1.5.0          curl_4.3         
[25] memoise_2.0.0     glue_1.4.2        cachem_1.0.4      compiler_4.0.3    pillar_1.5.1      jsonlite_1.7.2    pkgconfig_2.0.3

I tried gcs_get_object(object, parseObject = TRUE, parseFunction = fread) # fread is what I typically use for .csv and I got the message:

Error: Problem parsing the object with supplied parseFunction.
✖ Downloading PDC/pdc_2007_output.csv ... failed

I haven't gone through all of your suggestions yet, just wanted to follow up on my start. Thank you!

@sarahhirsch
Copy link
Author

Update:
The file I was trying was 50.1GB, so I tried again with a 9GB file (the smallest one I have), and got the same issues.

For context, I'm working on a large machine with 128 vCPU and 864 GB of RAM.

When I try to do an upload of a very small data.frame:

v1 <- c(1:10)
v2 <- c(11:20)
v3 <- rep("j", 10)

data <- as.data.frame(cbind(v1,v2,v3))
gcs_upload(data)

I get the message

ℹ 2023-09-01 19:15:56 > File size detected as  190 bytes
ℹ 2023-09-01 19:15:56 > Request Status Code:  403
Error in value[[3L]](cond) : http_403 Access denied.

gcs_get_global_bucket() does return the bucket I'm looking for.

I've also tried creating a custom parse function as suggested, but that doesn't seem to be working either. Any thoughts?

@MarkEdmondson1234
Copy link
Collaborator

The parse function needs an input/output argument, check the help for exact syntax.

It looks like you haven't access to the bucket you're trying to access anyhow. (403 is usually authentication issues)

If you check out the examples on the website I recommend getting those to work first, my guess is it is not a bug etc.

@sarahhirsch
Copy link
Author

Thank you! I have had some access issues, so I'll check those out. Also, I did see the note about custom functions, but after the fact, so I'll play around with that.

@MarkEdmondson1234
Copy link
Collaborator

I suspect a custom function isn't necessary if it's access issues. Check if your bucket is "fine-grain control" vs "bucket level" IAM and/or the role you have to access is sufficient for the email you are authenticating with.

@sarahhirsch
Copy link
Author

Thank you for this--would you mind explaining what effect fine-grain control vs bucket level IAM would have on my access and use of this package?

To explain on my end a bit--the original owner/creator of my VM is one of my organization's IT departments (I am an analyst). They have control of the original service account. They created one for me with wide permissions for this purpose, but I'm still getting the same error. I saw that in the documentation you mentioned needing to be an owner to properly authenticate. Would you mind going into more detail about this (both the fine-grain vs bucket-level control, as well as the ownership role [or whatever else might be relevant]? This might help us define exactly what service account and permissions I need access to in order to make this work.

Your package really is an excellent solution in terms of limiting persistent storage usage, especially when using large data, so I would really love to get this underway.

@MarkEdmondson1234
Copy link
Collaborator

You need Cloud Storage Admin role I think, not Owner. And if on a VM on GCP you can reuse the authentication of that VMs service key if configured, so perhaps that's the issue if they are clashing.

When trying the examples putting options(googleAuthR.verbose=2) will give more auth info. If still issues please use that and show the code you are using.

Fine-grained buckets you can set a different authentication state on every file uploaded to it. That was the original way buckets were used. But hassle, so bucket level means you get the same permissions for all objects within. It's a common issue confusing the two.

@sarahhirsch
Copy link
Author

Thanks! I'll check it out and report back.

@sarahhirsch
Copy link
Author

sarahhirsch commented Sep 6, 2023

> gcs_auth(email = "gcsemail@project-bucket.iam.gserviceaccount.com")
ℹ 2023-09-06 00:07:11 > Setting client.id from options(googleAuthR.client_id)
Warning messages:
1: replacing previous import ‘lifecycle::last_warnings’ by ‘rlang::last_warnings’ when loading ‘tibble’ 
2: replacing previous import ‘ellipsis::check_dots_unnamed’ by ‘rlang::check_dots_unnamed’ when loading ‘tibble’ 
3: replacing previous import ‘ellipsis::check_dots_used’ by ‘rlang::check_dots_used’ when loading ‘tibble’ 
4: replacing previous import ‘ellipsis::check_dots_empty’ by ‘rlang::check_dots_empty’ when loading ‘tibble’ 
5: replacing previous import ‘lifecycle::last_warnings’ by ‘rlang::last_warnings’ when loading ‘pillar’ 
6: replacing previous import ‘ellipsis::check_dots_unnamed’ by ‘rlang::check_dots_unnamed’ when loading ‘pillar’ 
7: replacing previous import ‘ellipsis::check_dots_used’ by ‘rlang::check_dots_used’ when loading ‘pillar’ 
8: replacing previous import ‘ellipsis::check_dots_empty’ by ‘rlang::check_dots_empty’ when loading ‘pillar’ 
> objects <- gcs_list_objects(bucket = "bucket")
ℹ 2023-09-06 00:07:12 > Token exists.
ℹ 2023-09-06 00:07:12 > Request:  https://www.googleapis.com/storage/v1/b/bucket/o/?pageToken=&versions=FALSE
ℹ 2023-09-06 00:07:12 > No prefixes found
ℹ 2023-09-06 00:07:12 > No paging required
> proj <- "project-bucket"
> buckets <- gcs_list_buckets(proj)
ℹ 2023-09-06 00:07:12 > Token exists.
ℹ 2023-09-06 00:07:12 > Request:  https://www.googleapis.com/storage/v1/b/?project=project-bucket&prefix=&projection=noAcl
> bucket <- "bucket"
> bucket_info <- gcs_get_bucket(bucket)
ℹ 2023-09-06 00:07:12 > Token exists.
ℹ 2023-09-06 00:07:12 > Request:  https://www.googleapis.com/storage/v1/b/bucket?projection=noAcl
> gcs_global_bucket(bucket = "bucket")
Set default bucket name to 'bucket'
> l_pdc <- gcs_get_object(objects$name[[31]], parseObject = TRUE)
ℹ 2023-09-06 00:07:47 > Token exists.
ℹ 2023-09-06 00:07:47 > Request:  https://www.googleapis.com/storage/v1/b/bucket/o/IA%2Fia_output.csv?alt=media
ℹ 2023-09-06 00:07:47 > Request Status Code:  403
ℹ 2023-09-06 00:07:47 > Could not parase error content to JSON
ℹ 2023-09-06 00:07:47 > API error:  Unspecified error
ℹ 2023-09-06 00:07:47 > No retry attempted:  Unspecified error
ℹ 2023-09-06 00:07:47 > Custom error 403 Unspecified error
Error in `abort_http()`:ut.csv
! http_403 Unspecified error
Run `rlang::last_trace()` to see where the error occurred.
✖ Downloading IA/ia_output.csv ... failed
> v1 <- c(1:10)
> v2 <- c(11:20)
> v3 <- rep("j", 10)
> 
> data <- as.data.frame(cbind(v1,v2,v3))
> gcs_upload(data)
ℹ 2023-09-06 00:11:13 > Set API cache
ℹ 2023-09-06 00:11:13 > File size detected as  190 bytes
ℹ 2023-09-06 00:11:13 > Simple upload
ℹ 2023-09-06 00:11:13 > Request:  https://www.googleapis.com/upload/storage/v1/b/nero-salomon1-bwood/o/?uploadType=media&name=data.csv&predefinedAcl=private
ℹ 2023-09-06 00:11:13 > Could not parse body JSON
ℹ 2023-09-06 00:11:13 > Request Status Code:  403
ℹ 2023-09-06 00:11:13 > API error:  Access denied.
ℹ 2023-09-06 00:11:13 > No retry attempted:  Access denied.
ℹ 2023-09-06 00:11:13 > Custom error 403 Access denied.
Error in value[[3L]](cond) : http_403 Access denied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants