Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable datalake storage #143

Open
fBedecarrats opened this issue Mar 31, 2023 · 52 comments
Open

Enable datalake storage #143

fBedecarrats opened this issue Mar 31, 2023 · 52 comments

Comments

@fBedecarrats
Copy link
Collaborator

Background
The package is implementing paralellization, now using a backend that works well with big data environments. The main bottlenecks now when large number of Ns are used is the way data is stored and accessed. Enabling the storage and access to data for datalake APIs (S3, Azure blob storage) could enable to further enhance the package performance.

Definition of done
The package enables to reference a datalake storage (Azure or S3) in the same way than it enables to reference a location on local file system.

Complexity
To be assessed. My first impression is that it would be low to medium.

@fBedecarrats fBedecarrats changed the title Enable cloud-based for Enable datalake storage Mar 31, 2023
@fBedecarrats
Copy link
Collaborator Author

Back to this issue. Some benefits would be:

  • to avoid re-downloading resources for each project;
  • a quicker read-write when processing in the cloud;
  • to enhance performance of parallel processing as data transfer between cores seems the current bottleneck for this approach.

Possible first steps could be:

I suggest we organize a specific webex focusing on this issue with the interested users/developers.

@goergen95
Copy link
Member

An important distinction to realize here is that what is proposed in this issue does not change the overall paradigm of the package ("download first, computation later") but simply aims at allowing users to select (different?) cloud-backend storage providers instead of the local file directory. This is not the same as the discussion evolving about the usage of cloud-native geospatial formats. That discussion could actually alter the paradigm of the package to something like "query while computing".

@fBedecarrats
Copy link
Collaborator Author

For packages using S3 API specifications (AWS or MinIO), two R packages are available:
{aws.s3}, which is the one I use, and {paws}.
{aws.s3} has two functions that would come particularly handy in our case: s3read_using() and s3write_using() (see documentation).

What I would imagine is the following:
Add an optional parameter like storage_type = to the init_portfolio section, that could take values such as c("local_filesystem", "aws_s3", "azure_blob", "gcp_whatever"), with local_filesystem being the default.
Modify the and get_resource.R using one of this function

In, calc_indicator.R the following lines:

if (resource_type == "raster") {
      tindex <- read_sf(available_resources[resource_name], quiet = TRUE)
      out <- .read_raster_source(shp, tindex, rundir)
    } else if (resource_type == "vector") {
      out <- lapply(available_resources[[resource_name]], function(source) {
        tmp <- read_sf(source, wkt_filter = st_as_text(st_as_sfc(st_bbox(shp))))
        st_make_valid(tmp)
      })
      names(out) <- basename(available_resources[[resource_name]])
    } else {
      stop(sprintf("Resource type '%s' currently not supported", resource_type))
    }

would be modified with something like:

    if (resource_type == "raster") {
      if(storage_type == "aws_s3") {
        tindex <- aws.s3::s3read_using(x = available_resources[resource_name],
                                       FUN = read_sf)
        out <- aws.s3::s3read_using(x = available_resources[resource_name],
                                    FUN = .read_raster_source)
      } else {
        tindex <- read_sf(available_resources[resource_name], quiet = TRUE)
        out <- .read_raster_source(shp, tindex, rundir)
      }
    }

What do you think @goergen95 ?
Just to clarify, this only takes advantage of the performance gains of reading cloud storage from a cloud computing environment.
Further enhancement could come from improving the spatial filtering when reading, so that every reads only focuses on the area of interest.

@goergen95
Copy link
Member

Here are some thoughts:

  1. Supporting each of the cloud infrastructures increases the dependencies of the package. In my point of view it is a very particular use-case thus I would opt to make this optional for users who need this (thus moving additional dependencies to SUGGEST and making sure that required namespaces are available)
  2. Concerning data I/O, you could investigate how far the GDAL Virtual File System drivers could be used to omit additional dependencies
  3. I am against an additional argument in init_portfolio(). Internal code should handle whether to write to the local file system or a supported cloud storage based on the string supplied to outdir.
  4. The code to support this should not directly be implemented in get_resources() or similar. In order to allow efficient maintenance and testing we would need to see working back-end code for writing and reading of raster and vector data for both the local file system and cloud storage types. These methods then should be called in get_resources() and elsewhere.
  5. Why do you expect improvements in the read performance in the cloud? I see the benefit that you can store the data in a shared bucket (or whatever name the providers give these things now) but I expect it to be slower compared to storing the data on the machine where your R instance runs.
  6. We already apply spatial filters when reading in the resources for a specific polygon.

@fBedecarrats
Copy link
Collaborator Author

The GDAL virtual file system seems indeed a great lead, although I'm stuck when I would expect the package to identify the resources. Below a detailed example (although not that reproducible if you don't have access to my platorm, maybe that could be worked out).
So I have a MinIO S3 bucket named "fbedecarrats". On it, there is a folder "mapme_biodiversity" with a subfolder "chirps" that contains all blobal the resources used by mapme.biodiversity package for chirps.

library(tidyverse)
library(aws.s3) # the package used to access the S3 API 

get_bucket_df("fbedecarrats", prefix = "mapme_biodiversity", region = "") %>%
  head(5) %>%
  pluck("Key")

# [1] "mapme_biodiversity/chirps/chirps-v2.0.1981.01.cog" "mapme_biodiversity/chirps/chirps-v2.0.1981.02.cog"
# [3] "mapme_biodiversity/chirps/chirps-v2.0.1981.03.cog" "mapme_biodiversity/chirps/chirps-v2.0.1981.04.cog"
# [5] "mapme_biodiversity/chirps/chirps-v2.0.1981.05.cog"

Using the GDAL Virtual File System driver for S3, the access to files stores in S3 is straightforward: one just need to specify the location on the S3 bucket like if it was on the local filesystem and add "/vsis3/" at the beginning. Nota bene: the credentials to access the S3 storage must be set (it is automatic on my cloud environment, but otherwise it needs to be specified manually).

library(terra)
chirps1 <- rast("/vsis3/fbedecarrats/mapme_biodiversity/chirps/chirps-v2.0.1981.01.cog")
print(chirps1)
# class       : SpatRaster 
# dimensions  : 2000, 7200, 1  (nrow, ncol, nlyr)
# resolution  : 0.05, 0.05  (x, y)
# extent      : -180, 180, -50, 50  (xmin, xmax, ymin, ymax)
# coord. ref. : lon/lat WGS 84 (EPSG:4326) 
# source      : chirps-v2.0.1981.01.cog 
# name        : chirps-v2.0.1981.01 

The init_portfolio function seems to work at first sight.

library(sf)
library(mapme.biodiversity)
neiba <- system.file("extdata", "sierra_de_neiba_478140_2.gpkg", 
                     package = "mapme.biodiversity") %>%
  sf::read_sf()

pf <- init_portfolio(neiba, years = 2000:2020, 
                     outdir = "/vsis3/fbedecarrats/mapme_biodiversity")
str(pf)
# sf [1 × 6] (S3: sf/tbl_df/tbl/data.frame)
#  $ WDPAID   : num 478140
#  $ NAME     : chr "Sierra de Neiba"
#  $ DESIG_ENG: chr "National Park"
#  $ ISO3     : chr "DOM"
#  $ geom     :sfc_POLYGON of length 1; first list element: List of 4
#   ..$ : num [1:1607, 1:2] -71.8 -71.8 -71.8 -71.8 -71.8 ...
#   ..$ : num [1:5, 1:2] -71.4 -71.4 -71.4 -71.4 -71.4 ...
#   ..$ : num [1:4, 1:2] -71.5 -71.5 -71.5 -71.5 18.6 ...
#   ..$ : num [1:5, 1:2] -71.5 -71.5 -71.5 -71.5 -71.5 ...
#   ..- attr(*, "class")= chr [1:3] "XY" "POLYGON" "sfg"
#  $ assetid  : int 1
#  - attr(*, "sf_column")= chr "geom"
#  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA
#   ..- attr(*, "names")= chr [1:5] "WDPAID" "NAME" "DESIG_ENG" "ISO3" ...
#  - attr(*, "nitems")= int 1
#  - attr(*, "bbox")= 'bbox' Named num [1:4] -71.8 18.6 -71.3 18.7
#   ..- attr(*, "names")= chr [1:4] "xmin" "ymin" "xmax" "ymax"
#   ..- attr(*, "crs")=List of 2
#   .. ..$ input: chr "WGS 84"
#   .. ..$ wkt  : chr "GEOGCRS[\"WGS 84\",\n    DATUM[\"World Geodetic System 1984\",\n        ELLIPSOID[\"WGS 84\",6378137,298.257223"| __truncated__
#   .. ..- attr(*, "class")= chr "crs"
#  - attr(*, "resources")= list()
#  - attr(*, "years")= int [1:21] 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 ...
#  - attr(*, "outdir")= chr "/vsis3/fbedecarrats/mapme_biodiversity"
#  - attr(*, "tmpdir")= chr "/tmp/RtmpXASngm"
#  - attr(*, "verbose")= logi TRUE
#  - attr(*, "testing")= logi FALSE

However, although all the cog files are present in the chirps subfolder, the resource were not recongized and the package attemps to download them again (which is not possible, as it cannot write on S3 with this protocol).

pf <- pf %>%
  get_resources("chirps")
# Starting process to download resource 'chirps'........
#   |                                                  | 0 % ~calculating  
# <simpleWarning in download.file(missing_urls[i], missing_filenames[i], quiet = TRUE,     mode = ifelse(Sys.info()["sysname"] == "Windows", "wb", "w")): URL https://data.chc.ucsb.edu/products/CHIRPS-2.0/global_monthly/cogs/chirps-v2.0.1981.01.cog: cannot open destfile '/vsis3/fbedecarrats/mapme_biodiversity/chirps/chirps-v2.0.1981.01.cog', reason 'No such file or directory'>
# Warning message:
# Download for resource chirps failed. Returning unmodified portfolio object.

pf <- pf %>%
   calc_indicators("precipitation_chirps",
                    engine = "exactextract",
                    scales_spi = 3,
                    spi_prev_years = 8)
# Error in .check_existing_resources(existing_resources, required_resources,  : 
#   The following required resource is not available: chirps.

The resources don't get recognized because they are indexed with the local path, eg.: "/home/onyxia/work/perturbations_androy/chirps/chirps-v2.0.1981.01.cog". I'll try to modify and replace.

# Read existing
tindex <- st_read("/vsis3/fbedecarrats/mapme_biodiversity/chirps/tileindex_chirps.gpkg")
# Correct path
tindex2 <- tindex %>%
  mutate(location = str_replace(location, 
                                "/home/onyxia/work/perturbations_androy/",
                                "/vsis3/fbedecarrats/mapme_biodiversity/"))
# write locally
st_write(tindex2, "tileindex_chirps.gpkg")
# replace object in S3
put_object(file = "tileindex_chirps.gpkg",
    object = "mapme_biodiversity/chirps/tileindex_chirps.gpkg",
    bucket = "fbedecarrats",
    region = "",
    multipart = TRUE)

After correcting the indexes in the tileindex, the presence of the resources is still not identified.

pf <- init_portfolio(neiba, years = 2000:2020, 
                     outdir = "/vsis3/fbedecarrats/mapme_biodiversity")
# Starting process to download resource 'chirps'........
#   |                                                  | 0 % ~calculating  
# <simpleWarning in download.file(missing_urls[i], missing_filenames[i], quiet = TRUE,     mode = ifelse(Sys.info()["sysname"] == "Windows", "wb", "w")): URL https://data.chc.ucsb.edu/products/CHIRPS-2.0/global_monthly/cogs/chirps-v2.0.1981.01.cog: cannot open destfile '/vsis3/fbedecarrats/mapme_biodiversity/chirps/chirps-v2.0.1981.01.cog', reason 'No such file or directory'>
# Warning message:
# Download for resource chirps failed. Returning unmodified portfolio object. 
pf <- pf %>%
  get_resources("chirps")
# Error in .check_existing_resources(existing_resources, required_resources,  : 
#   The following required resource is not available: chirps.

I don't understand why the package does not identify that the resource is already present, as it would do on the local filesystem.

@goergen95
Copy link
Member

We are indeed in need of a reproducible example. Please set one up using the minio docker image and covering principles-first. We need to figure out how we can authenticate against the S3 server, and write and read vector and raster data using the driver before we can expect the package to auto-magically handle S3.

@fBedecarrats
Copy link
Collaborator Author

You are right, we need a reproducible example. Instead of creating an ad hoc S3 server that will need credentials anyways, I wonder if it is not simpler if I generate tokens that enable another user to access my S3 bucket on the existing Minio server from anywhere, or generate tokens that enable another user to access a running pod (RStudio server) where all the environment parameters are pre-set to connect to the S3 bucket. I would need to communicate the tokens through a private channel though. What do you think @goergen95 ?

@goergen95
Copy link
Member

I do not favor that option because it is not reproducible by anyone else. It is also not about the environment parameters, because this is something the code will have to take care of eventually and I would actually like to see what is needed in terms on parameters in a reproducible example. Also, we will have to think about how to include tests for the new functionality in the package eventually. You might take some inspiration how to set things up from this repository here.

@Jo-Schie
Copy link
Member

Jo-Schie commented Jun 21, 2023

A question on the side: how great are the performance gains @fBedecarrats? Is it possible to make a benchmark? My intuition would tell me that a local file-storage is always superior, if "local" means in this context "locally in the cloud" whereever your R environment is installed. I would not expect read and write to be much faster with the cloud optimized storages and that being the bottleneck for mass processing...but of course I might be wrong, so a benchmark would be really great.

If performance gains for individual users are not much higher, than the value added of this feature would be to enable more collaboration across users and projects for a specific IT setup... we should discuss how far we want to support this, because there might be multiple solutions to that problem and I would see it more on the side of IT architects to enable collaboration within a specifc IT infrastructure given a tool/technology that exists.... instead of the other way around (Making your tool fit to a multitude of environments/IT setups)...

Note: In this specific case a shared network drive within your environment might already solve the problem and there would be no need for AWS or whatsoever...

@fBedecarrats
Copy link
Collaborator Author

A question on the side: how great are the performance gains @fBedecarrats? Is it possible to make a benchmark? My intuition would tell me that a local file-storage is always superior, if "local" means in this context "locally in the cloud" whereever your R environment is installed. I would not expect read and write to be much faster with the cloud optimized storages and that being the bottleneck for mass processing...but of course I might be wrong, so a benchmark would be really great.

If performance gains for individual users are not much higher, than the value added of this feature would be to enable more collaboration across users and projects for a specific IT setup... we should discuss how far we want to support this, because there might be multiple solutions to that problem and I would see it more on the side of IT architects to enable collaboration within a specifc IT infrastructure given a tool/technology that exists.... instead of the other way around (Making your tool fit to a multitude of environments/IT setups)...

Note: In this specific case a shared network drive within your environment might already solve the problem and there would be no need for AWS or whatsoever...

YEs, your comments echoes @goergen95 comment above

  1. Why do you expect improvements in the read performance in the cloud? I see the benefit that you can store the data in a shared bucket (or whatever name the providers give these things now) but I expect it to be slower compared to storing the data on the machine where your R instance runs.

I am really not sure on this, but I thought that this might improve in the following sense:
When I set a parallel computing strategy, like with the {future} option plan(cluster), the current R process becomes one cluster, and {future} creates additional clusters. Apparently, reading and transferring data become between clusters becomes a bottleneck when I reach 10-12 clusters. My hypothesis is that the initial cluster must transfer its data to the other clusters and this is slow.
My idea then it that if all clusters read the data from a third-party source for which we have very good performance (ie. S3 in my case, or Azure blob storage in yours), even with concurrent access, then the bottleneck is resorbed and we would see significant performance improvements above 10-12 parallel workers, which is not the case currently.
But I don't clearly understand the parallelization process and it is guesswork at this stage. I think it is worthwhile implement the S3 reading for the sake of data mutualization among several analyses, and performance improvement would be a cherry on the cake if it really works.
Does that make sense?

@goergen95
Copy link
Member

I agree in that we need to see some specific benchmark scripts to further discuss this issue.
Also, consider that we make some promises in the README:

It supports computational efficient routines and heavy parallel computing in cloud-infrastructures such as AWS or AZURE using in the statistical programming language R.

So I think it is worthwhile to investigate how we can deliver on that promise by supporting different types of cloud storage "natively" in the package. I don't expect performance improvements right away but I also do not think it is a priority at this stage.

@Jo-Schie
Copy link
Member

Jo-Schie commented Jun 21, 2023

It is definitely an interesting hypothesis by @fBedecarrats . Seeing it from this perspective a scalable read and write may solve a bottleneck when extra cpu on top just does not yield any significant results anymore. You may also read the Wikipedia article on Amdahl's Law on this for starters.

In computer architecture, Amdahl's law (or Amdahl's argument) is a formula which gives the theoretical speedup in latency of the execution of a task at fixed workload that can be expected of a system whose resources are improved. It states that "the overall performance improvement gained by optimizing a single part of a system is limited by the fraction of time that the improved part is actually used".

And more below

Amdahl's law does represent the law of diminishing returns if one is considering what sort of return one gets by adding more processors to a machine, if one is running a fixed-size computation that will use all available processors to their capacity. Each new processor added to the system will add less usable power than the previous one. Each time one doubles the number of processors the speedup ratio will diminish, as the total throughput heads toward the limit of 1/(1 − p).
This analysis neglects other potential bottlenecks such as memory bandwidth and I/O bandwidth. If these resources do not scale with the number of processors, then merely adding processors provides even lower returns.

It is quite what we observed in @Ohm-Np master thesis. Maybe @Ohm-Np can link an online copy of his thesis here?

@goergen95
Copy link
Member

Ok, can we agree that this issue here is about enabling (some) cloud storage types? I would suggest that we discuss parallelization strategies and improvements elsewhere and further down the line.

@fBedecarrats
Copy link
Collaborator Author

Ok, can we agree that this issue here is about enabling (some) cloud storage types? I would suggest that we discuss parallelization strategies and improvements elsewhere and further down the line.

Yes!

@fBedecarrats
Copy link
Collaborator Author

We are indeed in need of a reproducible example. Please set one up using the minio docker image and covering principles-first. We need to figure out how we can authenticate against the S3 server, and write and read vector and raster data using the driver before we can expect the package to auto-magically handle S3.

OK. After several attempts, it seems that I cannot set docker with the linux pods I am using on Kubernetes. I cannot do it neither with my work Windows PC. I need to find a machine on which I can launch docker. I don't know when I will be able to achieve that.

@goergen95
Copy link
Member

Yes!

Great! Then I would suggest to focus on S3 and Azure Blob as a starting point, maybe Google Cloud Storage later. For S3 it should be possible to use minio to set up a testing environment. I am not sure about Azure. Reading and writing geospatial data through GDAL should be easy. The main problem I see is that we cannot list already existing files on these systems without further dependencies. We thus need repex for both storage types that show how to read/write raster and vector data and list existing files.

@fBedecarrats
Copy link
Collaborator Author

fBedecarrats commented Jun 22, 2023

I just found out this simple way to launch minio (without docker) on a linux environment:

wget https://dl.min.io/server/minio/release/linux-amd64/archive/minio_20230619195250.0.0_amd64.deb -O minio.deb
sudo dpkg -i minio.deb
mkdir ~/minio
minio server ~/minio --console-address :9090

After this, the minio client is accessible locally with on the IP:port and with the credentials provided on the terminal.
Similar setup procedures are available for MacOS and Windows (it does not work however with Windows 11, so I cannot test it locally).

@fBedecarrats
Copy link
Collaborator Author

This is a complete procedure to run Minio and access it from R

In one terminal, run:

# Install MinIO
wget https://dl.min.io/server/minio/release/linux-amd64/archive/minio_20230619195250.0.0_amd64.deb -O minio.deb
sudo dpkg -i minio.deb
mkdir ~/minio
minio server ~/minio --console-address :9090

The terminal will remain buzy as long as MinIO is running. Open another terminal and run:

# Install MinioClient
wget https://dl.min.io/client/mc/release/linux-amd64/mc
chmod +x mc
sudo mv mc /usr/local/bin/mc

# Creates an alias
mc alias set local http://127.0.0.1:9000 minioadmin minioadmin
mc admin info local

# Creates a bucket
mc mb local/mapme

# Create a test file on the local filesystem
printf "blabla\nblibli\nbloblo" >> test.txt

# Send the test file to Minio
mc cp ~/work/test.txt local/mapme/test.txt

Now, we will use R to connect to the minio server that runs on the local machine:

library(aws.s3)
library(tidyverse)


# Set environment variables that aws.s3 use to connect to minIO
Sys.setenv("AWS_ACCESS_KEY_ID" = "minioadmin",
           "AWS_SECRET_ACCESS_KEY" = "minioadmin",
           "AWS_DEFAULT_REGION" = "",
           "AWS_SESSION_TOKEN" = "",
           "AWS_S3_ENDPOINT"= "localhost:9000")

get_bucket("mapme", region = "", use_https = FALSE)
# Bucket: mapme 
# 
# $Contents
# Key:            test.txt 
# LastModified:   2023-06-23T11:21:09.747Z 
# ETag:           "cb7a754ec0d230b2a3e28ccb55957e6d" 
# Size (B):       20 
# Owner:          minio 
# Storage class:  STANDARD 

s3read_using(FUN = readLines,
             object = "test.txt",
             bucket = "mapme",
             opts = list("region" = "", "use_https" = "FALSE"))
# [1] "blabla" "blibli" "bloblo"
# Warning message:
#   In FUN(tmp, ...) :
#   incomplete final line found on '/tmp/RtmpCJttXA/file2504cdaa3d9.txt'

Can you please test that it works on your side @goergen95 ?

@fBedecarrats
Copy link
Collaborator Author

I made a try with some geographic data using GDAL S3 driver, but for now it doesn't work with the local minio (although it works with the remote MinIO from SSP Cloud, see example above). I think that it is due to the fact that the local resource doesn't use https.

library(aws.s3)
library(tidyverse)


# Set environment variables that aws.s3 use to connect to minIO
Sys.setenv("AWS_ACCESS_KEY_ID" = "minioadmin",
           "AWS_SECRET_ACCESS_KEY" = "minioadmin",
           "AWS_DEFAULT_REGION" = "",
           "AWS_SESSION_TOKEN" = "",
           "AWS_S3_ENDPOINT"= "localhost:9000")

get_bucket("mapme", region = "", use_https = FALSE)
# Bucket: mapme 
# 
# $Contents
# Key:            test.txt 
# LastModified:   2023-06-23T11:21:09.747Z 
# ETag:           "cb7a754ec0d230b2a3e28ccb55957e6d" 
# Size (B):       20 
# Owner:          minio 
# Storage class:  STANDARD 

s3read_using(FUN = readLines,
             object = "test.txt",
             bucket = "mapme",
             opts = list("region" = "", "use_https" = "FALSE"))
# [1] "blabla" "blibli" "bloblo"
# Warning message:
#   In FUN(tmp, ...) :
#   incomplete final line found on '/tmp/RtmpCJttXA/file2504cdaa3d9.txt'



library(mapme.biodiversity)
library(sf)
library(terra)

# create an AOI like in package documentation
aoi <- system.file("extdata", "sierra_de_neiba_478140.gpkg", 
                        package = "mapme.biodiversity") %>%
  read_sf() %>%
  st_cast("POLYGON")

aoi_gridded <- st_make_grid(
  x = st_bbox(aoi),
  n = c(10, 10),
  square = FALSE
) %>%
  st_intersection(aoi) %>%
  st_as_sf() %>%
  mutate(geom_type = st_geometry_type(x)) %>%
  filter(geom_type == "POLYGON") %>%
  select(-geom_type, geom = x) %>%
  st_as_sf()

# get some GFC resource
sample_portfolio <- init_portfolio(aoi_gridded, years = 2010,
  outdir = ".") %>%
  get_resources("gfw_treecover")

# Copy the GFC resource to minio
put_object(file = "gfw_treecover/Hansen_GFC-2021-v1.9_treecover2000_20N_080W.tif",
           object = "gfw_treecover/Hansen_GFC-2021-v1.9_treecover2000_20N_080W.tif",
           bucket = "mapme",
           region = "", 
           use_https = FALSE)

get_bucket("mapme", region = "", use_https = FALSE)
# Bucket: mapme 
# 
# $Contents
# Key:            gfw_treecover/Hansen_GFC-2021-v1.9_treecover2000_20N_080W.tif 
# LastModified:   2023-06-23T11:39:38.021Z 
# ETag:           "ff12537644a35a34f88483b88d51e1fe" 
# Size (B):       119150611 
# Owner:          minio 
# Storage class:  STANDARD 
# 
# $Contents
# Key:            test.txt 
# LastModified:   2023-06-23T11:21:09.747Z 
# ETag:           "cb7a754ec0d230b2a3e28ccb55957e6d" 
# Size (B):       20 
# Owner:          minio 
# Storage class:  STANDARD 


my_rast <- rast("/vsis3/mapme/gfw_treecover/Hansen_GFC-2021-v1.9_treecover2000_20N_080W.tif")
# Error: [rast] file does not exist: /vsis3/mapme/gfw_treecover/Hansen_GFC-2021-v1.9_treecover2000_20N_080W.tif
# In addition: Warning message:
#   In new_CppObject_xp(fields$.module, fields$.pointer, ...) :
#   GDAL Error 11: CURL error: Could not resolve host: mapme.localhost

@goergen95
Copy link
Member

Hi!
I set up a Gist using Docker consisting of a RStudio and a minio server. You might still use the R script with some adaptations with your setup?

The results are the following:
If we assume that the envrionment variables are set up correctly we can use the {aws.s3} package to write data to a bucket. The GDAL drivers does not have write capabilities for both raster and vector. We can use both, {aws.s3} and GDAL to read data from the minio bucket. Using GDAL, two environment variables have to be set for it to resolve the location of the bucket correctly.

Conclusion:
Even-though we could use GDAL for reading data, it will fail if some env vars are not set up correctly. I thus opt for using {aws.s3} for setting up the read/write methods required to support S3 storage types in the package.

@goergen95
Copy link
Member

{aws.s3} currently does not rely on the AWS_DEFAULT_REGION and AWS_REGION env vars by default (as per cloudyr/aws.s3#371). We thus would either need custom code to better support both AWS and minio of we should look into alternatives (e.g. arrow).

@fBedecarrats
Copy link
Collaborator Author

{aws.s3} currently does not rely on the AWS_DEFAULT_REGION and AWS_REGION env vars by default (as per cloudyr/aws.s3#371). We thus would either need custom code to better support both AWS and minio of we should look into alternatives (e.g. arrow).

Yep. This region thing is the usual suspect for any problem. My understanding is that in many situations {aws.s3} sets us-east-1 as default if you don't specify it, so you need to pass region = "" in many situations (see example for s3read_using() or get_bucket().

@fBedecarrats
Copy link
Collaborator Author

Yep. This region thing is the usual suspect for any problem. My understanding is that in many situations {aws.s3} sets us-east-1 as default if you don't specify it, so you need to pass region = "" in many situations (see example for s3read_using() or get_bucket().

It's here: https://github.com/cloudyr/aws.s3/blob/master/R/get_location.R

@fBedecarrats
Copy link
Collaborator Author

Now that we have a working reproducible workflow and keeping in mind this question with region, next questions could be:

  1. how do we work on this? Shall we create a testing branch dedicated to this feature in the original repo or shall each of us work on separate forks?
  2. what do you think should be the most efficient approach to enable s3 read-write while minimizing the implications for the existing functions?

Some ideas for 2:

  • if we do not want to add arguments to init_portfolio(), should it test whether outdir is a cloud storage service and store it as an attribute to the output portfolio object (eg. attr(x, "cloud_storage") <- cloud_storage ?
  • is so, the functions get_resource() and calc_indicator() should have a variant when it comes to write or read and use s3read_using() and s3write_using() if attr(x, "cloud_storage") == TRUE?
    I'm mentionning these ideas, but I think that it is not very satisfying as it would overload existing functions. Ideally, we should make this modular and have a specific independent functions that would handle the clould_storage specificities, while minimizing the modifications of existing functions. What do you think?

@goergen95
Copy link
Member

Regarding 1: creating a dedicated branch in this repo is the way to go in my view.
Regarding 2:

  • I would like that users specify something like : outdir = "s3://<bucket-name>". We then assume that environment variables are set up correctly and the rest should be auto-magically handled
  • for this to work, we definitely need to modularize read/write code out of get_resources() and calc_indicators()
  • I think the best way for get_resources() to work is to download data to a temporal directory and then either push the data to outdir on the local file system or to the respective cloud storage type
  • not sure about the best way how to handle reading data within calc_indicators()

@fBedecarrats
Copy link
Collaborator Author

Thanks. I won't be able to work on this today and tomorrow, but I think that I will be able to dedicate some time on Wednesday and Thursday.

@goergen95
Copy link
Member

Just another note: {aws.s3} seems like it is no longer actively maintained (last release on 2020-04-07). I think it is high-quality, but I would not like to add an unmaintained dependency to the package. I think it would be worth to investigate some alternatives.

@fBedecarrats
Copy link
Collaborator Author

{paws} is actively maintained, but maybe less mature... https://github.com/paws-r/paws

@fBedecarrats
Copy link
Collaborator Author

The main drawback I see is that, althoug {paws} has many functions to interact with S3, it lacks the equivalent to aws.s3::s3read_using() or aws.s3::s3write_using(). These functions are simple (see code here) as they mostly rely on aws.s3::put_object() and aws.s3::save_object(), for which we have equivalents in {paws} (paws::put_object() and paws::get_object() respectively).

@goergen95
Copy link
Member

paws seems to be quite heavy on dependencies... maybe its a valid alternative if we just can rely on paws.storage

@fBedecarrats
Copy link
Collaborator Author

paws seems to be quite heavy on dependencies... maybe its a valid alternative if we just can rely on paws.storage

Apparently all required functions are in {paws.storage}. Note however that paws.storage imports {paws.commons}. @goergen95, can you please confirm ifpaws.storage would be a valid alternative considering its dependencies before I start inverting time in understanding its functionalities and seeing how it could be used for our purpose?

@goergen95
Copy link
Member

No, currently I am providing feedback for things you might consider when implementing the feature. IDK know the best way forward and it is currently not on my priority list to enable S3 data storage.

@fBedecarrats
Copy link
Collaborator Author

Hmmm. Does that mean that I might spend several days of work to develop a functional pull request that would be finally refused because it adds too many dependencies to the package?

@goergen95
Copy link
Member

I think it is fine to continue with {aws.s3}. There are several reverse dependencies on CRAN and my guess is that CRAN core maintainers will hold it functional on CRAN for that reason. We just cannot expect functional improvements in the future, reactions to issues, etc., and it is something to consider before taking blind action. I currently do not see a better alternative (e.g. paws or arrow) but that is just a first impression. Considering merging a potential PR, to which I will happily give feedback, I am fine with moving additional dependencie(s) to SUGGEST so that users who need it are enabled to use it. The fewer additional dependencies the better, and I think aws.s3 is quite light-weight on dependencies.

@fBedecarrats
Copy link
Collaborator Author

I think it is fine to continue with {aws.s3}. There are several reverse dependencies on CRAN and my guess is that CRAN core maintainers will hold it functional on CRAN for that reason. We just cannot expect functional improvements in the future, reactions to issues, etc., and it is something to consider before taking blind action. I currently do not see a better alternative (e.g. paws or arrow) but that is just a first impression. Considering merging a potential PR, to which I will happily give feedback, I am fine with moving additional dependencie(s) to SUGGEST so that users who need it are enabled to use it. The fewer additional dependencies the better, and I think aws.s3 is quite light-weight on dependencies.

Thank you for the clarificatoin! I will move forward with {aws.s3} and dependencies to SUGGEST then.

@goergen95
Copy link
Member

minioclient most probably will hit CRAN soon. Could you please too have a look if it fits our purposes? I will try to do the same within the week.

@fBedecarrats
Copy link
Collaborator Author

Nice!!! I will check that.

@fBedecarrats
Copy link
Collaborator Author

It seems that when it comes to read an object in S3, these 3 packages (paws, aws.s3 and minioclient) always copy the whole object to the local filesystem. This does not seem efficient and I think that /vsis3 remains the best option for reading at calc_indicator() stage.
These packages could howerver be ussed at get_resource() stage to write the acquired and prepared resources.

@goergen95
Copy link
Member

goergen95 commented Jun 29, 2023

At least using terra >= v1.7.39 we can directly read and write raster/vector to an S3 location (and probably other GDAL supported virtual file systems). See latest version of the Gist.

@fBedecarrats
Copy link
Collaborator Author

At least using terra >= v1.7.39 we can directly read and write raster/vector to an S3 location (and probably other GDAL supported virtual file systems). See latest version of the Gist.

Wow, this look great! Do you know if something similar is in sight for sf for vector data? S3 also refers to the type of Object Oriented system for R, so when I google for sf + S3, all I find are false positives (eg. https://r-spatial.github.io/sf/reference/Ops.html)

@goergen95
Copy link
Member

sf and stars both support read/write from virtual file systems. Take a look at the Gist where I show the usage for each package.

@cboettig
Copy link

Hi @goergen95 @fBedecarrats ,

Apologies to jump in at random, just came across your package from that aws.s3 thread (mentioning minioclient) and am got excited about what you're building here! Amazing stuff.

I've also been diving into GDAL virtual filesystem -- just a note that while sf::st_read() will take vector objects with VSI prefixes, I believe you're actually ending up reading the entire asset into memory if you just do (vector_sf <- read_sf("/vsis3/my-bucket/vector_sf.gpkg")). In order to take advantage of a range request to download only a subset of the polygons, say, you will want to pass some well-known text to the wkt argument of st_read(), or pass a vector object to the filter argument of terra::vect().

Some quick examples here: https://schmidtdse.github.io/biodiversity-catalog/examples/cloud-basics.html

Hope you don't mind me hopping in here. I'm eager to learn more about your project and find ways to contribute. Kind regards,

Carl

@goergen95
Copy link
Member

Hi Carl,

thanks for chiming in and congrats for the publication of minioclient on CRAN! 🎉

The issues you raised recently with R spatial packages have helped a lot to figure out the usage of GDAL's virtual file system drivers. We are trying to extent mapme.biodiversity to work seamlessly in cloud environments. Current implementations consider data to be located on the local file system. I have the hope that using GDAL's VSI capabilities will actually bring as a long way. For example, we already apply the WKT filter for vector data on the local file system:

tmp <- read_sf(source, wkt_filter = st_as_text(st_as_sfc(st_bbox(shp))))

For raster data, we have come up with a solution using VRT if the raster is provided in tiles covering the globe and two or more tiles intersect with a given polygon:

unique_bboxes <- unique(unlist(all_bboxes))
layer_index <- which(all_bboxes == unique_bboxes[[1]])
temporal_gap <- layer_index[2] - layer_index[1] - 1
out <- lapply(layer_index, function(j) {
target_files <- tindex$location[j:(j + temporal_gap)]
org_filename <- basename(target_files[1])
filename <- tools::file_path_sans_ext(org_filename)
vrt_name <- file.path(rundir, sprintf("vrt_%s.vrt", filename))
tmp <- terra::vrt(target_files, filename = vrt_name)
names(tmp) <- org_filename
tmp

I think most of this should work with minor modifications using /vsixx strings, but any advice from your side would be very much appreciated.

As mentioned earlier in this thread, there still remains the issue of listing existing files in cloud buckets and I think that minioclient seems like an excellent choice to provide this functionality for both S3 and GCS.

Best,
Darius.

@cboettig
Copy link

Hi Darius,

Very cool!

For raster tiles, I'm curious if you've tried out gdalcubes (see paper, github)? We've found it very helpful for doing highly parallel + lazy operations on large collections of rasters; especially when metadata is available from a STAC catalog.

@fBedecarrats
Copy link
Collaborator Author

Some quick examples here: https://schmidtdse.github.io/biodiversity-catalog/examples/cloud-basics.html

Excellent, thank you!

@fBedecarrats
Copy link
Collaborator Author

For some unknown reason, reading with "/vsis3/.*" method doesn't work when I use the minio testing server set up on my local machine. However it works with the complete version hosted on SSP Cloud. Maybe something to do with http vs. https ?
see: https://github.com/mapme-initiative/mapme.biodiversity/blob/s3_testing/trials/testing_s3.R
I will be traveling with very limited access to Internet for 10 days. I'll follow-up on this ASAP as possible. Thanks @goergen95 and @cboettig for your guidance above!

@goergen95
Copy link
Member

Could you try with setting AWS_HTTPS = "FALSE"?

@fBedecarrats
Copy link
Collaborator Author

Could you try with setting AWS_HTTPS = "FALSE"?

Alas, I tried addinf "AWS_HTTPS" = "FALSE" to the Sys.setenv() call, but it didn't work...

@cboettig
Copy link

cboettig commented Jul 3, 2023

@fBedecarrats I believe you also need to set

Sys.setenv("AWS_VIRTUAL_HOSTING"="FALSE")

and of course "AWS_S3_ENDPOINT" which you probably did already (e.g. see these notes). Also you might try making the bucket public read (download) and confirm that vsicurl mechanism works?

@fBedecarrats
Copy link
Collaborator Author

@fBedecarrats I believe you also need to set

Sys.setenv("AWS_VIRTUAL_HOSTING"="FALSE")

Yesssss! It works with the following parameters:

Sys.setenv("AWS_ACCESS_KEY_ID" = "minioadmin",
           "AWS_SECRET_ACCESS_KEY" = "minioadmin",
           "AWS_S3_ENDPOINT"= "localhost:9000",
           "AWS_SESSION_TOKEN" = "",
           "AWS_HTTPS" = "FALSE",
           "AWS_VIRTUAL_HOSTING"="FALSE")

@fBedecarrats
Copy link
Collaborator Author

Hi! Everything seem to work and I am preparing a PR to include these features in the package (branch s3_storage). I was not able to finish before leaving on holidays, so I'll get back to it mid August. Have a nice summer!

@goergen95
Copy link
Member

Hi all, please find this branch where I am testing using GDAL drivers to support both remote files as well as cloud storage solutions. It is still very much WIP, but the main ideas are there already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants