-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable datalake storage #143
Comments
Back to this issue. Some benefits would be:
Possible first steps could be:
I suggest we organize a specific webex focusing on this issue with the interested users/developers. |
An important distinction to realize here is that what is proposed in this issue does not change the overall paradigm of the package ("download first, computation later") but simply aims at allowing users to select (different?) cloud-backend storage providers instead of the local file directory. This is not the same as the discussion evolving about the usage of cloud-native geospatial formats. That discussion could actually alter the paradigm of the package to something like "query while computing". |
For packages using S3 API specifications (AWS or MinIO), two R packages are available: What I would imagine is the following: In, if (resource_type == "raster") {
tindex <- read_sf(available_resources[resource_name], quiet = TRUE)
out <- .read_raster_source(shp, tindex, rundir)
} else if (resource_type == "vector") {
out <- lapply(available_resources[[resource_name]], function(source) {
tmp <- read_sf(source, wkt_filter = st_as_text(st_as_sfc(st_bbox(shp))))
st_make_valid(tmp)
})
names(out) <- basename(available_resources[[resource_name]])
} else {
stop(sprintf("Resource type '%s' currently not supported", resource_type))
} would be modified with something like: if (resource_type == "raster") {
if(storage_type == "aws_s3") {
tindex <- aws.s3::s3read_using(x = available_resources[resource_name],
FUN = read_sf)
out <- aws.s3::s3read_using(x = available_resources[resource_name],
FUN = .read_raster_source)
} else {
tindex <- read_sf(available_resources[resource_name], quiet = TRUE)
out <- .read_raster_source(shp, tindex, rundir)
}
} What do you think @goergen95 ? |
Here are some thoughts:
|
The GDAL virtual file system seems indeed a great lead, although I'm stuck when I would expect the package to identify the resources. Below a detailed example (although not that reproducible if you don't have access to my platorm, maybe that could be worked out). library(tidyverse)
library(aws.s3) # the package used to access the S3 API
get_bucket_df("fbedecarrats", prefix = "mapme_biodiversity", region = "") %>%
head(5) %>%
pluck("Key")
# [1] "mapme_biodiversity/chirps/chirps-v2.0.1981.01.cog" "mapme_biodiversity/chirps/chirps-v2.0.1981.02.cog"
# [3] "mapme_biodiversity/chirps/chirps-v2.0.1981.03.cog" "mapme_biodiversity/chirps/chirps-v2.0.1981.04.cog"
# [5] "mapme_biodiversity/chirps/chirps-v2.0.1981.05.cog" Using the GDAL Virtual File System driver for S3, the access to files stores in S3 is straightforward: one just need to specify the location on the S3 bucket like if it was on the local filesystem and add "/vsis3/" at the beginning. Nota bene: the credentials to access the S3 storage must be set (it is automatic on my cloud environment, but otherwise it needs to be specified manually). library(terra)
chirps1 <- rast("/vsis3/fbedecarrats/mapme_biodiversity/chirps/chirps-v2.0.1981.01.cog")
print(chirps1)
# class : SpatRaster
# dimensions : 2000, 7200, 1 (nrow, ncol, nlyr)
# resolution : 0.05, 0.05 (x, y)
# extent : -180, 180, -50, 50 (xmin, xmax, ymin, ymax)
# coord. ref. : lon/lat WGS 84 (EPSG:4326)
# source : chirps-v2.0.1981.01.cog
# name : chirps-v2.0.1981.01 The init_portfolio function seems to work at first sight. library(sf)
library(mapme.biodiversity)
neiba <- system.file("extdata", "sierra_de_neiba_478140_2.gpkg",
package = "mapme.biodiversity") %>%
sf::read_sf()
pf <- init_portfolio(neiba, years = 2000:2020,
outdir = "/vsis3/fbedecarrats/mapme_biodiversity")
str(pf)
# sf [1 × 6] (S3: sf/tbl_df/tbl/data.frame)
# $ WDPAID : num 478140
# $ NAME : chr "Sierra de Neiba"
# $ DESIG_ENG: chr "National Park"
# $ ISO3 : chr "DOM"
# $ geom :sfc_POLYGON of length 1; first list element: List of 4
# ..$ : num [1:1607, 1:2] -71.8 -71.8 -71.8 -71.8 -71.8 ...
# ..$ : num [1:5, 1:2] -71.4 -71.4 -71.4 -71.4 -71.4 ...
# ..$ : num [1:4, 1:2] -71.5 -71.5 -71.5 -71.5 18.6 ...
# ..$ : num [1:5, 1:2] -71.5 -71.5 -71.5 -71.5 -71.5 ...
# ..- attr(*, "class")= chr [1:3] "XY" "POLYGON" "sfg"
# $ assetid : int 1
# - attr(*, "sf_column")= chr "geom"
# - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA
# ..- attr(*, "names")= chr [1:5] "WDPAID" "NAME" "DESIG_ENG" "ISO3" ...
# - attr(*, "nitems")= int 1
# - attr(*, "bbox")= 'bbox' Named num [1:4] -71.8 18.6 -71.3 18.7
# ..- attr(*, "names")= chr [1:4] "xmin" "ymin" "xmax" "ymax"
# ..- attr(*, "crs")=List of 2
# .. ..$ input: chr "WGS 84"
# .. ..$ wkt : chr "GEOGCRS[\"WGS 84\",\n DATUM[\"World Geodetic System 1984\",\n ELLIPSOID[\"WGS 84\",6378137,298.257223"| __truncated__
# .. ..- attr(*, "class")= chr "crs"
# - attr(*, "resources")= list()
# - attr(*, "years")= int [1:21] 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 ...
# - attr(*, "outdir")= chr "/vsis3/fbedecarrats/mapme_biodiversity"
# - attr(*, "tmpdir")= chr "/tmp/RtmpXASngm"
# - attr(*, "verbose")= logi TRUE
# - attr(*, "testing")= logi FALSE However, although all the cog files are present in the chirps subfolder, the resource were not recongized and the package attemps to download them again (which is not possible, as it cannot write on S3 with this protocol). pf <- pf %>%
get_resources("chirps")
# Starting process to download resource 'chirps'........
# | | 0 % ~calculating
# <simpleWarning in download.file(missing_urls[i], missing_filenames[i], quiet = TRUE, mode = ifelse(Sys.info()["sysname"] == "Windows", "wb", "w")): URL https://data.chc.ucsb.edu/products/CHIRPS-2.0/global_monthly/cogs/chirps-v2.0.1981.01.cog: cannot open destfile '/vsis3/fbedecarrats/mapme_biodiversity/chirps/chirps-v2.0.1981.01.cog', reason 'No such file or directory'>
# Warning message:
# Download for resource chirps failed. Returning unmodified portfolio object.
pf <- pf %>%
calc_indicators("precipitation_chirps",
engine = "exactextract",
scales_spi = 3,
spi_prev_years = 8)
# Error in .check_existing_resources(existing_resources, required_resources, :
# The following required resource is not available: chirps. The resources don't get recognized because they are indexed with the local path, eg.: "/home/onyxia/work/perturbations_androy/chirps/chirps-v2.0.1981.01.cog". I'll try to modify and replace. # Read existing
tindex <- st_read("/vsis3/fbedecarrats/mapme_biodiversity/chirps/tileindex_chirps.gpkg")
# Correct path
tindex2 <- tindex %>%
mutate(location = str_replace(location,
"/home/onyxia/work/perturbations_androy/",
"/vsis3/fbedecarrats/mapme_biodiversity/"))
# write locally
st_write(tindex2, "tileindex_chirps.gpkg")
# replace object in S3
put_object(file = "tileindex_chirps.gpkg",
object = "mapme_biodiversity/chirps/tileindex_chirps.gpkg",
bucket = "fbedecarrats",
region = "",
multipart = TRUE) After correcting the indexes in the tileindex, the presence of the resources is still not identified. pf <- init_portfolio(neiba, years = 2000:2020,
outdir = "/vsis3/fbedecarrats/mapme_biodiversity")
# Starting process to download resource 'chirps'........
# | | 0 % ~calculating
# <simpleWarning in download.file(missing_urls[i], missing_filenames[i], quiet = TRUE, mode = ifelse(Sys.info()["sysname"] == "Windows", "wb", "w")): URL https://data.chc.ucsb.edu/products/CHIRPS-2.0/global_monthly/cogs/chirps-v2.0.1981.01.cog: cannot open destfile '/vsis3/fbedecarrats/mapme_biodiversity/chirps/chirps-v2.0.1981.01.cog', reason 'No such file or directory'>
# Warning message:
# Download for resource chirps failed. Returning unmodified portfolio object.
pf <- pf %>%
get_resources("chirps")
# Error in .check_existing_resources(existing_resources, required_resources, :
# The following required resource is not available: chirps. I don't understand why the package does not identify that the resource is already present, as it would do on the local filesystem. |
We are indeed in need of a reproducible example. Please set one up using the minio docker image and covering principles-first. We need to figure out how we can authenticate against the S3 server, and write and read vector and raster data using the driver before we can expect the package to auto-magically handle S3. |
You are right, we need a reproducible example. Instead of creating an ad hoc S3 server that will need credentials anyways, I wonder if it is not simpler if I generate tokens that enable another user to access my S3 bucket on the existing Minio server from anywhere, or generate tokens that enable another user to access a running pod (RStudio server) where all the environment parameters are pre-set to connect to the S3 bucket. I would need to communicate the tokens through a private channel though. What do you think @goergen95 ? |
I do not favor that option because it is not reproducible by anyone else. It is also not about the environment parameters, because this is something the code will have to take care of eventually and I would actually like to see what is needed in terms on parameters in a reproducible example. Also, we will have to think about how to include tests for the new functionality in the package eventually. You might take some inspiration how to set things up from this repository here. |
A question on the side: how great are the performance gains @fBedecarrats? Is it possible to make a benchmark? My intuition would tell me that a local file-storage is always superior, if "local" means in this context "locally in the cloud" whereever your R environment is installed. I would not expect read and write to be much faster with the cloud optimized storages and that being the bottleneck for mass processing...but of course I might be wrong, so a benchmark would be really great. If performance gains for individual users are not much higher, than the value added of this feature would be to enable more collaboration across users and projects for a specific IT setup... we should discuss how far we want to support this, because there might be multiple solutions to that problem and I would see it more on the side of IT architects to enable collaboration within a specifc IT infrastructure given a tool/technology that exists.... instead of the other way around (Making your tool fit to a multitude of environments/IT setups)... Note: In this specific case a shared network drive within your environment might already solve the problem and there would be no need for AWS or whatsoever... |
YEs, your comments echoes @goergen95 comment above
I am really not sure on this, but I thought that this might improve in the following sense: |
I agree in that we need to see some specific benchmark scripts to further discuss this issue.
So I think it is worthwhile to investigate how we can deliver on that promise by supporting different types of cloud storage "natively" in the package. I don't expect performance improvements right away but I also do not think it is a priority at this stage. |
It is definitely an interesting hypothesis by @fBedecarrats . Seeing it from this perspective a scalable read and write may solve a bottleneck when extra cpu on top just does not yield any significant results anymore. You may also read the Wikipedia article on Amdahl's Law on this for starters. In computer architecture, Amdahl's law (or Amdahl's argument) is a formula which gives the theoretical speedup in latency of the execution of a task at fixed workload that can be expected of a system whose resources are improved. It states that "the overall performance improvement gained by optimizing a single part of a system is limited by the fraction of time that the improved part is actually used". And more below Amdahl's law does represent the law of diminishing returns if one is considering what sort of return one gets by adding more processors to a machine, if one is running a fixed-size computation that will use all available processors to their capacity. Each new processor added to the system will add less usable power than the previous one. Each time one doubles the number of processors the speedup ratio will diminish, as the total throughput heads toward the limit of 1/(1 − p). It is quite what we observed in @Ohm-Np master thesis. Maybe @Ohm-Np can link an online copy of his thesis here? |
Ok, can we agree that this issue here is about enabling (some) cloud storage types? I would suggest that we discuss parallelization strategies and improvements elsewhere and further down the line. |
Yes! |
OK. After several attempts, it seems that I cannot set docker with the linux pods I am using on Kubernetes. I cannot do it neither with my work Windows PC. I need to find a machine on which I can launch docker. I don't know when I will be able to achieve that. |
Great! Then I would suggest to focus on S3 and Azure Blob as a starting point, maybe Google Cloud Storage later. For S3 it should be possible to use minio to set up a testing environment. I am not sure about Azure. Reading and writing geospatial data through GDAL should be easy. The main problem I see is that we cannot list already existing files on these systems without further dependencies. We thus need repex for both storage types that show how to read/write raster and vector data and list existing files. |
I just found out this simple way to launch minio (without docker) on a linux environment:
After this, the minio client is accessible locally with on the IP:port and with the credentials provided on the terminal. |
This is a complete procedure to run Minio and access it from R In one terminal, run: # Install MinIO
wget https://dl.min.io/server/minio/release/linux-amd64/archive/minio_20230619195250.0.0_amd64.deb -O minio.deb
sudo dpkg -i minio.deb
mkdir ~/minio
minio server ~/minio --console-address :9090 The terminal will remain buzy as long as MinIO is running. Open another terminal and run: # Install MinioClient
wget https://dl.min.io/client/mc/release/linux-amd64/mc
chmod +x mc
sudo mv mc /usr/local/bin/mc
# Creates an alias
mc alias set local http://127.0.0.1:9000 minioadmin minioadmin
mc admin info local
# Creates a bucket
mc mb local/mapme
# Create a test file on the local filesystem
printf "blabla\nblibli\nbloblo" >> test.txt
# Send the test file to Minio
mc cp ~/work/test.txt local/mapme/test.txt Now, we will use R to connect to the minio server that runs on the local machine: library(aws.s3)
library(tidyverse)
# Set environment variables that aws.s3 use to connect to minIO
Sys.setenv("AWS_ACCESS_KEY_ID" = "minioadmin",
"AWS_SECRET_ACCESS_KEY" = "minioadmin",
"AWS_DEFAULT_REGION" = "",
"AWS_SESSION_TOKEN" = "",
"AWS_S3_ENDPOINT"= "localhost:9000")
get_bucket("mapme", region = "", use_https = FALSE)
# Bucket: mapme
#
# $Contents
# Key: test.txt
# LastModified: 2023-06-23T11:21:09.747Z
# ETag: "cb7a754ec0d230b2a3e28ccb55957e6d"
# Size (B): 20
# Owner: minio
# Storage class: STANDARD
s3read_using(FUN = readLines,
object = "test.txt",
bucket = "mapme",
opts = list("region" = "", "use_https" = "FALSE"))
# [1] "blabla" "blibli" "bloblo"
# Warning message:
# In FUN(tmp, ...) :
# incomplete final line found on '/tmp/RtmpCJttXA/file2504cdaa3d9.txt'
Can you please test that it works on your side @goergen95 ? |
I made a try with some geographic data using GDAL S3 driver, but for now it doesn't work with the local minio (although it works with the remote MinIO from SSP Cloud, see example above). I think that it is due to the fact that the local resource doesn't use https. library(aws.s3)
library(tidyverse)
# Set environment variables that aws.s3 use to connect to minIO
Sys.setenv("AWS_ACCESS_KEY_ID" = "minioadmin",
"AWS_SECRET_ACCESS_KEY" = "minioadmin",
"AWS_DEFAULT_REGION" = "",
"AWS_SESSION_TOKEN" = "",
"AWS_S3_ENDPOINT"= "localhost:9000")
get_bucket("mapme", region = "", use_https = FALSE)
# Bucket: mapme
#
# $Contents
# Key: test.txt
# LastModified: 2023-06-23T11:21:09.747Z
# ETag: "cb7a754ec0d230b2a3e28ccb55957e6d"
# Size (B): 20
# Owner: minio
# Storage class: STANDARD
s3read_using(FUN = readLines,
object = "test.txt",
bucket = "mapme",
opts = list("region" = "", "use_https" = "FALSE"))
# [1] "blabla" "blibli" "bloblo"
# Warning message:
# In FUN(tmp, ...) :
# incomplete final line found on '/tmp/RtmpCJttXA/file2504cdaa3d9.txt'
library(mapme.biodiversity)
library(sf)
library(terra)
# create an AOI like in package documentation
aoi <- system.file("extdata", "sierra_de_neiba_478140.gpkg",
package = "mapme.biodiversity") %>%
read_sf() %>%
st_cast("POLYGON")
aoi_gridded <- st_make_grid(
x = st_bbox(aoi),
n = c(10, 10),
square = FALSE
) %>%
st_intersection(aoi) %>%
st_as_sf() %>%
mutate(geom_type = st_geometry_type(x)) %>%
filter(geom_type == "POLYGON") %>%
select(-geom_type, geom = x) %>%
st_as_sf()
# get some GFC resource
sample_portfolio <- init_portfolio(aoi_gridded, years = 2010,
outdir = ".") %>%
get_resources("gfw_treecover")
# Copy the GFC resource to minio
put_object(file = "gfw_treecover/Hansen_GFC-2021-v1.9_treecover2000_20N_080W.tif",
object = "gfw_treecover/Hansen_GFC-2021-v1.9_treecover2000_20N_080W.tif",
bucket = "mapme",
region = "",
use_https = FALSE)
get_bucket("mapme", region = "", use_https = FALSE)
# Bucket: mapme
#
# $Contents
# Key: gfw_treecover/Hansen_GFC-2021-v1.9_treecover2000_20N_080W.tif
# LastModified: 2023-06-23T11:39:38.021Z
# ETag: "ff12537644a35a34f88483b88d51e1fe"
# Size (B): 119150611
# Owner: minio
# Storage class: STANDARD
#
# $Contents
# Key: test.txt
# LastModified: 2023-06-23T11:21:09.747Z
# ETag: "cb7a754ec0d230b2a3e28ccb55957e6d"
# Size (B): 20
# Owner: minio
# Storage class: STANDARD
my_rast <- rast("/vsis3/mapme/gfw_treecover/Hansen_GFC-2021-v1.9_treecover2000_20N_080W.tif")
# Error: [rast] file does not exist: /vsis3/mapme/gfw_treecover/Hansen_GFC-2021-v1.9_treecover2000_20N_080W.tif
# In addition: Warning message:
# In new_CppObject_xp(fields$.module, fields$.pointer, ...) :
# GDAL Error 11: CURL error: Could not resolve host: mapme.localhost |
Hi! The results are the following: Conclusion: |
|
Yep. This region thing is the usual suspect for any problem. My understanding is that in many situations {aws.s3} sets |
It's here: https://github.com/cloudyr/aws.s3/blob/master/R/get_location.R |
Now that we have a working reproducible workflow and keeping in mind this question with region, next questions could be:
Some ideas for 2:
|
Regarding 1: creating a dedicated branch in this repo is the way to go in my view.
|
Thanks. I won't be able to work on this today and tomorrow, but I think that I will be able to dedicate some time on Wednesday and Thursday. |
Just another note: |
|
The main drawback I see is that, althoug {paws} has many functions to interact with S3, it lacks the equivalent to |
paws seems to be quite heavy on dependencies... maybe its a valid alternative if we just can rely on paws.storage |
Apparently all required functions are in {paws.storage}. Note however that paws.storage imports {paws.commons}. @goergen95, can you please confirm ifpaws.storage would be a valid alternative considering its dependencies before I start inverting time in understanding its functionalities and seeing how it could be used for our purpose? |
No, currently I am providing feedback for things you might consider when implementing the feature. IDK know the best way forward and it is currently not on my priority list to enable S3 data storage. |
Hmmm. Does that mean that I might spend several days of work to develop a functional pull request that would be finally refused because it adds too many dependencies to the package? |
I think it is fine to continue with |
Thank you for the clarificatoin! I will move forward with {aws.s3} and dependencies to SUGGEST then. |
minioclient most probably will hit CRAN soon. Could you please too have a look if it fits our purposes? I will try to do the same within the week. |
Nice!!! I will check that. |
It seems that when it comes to read an object in S3, these 3 packages (paws, aws.s3 and minioclient) always copy the whole object to the local filesystem. This does not seem efficient and I think that /vsis3 remains the best option for reading at calc_indicator() stage. |
At least using |
Wow, this look great! Do you know if something similar is in sight for sf for vector data? S3 also refers to the type of Object Oriented system for R, so when I google for sf + S3, all I find are false positives (eg. https://r-spatial.github.io/sf/reference/Ops.html) |
sf and stars both support read/write from virtual file systems. Take a look at the Gist where I show the usage for each package. |
Hi @goergen95 @fBedecarrats , Apologies to jump in at random, just came across your package from that I've also been diving into GDAL virtual filesystem -- just a note that while Some quick examples here: https://schmidtdse.github.io/biodiversity-catalog/examples/cloud-basics.html Hope you don't mind me hopping in here. I'm eager to learn more about your project and find ways to contribute. Kind regards, Carl |
Hi Carl, thanks for chiming in and congrats for the publication of The issues you raised recently with R spatial packages have helped a lot to figure out the usage of GDAL's virtual file system drivers. We are trying to extent mapme.biodiversity/R/calc_indicator.R Line 154 in 1f4c7f7
For raster data, we have come up with a solution using VRT if the raster is provided in tiles covering the globe and two or more tiles intersect with a given polygon: mapme.biodiversity/R/calc_indicator.R Lines 206 to 216 in 1f4c7f7
I think most of this should work with minor modifications using As mentioned earlier in this thread, there still remains the issue of listing existing files in cloud buckets and I think that Best, |
Excellent, thank you! |
For some unknown reason, reading with "/vsis3/.*" method doesn't work when I use the minio testing server set up on my local machine. However it works with the complete version hosted on SSP Cloud. Maybe something to do with http vs. https ? |
Could you try with setting |
Alas, I tried addinf "AWS_HTTPS" = "FALSE" to the Sys.setenv() call, but it didn't work... |
@fBedecarrats I believe you also need to set
and of course |
Yesssss! It works with the following parameters: Sys.setenv("AWS_ACCESS_KEY_ID" = "minioadmin",
"AWS_SECRET_ACCESS_KEY" = "minioadmin",
"AWS_S3_ENDPOINT"= "localhost:9000",
"AWS_SESSION_TOKEN" = "",
"AWS_HTTPS" = "FALSE",
"AWS_VIRTUAL_HOSTING"="FALSE") |
Hi! Everything seem to work and I am preparing a PR to include these features in the package (branch |
Hi all, please find this branch where I am testing using GDAL drivers to support both remote files as well as cloud storage solutions. It is still very much WIP, but the main ideas are there already. |
Background
The package is implementing paralellization, now using a backend that works well with big data environments. The main bottlenecks now when large number of Ns are used is the way data is stored and accessed. Enabling the storage and access to data for datalake APIs (S3, Azure blob storage) could enable to further enhance the package performance.
Definition of done
The package enables to reference a datalake storage (Azure or S3) in the same way than it enables to reference a location on local file system.
Complexity
To be assessed. My first impression is that it would be low to medium.
The text was updated successfully, but these errors were encountered: