Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Function blackmarbler::bm_extract is not downloading #10

Open
alexyshr opened this issue Mar 9, 2024 · 12 comments
Open

Function blackmarbler::bm_extract is not downloading #10

alexyshr opened this issue Mar 9, 2024 · 12 comments

Comments

@alexyshr
Copy link

alexyshr commented Mar 9, 2024

Can you please advise me on why the function blackmarbler::bm_extract is not downloading the files as expected? Despite the system indicating that the files were processed 100%, no downloads are appearing in my designated folders. I used a local Administrator account on Windows 10 to run the R script, but no files are being written to the disk drive. I did not encounter any error messages or warnings during the execution of the script. I just installed the development version of blackmarbler package from GitHub, but It didn't work with the CRAN version either. I'm running the last R version [4.3.3]. Any assistance in resolving this issue would be greatly appreciated. I am using my NASA-EARTHDATA-TOKEN to define the bearer variable.

library(sf)
library(dplyr)
library(ggplot2)
library(purrr)
library(lubridate)
bearer <- "NASA-EARTHDATA-TOKEN"
#test
roi_sf <- gadm(country = "GHA", level = 1, path = tempdir()) |> st_as_sf()
blackmarbler::bm_extract(
  roi_sf = roi_sf,
  product_id = "VNP46A2",
  date = seq.Date(
    from = lubridate::ymd("2023-01-01"),
    to = lubridate::ymd("2023-01-02"), by = 1
  ),
  bearer = bearer,
  output_location_type = "file",
  file_dir = file.path(getwd())
)

This is the outcome:
error

@alexyshr
Copy link
Author

Just to clarify, the function is always downloading to temporal files but not to the folder defined in file_dir

@ramarty
Copy link
Contributor

ramarty commented Mar 27, 2024

@alexyshr Thanks for flagging this! The issue should be fixed in the dev version (# install.packages("devtools") devtools::install_github("worldbank/blackmarbler"), but let me know if you're still running into issues.

@alexyshr
Copy link
Author

alexyshr commented Mar 28, 2024

Thank you @ramarty. After reinstalling from GitHub. I am using the function as below:

  ntl_counties_df <- bm_extract(
    roi_sf = counties,
    product_id = "VNP46A2",
    date = days_to_get_ntl,
    bearer = bearer,
    variable = variable,
    quality_flag_rm = quality_flag_rm,
    output_location_type = "file",
    file_dir = file_dir,
    file_skip_if_exists = TRUE
  )

Inside file_dir I have all the hdf5 files. After running the code the bm_extract command placed all the Rds (e.g. VNP46A2_t2017_08_21.Rds) files, but their size is only 3k.

The process fails. I tried multiple times. The content of ntl_counties_df is empty or null. After the second execution of the code, the system does not try to download files anymore.

Thank you!

@alexyshr
Copy link
Author

The content of the Rds files is OK, but the combination of those results in long format inside the variable ntl_counties_df does not work.

@ramarty
Copy link
Contributor

ramarty commented Mar 28, 2024

@alexyshr This is all helpful feedback. A couple things:

1/ After running the below code, do you get values for ntl_counties_df?

ntl_counties_df <- bm_extract( roi_sf = counties, product_id = "VNP46A2", date = days_to_get_ntl, bearer = bearer, variable = variable, quality_flag_rm = quality_flag_rm )

2/ I'll update the documentation to make this more clear (or just update what the package does to give a more expected result). But when:

  • output_location_type = "file", both bm_extract and bm_raster only output data to the directory specified in file_dir. The functions don't return data to R. (Right now I tibble of 0 rows is returned, but I should change this to NULL -- or maybe have the function load all the Rds files, append, then return that -- in addition to saving the Rds files.
  • When output_location_type = "memory" (the default), the functions return data. So here, ntl_counties_df should then have data -- but no Rds files are saved.

I've used output_location_type = "file" if I want to download data for a lot of dates, then I'll do something like: list.files(file_dir, full.names = F, pattern = "*.Rds") |> map_df(readRDS) to get the appended dataframe.

Happy to clarify more, or let me know if the above code still doesn't work.

Thanks!

@alexyshr
Copy link
Author

@ramarty Thank you!

1/ It returns data in ntl_counties_df after downloading all the HDF5 files to temporal places.

  1. When using these options:
  ntl_counties_df <- bm_extract(
    roi_sf = counties,
    product_id = "VNP46A2",
    date = days_to_get_ntl, # as.character "2018-05-25",#,
    bearer = bearer,
    variable = variable,
    quality_flag_rm = quality_flag_rm,
    output_location_type = "memory",
    file_dir = file_dir,
    file_skip_if_exists = TRUE
  )

After re-downloading all the HDF5 files to temporary locations, the result is stored in ntl_counties_df.

  1. When using these options:
  ntl_counties_df <- bm_extract(
    roi_sf = counties,
    product_id = "VNP46A2",
    date = days_to_get_ntl, # as.character "2018-05-25",#,
    bearer = bearer,
    variable = variable,
    quality_flag_rm = quality_flag_rm,
    output_location_type = "file",
    file_dir = file_dir,
    file_skip_if_exists = TRUE
  )
  ntl_counties_df <- file_dir |>
    list.files(pattern = "*.Rds", full.names = TRUE) |>
    purrr::map_df(readRDS)

I was able to gather all the rds inside ntl_counties_df. The commands did not initiate any additional downloads of files.

  1. There are multiple issues with previous approach (3)

4.1

If I change the variable to read from HDF5 files, the system does not replace existing Rds files

VNP46A2_t2017_08_29.Rds" already exists; skipping.
VNP46A2_t2017_08_30.Rds" already exists; skipping
...

4.2

What if we have other Rds (not related) inside that folder. The integrated Rds will be wrong.

For instance, an additional Rds created as below. Those files will provoke a bad result.

  # # Append daily-level datasets into one file

  file_dir |>
    list.files(pattern = "*.Rds", full.names = TRUE) |>
    purrr::map_df(readRDS) |>
    saveRDS(file.path(
      file_dir,
      paste0(hurricanename, "_", as.character(year), "_ntl_daily.Rds")
    ))

4.3 What if we change the geometry of roi_sf? The Rds need to be recreated.

4.4 What if we change other parameters of the package commands? The Rds need to be recreated.

I think a better solution could be:

  • RDS creation should always occur within a temporary folder or within the specified file_dir with guaranteed unique names. This implies that the commands will not reuse RDS files (unless you have a way to prove that the exact command outcome was already created -- then the reuse of Rds will be valid)
  • Our package functions should internally utilize the purrr::map_df operation to consistently return the data after executing the functions, regardless of the chosen file or memory options (output_location_type).
  • In the event of creating temporary folders, perhaps the command's output could report the path of the temporary folder to the user.

Thank you!

@alexyshr
Copy link
Author

I thought that removing the Rds would be the solution to avoid re-downloading HDF5 files, but it seems that once no Rds are located inside file_dir (only hdf5 files are inside it), the next code is again downloading all HDF5 files. The step that I want to avoid is to download HDF5 files multiple times.

  ntl_counties_df <- bm_extract(
    roi_sf = counties,
    product_id = "VNP46A2",
    date = days_to_get_ntl, # as.character "2018-05-25",#,
    bearer = bearer,
    variable = variable,
    quality_flag_rm = quality_flag_rm,
    output_location_type = "file",
    file_dir = file_dir,
    file_skip_if_exists = TRUE
  )

Thank you for your time. I am not in a hurry. Take your time. I know that this package is just starting. I will try working with HDF5 files directly using stars or terra and sf.

@ramarty
Copy link
Contributor

ramarty commented Mar 28, 2024

@alexyshr These are excellent comments — thanks so much. Great point about making variable names more unique—such as incorporating the variable in the name.

And yep, definitely agree about returning the appended dataframe even when output_location_type = "file" is specified. Agree that would make more sense for the user.

Helpful point too on wanting to keep the HDF5 files to, for example, extract multiple variables but avoid re-downloading the data (which can be slow). I'll think on the best way to implement this, but figure could add a function that just downloads HDF5 files to a directory, then could add an option in bm_raster and bm_extract to read data from the HDF5 files instead of downloading those files.

Thanks again — will work on making these improvements!

@alexyshr
Copy link
Author

alexyshr commented Apr 4, 2024

Hi @ramarty ,

I have edited the current version in GitHub to avoid downloading HDF5 files if they exist in file_dir.

I know you are changing multiple things in the package (migration to TERRA) but I would like you to consider the changes that can be useful for the user. Please find the code attached.

I did all the changes testing the function bm_extract. I think the function bm_raster is also working by default, but some changes in the messages and warnings need to be made (in case you decide to accept some or all the proposed changes).

Main considerations:

  • check_all_tiles_exist: If TRUE the code does not stop but reports the missing tiles (messages and warnings). Currently, it also generates a CSV file with the report of missing tiles. If FALSE it does not stop or report missing tiles.
  • file_skip_if_exists: It is related only to override or not Rds files.
  • In your original version, it seems add_n_pixels = FALSE does not work when interpol_na = TRUE (not fixed here).
  • I have tested all the variable values inside the function bm_raster and it seems to work.
  • bm_extract always returns the data in long format (independent of file_dir value)

You can test this version with the code below (those dates in the code have one missing tile for the Puerto Rico area)

source("blackmarbler.R")


#library(blackmarbler)
library(geodata)
library(sf)
library(stars)
library(raster)
library(ggplot2)
library(dplyr)
library(tools)
library(terra)
library(zoo)
library(lubridate)

library(purrr) #blackmarbler.R
library(httr)
library(exactextractr)
library(tidyr)
library(knitr)
library(readr)
library(hdf5r)
library(tidyr)
library(stringr)

#libraries for the blackmarbler function
library(stringr)

bearer <- "..."


variable <- "Gap_Filled_DNB_BRDF-Corrected_NTL"
#[1]:"DNB_BRDF-Corrected_NTL"
#[2]:"DNB_Lunar_Irradiance"
#[3]:"Gap_Filled_DNB_BRDF-Corrected_NTL"
#[4]:"Latest_High_Quality_Retrieval"
#[5]:"Mandatory_Quality_Flag"
#[6]:"QF_Cloud_Mask"
#[7]:"Snow_Flag"


file_dir <- "C:/temp/bm"

the_dates  <- c("09/27/2017", "09/30/2017")
start_date <- as.Date(the_dates[1], format = "%m/%d/%Y")

stop_date <- as.Date(the_dates[2], format = "%m/%d/%Y")

days_to_get_ntl <- seq(start_date, stop_date, by = "day")

#Retrieve GADM polygons of Puerto Rico (PRI) (4326)
#
the_geometry <- gadm(
  country = "PRI",
  level = 1,
  path = "C:/temp/bm",
) |>
  st_as_sf()


#filter the field NAME to get only "Guaynabo" and "Arroyo"
the_geometry <- the_geometry %>%
  filter(NAME_1 %in% c("Guaynabo", "Arroyo"))


aggregation_fun <- c("mean")

values_df_long <- bm_extract(
  roi_sf = the_geometry,
  product_id = "VNP46A2",
  date = days_to_get_ntl,
  bearer = bearer,
  aggregation_fun = aggregation_fun,
  add_n_pixels = TRUE,
  variable = variable,
  quality_flag_rm = 2,
  check_all_tiles_exist = TRUE,
  interpol_na = FALSE,
  output_location_type = "file", # memory, file
  file_dir = file_dir,
  file_prefix = NULL,
  file_skip_if_exists = FALSE, #TRUE do not replace
  quiet = FALSE
)
[blackmarbler.zip](https://github.com/worldbank/blackmarbler/files/14875276/blackmarbler.zip)


#bm_raster is working but some updates are needed
# raster_stack <- bm_raster(
#   roi_sf = the_geometry,
#   product_id = "VNP46A2",
#   date = days_to_get_ntl,
#   bearer = bearer,
#   variable = variable,
#   quality_flag_rm = 2,
#   check_all_tiles_exist = FALSE,
#   interpol_na = FALSE,
#   output_location_type = "file", # memory, file
#   file_dir = file_dir,
#   file_prefix = NULL,
#   file_skip_if_exists = FALSE, #TRUE do not replace
#   quiet = FALSE
# )

@ramarty
Copy link
Contributor

ramarty commented Apr 6, 2024

@alexyshr thanks so much -- these changes are great & I agree makes sense. Aiming to go through this week, so should have an update by end of week!

@ramarty
Copy link
Contributor

ramarty commented Apr 19, 2024

@alexyshr Apologies that its taken me forever to make changes, but few updates (on the development version -- not on cran yet)

  1. Switched from raster to terra
  2. Added an h5_dir parameter. If NULL (default), h5 files get saved to a working directory & deleted. However, if you set a file path, h5 files get saved to that file path and are not deleted -- and the function first checks if needed h5 files are in the directory before downloading
  3. When output_location_type is set to file, the function still returns expected output (raster or dataframe)
  4. output_location_type is set to file, the filenames default to more unique names (eg, the variable is included)

I'll do some additional tests before updating these to CRAN -- but definitely let me know if you run into more issues or if have more feedback!

(Also working on allowing one to use bm_extract using terra::extract -- potentially an option to use either that or exact_extract).

@alexyshr
Copy link
Author

That is wonderful. I will check this tomorrow. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants