Adding functions and scripts for downloading, extracting, and processing observations, initial conditions, land cover types, ERA5 drivers for anchor sites within NA. #3278

DongchenZ · 2024-03-12T23:51:44Z

Description

This PR includes:

script for preparing required data sets (ERA5; AGB, LAI, SMAP, and SOC; land cover) for anchor sites.
script for preparing initial conditions (AGB, LAI, Soil moisture, SOC) for anchor sites.
updated function for searching ecoregion within NA.
functions for downloading and extracting soil moisture from the CDS server.
function for downloading and extracting MODIS land cover products.
function for extracting AGB initial condition from the GeoTIFF files.
function for extracting ISCN SOC from existing Rdata file.

Motivation and Context

Review Time Estimate

Immediately
Within one week
When possible

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My change requires a change to the documentation.
My name is in the list of CITATION.cff
I have updated the CHANGELOG.md.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.

…ecoregion codes (L1 and L2).

…develop

infotroph · 2024-03-13T00:15:25Z

As currently structured this PR adds more than 100 megabytes of data files to the package, which is far outside the norm for R package size -- CRAN wants special justification for anything with more than 5 MB of data or more then 10 MB for the whole package. Do the shapefiles and Rdata files really need to be distributed as part of data.land or can they be stored elsewhere and read in as needed?

mdietze

For the process of data prep for large-scale SDA runs, there needs to be better overall documentation of how one does that. Even if someone else found these functions and scripts, they wouldn't know how to run them without some sort of README.md. Indeed, for prep scripts that only make sense to run manually (as opposed to cron jobs or some other automation) you might consider writing them as Rmd files.

mdietze · 2024-03-13T00:55:26Z

modules/assim.sequential/inst/anchor/anchorSites_data_prep.R

+#prepare observations
+settings <- PEcAn.settings::read.settings("/projectnb/dietzelab/dongchen/anchorSites/SDA/pecan.xml")
+settings$state.data.assimilation$Obs_Prep$outdir <- "/projectnb/dietzelab/dongchen/anchorSites/Obs"
+obs <- PEcAnAssimSequential::SDA_OBS_Assembler(settings)


Is this one step (3 lines of code) loading the full list of all of your data constraints? That needs better documentation. Also, why is settings being loaded a second time? And why is one of the paths being manually rewritten? Also, what script is building the settings file to begin with? I expected you to be starting from either a file of sites (id, lat, lon, etc) or a query of a site group.

Here, we need to create the oldpecan.xml file using the Create_multi_settings.R script for the anchor site group on the Bety DB with group ID 1000000033. The reason is that the Bety DB is currently down, and it's challenging to create new records within it. Therefore, we will need to first grab what we have previously, iteratively add new sites to the site info, and write them into the new pecan.xml file.
Once we have every site prepared in the Bety, we can easily create the pecan.xml file by just using the Create_multi_settings.R script and ignore the chunk of pulling new sites into the existing database.

I also left this comment on the script so people will know how to deal with it appropriately.

mdietze · 2024-03-13T00:56:14Z

modules/assim.sequential/inst/anchor/anchorSites_data_prep.R

+obs <- PEcAnAssimSequential::SDA_OBS_Assembler(settings)
+
+#prepare LC and PFTs
+LC <- PEcAn.data.remote::MODIS_LC_prep(site_info = site_info, time_points = settings$state.data.assimilation$start.date, qc.filter = T)


Is "LC" land cover?

Yes, I replaced every LC with a land cover in the comments.

mdietze · 2024-03-13T01:02:07Z

modules/data.land/R/ISCN_extract.R

+#' 
+#' @examples
+#' @author Dongchen Zhang
+#' @importFrom magrittr %>%


Is there a reason you need to use magrittr pipes over R native pipes |> ? If yes, I think I saw recent PRs where @infotroph imported from dplyr instead of magrittr (presumably to reduce dependencies?)

@mdietze you're right that I switched data.atmosphere to import from dplyr, and yes it was to reduce dependencies, but looks like the existing pipe imports in data.land are still from magrittr. I support keeping this one from magrittr and switching in a separate PR.

As for |> vs %>%, I've still been using %>% in packages that are already using it, just for consistency. I don't object to switching to |> in new code -- when we do start using native pipes in a given package, we should add Depends: R (>= 4.1) to its DESCRIPTION.

Reverted back to dplyr

mdietze · 2024-03-13T01:08:12Z

modules/data.land/R/download.SM_CDS.R

+    )
+  })
+  #check if the token exists for the cdsapi.
+  if (!file.exists(file.path(Sys.getenv("HOME"), ".cdsapirc")))


Likewise, you definitely need to document the need to set up an api key as part of the function documentation. All these error messages are great, but it would be super frustrating to try to use this function in practice as it would just keep giving you different error messages until you finally got it working. I suspect most users would give up before they got through all the dependencies.

mdietze · 2024-03-13T01:09:22Z

modules/data.land/inst/IC_prep_anchorSites.R

@@ -0,0 +1,117 @@
+#read settings.


similar to other script, need better overall documentation

modules/data.remote/R/MODIS_LAI_prep.R

mdietze · 2024-03-13T01:16:51Z

modules/data.remote/R/Prep_AGB_IC_from_geotiff.R

@@ -0,0 +1,41 @@
+#' Extract above ensemble ground biomass density from pre-existing GeoTIFF file for the SDA workflow.


Ok, so presumably this isn't a generic approch to reading any AGB geotiff (of which there are many in the world), but something for working with some specific product? That needs to be well documented, including being clear in the function name.

mdietze · 2024-03-13T01:17:52Z

modules/data.land/R/download.SM_CDS.R

+      "Please make sure it is installed to a location accessible to `reticulate`.",
+      "You should be able to install it with the following command: ",
+      "`pip install --user cdsapi`.",
+      "The following error was thrown by `reticulate::import(\"cdsapi\")`: ",


This sort of dependency on a python library needs to documented in your Roxygen metadata.

I also found this article on managing R library dependencies on python libraries: https://cran.r-project.org/web/packages/reticulate/vignettes/python_dependencies.html

mdietze · 2024-03-13T01:18:04Z

modules/assim.sequential/inst/anchor/anchorSites_data_prep.R

+settings <- PEcAn.settings::read.settings("/projectnb/dietzelab/dongchen/anchorSites/SDA/pecan.xml")
+
+in.path <- "/projectnb/dietzelab/hamzed/ERA5/Data/Ensemble"
+out.path <- "/projectnb/dietzelab/dongchen/anchorSites//ERA5_2012_2021"


Since this is a stand-alone script, I think it would be good to include an overall introductory comment block explaining what the purpose of the script is and how to use it, and then additional comments on variables like these (settings, in.path, out.path) explaining what they should be.

Also, is there something here that's specific to anchor sites or would this script do data prep for any set of sites (in North America? in the world?) and the initial list of sites just happens to be anchor sites?

I replaced it with well-documented Rmd file.

It's preparing data for the 342 NA anchor sites that combine the Bety site group (1000000033) and some sites in the goolge sheet (https://docs.google.com/spreadsheets/d/1n7pVUcrYrB0S8bqrj77tUNrLHA2c_yqzkZL8mgdVDjs/edit#gid=0). I also left comments for this information.

… download is problematic.

…r MODIS LAI extraction.

mdietze · 2024-05-16T15:14:41Z

modules/assim.sequential/inst/anchor/anchorSites_data_prep.Rmd

+```{r}
+#filter based on NA boundary, land cover.
+#boundary
+site_eco <- PEcAn.data.land::EPA_ecoregion_finder(pre_site_info$lat, pre_site_info$lon)


Ok for this PR, but should update to make sure we're not excluding Central America, the Caribbean, and Hawaii (can discuss the latter, since it probably fall outside your ERA5 box). Offline we discussed alternative ecoregion maps for central and south America

mdietze · 2024-05-16T16:17:15Z

modules/data.land/R/download.SM_CDS.R

+#' @author Dongchen Zhang
+#' @importFrom dplyr %>%
+download.SM_CDS <- function(outfolder, time.points, overwrite = FALSE, auto.create.key = FALSE) {
+  ###################################Introduction on how to play with the CDS python API##########################################


I'd recommend moving all of this instruction text out of the function and into the ROxygen as part of the Details section

mdietze · 2024-05-16T16:18:39Z

modules/data.land/R/extract_SM_CDS.R

+#' @param in.path physical paths to where the unziped soil moisture files are downloaded.
+#' @param out.path Where the final CSV file will be stored.
+#' @param allow.download Flag determine if we want to automatic download files if they are not available.
+#' @param search_window search window for locate available soil moisture values.


not clear what units this is in. pixels? km? degrees? if pixels, include a reminder of how big a pixel is for this data product.

It's the time search window (in days). Because it will take ~ 10 days for this product to get a global coverage, we need to account for that to grab sufficient estimations for all locations that we are interested in.

Updated the documentation to make it more clear.

mdietze · 2024-05-16T16:19:42Z

modules/data.land/R/extract_SM_CDS.R

+        PEcAn.logger::logger.info("Try download from cds server.")
+        ncfile <- c(ncfiles[dates.ind.exist], PEcAn.data.land::download.SM_CDS(in.path, dates[dates.ind.download])) %>% sort()
+      } else {
+        PEcAn.logger::logger.severe("The download is not enabled, skip to the next time point.")


logger.severe won't skip to the next time point, it will kill the whole R job

Replaced it with logger.info.

mdietze · 2024-05-16T16:24:33Z

modules/data.land/inst/IC_prep_anchorSites.Rmd

@@ -0,0 +1,158 @@
+---
+title: "Initial condition prep script for NA anchor sites"


feel like this script should probably be in the inst/anchor folder.

Also, as a general comment (not just this script), if you're going to do prep in Rmd instead of R, I'd recommend embedding more text-based descriptions and explanations. Think about the next cohort of folks to use these tools and what they need to know that's still in your head instead of written out explicitly.

On this point, also a reminder to think about what in this PR also needs to be added to the general pecan documentation and change logs. For example, there's whole new sources of input data that need to be included. Any new pecan.xml tags also need to be documented

mdietze · 2024-05-16T16:28:20Z

modules/data.remote/R/MODIS_LAI_prep.R

@@ -5,21 +5,21 @@
 #' @param outdir Where the final CSV file will be stored.
 #' @param search_window search window for locate available LAI values.
 #' @param export_csv Decide if we want to export the CSV file.
+#' @param skip_high_sd if we want to skip observations with high standard error.


Rather than a boolean, what if you instead passed in the SD threshold you want to use and just set the default really high (e.g. 100) so that it's essentially "off" by default. Alternatively, you could set it to the current threshold and have it "on" by default (SD of 10, which implies a CI of ~40, on a variable that goes from 0-6 is already very wide)

I added a sd_threshold default by NULL, which could be used for filtering out the data.

mdietze · 2024-05-16T16:33:05Z

modules/data.remote/R/MODIS_LC_prep.R

+#' @param site_info Bety list of site info including site_id, lon, and lat.
+#' @param time_points A vector contains each time point within the start and end date.
+#' @param outdir Where the final CSV file will be stored.
+#' @param export_csv Decide if we want to export the CSV file.


these two tags seem redundant. If outdir == NULL then you don't want to export the CSV. That said, I'd expand the explanation text to make that point explicit (i.e. function returns data object by default, outdir is NULL by default, but if you want to export to a CSV you specify the path here)

mdietze · 2024-05-16T16:36:28Z

modules/data.remote/R/MODIS_LC_prep.R

+#' @param time_points A vector contains each time point within the start and end date.
+#' @param outdir Where the final CSV file will be stored.
+#' @param export_csv Decide if we want to export the CSV file.
+#' @param qc.filter decide if we want to filter data by the qc band.


recommend qc_filter as the variable name

similar to earlier point, rather than having this be a boolean, better would be to pass in the requested QC flag (e.g. set this to qc_filter = c("000", "001") by default) and then say set to NULL or FALSE or something like that to turn off QC filtering.

mdietze · 2024-05-16T16:39:29Z

modules/data.remote/R/MODIS_LC_prep.R

+  if (!is.null(outdir)) {
+    if(file.exists(file.path(outdir, "LC.csv"))){
+      PEcAn.logger::logger.info("Extracting previous MODIS Land Cover file!")
+      Previous_CSV <- utils::read.csv(file.path(outdir, "LC.csv"))


This sort of cache behavior is unintuitive and isn't documented. Please explain this in the Roxygen

mdietze · 2024-05-16T16:43:36Z

modules/data.remote/R/Prep_AGB_IC_from_geotiff.R

+#' @examples
+#' @author Dongchen Zhang
+#' @importFrom magrittr %>%
+Prep_AGB_IC_from_geotiff <- function(site_info, paths.list, ens) {


This function appears to make a lot of product-specific assumptions and isn't general to ANY geotiff containing AGB data. At this point I'd recommend just changing the name to something that's data product specific, and then when you encounter other geotiffs with differently formatted AGB data you can think more explictly about how to generalize this function (e.g. reading in metadata on variable names, projection, etc) versus writing different functions for different products.

…elop

Dongchen Zhang added 15 commits March 12, 2024 19:23

Add shapefiles of level 1 and 2 NA eco-region.

eddb3ac

Add a script for processing anchor site data preparations.

0370079

Automatic updated namespace.

0ac8aee

Update codes for grabbing site info from a multi-settings object.

76fa0f2

Add a function for extracting ISCN SOC initial conditions.

f4102be

Add a function for downloading soil moisture from the CDS server.

80de614

Add a function for extracting downloaded soil moisture netcdf files.

2c7d181

Add rdata file for the ISCN SOC records along with the corresponding …

d31f478

…ecoregion codes (L1 and L2).

Add a script for processing initial conditions for the anchor site.

0fb6a4d

Tweak ecoregion finder function to adjust the NA area.

2ecace0

Automatic updated namespace.

fa445ce

Update MODIS_LAI_prep function.

f8025bb

Add a function to download, extract and filter MODIS LC data.

b2ebe9b

Add a function for extracting AGB initial conditions from GeoTIFF files.

c9eb19d

Merge branch 'develop' of https://github.com/PecanProject/pecan into …

dace7a2

…develop

github-actions bot added Tests Modules labels Mar 12, 2024

mdietze requested changes Mar 13, 2024

View reviewed changes

Dongchen Zhang and others added 11 commits March 17, 2024 11:39

Updated previous documentation.

a75ab0b

Update progress bar for the qsub_parallel function.

d919d12

Correct the unit for using the soil moisture initial condition in SDA.

d89d66a

Correct the indexing for block based analysis function.

d27e7dd

Correct logic for the analysis run.

a854ed9

Add arguments in the plotting function to control plot size.

4cfc6be

Update create a multi-site script to include the NA anchor sites.

a103d7b

Fix the bug where we write ERA5 paths into the settings.xml file.

8d0bc46

Update the script to download North America ERA5 data.

9c1dbf4

Provide another option for downloading SoilGrids data if the parallel…

ca4fa97

… download is problematic.

Add argument to decide if we want to skip the high sd observations fo…

68a1471

…r MODIS LAI extraction.

Dongchen Zhang added 6 commits April 26, 2024 20:52

update data.

64bb891

Update documentation.

85cf197

Update the way of automatically create the credential file.

536d742

Update documentation.

ccfe190

document iscn_soc rdata.

e7fb69b

Update dependencies.

308e996

github-actions bot added the Dockerfile label Apr 27, 2024

Dongchen Zhang added 5 commits April 26, 2024 23:58

Bug fix.

053fe56

Fix documentation.

af45b0d

Fix typo in roxygen.

59bf4b8

Revert back to the magrittr dependency.

d11887c

add namespace.

7fe5670

mdietze requested changes May 16, 2024

View reviewed changes

Dongchen Zhang and others added 15 commits May 16, 2024 14:14

Put the instruction in the roxygen structure.

bb6d6df

Updated documentation.

6757a92

Update logger info.

ac37744

Update the usage for filtering high sd data and documentation.

14df11a

Update the usage for exporting the csv file.

2446661

Update the qc.filter usage.

d83f343

Rename the function to be more product specific.

1a68cee

Update documentation.

8f2823c

Update documentation.

08a454e

Move file to the correct folder.

69a3faf

Update documentations.

8a14ec0

Update documentation.

9b09744

Merge branch 'PecanProject:develop' into develop

c91660b

Merge branch 'develop' of https://github.com/DongchenZ/pecan into dev…

fda90e3

…elop

Update documentation.

569da69

DongchenZ requested a review from mdietze May 16, 2024 20:52

Update change log file.

bbb304f

		@@ -0,0 +1,41 @@
		#' Extract above ensemble ground biomass density from pre-existing GeoTIFF file for the SDA workflow.

		@@ -0,0 +1,158 @@
		---
		title: "Initial condition prep script for NA anchor sites"

Adding functions and scripts for downloading, extracting, and processing observations, initial conditions, land cover types, ERA5 drivers for anchor sites within NA. #3278

Are you sure you want to change the base?

Adding functions and scripts for downloading, extracting, and processing observations, initial conditions, land cover types, ERA5 drivers for anchor sites within NA. #3278

Conversation

DongchenZ commented Mar 12, 2024 • edited

Description

Motivation and Context

Review Time Estimate

Types of changes

Checklist:

infotroph commented Mar 13, 2024 • edited

mdietze left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DongchenZ commented Mar 12, 2024 •

edited

infotroph commented Mar 13, 2024 •

edited