Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding functions and scripts for downloading, extracting, and processing observations, initial conditions, land cover types, ERA5 drivers for anchor sites within NA. #3278

Open
wants to merge 72 commits into
base: develop
Choose a base branch
from

Conversation

DongchenZ
Copy link
Contributor

@DongchenZ DongchenZ commented Mar 12, 2024

Description

This PR includes:

  1. script for preparing required data sets (ERA5; AGB, LAI, SMAP, and SOC; land cover) for anchor sites.
  2. script for preparing initial conditions (AGB, LAI, Soil moisture, SOC) for anchor sites.
  3. updated function for searching ecoregion within NA.
  4. functions for downloading and extracting soil moisture from the CDS server.
  5. function for downloading and extracting MODIS land cover products.
  6. function for extracting AGB initial condition from the GeoTIFF files.
  7. function for extracting ISCN SOC from existing Rdata file.

Motivation and Context

Review Time Estimate

  • Immediately
  • Within one week
  • When possible

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation.
  • My name is in the list of CITATION.cff
  • I have updated the CHANGELOG.md.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@infotroph
Copy link
Member

infotroph commented Mar 13, 2024

As currently structured this PR adds more than 100 megabytes of data files to the package, which is far outside the norm for R package size -- CRAN wants special justification for anything with more than 5 MB of data or more then 10 MB for the whole package. Do the shapefiles and Rdata files really need to be distributed as part of data.land or can they be stored elsewhere and read in as needed?

Copy link
Member

@mdietze mdietze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the process of data prep for large-scale SDA runs, there needs to be better overall documentation of how one does that. Even if someone else found these functions and scripts, they wouldn't know how to run them without some sort of README.md. Indeed, for prep scripts that only make sense to run manually (as opposed to cron jobs or some other automation) you might consider writing them as Rmd files.

#prepare observations
settings <- PEcAn.settings::read.settings("/projectnb/dietzelab/dongchen/anchorSites/SDA/pecan.xml")
settings$state.data.assimilation$Obs_Prep$outdir <- "/projectnb/dietzelab/dongchen/anchorSites/Obs"
obs <- PEcAnAssimSequential::SDA_OBS_Assembler(settings)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this one step (3 lines of code) loading the full list of all of your data constraints? That needs better documentation. Also, why is settings being loaded a second time? And why is one of the paths being manually rewritten? Also, what script is building the settings file to begin with? I expected you to be starting from either a file of sites (id, lat, lon, etc) or a query of a site group.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we need to create the oldpecan.xml file using the Create_multi_settings.R script for the anchor site group on the Bety DB with group ID 1000000033. The reason is that the Bety DB is currently down, and it's challenging to create new records within it. Therefore, we will need to first grab what we have previously, iteratively add new sites to the site info, and write them into the new pecan.xml file.
Once we have every site prepared in the Bety, we can easily create the pecan.xml file by just using the Create_multi_settings.R script and ignore the chunk of pulling new sites into the existing database.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also left this comment on the script so people will know how to deal with it appropriately.

obs <- PEcAnAssimSequential::SDA_OBS_Assembler(settings)

#prepare LC and PFTs
LC <- PEcAn.data.remote::MODIS_LC_prep(site_info = site_info, time_points = settings$state.data.assimilation$start.date, qc.filter = T)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "LC" land cover?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I replaced every LC with a land cover in the comments.

#'
#' @examples
#' @author Dongchen Zhang
#' @importFrom magrittr %>%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you need to use magrittr pipes over R native pipes |> ? If yes, I think I saw recent PRs where @infotroph imported from dplyr instead of magrittr (presumably to reduce dependencies?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdietze you're right that I switched data.atmosphere to import from dplyr, and yes it was to reduce dependencies, but looks like the existing pipe imports in data.land are still from magrittr. I support keeping this one from magrittr and switching in a separate PR.

As for |> vs %>%, I've still been using %>% in packages that are already using it, just for consistency. I don't object to switching to |> in new code -- when we do start using native pipes in a given package, we should add Depends: R (>= 4.1) to its DESCRIPTION.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted back to dplyr

)
})
#check if the token exists for the cdsapi.
if (!file.exists(file.path(Sys.getenv("HOME"), ".cdsapirc")))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, you definitely need to document the need to set up an api key as part of the function documentation. All these error messages are great, but it would be super frustrating to try to use this function in practice as it would just keep giving you different error messages until you finally got it working. I suspect most users would give up before they got through all the dependencies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -0,0 +1,117 @@
#read settings.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to other script, need better overall documentation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

modules/data.remote/R/MODIS_LAI_prep.R Show resolved Hide resolved
@@ -0,0 +1,41 @@
#' Extract above ensemble ground biomass density from pre-existing GeoTIFF file for the SDA workflow.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so presumably this isn't a generic approch to reading any AGB geotiff (of which there are many in the world), but something for working with some specific product? That needs to be well documented, including being clear in the function name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

"Please make sure it is installed to a location accessible to `reticulate`.",
"You should be able to install it with the following command: ",
"`pip install --user cdsapi`.",
"The following error was thrown by `reticulate::import(\"cdsapi\")`: ",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sort of dependency on a python library needs to documented in your Roxygen metadata.

I also found this article on managing R library dependencies on python libraries: https://cran.r-project.org/web/packages/reticulate/vignettes/python_dependencies.html

settings <- PEcAn.settings::read.settings("/projectnb/dietzelab/dongchen/anchorSites/SDA/pecan.xml")

in.path <- "/projectnb/dietzelab/hamzed/ERA5/Data/Ensemble"
out.path <- "/projectnb/dietzelab/dongchen/anchorSites//ERA5_2012_2021"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a stand-alone script, I think it would be good to include an overall introductory comment block explaining what the purpose of the script is and how to use it, and then additional comments on variables like these (settings, in.path, out.path) explaining what they should be.

Also, is there something here that's specific to anchor sites or would this script do data prep for any set of sites (in North America? in the world?) and the initial list of sites just happens to be anchor sites?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I replaced it with well-documented Rmd file.
  2. It's preparing data for the 342 NA anchor sites that combine the Bety site group (1000000033) and some sites in the goolge sheet (https://docs.google.com/spreadsheets/d/1n7pVUcrYrB0S8bqrj77tUNrLHA2c_yqzkZL8mgdVDjs/edit#gid=0). I also left comments for this information.

```{r}
#filter based on NA boundary, land cover.
#boundary
site_eco <- PEcAn.data.land::EPA_ecoregion_finder(pre_site_info$lat, pre_site_info$lon)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok for this PR, but should update to make sure we're not excluding Central America, the Caribbean, and Hawaii (can discuss the latter, since it probably fall outside your ERA5 box). Offline we discussed alternative ecoregion maps for central and south America

#' @author Dongchen Zhang
#' @importFrom dplyr %>%
download.SM_CDS <- function(outfolder, time.points, overwrite = FALSE, auto.create.key = FALSE) {
###################################Introduction on how to play with the CDS python API##########################################
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend moving all of this instruction text out of the function and into the ROxygen as part of the Details section

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

#' @param in.path physical paths to where the unziped soil moisture files are downloaded.
#' @param out.path Where the final CSV file will be stored.
#' @param allow.download Flag determine if we want to automatic download files if they are not available.
#' @param search_window search window for locate available soil moisture values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not clear what units this is in. pixels? km? degrees? if pixels, include a reminder of how big a pixel is for this data product.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the time search window (in days). Because it will take ~ 10 days for this product to get a global coverage, we need to account for that to grab sufficient estimations for all locations that we are interested in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the documentation to make it more clear.

PEcAn.logger::logger.info("Try download from cds server.")
ncfile <- c(ncfiles[dates.ind.exist], PEcAn.data.land::download.SM_CDS(in.path, dates[dates.ind.download])) %>% sort()
} else {
PEcAn.logger::logger.severe("The download is not enabled, skip to the next time point.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logger.severe won't skip to the next time point, it will kill the whole R job

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced it with logger.info.

@@ -0,0 +1,158 @@
---
title: "Initial condition prep script for NA anchor sites"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feel like this script should probably be in the inst/anchor folder.

Also, as a general comment (not just this script), if you're going to do prep in Rmd instead of R, I'd recommend embedding more text-based descriptions and explanations. Think about the next cohort of folks to use these tools and what they need to know that's still in your head instead of written out explicitly.

On this point, also a reminder to think about what in this PR also needs to be added to the general pecan documentation and change logs. For example, there's whole new sources of input data that need to be included. Any new pecan.xml tags also need to be documented

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -5,21 +5,21 @@
#' @param outdir Where the final CSV file will be stored.
#' @param search_window search window for locate available LAI values.
#' @param export_csv Decide if we want to export the CSV file.
#' @param skip_high_sd if we want to skip observations with high standard error.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than a boolean, what if you instead passed in the SD threshold you want to use and just set the default really high (e.g. 100) so that it's essentially "off" by default. Alternatively, you could set it to the current threshold and have it "on" by default (SD of 10, which implies a CI of ~40, on a variable that goes from 0-6 is already very wide)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a sd_threshold default by NULL, which could be used for filtering out the data.

#' @param site_info Bety list of site info including site_id, lon, and lat.
#' @param time_points A vector contains each time point within the start and end date.
#' @param outdir Where the final CSV file will be stored.
#' @param export_csv Decide if we want to export the CSV file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these two tags seem redundant. If outdir == NULL then you don't want to export the CSV. That said, I'd expand the explanation text to make that point explicit (i.e. function returns data object by default, outdir is NULL by default, but if you want to export to a CSV you specify the path here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

#' @param time_points A vector contains each time point within the start and end date.
#' @param outdir Where the final CSV file will be stored.
#' @param export_csv Decide if we want to export the CSV file.
#' @param qc.filter decide if we want to filter data by the qc band.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. recommend qc_filter as the variable name
  2. similar to earlier point, rather than having this be a boolean, better would be to pass in the requested QC flag (e.g. set this to qc_filter = c("000", "001") by default) and then say set to NULL or FALSE or something like that to turn off QC filtering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

if (!is.null(outdir)) {
if(file.exists(file.path(outdir, "LC.csv"))){
PEcAn.logger::logger.info("Extracting previous MODIS Land Cover file!")
Previous_CSV <- utils::read.csv(file.path(outdir, "LC.csv"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sort of cache behavior is unintuitive and isn't documented. Please explain this in the Roxygen

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

#' @examples
#' @author Dongchen Zhang
#' @importFrom magrittr %>%
Prep_AGB_IC_from_geotiff <- function(site_info, paths.list, ens) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function appears to make a lot of product-specific assumptions and isn't general to ANY geotiff containing AGB data. At this point I'd recommend just changing the name to something that's data product specific, and then when you encounter other geotiffs with differently formatted AGB data you can think more explictly about how to generalize this function (e.g. reading in metadata on variable names, projection, etc) versus writing different functions for different products.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@DongchenZ DongchenZ requested a review from mdietze May 16, 2024 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants