New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding functions and scripts for downloading, extracting, and processing observations, initial conditions, land cover types, ERA5 drivers for anchor sites within NA. #3278
base: develop
Are you sure you want to change the base?
Conversation
…ecoregion codes (L1 and L2).
As currently structured this PR adds more than 100 megabytes of data files to the package, which is far outside the norm for R package size -- CRAN wants special justification for anything with more than 5 MB of data or more then 10 MB for the whole package. Do the shapefiles and Rdata files really need to be distributed as part of data.land or can they be stored elsewhere and read in as needed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the process of data prep for large-scale SDA runs, there needs to be better overall documentation of how one does that. Even if someone else found these functions and scripts, they wouldn't know how to run them without some sort of README.md. Indeed, for prep scripts that only make sense to run manually (as opposed to cron jobs or some other automation) you might consider writing them as Rmd files.
#prepare observations | ||
settings <- PEcAn.settings::read.settings("/projectnb/dietzelab/dongchen/anchorSites/SDA/pecan.xml") | ||
settings$state.data.assimilation$Obs_Prep$outdir <- "/projectnb/dietzelab/dongchen/anchorSites/Obs" | ||
obs <- PEcAnAssimSequential::SDA_OBS_Assembler(settings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this one step (3 lines of code) loading the full list of all of your data constraints? That needs better documentation. Also, why is settings being loaded a second time? And why is one of the paths being manually rewritten? Also, what script is building the settings file to begin with? I expected you to be starting from either a file of sites (id, lat, lon, etc) or a query of a site group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we need to create the oldpecan.xml file using the Create_multi_settings.R
script for the anchor site group on the Bety DB with group ID 1000000033
. The reason is that the Bety DB is currently down, and it's challenging to create new records within it. Therefore, we will need to first grab what we have previously, iteratively add new sites to the site info, and write them into the new pecan.xml file.
Once we have every site prepared in the Bety, we can easily create the pecan.xml file by just using the Create_multi_settings.R
script and ignore the chunk of pulling new sites into the existing database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also left this comment on the script so people will know how to deal with it appropriately.
obs <- PEcAnAssimSequential::SDA_OBS_Assembler(settings) | ||
|
||
#prepare LC and PFTs | ||
LC <- PEcAn.data.remote::MODIS_LC_prep(site_info = site_info, time_points = settings$state.data.assimilation$start.date, qc.filter = T) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "LC" land cover?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I replaced every LC with a land cover in the comments.
modules/data.land/R/ISCN_extract.R
Outdated
#' | ||
#' @examples | ||
#' @author Dongchen Zhang | ||
#' @importFrom magrittr %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason you need to use magrittr pipes over R native pipes |> ? If yes, I think I saw recent PRs where @infotroph imported from dplyr instead of magrittr (presumably to reduce dependencies?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mdietze you're right that I switched data.atmosphere to import from dplyr, and yes it was to reduce dependencies, but looks like the existing pipe imports in data.land are still from magrittr. I support keeping this one from magrittr and switching in a separate PR.
As for |> vs %>%, I've still been using %>% in packages that are already using it, just for consistency. I don't object to switching to |> in new code -- when we do start using native pipes in a given package, we should add Depends: R (>= 4.1)
to its DESCRIPTION.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted back to dplyr
) | ||
}) | ||
#check if the token exists for the cdsapi. | ||
if (!file.exists(file.path(Sys.getenv("HOME"), ".cdsapirc"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise, you definitely need to document the need to set up an api key as part of the function documentation. All these error messages are great, but it would be super frustrating to try to use this function in practice as it would just keep giving you different error messages until you finally got it working. I suspect most users would give up before they got through all the dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -0,0 +1,117 @@ | |||
#read settings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar to other script, need better overall documentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -0,0 +1,41 @@ | |||
#' Extract above ensemble ground biomass density from pre-existing GeoTIFF file for the SDA workflow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so presumably this isn't a generic approch to reading any AGB geotiff (of which there are many in the world), but something for working with some specific product? That needs to be well documented, including being clear in the function name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
"Please make sure it is installed to a location accessible to `reticulate`.", | ||
"You should be able to install it with the following command: ", | ||
"`pip install --user cdsapi`.", | ||
"The following error was thrown by `reticulate::import(\"cdsapi\")`: ", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sort of dependency on a python library needs to documented in your Roxygen metadata.
I also found this article on managing R library dependencies on python libraries: https://cran.r-project.org/web/packages/reticulate/vignettes/python_dependencies.html
settings <- PEcAn.settings::read.settings("/projectnb/dietzelab/dongchen/anchorSites/SDA/pecan.xml") | ||
|
||
in.path <- "/projectnb/dietzelab/hamzed/ERA5/Data/Ensemble" | ||
out.path <- "/projectnb/dietzelab/dongchen/anchorSites//ERA5_2012_2021" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a stand-alone script, I think it would be good to include an overall introductory comment block explaining what the purpose of the script is and how to use it, and then additional comments on variables like these (settings, in.path, out.path) explaining what they should be.
Also, is there something here that's specific to anchor sites or would this script do data prep for any set of sites (in North America? in the world?) and the initial list of sites just happens to be anchor sites?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I replaced it with well-documented Rmd file.
- It's preparing data for the 342 NA anchor sites that combine the Bety site group (1000000033) and some sites in the goolge sheet (https://docs.google.com/spreadsheets/d/1n7pVUcrYrB0S8bqrj77tUNrLHA2c_yqzkZL8mgdVDjs/edit#gid=0). I also left comments for this information.
… download is problematic.
…r MODIS LAI extraction.
```{r} | ||
#filter based on NA boundary, land cover. | ||
#boundary | ||
site_eco <- PEcAn.data.land::EPA_ecoregion_finder(pre_site_info$lat, pre_site_info$lon) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok for this PR, but should update to make sure we're not excluding Central America, the Caribbean, and Hawaii (can discuss the latter, since it probably fall outside your ERA5 box). Offline we discussed alternative ecoregion maps for central and south America
#' @author Dongchen Zhang | ||
#' @importFrom dplyr %>% | ||
download.SM_CDS <- function(outfolder, time.points, overwrite = FALSE, auto.create.key = FALSE) { | ||
###################################Introduction on how to play with the CDS python API########################################## |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend moving all of this instruction text out of the function and into the ROxygen as part of the Details section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
modules/data.land/R/extract_SM_CDS.R
Outdated
#' @param in.path physical paths to where the unziped soil moisture files are downloaded. | ||
#' @param out.path Where the final CSV file will be stored. | ||
#' @param allow.download Flag determine if we want to automatic download files if they are not available. | ||
#' @param search_window search window for locate available soil moisture values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not clear what units this is in. pixels? km? degrees? if pixels, include a reminder of how big a pixel is for this data product.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the time search window (in days). Because it will take ~ 10 days for this product to get a global coverage, we need to account for that to grab sufficient estimations for all locations that we are interested in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the documentation to make it more clear.
modules/data.land/R/extract_SM_CDS.R
Outdated
PEcAn.logger::logger.info("Try download from cds server.") | ||
ncfile <- c(ncfiles[dates.ind.exist], PEcAn.data.land::download.SM_CDS(in.path, dates[dates.ind.download])) %>% sort() | ||
} else { | ||
PEcAn.logger::logger.severe("The download is not enabled, skip to the next time point.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.severe won't skip to the next time point, it will kill the whole R job
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced it with logger.info
.
@@ -0,0 +1,158 @@ | |||
--- | |||
title: "Initial condition prep script for NA anchor sites" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feel like this script should probably be in the inst/anchor folder.
Also, as a general comment (not just this script), if you're going to do prep in Rmd instead of R, I'd recommend embedding more text-based descriptions and explanations. Think about the next cohort of folks to use these tools and what they need to know that's still in your head instead of written out explicitly.
On this point, also a reminder to think about what in this PR also needs to be added to the general pecan documentation and change logs. For example, there's whole new sources of input data that need to be included. Any new pecan.xml tags also need to be documented
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -5,21 +5,21 @@ | |||
#' @param outdir Where the final CSV file will be stored. | |||
#' @param search_window search window for locate available LAI values. | |||
#' @param export_csv Decide if we want to export the CSV file. | |||
#' @param skip_high_sd if we want to skip observations with high standard error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than a boolean, what if you instead passed in the SD threshold you want to use and just set the default really high (e.g. 100) so that it's essentially "off" by default. Alternatively, you could set it to the current threshold and have it "on" by default (SD of 10, which implies a CI of ~40, on a variable that goes from 0-6 is already very wide)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a sd_threshold
default by NULL, which could be used for filtering out the data.
#' @param site_info Bety list of site info including site_id, lon, and lat. | ||
#' @param time_points A vector contains each time point within the start and end date. | ||
#' @param outdir Where the final CSV file will be stored. | ||
#' @param export_csv Decide if we want to export the CSV file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these two tags seem redundant. If outdir == NULL then you don't want to export the CSV. That said, I'd expand the explanation text to make that point explicit (i.e. function returns data object by default, outdir is NULL by default, but if you want to export to a CSV you specify the path here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
#' @param time_points A vector contains each time point within the start and end date. | ||
#' @param outdir Where the final CSV file will be stored. | ||
#' @param export_csv Decide if we want to export the CSV file. | ||
#' @param qc.filter decide if we want to filter data by the qc band. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- recommend qc_filter as the variable name
- similar to earlier point, rather than having this be a boolean, better would be to pass in the requested QC flag (e.g. set this to qc_filter = c("000", "001") by default) and then say set to NULL or FALSE or something like that to turn off QC filtering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
if (!is.null(outdir)) { | ||
if(file.exists(file.path(outdir, "LC.csv"))){ | ||
PEcAn.logger::logger.info("Extracting previous MODIS Land Cover file!") | ||
Previous_CSV <- utils::read.csv(file.path(outdir, "LC.csv")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sort of cache behavior is unintuitive and isn't documented. Please explain this in the Roxygen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
#' @examples | ||
#' @author Dongchen Zhang | ||
#' @importFrom magrittr %>% | ||
Prep_AGB_IC_from_geotiff <- function(site_info, paths.list, ens) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function appears to make a lot of product-specific assumptions and isn't general to ANY geotiff containing AGB data. At this point I'd recommend just changing the name to something that's data product specific, and then when you encounter other geotiffs with differently formatted AGB data you can think more explictly about how to generalize this function (e.g. reading in metadata on variable names, projection, etc) versus writing different functions for different products.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Description
This PR includes:
Motivation and Context
Review Time Estimate
Types of changes
Checklist: