Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Record time dimension in stars objects using CFtime package #674

Open
pvanlaake opened this issue Mar 13, 2024 · 6 comments
Open

Comments

@pvanlaake
Copy link

pvanlaake commented Mar 13, 2024

Pinging @mdsumner and @dblodgett for information and possible feedback.

Time in stars

stars currently uses a combination of POSIXct, Date and PCICt to represent the time dimension in raster data sets, predominantly (only?) those based on the NetCDF format. stars selects between these representations based on the "calendar" reported in the metadata of variables, "360_day" and "365_day" with PCICt and the remaining ones with POSIXct or Date.

Time information in many data sets is based on the udunits format, which is used in both the COARDS and CF Metadata Conventions, in common use for climate and weather data and other environmental data sets. That format is parsed in multiple locations, such as parse_netcdf_meta() in dimensions.R, read_mdim() in mdim.R, and .get_nc_time() in ncdf.R, with several helper functions for interpretation and writing. There are some differences in how the parsing is performed across these functions. As I see it, there are several issues with this:

  • A single file opened with the three options for reading in a CF-compliant NetCDF file (read_stars(), read_mdim(), read_ncdf()) may result in different time representations.
  • A unit of "days since ..." will be parsed to a Date in read_mdim(). The offset values, however, are doubles and any fractional part is dropped. This is not a hypothetical possibility; the CMIP5/6 standards require that the time coordinate is recorded for the middle of the observation period ("For time-mean data, a time coordinate value must be defined as the mid-point of the interval over which the average is computed. (More generally, this same rule applies whenever time-bounds are included: the time coordinate value should be the mean of the two time bounds.)") so daily data is always recorded at 12 noon, monthly data at noon on day 15 for a month with 31 days, etc. There are also CMIP 6hr data sets that use "days since ..." with fractional offsets, so those data sets cannot be read with read_mdim() at all without loss of data integrity. Writing data back to file will
    result in a non-compliant time representation.
  • For both POSIXct and PCICt the time information is represented in seconds, which potentially leads to a loss of information. Upon writing a stars object to file the time representation may be different from the source. In add_units_attr() in mdim.R, for instance, an effort is made to select a time unit that gives integer offset values. So a CF-compliant day data set with "days since ..." and the time coordinate at noon will become "hours since ...".
  • Both POSIXct and PCICt use 1970-01-01 00:00:00 as their origin and time information from the file is converted to that origin. With data representing periods prior to that origin the offsets are negative. This is a violation of CMIP5/6 standards which require that "all values of the time coordinate should be positive and monotonically increasing". Climate projection data uses 1850-01-01 as the start of the historical simulation period so there is lots of data out there where this may produce non-compliant results. (I am not sure if the COARDS and CF Metadata Conventions themselves allow negative offsets.)
  • The logic to identify coordinate variables that represent time is not guaranteed to find the time dimension in read_ncdf(). In .get_nc_time() time information is identified only if there is an "axis = T" attribute. While many data sets (but far from all) will indeed report axis attributes this is not a requirement in the COARDS and CF Metadata Conventions. Both conventions state that "the attribute axis may be attached to a coordinate variable and given one of the values X, Y, Z or T". Further, "a reference time coordinate is identifiable from its units string alone" and "optionally, the time coordinate may be indicated additionally by providing the standard_name attribute with an appropriate value, and/or the axis attribute with the value T" (note the "may be"s). Checking just for the attribute "units = ..." is more reliable as there is no default and no alternative encoding.
  • The julian calendar is not supported by POSIXct, Date or PCICt.

In summary, stars is not consistent in how "time" is interpreted and represented, and there may be a loss of data integrity.

Proposal

CFtime is a package that supports management and interpretation of the "time" dimension as defined in the CF Metadata Conventions (and thus the COARDS convention). All 9 defined calendars are supported, as well as all reasonable units (second, minute, hour, day, month, year). The package is pure R and has no dependencies outside of default R packages. The package does not read or write files itself and it thus works with any file driver that can produce the required dimension data and stores time information using the udunits format; RNetCDF and ncdf4 both work and the GDAL drivers for NetCDF (including mdim) can be easily made to work with CFtime.

I see several advantages to using CFtime in stars:

  • CFtime is a relatively complete and accurate implementation of "time" information as defined by the CF Metadata Conventions (and COARDS), supporting all defined calendars and all reasonable time units.
  • CFtime can handle time dimension information for read_stars(), read_mdim() and read_ncdf(), providing consistency to the user of stars.
  • The code is much more concise than use of, especially, PCICt (see examples below).
  • Writing data back to file can use the exact same time definition as the source file.
  • CFtime has logic to determine completeness of time series that are not numerically equi-distant. Many CMIP5/6 data sets of monthly, seasonal or yearly data use a "days since ..." unit because the udunits definition of units coarser than a day are not the same as those used in climate modeling and use of the units "month" and "year" is discouraged. So monthly data are recorded as being between 29.5 to 31 units apart. CFtime will correctly interpret this as being a complete time series while regular_intervals() in dimensions.R reports it as irregular.
  • CFtime supports merging of time information from multiple source files.
  • CFtime can produce a logical vector for subsetting based on timestamps; this could support functions like slice.stars() using an index consisting of a character vector of length 1 or 2 with the (extreme) value(s of the range) to extract.
  • CFtime can produce calender-aware factors that can make a function like aggregate.stars() more versatile and user-friendly. This functionality goes beyond what aggregate.stars() (currently) does, particularly with regards to epoch-based factors. Exposing this CFtime functionality through stars, or extending aggregate.stars() with these options, would give the users more modeling options.

Examples

Package CFtime is being integrated into package ncmeta, the dev version on GitHub already contains it. Time information is stored in the new extended attribute of the nc_meta object. In ncdf.R, functions .get_time_meta(), .get_nc_time() and make_cal_time2() reduce to this:

.get_nc_time <- function(dimensions, coord_var, meta) {
    # Is there a CFtime instance in the extended metadata?
    tidx <- which(sapply(meta$extended$time, inherits, "CFtime"))
    if (length(tidx) > 0L) {
	# What is the name of the time dimension?
	tname <- meta$extended$name[[tidx]]
	# Is the CFtime instance tied to a coord_var of the variable?
	nidx <- which(simplify2array(coord_var) == tname)
	if (length(nidx) > 0L) {
	    # If so, grab the CFtime instance and attach it to the time dimension
	    time <- meta$extended$time[[tidx]]
	    dimensions[[tname]] <- create_dimension(values = time, refsys = "CFtime")
	}
    }
    dimensions
}

In read_mdim(), encapsulated function create_units() becomes:

create_units <- function(x) {
    u = attr(x, "units")
    x = structure(x, units = NULL) # remove attribute
    if (is.null(u) || u %in% c("", "none")) return(x)

    cal <- if (!is.null(a <- attr(x, "attributes"))) a["calendar"] else NULL
    time <- try(CFtime::CFtime(u, cal), silent = TRUE)  # cheaply try if we can make CFtime
    if (inherits(time, "CFtime")) time + x              # if we have CFtime, add the offsets
    else x                                              # if not, fail graciously
}

The same simplification can be made in read_stars(). Note that the call to get_pcict() is gone. The only refsys that will be created for the time dimension is CFtime (or something like that), making the code simpler in other places as well. If needed, a CFtime object can create a vector of POSIXct values if the calendar is compatible (standard, gregorian, proleptic_gregorian): posix_time <- CFtimestamp(time, asPOSIX = TRUE).

Implementation

Is there interest to evaluate the use of CFtime in stars?

If so, I can put in time. As developer of CFtime I know that package in-and-out and I can make any requisite changes to smoothen integration. For the basic read/write/print operations I can integrate that functionality fairly quickly. I will most likely need some support on extending functions like slice() and aggregate(), even if only on stress-testing modified code and sternly lecturing me on the tidyverse.

The ncmeta package will be updated in the near future. The CFtime package will likely need an update as well.

Any other areas of change that I have not identified could be included as well.

Thinking about a 2-3 month timeline to release to allow for additional features and stress-testing.

Happy to team up with other interested contributors.

Reference documents used above

@edzer
Copy link
Member

edzer commented Mar 21, 2024

Is there interest to evaluate the use of CFtime in stars?

Absolutely!

@edzer
Copy link
Member

edzer commented Mar 22, 2024

Do you also plan to provide as.POSIXct() and as.Date() methods for CFtime objects? This would be needed to make as.POSIXct.stars() work. I think that would be needed for integrating CF datasets with other time-related datasets; even if this is somewhat black magic going from "360_day" to POSIX time, it is needed, and I see having that as the main benefit of PCICt.

Date can have fractional values, b.t.w.; if stars truncates them then that is an error:

> structure(19804.5, class = "Date") |> as.POSIXct()
[1] "2024-03-22 12:00:00 UTC"

@pvanlaake
Copy link
Author

First the good bits:

CFtime already has function CFtimestamp() to produce POSIXct values. as.POSIXct.stars() would look like this:

as.POSIXct.stars = function(x, ...) {
    d = st_dimensions(x)
    for (i in seq_along(d)) {
	if (d[[i]]$refsys == "CFtime") {
	    p = try(CFtime::CFtimestamp(d[[i]]$values, asPOSIX = TRUE), silent = TRUE)
	    if (inherits(p, "try-error"))
		stop(paste("cannot create a POSIXct dimension from a non-compliant calendar:", 
			    CFtime::CFcalendar(d[[i]]$values)))
	    d[[i]] = create_dimension(values = p)
	}
    }
    structure(x, dimensions = d)
}

Assuming there will only be POSIXct and CFtime refsys'es, the latter is converted to POSIXct and a new dimension created for it. The expensive expand_dimensions() call is no longer required. Compatible calendars are converted to POSIXct, incompatible calendars generate an error.

There are likely going to be other areas where the CFtime package needs an upgrade to support the functionality of stars - I will ensure that any such changes are made.

This supports conversion of a large proportion of NetCDF data sets out there, particularly re-analysis data such as ERA5 and other data sets of observational data. That includes climate projections that use the standard, gregorian or proleptic_gregorian calendars, but not any such data using any of the other CF calendars. But then again, climate projections are not supposed to be compared to other data sets on a daily basis anyway - read on.

Next the major issue:

CFtime does not support conversion between calendars (and it never will as long as I am maintaining the package). In fact, the package was conceived and developed because in my opinion PCICt is built on a wrong premise and inviting users to do things they shouldn't be doing.

The non-standard calendars are used for climate projections and like data. That data is not intended to be assessed for individual absolute values and most certainly not for comparison of daily projections between different models, even when the calendars are the same. The projected data should also not be compared to observed data, they are not predictions after all. So when the PCICt package states that "PCICt does the work of normalizing the calendars and making points in time on seperate calendars cross comparable" it proposes to do something that was never intended by the data producers and neither does it make sense when considering the physics behind the modelling using these calendars. Rather than "normalize calendars", users should "normalize data" for individual data sets (a.k.a. computing anomalies or climatologies, which typically involves aggregating daily data to monthly or seasonal data over longer periods of time) and only then compare between different data sets or to observed data. The vignettes of the CFtime package elaborate this in more detail, with an example of the correct analysis procedure.

I realize that many users are looking for a way to compare data from different models and observations on a daily basis. I have just not heard a compelling case anywhere why that would be necessary or advisable, even for processing efficiencies - I am open to be convinced if you have any case to present. Rather than accommodating the default argument of "put it all in a data.frame to analyze all data in one go" I would propose to explain why that is a bad idea. In more practical terms, I will happily write an extensive vignette explaining the issue and demonstrating how to do it the right way. What I will not do is fudge calendars to enable bad analyses.

Given this, please let me know if you prefer "black magic" or CFtime.

@edzer
Copy link
Member

edzer commented Mar 25, 2024

So, replacing PCICt with CFtime will limit the functionality of stars because the things you can now do (e.g. convert dates from a 360 day calendar to a POSIXct date/times) are things you consider "black magic" and shouldn't be given in the hands of end users. I am all in favor of guiding users what to do, but not so much of limiting them what to do - they will simple seek another way.

I see some parallel here with datum transformations, which is also approximate and black magic to most, but enables you to get combine datasets associated with one datum with that associated with another. Approximate and all, but despite the black magic important to have, and part of accepted scientific practice (there's no alternative).

I see some of the necessity of being able to convert the funny (year_360, year_365) calendars to regular types (POSIXct, Date) as a consequence of R not having no other time types (e.g. for year, or month-of-year, to which you would like to aggregate both) and having strong tooling for POSIXct and Date (try methods(class = "POSIXct"); for instance cut is a very convenient one to use before aggregate.

How would you advice an econometrist or insurance person to combine economic forecasts associated with a regular Date field with climate forecasts associated with a year_360 calendar? I probably agree with what's written in the vignette you point to but the code mess arising when working with raw arrays and other objects containing what the dimensions reflect is exactly what stars tries to prevent.

@pvanlaake
Copy link
Author

I'm not sure that datum transformations would compare, unless people start producing their own transformation parameters. MAUP is a better analogue in my opinion, in that people may combine or compare things in inappropriate ways. But let's rest that discussion for now.

How about the following: stars will use CFtime for its internal management of the time dimension on any data set that uses COARDS or the CF Metadata Conventions to encode time. as.POSIXct.stars() will support conversion of non-standard calendars to POSIXct using PCICt (with a warning, I would suggest). Like this the full current functionality of stars is maintained, while achieving the consistency and compatibility provided by CFtime.

@edzer
Copy link
Member

edzer commented Mar 26, 2024

I think that is a great idea!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants