Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Set (optional) weight threshold for averaging operations #531

Open
pochedls opened this issue Aug 15, 2023 · 6 comments
Open
Labels
type: enhancement New enhancement request work: complex

Comments

@pochedls
Copy link
Collaborator

Is your feature request related to a problem?

When xCDAT performs an averaging operation (spatial or temporal) with missing data, it assigns missing values a weight of zero (which is correct). But this can be misleading if some or the majority of data is missing. Imagine if data was missing in spring, fall, and summer (leaving only winter data). xCDAT would take the annual average and report a very cold annual average temperature. A user is usually aware of this kind of thing, but may miss it if a small number of grid cells are missing part of the dataset (or are missing anomaly values, which would be harder to recognize as "weird").

See the example below:

import xcdat as xc
import numpy as np
# load example dataset
fn = '/p/css03/esgf_publish/CMIP6/CMIP/NCAR/CESM2/amip/r1i1p1f1/Amon/tas/gn/v20190218/tas_Amon_CESM2_amip_r1i1p1f1_gn_195001-201412.nc'
ds = xc.open_dataset(fn)
# select 12 months of data (for some midlatitude grid cell over Asia)
ds = ds.isel(time=np.arange(0, 12), lat=[150], lon=[50]).squeeze()
print('range of seasonal cycle: ' + str(np.min(ds.tas.values)) + ' - ' + str(np.max(ds.tas.values)))
print('annual average: ' + str(ds.temporal.group_average('tas', freq='year').tas.values[0]))
# NaN out data over Jan - Nov
ds['tas'][0:11] = np.nan
print('annual average (with NaNs): ' + str(ds.temporal.group_average('tas', freq='year').tas.values[0]))

range of seasonal cycle: 260.47635 - 297.10477
annual average: 278.48889160156256
annual average (with NaNs Jan - Nov): 265.1025406006844

A similar situation can arise if a timestep of observations is missing part of the field during spatial averaging operations (e.g., missing the tropics, leading to too cold global temperatures).

Describe the solution you'd like

This would need to be mapped out more, but it might be useful to have a weight_threshold optional argument that allows the user to specify the minimum temporal/spatial weight needed to produce a value. For example weight_threshold=0.9 would require 90% of the spatial / temporal data within the spatial or temporal averaging window be present.

CDAT had similar functionality for temporal averaging calculations (see Specifying Data Coverage Criteria. I'm not sure if there was any similar functionality for spatial averaging.

Describe alternatives you've considered

No response

Additional context

No response

@pochedls pochedls added the type: enhancement New enhancement request label Aug 15, 2023
@lee1043
Copy link
Collaborator

lee1043 commented Aug 15, 2023

@pochedls thank you for posting this. I think this would be very important, especially when there is time-varying missing data in different locations, which is not uncommon in observation dataset. @gleckler1 handles observation datasets using xcdat for obs4mips, and this subject is going to be very relevant to his work. Datasets from obs4mips are used as reference datasets in PMP, thus this issue is also related to PMP.

@gleckler1
Copy link

gleckler1 commented Aug 16, 2023

@pochedls @lee1043 Thanks for bringing this up! It would be great to have some sort of weight_threshold ... is the thinking that if the threshold was not met an NaN would be given or an error would be raised, ? It would be great to have this for both time and space but my first choice would be time. Thanks again for thinking of this.

@taylor13
Copy link

Here's an exchange I had with Chris Golaz related to this in 2018 and addresses the question of an annual mean. It requires one to form monthly means, then if at least one month of data was available during each season, then an annual mean would be calculated. If all 3 months were missing in one or more seasons, the annual mean would be considered missing.

I'll also try to find my notes on using a centroid criteria for means of cyclical data (like the annual cycle). It gave the user more control over how much data could be missing in a situation like that.


Chris,

I think the seasonal climatology assumes the first month in the time-series is January, so that part needs to be generalized to handle different starting months.

happy coding,
Karl

On 5/23/18 3:05 PM, Chris Golaz wrote:
Karl,

Thanks for that. This algorithm makes perfect sense to me with reasonable choices for propagating missing values from monthly to seasonal and annual. Maybe I should code it up in Fortran to see how much faster than Python it can be...

As far as I can tell, the extra complication in cdutil comes from the added flexibility on how missing values propagate up in the average (with the option of specifying a threshold and centroid).

-Chris

On 05/23/2018 02:43 PM, Karl Taylor wrote:
what, you aren't at NOAA?
Here's what I sent:

-------- Forwarded Message --------
Subject: Fwd: climatologies
Date: Wed, 23 May 2018 11:27:25 -0700
From: Karl Taylor taylor13@llnl.gov
To: Chris Golaz chris.golaz@noaa.gov

Hi Chris,

I can't find notes from 2001, but below I've copied a suggestion I made for computing climatologies for use with the metric package. The pseudo-code can handle missing months. It computes climatological monthly means first, then from those it computes the seasonal means and the annual means.

If this has been implemented, it was probably implemented on top of for fundamental CDMS functions.

Hope this helps a little,
Karl

-------- Forwarded Message --------
Subject: climatologies
Date: Thu, 9 Jul 2015 13:51:31 -0700
From: Karl Taylor taylor13@llnl.gov
To: Doutriaux, Charles doutriaux1@llnl.gov, Gleckler, Peter pgleckler@llnl.gov, Durack, Paul J. durack1@llnl.gov

Hi Charles and all,

Here's a proposal for how to compute climatological means, starting with
monthly data. Does anyone want to suggest a different approach?

Let's consider a seasonal mean missing if the climatological monthly
mean for one or more of the 3 months in that season is/are missing.
Let's consider an annual mean missing if the climatological monthly mean
for one or more of the 12 months of the year is/are missing.

Then I suggest the following algorithm:

Let f(x, m, y) be the value at grid-cell x, year y, and month m.
Let w(m, y) be the length of month m of year y.

*** First compute monthly climatologies for each grid cell [C(x, m)]

loop over grid cells (x)

    loop over months of year (m)

        C(x, m) = 0.
        A(x, m) = 0.
        n = 0

          loop over years (y)

                If (f(x, m, y) .not. missing)
                    n = n + 1
                    C(x, m) = C(x, m) + f(x, m, y) * w(m, y)
                    A(x, m) = A(x, m) + w(m, y)

            end y loop

            if (n = 0)
                C(x, m) = missing
            else
                C(x, m) = C(x, m) / A(x,m)
                A(x, m) = A(x, m) / n
            endif

    end m loop

end x loop

*** Now compute seasonal mean climatologies [Cs(x, s)]:

loop over grid cells (x)

    loop over 4 seasons (s)

        Cs(x, s) = 0.

        m1 = 3*s
        m2 = m1 + 1
        m3 = m1 + 2
        if m1 = 0 then m1 = 12

                If   A(x, m1) * A(x, m2) * A(x, m3)  = 0.  then

                    Cs(x, s) = missing

                else

                    Cs(x, s) = (C(x, m1)*A(x, m1) + C(x, m2)*A(x, 

m2) + C(x, m3)*A(x, m3) ) / (A(x, m1) + A(x, m2) + A(x, m3))

        end s loop

end x loop

*** Now compute annual mean climatology [Ca(x)]:

loop over grid cells (x)

    Ca(x) = 0.
    Aa(x) = 0.

    loop over months (m)

            If A(x,m) = 0. then

                Ca(x) = missing
                exit loop over months

            else

                Ca(x) = Ca(x) + C(x, m)*A(x, m)
                Aa(x) = Aa(x) + A(x, m)

            endif

    end m loop

    If Aa(x) > 0. then

        Ca(x) = Ca(x) / Aa(x)

    endif

end x loop

Notes:

  1. if there are no missing data and an integral number of years is
    considered, then Ca(x) will be equivalent to a straight-forward average
    of all the monthly values (each weighted by month length); otherwise the
    two means can differ.

  2. If we want to invariably compute all three climatologies (monthly,
    seasonal and annual), then the x loop can be placed around all three
    (instead of having 3 separate x loops)

that's all,
Karl

@taylor13
Copy link

I can't find my notes on this, but there is some description of how the centroid method I came up with is applied in https://cdat.llnl.gov/documentation/utilities/utilities-1.html under "temporal averaging". In general two criteria are set: a minimum coverage of the time period (threshold), and a constraint requiring data to be near the centroid of times sampled.

For a simple time-series (assumed not to be quasi-cyclic), the centroid is calculated as a simple mean of all times with data. This may be useful in deciding whether you can calculate a meaningful mean over a interval that includes a trend. If the mean of the sampling times is too close to one end the interval, then you'll get a non-representative time-mean.

For quasi-cyclic data like the diurnal cycle or the annual cycle, the centroid is calculated as for a two-dimensional field. For an annual mean to be calculated from monthly values, for example, you can specify that the centroid lie near the point calculated when all months are available. You basically treat each month as a point on an analog clock face, and leaving out the missing months calculate the centroid of the remaining months. Assume you've centered the clock on a polar coordinate system. If the radial distance to the centroid is less than some threshold, then the mean of the monthly values will give a reasonable annual mean. You might also, of course, set a minimum number of months as a second criteria.

@tomvothecoder
Copy link
Collaborator

Hi @taylor13, thank you for your input! We are planning to carefully review your notes about weight thresholds and the "centroid" function.

Here are more related comments:

You also sent us a helpful email about this on 7/14/21:

I designed the “missing data” treatment of that function, which I thought was useful and not trivial to include.

It’s not that the missing data can’t be easily handled, but that a user may want to set criteria for excluding locations entirely from a climatology if the amount of missing data is excessive or unrepresentative. For example suppose in some location, all of the winter data are missing; you might want to exclude that location if you’re computing an annual mean climatology because the annual mean would be skewed warm. Even in transition seasons (e.g., spring) if data are missing for the first month of the season, but are available for the last month, the seasonal mean will be skewed and not accurately represent the seasonal mean.

As I recall, CDAT allowed the user to set two criteria for excluding a location (i.e., setting it to “missing”) to mask out locations with insufficient data. The user could require a certain percentage of the data be present and the user could require the “centroid” of the contributing data to be within a certain distance of the location of the centroid of a dataset without missing data. For data that is cyclical (like the annual cycle) the centroid for climatologies is computed by considering each month to be located at the 12 divisions of a clock face. Without missing data the centroid would lie at the center of the circle. If data for the winter months were missing (months 12, 1, and 2), then the centroid would be offset toward the 7. However if months 1, 3, 5, 7, 9, and 11 were missing, the centroid would lie again at the center of the circle (and the annual mean computed from these values would more likely be accurate than one in which the winter months were missing. By specifying how close to the center of the circle the centroid must be, a user can control when to include a given location in the annual mean climatology. Centroids for individual seasons are similarly computed.

I’m not sure where in the code this all is done, but there is some discussion under https://cdat.llnl.gov/documentation/utilities/utilities-1.html (I meant to send that link earlier but must have not.) Look at the paragraphs under the heading “Specifying Data Coverage Criteria”.

@taylor13
Copy link

thanks for linking in the earlier input. I think the xcdat strategy should probably be to implement something rather simple like the "seasons" approach suggested in #531 (comment) . Two cases might be commonly encountered. Monthly mean data covering multiple years. Here, if there are no big trends, the multi-year calendar months could be averaged to form a single annual cycle. The other is to compute a time-series of annual means. For this case an annual mean could be calculated when at least one month of data was available for each of the 4 seasons. Seasonal means would be calculated from the months available within each season and then the annual mean would be calculated from all 4 seasons. (for seasonal means, the months should be weighted by the length of each month, and for annual means the seasons should be weighted by the length of each season.)

there are another of other simple options, one might adopt, so perhaps others can weigh in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement New enhancement request work: complex
Projects
Status: Todo
Development

No branches or pull requests

5 participants