-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Set (optional) weight threshold for averaging operations #531
Comments
@pochedls thank you for posting this. I think this would be very important, especially when there is time-varying missing data in different locations, which is not uncommon in observation dataset. @gleckler1 handles observation datasets using xcdat for obs4mips, and this subject is going to be very relevant to his work. Datasets from obs4mips are used as reference datasets in PMP, thus this issue is also related to PMP. |
@pochedls @lee1043 Thanks for bringing this up! It would be great to have some sort of weight_threshold ... is the thinking that if the threshold was not met an NaN would be given or an error would be raised, ? It would be great to have this for both time and space but my first choice would be time. Thanks again for thinking of this. |
Here's an exchange I had with Chris Golaz related to this in 2018 and addresses the question of an annual mean. It requires one to form monthly means, then if at least one month of data was available during each season, then an annual mean would be calculated. If all 3 months were missing in one or more seasons, the annual mean would be considered missing. I'll also try to find my notes on using a centroid criteria for means of cyclical data (like the annual cycle). It gave the user more control over how much data could be missing in a situation like that. Chris, I think the seasonal climatology assumes the first month in the time-series is January, so that part needs to be generalized to handle different starting months. happy coding, On 5/23/18 3:05 PM, Chris Golaz wrote: Thanks for that. This algorithm makes perfect sense to me with reasonable choices for propagating missing values from monthly to seasonal and annual. Maybe I should code it up in Fortran to see how much faster than Python it can be... As far as I can tell, the extra complication in cdutil comes from the added flexibility on how missing values propagate up in the average (with the option of specifying a threshold and centroid). -Chris On 05/23/2018 02:43 PM, Karl Taylor wrote: -------- Forwarded Message -------- Hi Chris, I can't find notes from 2001, but below I've copied a suggestion I made for computing climatologies for use with the metric package. The pseudo-code can handle missing months. It computes climatological monthly means first, then from those it computes the seasonal means and the annual means. If this has been implemented, it was probably implemented on top of for fundamental CDMS functions. Hope this helps a little, -------- Forwarded Message -------- Hi Charles and all, Here's a proposal for how to compute climatological means, starting with Let's consider a seasonal mean missing if the climatological monthly Then I suggest the following algorithm: Let f(x, m, y) be the value at grid-cell x, year y, and month m. *** First compute monthly climatologies for each grid cell [C(x, m)] loop over grid cells (x)
end x loop *** Now compute seasonal mean climatologies [Cs(x, s)]: loop over grid cells (x)
m2) + C(x, m3)*A(x, m3) ) / (A(x, m1) + A(x, m2) + A(x, m3))
end x loop *** Now compute annual mean climatology [Ca(x)]: loop over grid cells (x)
end x loop Notes:
that's all, |
I can't find my notes on this, but there is some description of how the centroid method I came up with is applied in https://cdat.llnl.gov/documentation/utilities/utilities-1.html under "temporal averaging". In general two criteria are set: a minimum coverage of the time period (threshold), and a constraint requiring data to be near the centroid of times sampled. For a simple time-series (assumed not to be quasi-cyclic), the centroid is calculated as a simple mean of all times with data. This may be useful in deciding whether you can calculate a meaningful mean over a interval that includes a trend. If the mean of the sampling times is too close to one end the interval, then you'll get a non-representative time-mean. For quasi-cyclic data like the diurnal cycle or the annual cycle, the centroid is calculated as for a two-dimensional field. For an annual mean to be calculated from monthly values, for example, you can specify that the centroid lie near the point calculated when all months are available. You basically treat each month as a point on an analog clock face, and leaving out the missing months calculate the centroid of the remaining months. Assume you've centered the clock on a polar coordinate system. If the radial distance to the centroid is less than some threshold, then the mean of the monthly values will give a reasonable annual mean. You might also, of course, set a minimum number of months as a second criteria. |
Hi @taylor13, thank you for your input! We are planning to carefully review your notes about weight thresholds and the "centroid" function. Here are more related comments:
You also sent us a helpful email about this on 7/14/21:
|
thanks for linking in the earlier input. I think the xcdat strategy should probably be to implement something rather simple like the "seasons" approach suggested in #531 (comment) . Two cases might be commonly encountered. Monthly mean data covering multiple years. Here, if there are no big trends, the multi-year calendar months could be averaged to form a single annual cycle. The other is to compute a time-series of annual means. For this case an annual mean could be calculated when at least one month of data was available for each of the 4 seasons. Seasonal means would be calculated from the months available within each season and then the annual mean would be calculated from all 4 seasons. (for seasonal means, the months should be weighted by the length of each month, and for annual means the seasons should be weighted by the length of each season.) there are another of other simple options, one might adopt, so perhaps others can weigh in. |
Is your feature request related to a problem?
When xCDAT performs an averaging operation (spatial or temporal) with missing data, it assigns missing values a weight of zero (which is correct). But this can be misleading if some or the majority of data is missing. Imagine if data was missing in spring, fall, and summer (leaving only winter data). xCDAT would take the annual average and report a very cold annual average temperature. A user is usually aware of this kind of thing, but may miss it if a small number of grid cells are missing part of the dataset (or are missing anomaly values, which would be harder to recognize as "weird").
See the example below:
A similar situation can arise if a timestep of observations is missing part of the field during spatial averaging operations (e.g., missing the tropics, leading to too cold global temperatures).
Describe the solution you'd like
This would need to be mapped out more, but it might be useful to have a
weight_threshold
optional argument that allows the user to specify the minimum temporal/spatial weight needed to produce a value. For exampleweight_threshold=0.9
would require 90% of the spatial / temporal data within the spatial or temporal averaging window be present.CDAT had similar functionality for temporal averaging calculations (see Specifying Data Coverage Criteria. I'm not sure if there was any similar functionality for spatial averaging.
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: