Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: temporal.group_average daily to monthly assigns wrong time index (and bounds) #586

Open
durack1 opened this issue Jan 12, 2024 · 8 comments
Labels
duplicate This issue or pull request already exists

Comments

@durack1
Copy link
Collaborator

durack1 commented Jan 12, 2024

What happened?

I have been attempting to generate monthly mean from daily data, and expected the process-resolved time value to generate an index that is the middle of the month. The data example is 1 through 31 January 1998 daily files of SST. I would expect for a monthly average over these days [1998-01-01 00:00 ... 1998-01-31 00:00] the new time value represents the mean of the period e.g. [1998-01-16 12:00], whereas I get [1998-01-01 00:00], and no time_bounds assigned

What did you expect to happen? Are there are possible answers you came across?

I expected a new monthly mean value to have the time axis value [1998-01-16 12:00], which also includes time_bounds [1998-01-01 00:00, 1998-02-01 00:00].

Working example included in the Jupyter Notebook

Minimal Complete Verifiable Example (MVCE)

Example calculates same January 1998 monthly average is included in the Jupyter Notebook (above)

Relevant log output

See Notebook above

Anything else we need to know?

ping @taylor13, @gleckler1, @lee1043 who I have been discussing this behaviour with

Environment

INSTALLED VERSIONS

commit: None
python: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-1160.90.1.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.3
libnetcdf: 4.9.2

xarray: 2023.12.0
pandas: 2.1.4
numpy: 1.26.3
scipy: 1.11.4
netCDF4: 1.6.5
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.6.3
nc_time_axis: None
iris: None
bottleneck: 1.3.7
dask: 2023.12.1
distributed: 2023.12.1
matplotlib: 3.8.2
cartopy: 0.22.0
seaborn: None
numbagg: None
fsspec: 2023.12.2
cupy: None
pint: None
sparse: 0.15.1
flox: None
numpy_groupies: None
setuptools: 69.0.3
pip: 23.3.2
conda: None
pytest: None
mypy: None
IPython: 8.20.0
sphinx: None

@pochedls
Copy link
Collaborator

This is essentially the same as #565. PRs welcome.

@pochedls
Copy link
Collaborator

@durack1 – as a workaround, you can probably use ds.bounds.add_time_bounds() to create bounds after your temporal averaging operation. If the time point is important, you could also average the resulting time bounds and reassign them to ds.time.

@taylor13
Copy link

taylor13 commented Jan 12, 2024

To be sure. Regarding the statement made above: "The data example is 1 through 31 January 1998 daily files of SST. I would expect for a monthly average over these days [1998-01-01 00:00 ... 1998-01-31 00:00]".

I think the monthly average should be for all data representing intervals partially or fully within the bounds [1998-01-01 00:00 and 1998-02-01 00:00], so if the the coordinates values represent the beginning of daily means (rather than the middle of the day), then indeed data values for coordinates intervals associated with the coordinate values [1998-01-01 00:00 ... 1998-01-31 00:00] will be simply averaged together (no weighting required) to yield a monthly mean extending from 1998-01-01 00:00 to 1998-02-01 00:00. If, however, the coordinates represent the middle of each daily interval, then the first daily sample represents an interval from 1997-12-31 12:00 through 1998-01-01 12:00 and the monthly mean will include 32 daily samples (rather than 31) with the first and last samples included but with half the "weight" of the others. There is further discussion of this at #565.

@durack1 durack1 added the duplicate This issue or pull request already exists label Jan 12, 2024
@durack1
Copy link
Collaborator Author

durack1 commented Jan 12, 2024

This is essentially the same as #565. PRs welcome.

Agreed, this is effectively a dupe although focuses on daily -> monthly, vs monthly -> annual #565. duplicate label assigned

@pochedls
Copy link
Collaborator

#565 also seems to be a duplicate of #534, which is related to #529.

@durack1
Copy link
Collaborator Author

durack1 commented Jan 12, 2024

#565 also seems to be a duplicate of #534, which is related to #529.

@pochedls it seems in #529 there was an xarray version dependence. Is there a test that would catch issues in the case upstream (xarray) changes impact the temporal_average function updates?

@durack1
Copy link
Collaborator Author

durack1 commented Jan 12, 2024

@durack1 – as a workaround, you can probably use ds.bounds.add_time_bounds() to create bounds after your temporal averaging operation. If the time point is important, you could also average the resulting time bounds and reassign them to ds.time.

I had tried this (in the Notebook) but got the error
ValueError: Cannot generate bounds for coordinate variable 'time' which has a length <= 1 (singleton).

The code I attempted was ds.bounds.add_time_bounds(method="freq", freq="month") not sure if this was some noob error

@pochedls
Copy link
Collaborator

I think it is generally not possible to automatically generate bounds if there is only one time value, but I think it could be done in this case (you are specifying the frequency). This would require revisiting this block of code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists
Projects
Status: Todo
Development

No branches or pull requests

3 participants