[Enhancement]: Update `coords="minimal"` and `compat="minimal"` as defaults to improve performance of `xc.open_mfdataset()`? #641

tomvothecoder · 2024-04-11T17:39:35Z

Is your feature request related to a problem?

xarray.open_mfdataset() has a few issues related to: (1) incorrectly concatenating coords on variables (e.g,. "time" gets added to "lat_bnds") and 2) performance. xCDAT addresses (1) by defaulting data_vars="minimal". To address (2) performance, the post and docs below suggest adding coords="minimal" and "compat="override".

pydata/xarray#1385 (comment)

It is very common for different netCDF files in a "dataset" (a folder) to be encoded differently so we can't set decode_cf=False by default.

there's probably something else going on under the hood that's causing the slowness of open_mfdataset at present.

There's

slowness in reading coordinate information from every file. We have parallel to help a bit here.

slowness in combining each file to form a single Xarray dataset. By default, we do lots of consistency checking by reading data. Xarray allows you to control this, data_vars='minimal', coords='minimal', compat='override' is a common choice.

What your describing sounds like a failure of lazy decoding or acftime slowdown (example)which should be fixed. If you can provide a reproducible example, that would help.

pydata/xarray#1385 (comment)

This is an amazing bug. The defaults say data_vars="all", coords="different" which means always concatenate all data_vars along the concat dimensions (here inferred to be "time") but only concatenate coords if they differ in the different files.

When decode_cf=False , lat ,lon are data_vars and get concatenated without any checking or reading. When decode_cf=True, lat, lon are promoted to coords, then get checked for equality across all files. The two variables get read sequentially from all files. This is the slowdown you see.

Once again, this is a consequence of bad defaults for concat and open_mfdataset.

I would follow docs.xarray.dev/en/stable/user-guide/io.html#reading-multi-file-datasets and use data_vars="minimal", coords="minimal", compat="override" which will only concatenate those variables with the time dimension, and skip any checking for variables that don't have a time dimension (simply pick the variable from the first file).

Describe the solution you'd like

Xarray documentation

A common use-case involves a dataset distributed across a large number of files with each file containing a large number of variables. Commonly, a few of these variables need to be concatenated along a dimension (say "time"), while the rest are equal across the datasets (ignoring floating point differences). The following command with suitable modifications (such as parallel=True) works well with such datasets:
xr.open_mfdataset('my/files/*.nc', concat_dim="time", combine="nested",
                  data_vars='minimal', coords='minimal', compat='override')
This command concatenates variables along the "time" dimension, but only those that already contain the "time" dimension (data_vars='minimal', coords='minimal'). Variables that lack the "time" dimension are taken from the first dataset (compat='override').

Describe alternatives you've considered

No response

Additional context

I don't know how reliable parallel=True is for speeding up reading coordinate information. There is a Xarray GitHub issue #7079 with comments suggesting using the parallel=True is not thread-safe and might cause resource locking on some filesystems, unlike the default parallel=False. Tony B and I ran into this in e3sm_to_cmip (related issue).

In this e3sm_diags PR, we are getting a TimeoutError: Timed out when using xcdat.open_mfdataset(). ~~There might be some performance issues with the underlying call to xarray.open_mfdataset().~~ I think this e3sm_diags issue is actually related to compatibility with the multiprocessing scheduler manually defined in e3sm_diags (related issue).

The text was updated successfully, but these errors were encountered:

pochedls · 2024-04-14T16:31:27Z

@tomvothecoder – it seems like adding these defaults could be helpful, but this would ideally be tested across many datasets (e.g., in the PMP) before it is rolled out.

tomvothecoder added the type: enhancement New enhancement request label Apr 11, 2024

tomvothecoder changed the title ~~[Enhancement]: Update coords=minimal and compat=minimal as defaults in xc.open_mfdataset()?~~ [Enhancement]: Update coords=minimal and compat=minimal as defaults to improve performance of xc.open_mfdataset()? Apr 11, 2024

tomvothecoder changed the title ~~[Enhancement]: Update coords=minimal and compat=minimal as defaults to improve performance of xc.open_mfdataset()?~~ [Enhancement]: Update coords="minimal" and compat="minimal" as defaults to improve performance of xc.open_mfdataset()? Apr 11, 2024

tomvothecoder mentioned this issue Apr 11, 2024

CDAT Migration: Refactor annual_cycle_zonal_mean set E3SM-Project/e3sm_diags#798

Draft

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement]: Update `coords="minimal"` and `compat="minimal"` as defaults to improve performance of `xc.open_mfdataset()`? #641

[Enhancement]: Update `coords="minimal"` and `compat="minimal"` as defaults to improve performance of `xc.open_mfdataset()`? #641

tomvothecoder commented Apr 11, 2024 •

edited

pochedls commented Apr 14, 2024

[Enhancement]: Update coords="minimal" and compat="minimal" as defaults to improve performance of xc.open_mfdataset()? #641

[Enhancement]: Update coords="minimal" and compat="minimal" as defaults to improve performance of xc.open_mfdataset()? #641

Comments

tomvothecoder commented Apr 11, 2024 • edited

Is your feature request related to a problem?

pydata/xarray#1385 (comment)

pydata/xarray#1385 (comment)

Describe the solution you'd like

Describe alternatives you've considered

Additional context

pochedls commented Apr 14, 2024

[Enhancement]: Update `coords="minimal"` and `compat="minimal"` as defaults to improve performance of `xc.open_mfdataset()`? #641

[Enhancement]: Update `coords="minimal"` and `compat="minimal"` as defaults to improve performance of `xc.open_mfdataset()`? #641

tomvothecoder commented Apr 11, 2024 •

edited