Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement]: Update coords="minimal" and compat="minimal" as defaults to improve performance of xc.open_mfdataset()? #641

Open
tomvothecoder opened this issue Apr 11, 2024 · 1 comment
Labels
type: enhancement New enhancement request

Comments

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Apr 11, 2024

Is your feature request related to a problem?

xarray.open_mfdataset() has a few issues related to: (1) incorrectly concatenating coords on variables (e.g,. "time" gets added to "lat_bnds") and 2) performance. xCDAT addresses (1) by defaulting data_vars="minimal". To address (2) performance, the post and docs below suggest adding coords="minimal" and "compat="override".

pydata/xarray#1385 (comment)

It is very common for different netCDF files in a "dataset" (a folder) to be encoded differently so we can't set decode_cf=False by default.

there's probably something else going on under the hood that's causing the slowness of open_mfdataset at present.

There's

  1. slowness in reading coordinate information from every file. We have parallel to help a bit here.
  2. slowness in combining each file to form a single Xarray dataset. By default, we do lots of consistency checking by reading data. Xarray allows you to control this, data_vars='minimal', coords='minimal', compat='override' is a common choice.

What your describing sounds like a failure of lazy decoding or acftime slowdown (example)which should be fixed. If you can provide a reproducible example, that would help.

pydata/xarray#1385 (comment)

This is an amazing bug. The defaults say data_vars="all", coords="different" which means always concatenate all data_vars along the concat dimensions (here inferred to be "time") but only concatenate coords if they differ in the different files.

When decode_cf=False , lat ,lon are data_vars and get concatenated without any checking or reading. When decode_cf=True, lat, lon are promoted to coords, then get checked for equality across all files. The two variables get read sequentially from all files. This is the slowdown you see.

Once again, this is a consequence of bad defaults for concat and open_mfdataset.

I would follow docs.xarray.dev/en/stable/user-guide/io.html#reading-multi-file-datasets and use data_vars="minimal", coords="minimal", compat="override" which will only concatenate those variables with the time dimension, and skip any checking for variables that don't have a time dimension (simply pick the variable from the first file).

Describe the solution you'd like

Xarray documentation

A common use-case involves a dataset distributed across a large number of files with each file containing a large number of variables. Commonly, a few of these variables need to be concatenated along a dimension (say "time"), while the rest are equal across the datasets (ignoring floating point differences). The following command with suitable modifications (such as parallel=True) works well with such datasets:

xr.open_mfdataset('my/files/*.nc', concat_dim="time", combine="nested",
                  data_vars='minimal', coords='minimal', compat='override')

This command concatenates variables along the "time" dimension, but only those that already contain the "time" dimension (data_vars='minimal', coords='minimal'). Variables that lack the "time" dimension are taken from the first dataset (compat='override').

Describe alternatives you've considered

No response

Additional context

I don't know how reliable parallel=True is for speeding up reading coordinate information. There is a Xarray GitHub issue #7079 with comments suggesting using the parallel=True is not thread-safe and might cause resource locking on some filesystems, unlike the default parallel=False. Tony B and I ran into this in e3sm_to_cmip (related issue).

In this e3sm_diags PR, we are getting a TimeoutError: Timed out when using xcdat.open_mfdataset(). There might be some performance issues with the underlying call to xarray.open_mfdataset(). I think this e3sm_diags issue is actually related to compatibility with the multiprocessing scheduler manually defined in e3sm_diags (related issue).

@tomvothecoder tomvothecoder added the type: enhancement New enhancement request label Apr 11, 2024
@tomvothecoder tomvothecoder changed the title [Enhancement]: Update coords=minimal and compat=minimal as defaults in xc.open_mfdataset()? [Enhancement]: Update coords=minimal and compat=minimal as defaults to improve performance of xc.open_mfdataset()? Apr 11, 2024
@tomvothecoder tomvothecoder changed the title [Enhancement]: Update coords=minimal and compat=minimal as defaults to improve performance of xc.open_mfdataset()? [Enhancement]: Update coords="minimal" and compat="minimal" as defaults to improve performance of xc.open_mfdataset()? Apr 11, 2024
@pochedls
Copy link
Collaborator

@tomvothecoder – it seems like adding these defaults could be helpful, but this would ideally be tested across many datasets (e.g., in the PMP) before it is rolled out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement New enhancement request
Projects
Status: Todo
Development

No branches or pull requests

2 participants