Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed Recipes for the Last Millennium Reanalysis, v2.x #142

Open
CommonClimate opened this issue Jun 28, 2022 · 3 comments
Open

Proposed Recipes for the Last Millennium Reanalysis, v2.x #142

CommonClimate opened this issue Jun 28, 2022 · 3 comments

Comments

@CommonClimate
Copy link

CommonClimate commented Jun 28, 2022

Source Dataset

The Last Millennium Reanalysis (LMR) utilizes an ensemble methodology to assimilate paleoclimate data for the production of annually resolved climate field reconstructions of the Common Era. The data are available at NOAA but not (as far as we know) enabled for OpenDAP access, much less cloud access. The PaleoCube project would like to make them available to paleoclimatologists to support several workflows in the Cloud.

Gridded fields (sea-level pressure, surface air temperature, sst, precipitation, Palmer Drought Severity Index) have the format: (time, MCrun, lat, lon) where time is the year, lat is the latitude index, lon is the longitude index, and MCrun indicate the Monte Carlo iteration index. There are in fact 20 LMR reconstructions contained in these arrays. They differ in the climate model ensemble prior to assimilation (random draws from the CCSM4 Last Millennium simulation) and the proxies that were drawn randomly for the reconstruction (75% of all available proxies). All fields are anomalies from the 1951--1980 time-mean.
File and variable naming conventions follow as closely as possible those for the NOAA 20th Century Reanalysis.
In addition, there are files with full (5000-member) ensembles for global mean surface temperature, northern and southern hemisphere temperature, and various climate indices (e.g. AMO, PDO, AO, NAO, NINO3, SOI).

Data from two versions (2.0 and 2.1) are provided, both described in Tardif et al. (2019). Common aspects are:

  • CCSM4 Last Millennium simulation as the source of prior, with states from 100 randomly drawn years as the prior ensemble in each Monte-Carlo realization.
  • Regression-based Proxy System Models, formulated using the seasonal responses of individual records, with bivariate models w.r.t. temperature and precipitation for tree-ring width proxies, and univariate w.r.t. temperature for all other proxy archives.
  • Covariance localization applied with a cut-off length scale of 25000 km
  • Reconstructions generated at an annual resolution, on a 2x2deg grid.

Differences are related to the set of assimilated proxies:

  • LMR v2.1: Proxies from the PAGES2k (2017) data set*. Corresponds to results presented in Tardif et al. (2019), section 3, figures 2-5.
  • LMR v2.0: Proxies from PAGES2k (2017) + Anderson et al. (2019) [see figure 8 from Tardif et al. (2019)]. Reconstruction results are discussed in Tardif et al. (2019), section 4.3, and shown in figures 9-10, (e) and (f), and in figure 11.
  • with the exception of the Palmyra coral record: we used the more recent version from Emile-Geay et al (2013), instead of the version from Cobb et al. (2003) as in PAGES2k (2017); and the Kiritimati coral record: the longer record from the Anderson et al (2019) dataset, taken from Cobb et al (2013), replaces a slightly shorter record included in PAGES2k (2017).
  • Link to the website / online documentation for the data: https://www.ncei.noaa.gov/access/paleo-search/study/27850

  • The file format is netCDF

  • The source files are organized as follows: 4 files per gridded field: v2.0 and v2.1, each with mean and spread across the ensemble. for indices, 8 files (4 files per LMR "flavor": GMST, NHMT, SHMT, and posterior indices)

  • How are the source files accessed : access protocol unknown, but netCDF files are available here: v2.0 files , v2.1 files

  • Data are public, fully open.

Transformation / Alignment / Merging

No transformation beyond loading into zarr. The .nc files can easily be loaded by xarray, so this step should not pose particular problems.

Output Dataset

zarr format, preferably parked in GCP US-central so it is easily accessible by 2i2c's linkedearth research hub

@jordanplanders
Copy link
Contributor

@cisaacstern Ok! I think I've got another one in the works! One question that came up however is whether it would be best to move these data from their current residence on the NOAA FTP server to THREDDS, and whether that will introduce any new subtleties I should be aware of.

When I ran this locally (with the FTP urls), it took about two hours.

variables = ['air_mean', 'air_spread', 
             'pdsi_mean', 'pdsi_spread', 
             'pr_mean', 'pr_spread', 
             'prate_mean', 'prate_spread', 
             'prmsl_mean', 'prmsl_spread', 
             'sst_mean', 'sst_spread']


def make_url(time, variable):
    pair = variable.rsplit('_',1)
    stem = 'https://www.ncei.noaa.gov/pub/data/paleo/reconstructions/tardif2019lmr/v2_1/'
    nc_file = '{_var}_MCruns_ensemble_{val_type}_LMRv2.1.nc'.format(_var=pair[0], val_type=pair[1])
    url = stem+nc_file
    return url

# the full time series is in each file, each of which is between ~300 mb and ~3 Gb
time_concat_dim = ConcatDim("time", [0])
pattern = FilePattern(make_url,
                      time_concat_dim,
                      MergeDim(name="variable", keys=variables))


# renames variable to var_* where * is either the "mean" or "spread" value type
def postproc(ds):
    variables = [var for var in ds.data_vars.keys() if 'bound' not in var]

    if 'spread' in ds.attrs['comment'].lower():
        data_type = 'spread'
    elif 'mean' in ds.attrs['comment'].lower():
        data_type = 'mean'
        
    ds = ds.rename(name_dict={variable: '_'.join([variable, data_type]) for variable in variables})
        
    return ds
       
     
# use subset_inputs to make the processing more tractable
recipe = XarrayZarrRecipe(pattern, inputs_per_chunk=1,
                          consolidate_zarr=True,
                          subset_inputs={'time':42},
                          target_chunks={'time':1},
                          process_chunk = postproc, 
                          copy_input_to_local_file=False,
                          xarray_open_kwargs={'decode_coords':True, 
                                              'use_cftime':True, 
                                              'decode_times':True})

@jordanplanders
Copy link
Contributor

@cisaacstern True to form, I think I might have a way to tackle the un-gridded variables, but that will have to wait for tomorrow :)

@cisaacstern
Copy link
Member

One question that came up however is whether it would be best to move these data from their current residence on the NOAA FTP server to THREDDS

Up to you! FWIW, I don't think 2 hrs to cache data is necessarily that long. My intuition is that waiting 2 hrs to cache the data (which only has to happen once) is a smaller price to pay than moving things around on the NOAA side, but I don't know how easy it may (or may not) be to move to THREDDS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants