Skip to content



Repository files navigation

Optimal Strategies for Storing Earth Science Datasets in the Commercial Cloud


This repo provides the code for the study on the optimal chunking strategies of multi-dimensional data in Zarr format for storage in the cloud (AWS S3). Code is provided for rechunking a default Zarr archive, measuring the performance (mainly time and memory consumption) of different chunking strategies, and performance data visualization. This study was conducted as part of the Fall 2021/Spring 2022 internship at NASA Goddard Space Flight Center.


The complete list of required packages is provided in env-eisfire.yml, which you can install in your environment via conda with the command conda env create -f env-eisfire.yml. Note that this code is set up on a cluster on AWS that uses slurm.


1. Rechunk dataset


In this study, we use the GEOS-FP dataset in Zarr format stored in the AWS S3 bucket eis-dh-fire/geos-fp-global/; specifically, the inst.zarr store and BCEXTTAU variable. The default chunking scheme: 5136 chunks in the time dimension, 1 chunk in longitude, and 1 chunk in latitude.


To rechunk the dataset into a different scheme (e.g., 5136 chunks in time, 100 in longitude, and 100 in latitude), navigate to the directory rechunk/ and modify the main() function in the script for the variables time, lat, and lon to take on desired values (single value or a list of values for each variable - the script will create unique combinations of the variables). Run the rechunking script with the command: python to automatically launch a cluster job for each combination of variable values.


Job info and progress as well as any errors are stored in the .out and .err files in the sub-directory logs-slurm/. The final output Zarr store is back written to S3 (eis-dh-fire/dieumynguyen_rechunked/geos-fp-global_inst/).

2. Measure performance

Input & Info:

After rechunking the dataset to various chunking schemes and storing the different versions of the dataset on S3, we track how the schemes perform for common data access and analysis operations (e.g., extracting a time series at a location or extracting a map or spatial slice at a datetime). Performance metrics include CPU time, peak memory usage, the rechunking time, and Zarr store archive size.


Navigate to directory measure_performance/.

  • To obtain archive size data, run sbatch to submit a cluster job, which runs
  • To obtain rechunking time, run sbatch to submit a cluster job, which runs
  • To obtain CPU time and peak memory usage for a given data operation, modify the selected operation in the main() function in Then, run sbatch to submit a cluster job, which runs

Job info and progress as well as any errors are stored in the .out and .err files in the sub-directory performance-logs-slurm/.

  • Archive size data is saved in data/geos-fp-global_inst/archive_sizes.csv.
  • Rechunking time data is saved in data/geos-fp-global_inst/rechunking_time.csv.
  • Time and memory data for each operation are saved in data/geos-fp-global_inst with filename indicating the operation and number of trials/repetitions (e.g., time_series_metrics_ntrials1.csv).

3. Data visualization


The performance data generated in #2.


Navigate to directory visualization/. Run sbatch to submit a job to run


Heatmaps and scatterplots shown in paper, stored in data/geos-fp-global_inst/heatmaps, data/geos-fp-global_inst/normalized_heatmaps, and data/geos-fp-global_inst/scatterplots.

Reference: Nguyen DMT, Cortes JC, Dunn MM, Shiklomanov AN (2022). Optimal Strategies for Storing Earth Science Datasets in the Commercial Cloud.