Skip to content

Latest commit

 

History

History

prep

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Data Preparation

All data is from cds.climate.copernicus.eu. I'm using a few years of the ERA5 hourly data on single levels from 1979 to present (cds.climate.copernicus.eu) dataset for all values but ocean currents. This dataset is a ECMWF reanalysis which holds hourly data of a lot of variables like wind, pressure, temperature. Data points were regridded to a regular lat-lon grid of 0.25 degrees. To reduce the download sizes I only took every 3rd hour (00:00, 03:00, 06:00, ...) and exluded 20° from the poles (70°S to 70°N). For ocean currents I am using ORAS5 global ocean reanalysis monthly data from 1958 to present (cds.climate.copernicus.eu). It provides data in a 3D ocean model with a lat-lon grid of approximately 0.25 degrees. It includes variables for zonal and meridional water velocity. However, this model only has monthly means.

I aggregated data into certain time ranges. Usually, I am interested in a certain month of a year. Last year's data would give a very recent reading. Last 10 year's data would give a more robust reading.

last download 05.02.2024, v6, each month over year 2023 and over years 2019-2023

Winds Winds are given in u and v vectors in m/s where u describes the component blowing eastwards and v describes the component blowing northwards. The actual directions and velocities have to be calculated from these vectors. (Note: 180° addition here because as of convention I want the direction the wind is comming from!) Then, I used the Beaufort scale for binning wind velocities into 13 bins, and the traditional compass rose bearings (N, NNE, NE, ...) for binning wind directions into 16 bins. For each time range I counted direction x velocity.

Temperatures (air) Temperatures are given in Kelvin extrapolated to 2 m above earth surface. (So, I guess this temperature is not very reliable in mountains.) For each time range I extracted daily high and low temperatures, then calculated high and low means and standard deviations.

Precipitation Total precipitation is given in m (over the grid area if the water would spread evenly). This includes liquid and frozen water from large-scale and convective precipitation. "This parameter does not include fog, dew or the precipitation that evaporates in the atmosphere before it lands at the surface of the Earth." I sum up all precipitation over each day and report daily averages.

Sea surface temperature This is the water temperature near sea surface given in Kelvin. As with air temperatures I calculated daily high and low means and standard deviations for each time range.

Wave heights This is the average height of the highest third of sea waves that were generated by wind and swell. It represents the vertical distance in m between wave crest and wave trough. I used the Douglas scale to bin wave heights into 10 bins. For each time range I counted wave heights.

Currents Currents are given in rotated zonal and meridional velocity in m/s. This is similar to u- and v-vectors where zonal describes eastwards velocity and meridional describes northwards velocity. The actual directions and velocities have to be calculated from these vectors. Like with the winds, all directions are binned into the 16 traditional compass rose bearings (N, NNE, NE, ...). There is no obvious scale for current strength, so I created a binning with increasingly larger current intervals.

Steps to Reproduce

Install and activate environment.yml (conda env create -f environment.yml && conda activate prevwinds_prep). There is a python CLI for the different steps. See the description with python -m main --help. A data directory (--datadir) is used for temporarly storing all files. Its default is [data/](./data/) (in gitignore here). Raw variables are first downloaded, then extracted into parquet files. Target variables are then derived by aggregation, and finally uploaded. Files for extracted varaible (data/extracted_*.pq`) can be reused. But it makes sense to download everything from scratch after some time because datasets are sometimes updated in retrospect.

python -m main --datadir ./data download
python -m main --datadir ./data extract
python -m main --datadir ./data aggregate

All raw downloaded files (for 5 years) are arbout 130GB, the extracted files about 100GB. So, it makes sense to do one variable at a time (using --variables). The aggregated files are 2.5GB.

Upload to S3 will take about 9h. Do it overnight when fewer people are using the network. If everything was uploaded correctly can be checked with python -m main check .... If the upload for some keys failed they can be uploaded separately with python -m main --datadir ./data upload --keys <mykey>.