Bioclimatic variables data extraction at high resolution

About the project

This is a small python program that allows easy and fast extraction of Bioclimatic variables (commonly used in ecological analyses and modeling) from 2 of the most prohiminent climate datasets, Chelsa and WorldClim. To extract the data, you only need the lon/lat (x,y) coordinates using any of the Coordinate Reference System in combination with the download GeoTIFF files. The program will take care of offset + scale correction and will output the data in the proper units for each variable.

Included also is a quick way to visualize with the Plotly library via running data_viz.py script that can improved and customized.

Databases

WorldClim Bioclimatic variables (latest version 2.1)

Latest publication : WorldClim 2: new 1-km spatial resolution climate surfaces for global land areas (2017)

Historical data range : 1970-2000

Data at 1 km² resolution for all 19 BIOCLIM vars https://www.worldclim.org/

Metadata : bioclim-data-extraction/data/specs/anuclim61.pdf

Chelsa - Climatologies at high resolution for the earth’s land surface areas (latest version V2.1)

Latest publication : Global climate-related predictors at kilometre resolution for the past and future. (preprint)

Historical data range : 1981-2010

Data at 1 km² resolution for all 19 BIOCLIM vars/ https://chelsa-climate.org

Metadata : bioclim-data-extraction/data/specs/CHELSA_tech_specification_V2.pdf

Installation

Use conda to setup the python environment

Requirements

For full requirements see requirements.txt. Uses mainly python > 3 with the following main libraries:

rasterio
pyproj
pandas

Linux-based systems

Clone the repo

git clone git@github.com:simlal/bioclim-data-extraction.git
cd bioclim-data-extraction

Create a new conda environement

conda create --name bioclim --file requirements.txt
conda activate bioclim

There are some unused dependencies in there, but there are some intricacies regarding compatible versions of rasterio and pyproj so just go ahead and use the requirements.txt

Dir structure

.
├── data
│   ├── bioclim
│   ├── specs
│   │   ├── anuclim61.pdf
│   │   └── CHELSA_tech_specification_V2.pdf
│   ├── states.csv
│   └── urls
│       ├── chelsa_bioclim19_S3paths.txt
│       └── worldclim_bioclim19_30s_path.txt
├── scripts
│   ├── config.yaml
│   ├── data_extraction.py
│   ├── download.py
│   ├── __init__.py
├──requirements.txt
└── run.py

We will perform our data analysis from the main directory (i.e. run.py in this example).

Download data

With wget

mkdir data/bioclim/
wget -i data/urls/chelsa_bioclim19_S3paths.txt -P data/bioclim/
wget -i data/urls/worldclim_bioclim19_30s_path.txt -P data/bioclim/

Directly with Python

Steps

Run bioclim_download.py from the command line.

python scripts/download.py

chelsa : for Chelsa bioclim19 dataset 
worldclim : for WorldClim bioclim19 dataset 
both : for Chelsa and WorldClim bioclim datasets 

Enter which dataset(s) you would like to download.

Entering dataset (or 'both') keyword will start the download.

There will be some infographics about the progress and speed of the download within the terminal. Note that the download will proceed by chunks.

Extract data for bioclim 1 to 19 + elevation variables

For a single data point

Use directly with CrsDataPoint class from run.py

Create a CrsDataPoint instance

>>> from scripts.data_extraction import CrsDataPoint

>>> sherby_3857 = CrsDataPoint('Sherbrooke', epsg=3857, x=-8002765.769038227, y=5683742.6823244635)
>>> sherby_3857.get_info()
id : Sherbrooke
EPSG : 3857
Map coordinates :
        x = -8002765.769038227
        y = 5683742.6823244635
        (x,y) = (-8002765.769038227, 5683742.6823244635)

Transform to other EPSG if needed

>>> sherby_4236 = sherby_3857.transform_crs()    # default is 4326 (aka GPS)
>>> sherby_gps.get_info()
id : Sherbrooke_transformed
EPSG : 4326
Map coordinates :
   x = -71.89006805555556
   y = 45.39386888888889
   (x,y) = (-71.89006805555556, 45.39386888888889)

Since both Worldclim and Chelsa db (GeoTIFF) are encoded with the EPSG:4326 coordinate reference system (crs), we need to convert the x/y coords to EPSG:4326.

The method transform_crs() will be automatically called when encountering a non-"4326" object and convert the coordinates accordingly.

Extract bioclim 1 to 19 + elevation from Wordclim or Chelsa datasets

Get values and relevant metadata

>>> from scripts.data_extraction import CrsDataPoint

>>> sherby_3857 = CrsDataPoint('Sherbrooke', epsg=3857, x=-8002765.769038227, y=5683742.6823244635)
>>> sherby_4236_chelsa = sherby_3857.extract_bioclim_elev(dataset='chelsa')     # As dictionary
Data point with x,y other than EPSG:4326. Calling transform_crs() method...
...than extracting values for Sherbrooke_transformed at lon=-71.890 lat=45.394  for all climate variables bio1 to bio19 in CHELSA V2.1 (1981-2010)  + elevation in WorldClim 2.1 dataset...
Done!

# Convert to DataFrame
>>> import pandas as pd
>>> sherby_bio_chelsa = pd.DataFrame([sherby_4236_chelsa])
>>> with pd.option_context('display.max_colwidth', 15):
    print(sherby_bio_chelsa)       # Output will vary base on terminal width
               id  epsg        lon        lat  bio1 (Celcius)  ... bio19 (kg / m**2 / month)  bio19_longname  bio19_explanation elevation_Meters elevation_explanation
0  Sherbrooke_...  4326 -71.890068  45.393869            6.05  ...           243.0            mean monthl...  The coldest...                158   Elevation i...      

[1 rows x 63 columns]

To get a leaner dataframe with only relevant class information + values

>>> from scripts.data_extraction import trim_data

>>> sherby_bio_chelsa_trimmed = trim_data(sherby_4236_chelsa) # From full dict to trimmed dict
>>> print(pd.DataFrame([sherby_bioclim_trimmed]))       # Display as df
                       id  epsg        lon  ...  bio18 (kg / m**2 / month)  bio19 (kg / m**2 / month)  elevation_Meters
0  Sherbrooke_transformed  4326 -71.890068  ...                      375.8                      243.0               158

[1 rows x 24 columns]

For multiple data points from a csv file

Instantiate from csv

Use the load_csv() classmethod with a csv containing the CrsDataPoint attributes as header

>>> from scripts.data_extraction import CrsDataPoint
>>> from pathlib import Path

# State centroid example
>>> csv_file = Path("./data/us-state-capitals.csv")
>>> data = CrsDataPoint.load_csv(csv_file)      # As list of CrsDataPoint objects

# First 3 capitals
>>> print(data[0:3])
[CrsDataPoint(Montgomery_Alabama, epsg=4326, x=-86.279118, y=32.361538), CrsDataPoint(Juneau_Alaska, epsg=4326, x=-134.41974, y=58.301935), CrsDataPoint(Phoenix_Arizona, epsg=4326, x=-112.073844, y=33.448457)]

We can check at anypoint the complete list of all instantiated objects with by using the CrsDataPoint.all attribute.

To extract the bioclim1 to 19 + elevation values for a given database Calling the extract_multiple_bioclim_elev(specimens_list, dataset, trimmed=True function returns a dataframe (trimmed or exhaustive depending on trimmed arg).

>>> from scripts.data_extraction import extract_multiple_bioclim_elev

# Extract bioclim values for all states
>>> df_trimmed = extract_multiple_bioclim_elev(data, 'worldclim', trimmed=True) # if False : full df
>>> df_trimmed = df_trimmed.set_index('id')
Extracting values for Montgomery_Alabama at lon=-86.279 lat=32.362  for all climate variables bio1 to bio19  + elevation in WorldClim 2.1 (1970-2000) dataset...
Done!
Extracting values for Juneau_Alaska at lon=-134.420 lat=58.302  for all climate variables bio1 to bio19  + elevation in WorldClim 2.1 (1970-2000) dataset...
Done!
...
Extracting values for Cheyenne_Wyoming at lon=-104.802 lat=41.146  for all climate variables bio1 to bio19  + elevation in WorldClim 2.1 (1970-2000) dataset...
Done!

# Checking the first five capitals
>>> print(df_trimmed.head())
                       epsg         lon        lat  bio1 (Celcius)  ...  bio17 (kg / m**2 / month)  bio18 (kg / m**2 / month)  bio19 (kg / m**2 / month)  elevation_Meters
id                                                                  ...                                                                                                   
Montgomery_Alabama     4326  -86.279118  32.361538       18.850000  ...                      264.0                      330.0                      392.0                88
Juneau_Alaska          4326 -134.419740  58.301935        5.045834  ...                      306.0                      409.0                      463.0                45
Phoenix_Arizona        4326 -112.073844  33.448457       22.787500  ...                       13.0                       53.0                       59.0               336
Little Rock_Arkansas   4326  -92.331122  34.736009       16.950001  ...                      270.0                      275.0                      301.0               104
Sacramento_California  4326 -121.468926  38.555605       16.587500  ...                        9.0                        9.0                      249.0                 9

[5 rows x 23 columns]

Then save to csv

>>> bioclim_out = data_dir / "us-capitals_bioclim.csv"
>>> if not Path.is_file(bioclim_out):
>>>     df_trimmed.to_csv(bioclim_out)

Data visualization

All visualization are made with the Plotly graphing library for Python. Run the data_viz.py script command line with the previously generated csv as follow :

python data_viz.py us-capitals_bioclim.csv.

The data_viz.py was written quickly to just have a visualization of the climate data pipeline.

Mapbox

Example with scatterplot on Mapbox

Dotplot for bioclim + elev

Example with bio1 selection from dropdown menu Example with elevation selection from dropdown menu

Contact

Feel free to contact me by email or any other platform mentioned in my GitHub profile for any questions of feedback!

Name		Name	Last commit message	Last commit date
Latest commit History 124 Commits
data		data
scripts		scripts
viz_example		viz_example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

simlal/bioclim-data-extraction

Folders and files

Latest commit

History

Repository files navigation