Skip to content

Clay-foundation/stacchip

Repository files navigation

stacchip

Dynamically create image chips for earth observation machine learning applications using a custom chip index based on STAC items.

Get a STAC item, index its contents, and create chips dynamically like so

# Get item from an existing STAC catalog
item = stac.search(...)

# Index all chips that could be derived from the STAC item
index = Indexer(item).create_index()

# Use the index to get RGB array for a specific chip
chip = Chipper(index).chip(x=23, y=42)

Motivation

Remote sensing imagery is typically distributed in large files (scenes) that typically have the order of 10 thousand of pixels in both the x and y directions. This is true for systems like Landsat, Sentinel 1 and 2, and aerial imagery such as NAIP.

Machine learning models operate on much smaller image sizes. Many use 256x256 pixels, and the largest inputs are in the range of 1000 pixels.

This poses a challenge to modelers, as they have to cut the larger scenes into pieces before passing them to their models. The smaller image snippets are typically referred to as "chips". A term we will use throughout this documentation.

Creating imagery chips tends to be a tedious and slow process, and it is specific for each model. Models will have different requirements on image sizes, datatypes, and the spectral bands to include. A set of chips that works for one model might be useless for the next.

Systemizing how chips are tracked, and making the chip creation more dynamic is a way to work around these difficulties. This is the goal fo stacchip. It presents an approach that leverages cloud optimized technology to make chipping simpler, faster, and less static.

Overview

Stacchip relies on three cloud oriented technologies. Cloud Optimized Geotiffs (COG), Spatio Temporal Asset Catalogs (STAC), and GeoParquet. Instead of pre-creating millions of files of a fixed size, chips are indexed first in tables, and then created dynamically from the index files when needed. The imagery data itsel is kept in its original format and referenced in STAC items.

Creating chips with stacchip is composed of two steps:

  1. Create a stacchip index from a set of STAC
  2. Dynamically create pixel arrays for any chip in the stacchip index

Indexes can be created separately for different imagery sources, and combined into larger indexes when needed. This makes mixing different imagery sources simple, and allows for flexibility during the modeling process, as imagery sources can be added and removed by only updating the combined index.

The mechanism is purposefully kept as generic as possible. The index creation is done based on a STAC item alone, no other input is needed. Obtaining image data for a chip that is registered in a stacchip index only requires a few lines of code.

The indexer

The Indexer class is build to create a chip index for data registered in a a STAC item. The indexer will calculate the number of available chips in a STAC item given a chip size. The resulting chip index is stored as a geoparquet table.

The following example creates an index the Landsat-9 STAC item from the tests

from pystac import Item
from stacchip.indexer import LandsatIndexer

item = Item.from_file(
    "tests/data/landsat-c2l2-sr-LC09_L2SR_086107_20240311_20240312_02_T2_SR.json"
)
indexer = LandsatIndexer(item)
index = indexer.create_index()

Nodata and cloud coverage

Earth observation data often comes in scenes that contain nodata pixels, and the imagery might contain clouds. Statistics on nodata and cloud cover is relevant information for model training. Typically a model is trained with limited amounts nodata and cloud pixels.

The indexer therefore needs to be track these two variables so that the modeler can choose how much or how little nodata pixels and cloudy pixels should be passed to the model. However, how this information is stored varies for different image sources.

The indexer class might need adaption for new data sources. In these cases, the base class has to be subclassed and the get_stats method overridden to produce the right statistics.

The stacchip library has a generic indexer for sources that have neither nodata or cloudy pixels in them. It has one indexer that takes a nodata mask as input, but assumes that there are no cloudy pixels (useful for sentinel-1). It also contains specific indexers for Landsat and Sentinel-2. For more information consult the reference documentation.

Chipper

The Chipper class can be used to create chips based on an existing stacchip index.

There are multiple ways to instanciate the chipper class. Either point to a parquete file on S3, to a local parquet file, or pass a geoparquet table object to the instanciator. Once instantiated, any chip can be generated for a chip index, or all the chips can be returned by iterating over the chipper.

The following code snippet gives an example using a local path.

from stacchip.chipper import Chipper
import geoarrow.pyarrow.dataset as gads

# Load a stacchip index table
dataset = gads.dataset("/path/to/parquet/index", format="parquet")
table = dataset.to_table()

# Get data for a single chip
row = 42
chipper = Chipper(
    bucket="clay-v1-data",
    platform=table.column("platform")[row],
    item_id = table.column("item")[row],
)
chip_index_x = table.column("chip_index_x")[row].as_py()
chip_index_y = table.column("chip_index_y")[row].as_py()
data = chipper.chip(chip_index_x, chip_index_y)

Merging indexes

Stacchip indexes are geoparquet tables, and as such they can be merged quite easily in to a single table. The recommendation is to store each stacchip index for a single STAC item in a subfolder, then the files can be merged and the STAC item can be tracked using the folder structure using partitioning feature from pyarrow.

The following example assumes that each index file from a single STAC item is in a subfolder that is named after the STAC item id.

from pyarrow import dataset as ds

part = ds.partitioning(field_names=["item_id"])
data = ds.dataset(
    "/path/to/stacchip/indices",
    format="parquet",
    partitioning=part,
)
ds.write_dataset(
    data,
    "/path/to/combined-index",
    format="parquet",
)

Processors

To use stacchip for an existing imagery archive, the indexes need to be created for each scene or STAC item.

Stacchip comes with processors that can be used to collect and index imagery from multiple data sources. This will be extended as the package grows.

Each processor is registered as a command line utility so that it can be scaled easily. Note that these processors are created to work well with AWS Batch, but are not dependent on it and can be used locally or anywhere else too.

Sentinel-2

The stacchip-sentinel-2 processor CLi command processes Sentinel-2 data. It will process MGRS tiles from a list of tiles from a layer that can be opened by geopandas.

Each MGRS tile will be processed by the row index in the source file.

For each tile it will process the least cloudy image in each quartal from two random years between 2018 and 2023.

The script uses environment variables to determine all inputs:

  1. The index of the MGRS tile to be processes from the source file
  2. The source file for the MGRS tile sample
  3. A target bucket for writing the assets, stac items, and stacchip index.

An example set of environment variables to run this script is:

export AWS_BATCH_JOB_ARRAY_INDEX=0
export STACCHIP_MGRS_SOURCE=https://clay-mgrs-samples.s3.amazonaws.com/mgrs_sample_v02.fgb
export STACCHIP_BUCKET=clay-v1-data

Landsat

The stacchip-landsat processor CLI command processes Landsat data. It will process a list of geometries from a layer that can be opened by geopandas. For each row, it will use the centroid of the geometry to search for landsat scenes.

For each geometry it will process the least cloudy image in each quartal from two random years between 2018 and 2023. For one year it will collect L1 data, and for the other year L2 data. The platform is either Landsat-8 or Landsat-9, depending on availability and cloud cover.

The script uses environment variables to determine all inputs:

  1. The index of geometry to be processes from the source file
  2. The source file for the source sample file
  3. A target bucket for writing the assets, stac items, and stacchip index.

An example set of environment variables to run this script is:

export AWS_BATCH_JOB_ARRAY_INDEX=0
export STACCHIP_SAMPLE_SOURCE=https://clay-mgrs-samples.s3.amazonaws.com/mgrs_sample_v02.fgb
export STACCHIP_BUCKET=clay-v1-data

NAIP

The stacchip-naip processor CLI command processes imagery from the National Imagery Program (NAIP).

The sample locations were created using the Natural Earth database as a source. The sample includes all popluated places, protected areas and parks, airports, and ports. In addition, we sampled one random point along each river, and one random location within each lake that is registered in Natural Earth. Finally, we sampled 4000 random points. All data was filtered to be within the CONUS region.

Similar to the other processors, the input variables are provided using env vars.

An example set of environment variables to run this script is:

export AWS_BATCH_JOB_ARRAY_INDEX=0
export STACCHIP_SAMPLE_SOURCE=https://clay-mgrs-samples.s3.amazonaws.com/clay_v1_naip_sample_natural_earth.fgb
export STACCHIP_BUCKET=clay-v1-data

LINZ

The stacchip-linz processor CLI processes data from the New Zealand high resolution open aerial imagery.

As a sample, we randomly select 50% the scenes, whith a minimum of 10 and a maximum of 2000 scenes for each catalog that was included. We selected the latest imagery for each of the available regions of new zealand. The list of catalogs is in the linz processor file.

We also resample all the imagery to 30cm so that the data is consistent.

Similar to the other processors, the input variables are provided using env vars.

An example set of environment variables to run this script is:

export AWS_BATCH_JOB_ARRAY_INDEX=0
export STACCHIP_BUCKET=clay-v1-data

Batch processing

The following base image can be used for batch processing. Installing the package will include the command line utilities for each processor.

FROM python:3.11

RUN pip install https://github.com/Clay-foundation/stacchip/archive/refs/heads/main.zip

Prechip

In cases where chips need to be computed in advance, the stacchip-prechip cli script is a helper to create npz files from the chips.