New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clay v1 Pipeline Spec #178
Comments
Love this idea. I believe this also restricts us to use COG data on S3 (for speed and window reads) but that's an ok cost to bear imo. We should just make sure we can fall back always to static pre-made batches (important to keep that functionality for finetunning and users who want to continue training potentially offline).
|
Thanks for the feecback @brunosan The fallback would be to run this pipeline but store all tiles permanently. That should be fine for smaller datasets, but then maybe this mechanism is overkill and the user is better off to simply create chips on their own scripts. But in any case, the proposed mechanism could be used to permanently store chips as well.
|
@yellowcap Thanks for doing this write up 🚀 , I think it is a good summary of some of the discussions we had. I'll try to carve out some time this week to add tickets for
A few questions/comments about the approach you've described. On the STAC side
On the index table side, for optimized performance, we may wish to consider in-lining all of the relevant STAC Item information into columns in a Parquet dataset/files (with the assets dictionary stored in a serialized column) to avoid having to load and parse a very large JSON index file. This may only be valuable when our index size grows to > 1M rows. On the temporary data loading side, the depth of my Pytorch knowledge is fairly limited, but we will also need to include hooks within the training loop to publish an external notification when a batch is completed so that our infrastructure can remove the assets associated with that batch from fast temporary storage. We will likely also need some mechanism to "peek" at the next batch for the Dataloader's iterator so that it can be preloaded to disk. (This would achieve our temporary storage "streaming" goals and allow us to use our fast disk capacity efficiently). |
In a very timely post, @jhamman. et al at Earthmover seem to have managed to stream Zarr data directly to the GPU. https://earthmover.io/blog/cloud-native-dataloader/ |
Here is a proposal for where to get / store the metadata fields that are required by the Clay v1 model in STAC. Everything except the normalization parameters is either native STAC or from the eo and raster extensions. Each of these values would be required for each band. The assumption is that each band is its own "token" to the model eventually. Some things like
|
@yellowcap I'm writing up a slightly modified version of this approach based on the recent demonstration from the EarthMover team https://github.com/earth-mover/dataloader-demo. I hadn't even considered Dask's in-memory opportunistic caching as a fast intermediate store for pre-fetched chips to maintain high GPU saturation but this is actually very well aligned with our requirements. One question which will drive the partitioning/row group structure of our chip index parquet file and how we perform sampling for shuffling is what level of reproducibility we need for batches? With the approach we'll use Dask Dataframe's sample. We have some options for ensuring that our batch partitioning and shuffling is fully reproducible but I'll need to consider this in our pipeline architecture. Do we need fully reproducible batch generation in this case? |
Hmm not 100% sure here. I it depends on how we distribute batches across GPU nodes. Maybe we need to reproduce the sample in each node and then select from that, or the sample is created centrally and then pushed to the nodes. In this case reproducibility is less important. But not sure about this, maybe @srmsoumya knows more... |
There are two more issues that we have to take into account:
|
I have been working more on the v1 pipeline, realizing that we will need cloud and nodata tracking at the chip level. I played around with creating classes where the platform specific cloud and nodata filters can be added through class method overrides. the current code is in the PR linked above. But here are the the specific pointers https://github.com/Clay-foundation/model/blob/clay-v1-pipeline/scripts/pipeline_v1/chip_indexer.py script to try them out with the data attached below is The latests idea is to make copies of scenes that we want to use in a single bucket that can be hooked directly into FSx. Storing 100TB permanently would cost about 14k per month, which is still acceptable given the credits we have. If we put the scene files, the STAC items, and a central index in a bucket, then FSx can sync that stuff and we can expose it to our training nodes. The data loader could then do dynamic chipping from the scenes based on the index. This should be quite fast. If we move with this approach now we gain time to do the "megabatch" part where we handle batches of scenes. I feel like if we do dynamic chipping and indexing on of the chips, that should integrate well with a more batched streaming approach. What do you think @sharkinsspatial |
Splitting the indexing and chipping functionality into a separate repo, as this can be done quite univerally I think. |
Got a first full version for the new v1 pipeline for Sentinel-2, see https://github.com/Clay-foundation/stacchip/blob/main/scripts/sentinel-2-processor.py This will
If we let this run through all 2500 MGRS tiles we have previously sampled we should have all the S2 data we want for v1. We can test this at scale this week and then replicated for other sources! Results from the a single MGRS tile and one date are here |
Ok this is ready for a first serious test drive tomorrow. The index file is written properly to geoparquet and both the STAC item and the assets are created as desired. I'll push this to Batch tomorrow, if things go smoothly we will produce 2500 MGRS tiles for 3 years with 4 seasons each. I.e. 30'000 Sentinel-2 scenes. That should be a good start! Screenshot from visualization of the geoparquet index using lonboard |
This is a proposal for a streaming approach that relies on static STAC Indexes and dynamic tiling.
tl;dr
Tracking scenes
The first step is creating a list of scenes from multiple sources, like Modis, Landsat, Sentinel-2, Naip, Drones. Separately, one by one.
Track chips in each scene
Track how many chips are in each scene
The index table
We would then store all the static STAC item as json in files in a central location and index all the chips into a single table.
The pipeline would then do random sampling into batches in two steps:
Considerations:
Data Flow
Considerations
The text was updated successfully, but these errors were encountered: