Multi-threaded Node JS script to spatially join attributes between two GeoJson datasets
Spatial merge can be a really tedious process if the datasets are quite large. It can take up to several hours if not days.
The trick to successfully implement a fast spatial join is either work on GPU (1,2) that requires expertise in GPU computing or distribute the processing load to multiple threads. The latter improves performance by utilising multiple cores availabe on a machine.
This algorithm written in pure Javascript distributes the processing load Node.js Cluster API for multi-threading, coupled with batch processing the data on each thread. The geographical analysis is done using Turf.js Library. The data is stored and processed from a mongoDB database. It can join 2.5 million features and 650,000 features in two layers in nearly 2 hours (Tested on Intel i7-6700HQ CPU @ 2.60Hz, 16GB RAM, Windows 10).
Consider the two images above. One of the dataset has building features and can be regarded as inner layer. The buildings are always inside a building-lot. The other dataset has lot features and can be regarded as outer layer. We split both of the datasets in tiles for batch processing. Each tile contains multiple lot features and within each lot there are multiple building features.
-
The features in both inner and outer layer are clipped to city boundaries.
-
The area inside city boundaries is divided into two sets of grids - square tile and triangular tile Two sets of grids are necessary to perform the spatial merge in batches of two separate non-identical tiles. Otherwise, after a single pass some salt-n-pepper features are left along the boundary of the tiles. The tile grid structure is explained with the images below.
** Leftover features at the intersection of square tiles which need to be accounted for using triangle tiles**
- Next, the tile ID is added to each feature within the respective tile.
- The tiles are then sent to each working thread for performing the spatial merge.
- Each thread looks at features within the tile alloted to it and runs a the spatial join.
- This process is makes it faster because each thread only scans features within a tile, instead of the whole map. This is made possible by adding tile IDs to each feature in advance.
In this case the two datasets to be merged are :
- Convert both datasets to geoJson Files
- Use Tippecanoe-decode to convert .mbtiles to geojson
- Use ogr2ogr to convert .shp to geojson
- Import both datasets as Json files to individual collections in a Mongodb database. Local instance should be fine. Refer to detailed documentation regarding data conversion and importing here
- Setup the MongoDB database
- Import data to Open Map Tiles data to
buildings
collection and MapPluto Lots data tolots
collection. - Import city boundary data to
cityBoundary
- The
squareGrid
andtriangleGrid
is created by theinit.js
and running 'step 1.' - Modify
config.json
values to describe database and collection names.
// MongoDB Collection Structure
`nycdb` collections
|
|- buildings // Open Map Tiles Buildings
|- lots // MapPluto Lots
|- cityBoundary // City Boundary to clip non-intersecting features
|- squareGrid // Batchwise spatial merge for muti-threaded process
|- triangleGrid // Batch processing, second pass
- Install Dependencies
npm install # Install dependencies
npm start
- Start Mongo Daemon
mongod
- Run
node init.js
and follow the steps listed sequentially.
Spatial-Join
Enter the step number to execute:
1. Create tile girds for batch processing (required for next steps)
2. Clip inner & outer layer features to city boundary
3. Link inner & outer layer features to tile grids
4. Spatial Merge: Add outer layer properties to inner layer properties
- Create a GUI interface in Electron
- Run it in cloud on higher thread count
- Create a docker container