Skip to content

An empirical study and comparison of Deterministic, Statistical, and ML Algorithms for the Spatial Modeling of significant wave height data from NOAA's National Data Buoy Center and other Environment-related datasets

License

Notifications You must be signed in to change notification settings

simonsanvil/ndbc-spatial-modeling

Repository files navigation

Spatial Modeling on NDBC Data

An empirical study and comparison of Deterministic, Statistical, and ML Algorithms for performing Spatial Modeling of significant wave height values collected by buoy and sea monitoring stations managed by the United States' National Data Buoy Center (NDBC) located near costs of the Southern Atlantic regions of the United States, including those on the Gulf of Mexico and parts of the Caribbean.

ML Spatial Interpolation
Temporal-Spatial Interpolation of wave height in the area at a certain timestamp. Black dots indicate the points that have been actually sampled. Red circles are points that were excluded from training data

Techniques Studied:

Technical Approach:

Available Buoys per Set
Timeseries of wave height measurements from buoy #42019
  • The general preprocessing steps were done by defining a kedro pipeline to detect and parse missing values, format the columns, and convert it to a geo-parquet format (Geopandas was used for read/write operations and to work with it as geospatial data).
  • The data was then split into training and test sets. The test set itself consisted of several subsets of selected data, each of which was used to evaluate the performance of the algorithms based on the specific spatial configuration of the buoys available in each set.
Available Buoys per Set
Test subsets evaluated in this area. Inside red circles are the buoys that were not available in the training set of each period mentioned.
  • Evaluation was conducted by writing individual MLFlow experiments of each of the algorithms and were then executed with each of the subsets of the test data on parallel (see the experiments/ directory to see examples of this).
  • The results of the experiments were then analyzed by comparing the performance of the algorithms on the test sets.

Results:

The results of the study favour the use of ML algorithms over the use of other methods when paired with a strong feature set that are able to capture the spatial distribution of the data well. While they achieve similar error than other algorithms in sets that test interpolation inside the convex hull of the data (such as those in sets A,B,C) they are much better than the others on points that would require extrapolation outside the convex hull of the data (sets D,E,F).

Available Buoys per Set Available Buoys per Set
Overall error metrics Avg RMSE per test set
Results of various interpolation methods
Visual results per evaluated technique

Of the two ML methods, Gradient Boost (LightGBM) was the one that turned out to be most successful not only on accuracy but also when comparing the time it takes to run inference in comparison to Random Forest (3x faster).

About

An empirical study and comparison of Deterministic, Statistical, and ML Algorithms for the Spatial Modeling of significant wave height data from NOAA's National Data Buoy Center and other Environment-related datasets

Topics

Resources

License

Stars

Watchers

Forks