Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try Google BigQuery #15

Open
jameshalgren opened this issue Oct 7, 2022 · 28 comments
Open

Try Google BigQuery #15

jameshalgren opened this issue Oct 7, 2022 · 28 comments

Comments

@jameshalgren
Copy link
Contributor

jameshalgren commented Oct 7, 2022

We have a lot of data from the national water model in here and here. And we potentially need want* to get large volumes of that data out very quickly. In general, the extractions will be in the form of time series of forecasts, sometimes combined in their entirety, and sometimes sliced to different specific sub-parts of the forecast.

We want to try an experiment in BigQuery to see if that platform, proprietary though it is, can render such access.

@michaelames @danames or @KMarkert (welcome!) -- if you have already started an issue, please link it over here.

*My better self forced me to caveat this. Maslow would suggest that we eat first. But without water, we can't grow food!

@jameshalgren
Copy link
Contributor Author

We could explore something similar with Amazon tools...

@jameshalgren
Copy link
Contributor Author

This might be similar to what we were already trying to look at with OpenDataCube...

@jameshalgren
Copy link
Contributor Author

jameshalgren commented Oct 7, 2022

The Google cloud bucket description page has a hint about BigQuery already being used to index some of these data. Perhaps that's something to explore. Did they mean to make it more easily accessible with BigQuery already?

Key metadata for this dataset has been extracted into convenient BigQuery tables , one for each product type (analysis, short-range, medium-range, etc). These tables can be used to query metadata in order to filter the data down to only a subset of raw netcdf4 files available in Google Cloud Storage.

@jameshalgren
Copy link
Contributor Author

jameshalgren commented Oct 11, 2022

ping @swainn, @ZacharyWills @mgdenno
All -- For this initial test, what if we start by just making the forecast information (and potentially the retrospective data) available in BigQuery for only the USGS gaged sites?

This table has most of the current features:
https://github.com/NOAA-OWP/hydrotools/blob/main/python/nwm_client/src/hydrotools/nwm_client/data/RouteLink_CONUS_NWMv2.1.6.csv
'nwm_feature_id' in that table will correspond to the 'feature_id' field in the data tables in the GCP bucket. We could also use the 'gage_id' field in the GCP buckets -- I think it's stored as a 16 character byte string, so there a couple of pitfalls to avoid, but as long as it has a true integer value (the USGS Gage ID), that would indicate that the number is connected to a gage.

This is the help page for the library that implements a crosswalk between the gages and the NWM channel network (using a less human-readable but more maintainable version of the current set of gages):
https://noaa-owp.github.io/hydrotools/hydrotools.nwm_client.utils.html#national-water-model-file-utilities

As a warning, not all the information in there is perfect:
NOAA-OWP/hydrotools#206

@jameshalgren
Copy link
Contributor Author

jameshalgren commented Oct 20, 2022

Thinking towards putting this data in a data cube (BigQuery or otherwise)...

The data are pretty simple. Each data value has a reference time (when the forecast was considered valid), a time (which can be inferred from the reference time and file name, which contains the time offset from the reference time), and a spatial reference. The spatial reference is a 1 km grid for the land file, a 250m subset of the same grid for the terrain output, a completely separate polyline vector set for the channel, and a similar point vector set for the reservoir output.

Todo: Get the polyline somewhere on public HTTP -- ask @aaraney?
Todo: Confirm that the forcing files are hourly and on the same spatial domain as the land parameters (presumably after pre-processing ... we might be able to create a pipeline from the GFS/HRRR/MRMS output directly to the pre-processed NWM inputs. Maybe an opportunity for AI/ML)

The most requested data are the channel_rt outputs for all the forecast horizons and geographies (analysis_assim, short_range, medium_range, long_range). Optimizing access to those variables will be the most noticeable. After that, making the forcing values available would be most useful.

Note: This is because the most easily measured value relative to the NWM output is the stream stage, which has it's closest analogy in the flow coming out of the channel_rt outputs. With more experiments, there could be a lot of attention paid to the soil moisture values (in the land output) or the water on terrain (terrain_rt) and comparison of these to things coming out of remote sensing platforms (stuff like this soil moisture product. Or this flood product.)

In the channel files, there is no useful distinction between the PR_VI, HI, and CONUS outputs -- meaning someone could reasonably want to query and aggregate them as a single dataset. The HI data produce some 15 minute output whereas all others are 1 hour, so there will need to be attention paid to that.

The channel sequential feature ids are useless. The link is the primary ID and the to is a key reference to the downstream link. The to could theoretically be an array if the downstream connection is a delta or some other distributary; for now, the dataset is a completely coalescing graph from the headwaters downstream. Graph representations of the topology can be built from those fields (the entire concept of the new model is built on those representations) and being able to do something like SELECT * WHERE link in upstream(link) would be pretty cool, if you can think of ways that might be implemented.

The time and reference_time dimensions are key.

We could add spatial indexes for HUC regions, states, RFCs, and other sensible aggregations -- not currently in there, but something that would make sense. Those kinds of spatial indexes would be more important than the strict Lat-Lon. Also, as I mentioned earlier on this thread, the most queried values will be the channel results for the gages (that link takes you to a utility for crosswalking between the NWM and USGS ids). Todo find or build a super simple example of that...

@jameshalgren
Copy link
Contributor Author

Regarding most queried data, this comment from @jarq6c is helpful.

I'd also add that the proposed queries seem to skew toward times series at one (or a few) channel features. This makes sense because it is the most common use-case (or rather, the most commonly requested capability). However, it seems to leave the question of scalability unanswered. In my experience, the most common queries have been:

  1. Long time series of simulated streamflow (ANA_NoDA) for an arbitrary channel reach. Reaches are not necessarily limited to USGS gage locations, since some municipalities and academic inquiries manage their own gauge networks, but know which NWM reaches correspond to their gauges.
  2. Multiple forecast streamflow time series (18 hours to 10 days) over a specified period for an arbitrary channel reach.
  3. Multiple simulated streamflow time series from a particular spatially contiguous region affected by a significant flooding event. Spatial scales can vary from a small town or US county to a third of the continent following wide scale tropical events.
  4. Same as 3, but forecast streamflow across multiple issue times.

Evaluations have tended to involve either a comparison of (a) simulated vs observed, (a) forecast vs observed, or (c) forecast vs simulated.

Regarding his last point, I wonder what might be the value (out of scope for now) of sticking USGS data in this repository, keyed to the same spatial reference...

@KMarkert
Copy link

Agreed, the data is fairly straight forward and can be organized in a tabular fashion. My thoughts are tables will be the different file types such as channel_rt, forcings, reservoir, land and so on. The variables from the different files can be columns where we add the time variables as columns, a geo column, and a column to flag what kind of run it is (ie assim, short range, etc). The following table shows my thoughts on setting up the tables for channel_rt and forcing:

channel_rt

time initialization_time model_config nudge ... qSfcLatRunoff feature_id geom

forcing

time initialization_time model_config LWDOWN ... RAINRATE geom

Geoms for channel_rt will be the channel linestrings or catchment polygons (which is preferred here?) and the geoms for forcing will be the points for the grid.

I propose we have a separate table with the geo information from the national hydrography database with all of the attributes. That way we can do some simple geo queries directly on the NWM tables but join if we need to pull in the geo attributes.

The differences in times will not be an issue between the different regions the time column will just have more resolution for the Hawaii areas.

@KMarkert
Copy link

Structuring the data in this matter will be able to address the use cases mentioned and get time series out based on different model runs, spatial location, time, and other metadata. Here is an example query taking the a forecasts from short and long range runs for a reach based on a initialization time assuming the structure I posted above:

SELECT *
FROM `nwm.channel_rt`
WHERE 
    feature_id = XXXXXX AND
    initialization_time > 2022-10-20 AND
    model_config = 'long_range' AND model_config = 'medium_range'
GROUP BY model_config

@danames
Copy link

danames commented Oct 20, 2022

So it turns out all the github notifications about this thread have been going to my email spam. I've whitelisted it so hopefully I can contribute here....

That being said, here's what I just sent by email.

I propose (and would seriously prefer) that we work together to design a solid table structure for the data before any of this is implemented. From what I understand, it almost doesn't matter how we structure the tables - Google BQ will still pull the data out fast. That being said, I would like to organize a table structure in BQ that is logical, interpretable, and conceptually accessible/understandable to/by hydrologic scientists. None of these items were taken into consideration when the NWM developers decided to just dump out continent scale time step based NetCDF files.

We can do better than that for sure. Here are a some design principles, I'd like us to consider in this process:

The first design principle would be that we create separate tables for each model configuration (short_range, medium_range, and long_range) as well as for the analysis and assimilation data set and the forcing files. While it would be possible to put lots of this into one massive table - and BQ probably doesn't care - let's please break them apart into separate tables for the sake of simplifying writing SQL queries and for the sake of having a more clearly defined architecture for education, outreach, and utilization.

The second design principle would be to structure the forecast tables and add the data to the tables BY REACH not BY DATE/TIME. So in other words, the reach_id is the first class citizen. I would propose that the data be entered in the table in rows with reach_id, followed by model_run_datetime (the date/time that the model was executed), the variable of interest (i.e. channel flow), forecast datetime (the time step of the actual forecast), value (the value of the variable of interest). By doing this, a SQL query for a particular forecast for a particular variable, for a specific model run and model configuration would find the needed results sequentially in the table. Again, BQ may not care about this, but from a conceptual point of view it makes so much more sense than to have 2.7 million rows for one time step, followed by 2.7 million rows for the next time step. I would wager that our queries will run much faster if we load the data as time series at reaches rather than time stamps.

The third design principle would be that we need to be geospatially connected from the get-go. Each reach_id needs to be a foreign key into a table of reaches with shape and point geometries (full polyline stream segment AND xy representative point). To facilitate queries we should also directly add HUC IDs (at least HUC 12) to the geometry table.

I've started a table design document here: https://docs.google.com/presentation/d/1-qJ2C91Bs9T8CYjECkibvSlKvCrpb-jzKAlWlHZkDC4/edit#slide=id.g16fd5ac5941_0_7 to capture some of these ideas.

@danames
Copy link

danames commented Oct 20, 2022

Some of my comments feel a bit nonsequiter after reading the existing discussion, but... I think the principles are still good. Separate tables for the 3 main model configurations would be preferable to one big one, data should be ingested into the tables BY REACH rather than BY TIME STEP, and we should be highly geospatially connected from the start...

@KMarkert
Copy link

KMarkert commented Oct 20, 2022

@danames Thank you for the thoughts and I understand your perspective. From the Google side we want YOU/the experts to tell us the best way to format things that make it useful. With that I have a few comments:

first design principle

This is doable, the only sticky point from my perspective is we go from 4 tables (different file types) to 4xn tables where n are the model runs so some thought needs to be put on how to manage this from the BQ end. BQ uses datasets to organize multiple tables. So will it be preferable to structure things under a larger nwm dataset and have multiple tables for the type and model (eg channel_rt_assim and channel_rt_longrange)? Or would your like multiple datasets for the different types and tables for model runs (e.g. nwm_channel_rt.assim, nwm_forcing.long_range)? The first option seems more logical and maintainable from my (and Google maintaining this) perspective but would love to hear your thoughts.

second design principle

With the way BQ works, this isn't necessarily an issue. We can order the columns in whatever way makes the most sense (time before reach) but it won't change performance. The only time performance will come to play is with partitioning tables. This will help with reducing volume of data scanned for queries and reduce time and costs. Partitioning by reaches can certainly work.

third design principle

I highly suggest we join the geospatial data to the tables before ingest and store as a column. The main benefit to this is allowing geospatial queries directly on the tables and not having to join the geospatial data (saves time and money for rudimentary geo queries). We certainly can have the larger NHD data ingested if we need to get the geo attributes or do network connectivity. I am not sure if this addresses your point and please let me know if there is a way you would like to geo info included if this isn't what you were thinking.

Please let me know your thoughts and reactions to my comments.

@danames
Copy link

danames commented Oct 20, 2022

OK just a quick couple of thoughts (thanks for this discussion, I'm learning a lot)

  1. definitely don't want to do a new table per model run! Sorry if that came across. Rather I'm proposing one table per model configuration (short, med, long) that gets appended to indefinitely. The only vector data that make sense in BQ are channelRT and reservoirs. The others (terrain and land) are raster and should/could go straight to GEE.

  2. OK I think I'm talking about partitioning as well. Just want to make sure that this query get all values at one feature WHERE reach_id=1234 is much faster than this query get values at all features WHERE time_step=2020_12_15:12:00

  3. OK as long as we dont' end up repeating the JSON strings over and over. So if 1000 flow records exist for a single reach_id, then we don't want to repeat the GEOJSON 1000x as well. It should be linked I think...

@aaraney
Copy link

aaraney commented Oct 21, 2022

Todo: Get the polyline somewhere on public HTTP -- ask @aaraney?

@jameshalgren, is this what you are after? The RouteLink_CONUS.nc is not a polyline, instead points. One point per reach, I think they are reach centroids? Please correct me if i'm wrong. Each reach has a reference to the next downstream reach if it exists. So, you should be able to generate a polyline from this, but its geographic accuracy will be questionable if these are reach centroids.

@jameshalgren
Copy link
Contributor Author

ping @ma-sada

@jameshalgren
Copy link
Contributor Author

jameshalgren commented Oct 21, 2022

Todo: Get the polyline somewhere on public HTTP -- ask @aaraney?

@jameshalgren, is this what you are after? The RouteLink_CONUS.nc is not a polyline, instead points. One point per reach, I think they are reach centroids? Please correct me if i'm wrong. Each reach has a reference to the next downstream reach if it exists. So, you should be able to generate a polyline from this, but its geographic accuracy will be questionable if these are reach centroids.

@aaraney I think @KMarkert is talking about joining the table to the NHD geometry. We can derive the centroids for quick display from the lat-lon values that will be in the table, but the actual geometry will be there for geopandas/geoserver/whatever. It might be useful to extract endpoint positions to go with the centroid lat-lons (that's gotta be low priority.) In general, we should think ahead about what will happen here as the hydrofabric is updated.

@KMarkert
Copy link

@danames I think we are talking about the same thing for point 1. I propose we keep one large dataset (national_water_model and being verbose/explicit is better here for all users that don't know the acronym) and have a tables of type/config. From a hierarchy standpoint it will look like:

|-- national_water_model
    |-- channel_rt_assimilation
    |-- channel_rt_shortrange
    |-- ...
    |-- reservoir_longrange

So that all of the tables are under the national_water_model dataset and we can point to the correct table as needed (eg FROM 'bigquery-public-data.national_water_model.channel_rt_assimilation')

I didn't consider the raster (land, terrain, forcing) data going into Earth Engine...this will need to be separate conversation so maybe we focus on the channel_rt and reservoir data here.

I need to do more digging into partitioning but from my understanding the interface for user perspective the tables will be one big table but the backend will split up. This is opposed to sharding where sharding is frowned upon. Maybe @ma-sada might have more insight here on partitioning.

For the geo column, yes, each row will have the geometry info (eg LINESTRING(...)). Since Google is making this data public you won't be responsible for the storage volume/costs. Also, generally compute is more expensive so if you were needing to do geo queries with the forecast data and having to join the NHD geoms to the tables for analysis that will be a lot. Either way you setup the join one table will be massive that will be inefficient and expensive.

@jameshalgren
Copy link
Contributor Author

jameshalgren commented Oct 21, 2022

A couple of things to expand on @danames comments

  • OK I think I'm talking about partitioning as well. Just want to make sure that this query get all values at one feature WHERE reach_id=1234 is much faster than this query get values at all features WHERE time_step=2020_12_15:12:00

A related thought -- one useful query will involve joining on lead-time -- if we store the reference_time and time, then that would require a subtraction. (i.e., WHERE t - rt = 6 to get forecast values with a 6 hour lead time.) Kel, is that kind of math going to be slow and therefore should we keep a lead_time field?

Edit: rolled back the table comment -- see here.

The only vector data that makes sense in BQ are channelRT and reservoirs. The others (terrain and land) are raster and should/could go straight to GEE.

Agree that the datatypes are fundamentally different -- data-streams from the terrain, land, and forcing will tend to be treated as rasters and aggregated spatially if anything.

@KMarkert
Copy link

A related thought -- one useful query will involve joining on lead-time -- if we store the reference_time and time, then that would require a subtraction. (i.e., WHERE t - rt = 6 to get forecast values with a 6 hour lead time.) Kel, is that kind of math going to be slow (and therefore should we keep a lead_time field?

This kind of math will not be an issue in terms of performance. I am always in favor of precomputing things that will be commonly used to make queries/use easier so if you for see calculating lead time to be done often we can include in as a column.

@KMarkert
Copy link

Another thought I have is that these tables should be consistent based on the type. So, if we include a lead_time field for the forecast tables, we should also include this column on the assimilation table and set it to 0. I prefer consistency and think it will reduce questions from a user perspective on why some tables have certain columns vs others.

@swainn
Copy link

swainn commented Oct 24, 2022

I am always in favor of precomputing things that will be commonly used to make queries/use easier so if you for see calculating lead time to be done often we can include in as a column.

I agree with @KMarkert on this. Although it won't necessarily provide big savings in terms of runtime computation, precomputing things like this simplifies working with the dataset and is worthwhile from a usability standpoint.

@danames
Copy link

danames commented Oct 24, 2022

I like this organization with one dataset (national_water_model) and multiple tables for each configuration. This allows for easily adding new tables as new configurations become available (also allows for stopping adding data to a table if/when that configuration stops running at NOAA). Both of which are very likely scenarios with the next-gen model coming online in the coming years.

|-- national_water_model
|-- channel_rt_assimilation
|-- channel_rt_shortrange
|-- ...
|-- reservoir_longrange

In terms of using the identical table fields for each table... we should discuss this...

I could make sense in the case of the forecast tables (shortrange, mediumrange, longrange) because they are essentially the same type of product - although shortrange is deterministic, and the other two have multiple ensemble simulations at each forecast initiation (and forecast timestep). So for the shortrange forecasts we can just identify the ensemble member field as "0". Where as for medium range there will be an indicator for ensembles 1-7 for each initiation time and for each forecast time step. (Though ensemble member 1 one extends 10 days, and members 2-7 only go 8.5 days)

However, the assimilation and analysis dataset is a completely different beast. I does not have any ensemble members, initiation times or forecast time steps. As I understand it, the AA datasets attempt to recreate historical through current flow conditions in all stream segments so it is essentially a single value at each reach (not a set of forecasts at initiation times). So that one should probably be set up with it's own unique field structure so as not to confuse users.

@ma-sada
Copy link

ma-sada commented Oct 24, 2022

@KMarkert -- Yes on preferring partitioning over sharding. But we should also look at clustering. If there is a clear hierarchy of fields used to filter most queries, e.g., users will most often filter by reach and then by the forecast interval, clustering can give significant advantages in query time and cost. The downside is that, unlike partitions, the amount of data scanned can't be known in advance, so you can't use BQ's dry_run API to estimate the query cost ahead of time. Partitions and clusters can be used in conjunction, though, so it's possible to get a little of the best of both worlds, if the queries you want to do support it.

Getting to the "right" answer on this requires a little finesse. The single biggest driver in understanding how to model/partition/cluster the data will be understanding the query use cases. @jameshalgren's earlier post was helpful along those lines. Drilling down one layer deeper to pseudocode out what those queries might look like in SQL against an imaginary table could help us tease out some of these nuances.

@jameshalgren
Copy link
Contributor Author

jameshalgren commented Oct 24, 2022

However, the assimilation and analysis dataset is a completely different beast. I does not have any ensemble members, initiation times or forecast time steps. As I understand it, the AA datasets attempt to recreate historical through current flow conditions in all stream segments so it is essentially a single value at each reach (not a set of forecasts at initiation times). So that one should probably be set up with it's own unique field structure so as not to confuse users.

It's different, but not that different.

Ways that it AnA (Analysis and Assimilation) is different from the forecasts

  1. The input -- AnA uses some sort of best-estimate observation including stream-gage measurements applied via the DA algorithm (or not applied in the no-DA); the forecasts are driven by HRRR, GFS, and CFS for SR, MR, and LR, respectively, and of course, there are no observations to confuse anyone regarding DA.
  2. The time-period and length -- analysis_assim is 3 hours up to the reference time, with a daily 28 hour update up to the reference time,analysis_assim_extend, that is (since April 2022) run at a time when a big pile of data are dumped (or something like that -- I have a vague recollection that it is related to the cycles of updates from USGS and satellites into the WCOSS data tanks...?) This is as opposed to the forecasts, which, of course, go into the future from the reference time instead of into the past.
  3. There are DA and no-DA configurations of the AnA -- the forecasts only have no-DA (which is kind of like Explore geographic plotting methods for NWIS and NWM data #1, except that I'm referring to the additional output dataset, not just the nature of the inputs.)
  4. The AnA uses tm0XX to refer to the number of hours back from the reference time represented in the data.

Ways that AnA is the same as the forecasts

  1. AnA is the starting point for each of the forecasts and both AnA and Forecast outputs rely on the same reference_time so theoretically, the forecasts are a continuation of the AnA (the DA AnA, at least...)
  2. They are exactly the same variable values in all of the output datasets (land, terrain, reservoir, channel).
  3. The primary key is the same: configuration+reference_time+time+reach_id+variable will give a unique key-value pair for every data value.
  4. The time is simply the difference of a number of hours from the reference time. Forecasts are + hours and AnA are - hours.

Tons more (and probably more accurate) detail here: https://water.noaa.gov/about/nwm

Pinging @aaraney and @jarq6c to request correction or relevant clarification of any misstatements here.

@jameshalgren
Copy link
Contributor Author

jameshalgren commented Oct 24, 2022

Essentially a single value

There are three different hourly values in the regular AnA and 28 in the daily extended output -- sort of like a three hour forecast, only into the past, not the future. It would be very interesting to look and see how often anything changes between the different 3 hour values or between them and the 28 hour output. That would happen if USGS gage data arrived late or if the MRMS cycles were somehow delayed and "old" data were used before a final correction was available. If I'm not mistaken, then, the stitched together tm27-tm03 outputs from the 19Z analysis_assim_extend runs would be the best-available "single value" time series (e.g. here).

@andywood
Copy link

@jameshalgren -- I just wanted to add that it's also possible to have an ensemble dimension with the analysis & retrospective runs. I run forecasts now that have a 36 member ensemble initialization (from one model but with different met inputs), and other variations could be parameter & structure ensembles (e.g., if we run 5 different configurations that we may want to combine in some way to initialize a forecast or provide for 'monitoring' with structural uncertainty. Most hydrologic forecast systems use deterministic initialization, and that can be a default where the ensemble dimension has length 1.

@jameshalgren
Copy link
Contributor Author

Good point @andywood. That seems to suggest that we consider adding a separate table with a descriptor for all of the unique forcing configurations associated with the outputs in the various tables -- geography+configuration+ensemble -- with some sort of explanatory field referring to the

different met inputs ...[and the] parameter & structure ensembles

... that you mention.

This might be somewhat boring for the operational runs, but if we create a similar table structure for a datacube to accompany the proposed "model-in-a-box" concept, then this sort of distinction would become very important for running and keeping track of different experimental configurations.

Along those lines, it would be useful, even for the operational configurations, but especially for the experiments, for this table to contain a URL pointing to a configuration deck, allowing, if possible, replication of the execution. For the operational configurations, for instance, it could point to the nco site. Experiments might point to a GitHub repository or a Hydroshare/Zenodo/DOI entry. Bonus points would be given if we could include some sort of validation hash so that an experimenter repeating the experiment could be sure they had fully replicated the original work (and could then confidently say they had made some improvement or change, if that was their goal.)

@jameshalgren
Copy link
Contributor Author

ping @kmsampson

@jameshalgren
Copy link
Contributor Author

jameshalgren commented Jan 5, 2023

Note: eventually, the script to port the data over from the original netcdf format will live somewhere here: https://github.com/GoogleCloudPlatform/public-datasets-pipelines

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants