Added functionality to output to parquet #761

karnesh · 2024-04-23T17:27:12Z

PR to add functionality to output flow, velocity and depth to parquet file. The output parquet is a timeseries data(details in Notes). It is required for the TEEHR input. TEEHR comprises of set of tools for hydrologic model and forecast evaluation. Storing the t-route output in parquet format will lead to efficient query of the data. Also, it will help in automation by connecting TEEHR with NextGen water model. CIROH along with Lynker has developed a containerized version of NextGen National Water Model NextGen In A Box (NGIAB). TEEHR uses DuckDB to query the parquet output from NGIAB stored on cloud.

Additions

Added functionality to output write flow, velocity and depth to parquet format.

Changes

Changed yaml input file to include parquet output format

parquet_output:
    #---------
    parquet_output_folder: output/
    configuration: short_range
    prefix_ids: nex

Testing

The modified code is tested by executing LowerColorado_TX demonstration test.

Screenshots

Here is a screenshot of successful compilation.

Parquet output timeserie data from LowerColorado_TX demonstration test.

Notes

The flowveldepth dateframe is modified to create a timeseries containing following variables:

location_id: string
value: double
value_time: timestamp[us]
variable_name: string
configuration: string
units: string
reference_time: timestamp[us]

Location_id variable contains nexus IDs and has 'nex-' prepended to it.
Configuration (short range, medium range, long range etc.) has to be entered by the user in the input yaml file.

Todos

Checklist

Testing checklist

Target Environment support

Windows
Linux
Browser

hellkite500

In general, I like the idea of adding this as an option for possible output formats. There are some details of the implementation that need to be discussed/considered a bit I think, however.

Also, a more general note, but it configuring your local dev environment git to ignore whitespace changes may be helpful on PR's like this which touch many files. It looks like some auto formatting was applied, and the formatting changes can make review more challenging when trying to see all the places in the code which have changed in functionality. If you want to contribute formatting changes, I would suggest putting them all into a single commit on a PR, or an independent PR.

hellkite500 · 2024-04-26T13:37:07Z

src/troute-nwm/src/nwm_routing/output.py

+    timeseries_df['units'] = timeseries_df['variable_name'].map(variable_to_units_map)
+    timeseries_df['reference_time'] = start_datetime.date()
+    timeseries_df['location_id'] = timeseries_df['location_id'].astype('string')
+    timeseries_df['location_id'] = 'nex-' + timeseries_df['location_id']


I don't think this is generically appropriate to label all locations in the routing data frame with a nex- prefix. This may be applicable to routing which uses a hy_features network, but wouldn't be applicable to one which uses a nhd network. Is there any guarantee before this point that this function only gets applied to hy_features network results? Also, even within a hy_features network, the routed segments this ouptut relates to is for the waterbodies which typically have a wb- prefix in the hydrofabric identifiers, not nex. The nexus and waterbody features are related, but destinct concepts.

See comment below...

hellkite500 · 2024-04-26T13:44:53Z

src/troute-nwm/src/nwm_routing/output.py

+
+    df.index.name = 'location_id'
+    df.reset_index(inplace=True)
+    timeseries_df = df.melt(id_vars=['location_id'], var_name='var')


I think if you use value_vars=[ 'q', 'v', 'd' ] as a kwarg to melt, you might have an easier time extracting from un-pivoted table?

The below method seems like there should be a better way besides casting to string, manipulating the string, and recasting to numeric/datetime types.

I'm not sure exactly what the df looks like that is trying to be manipulated at this point, but I would try to consider a different method(s) for extracting the needed data from it.

If for some reason this is the only way, then please comment this implementation to describe what the state of the df is and why this is the way it needs to be manipulated.

I looked into options and this seems to be the only way. I cannot use value_vars=[ 'q', 'v', 'd' ] as a kwarg to melt because of the format of df column names. Also, the column names needed to be manipulated as strings. Please look at the screenshot below.

Each column name consists of a time step and a variable name (q, v or d) in string format. The time steps values are used to get the value_time.

hellkite500 · 2024-04-26T13:48:08Z

src/troute-nwm/src/nwm_routing/output.py

+        timeseries_df = _parquet_output_format_converter(flowveldepth, restart_parameters.get("start_datetime"), dt,
+                                                         output_parameters["parquet_output"].get("configuration"))
+
+        parquet_output_segments_str = ['nex-' + str(segment) for segment in parquet_output_segments]


another use of nex- prefix that may not be generically appropriate.

Would suggest we make it a variable with a default argument in the wrapping function and a value in the yaml.
parquet_output_segments_str = [prefix_str + str(segment) for segment in parquet_output_segments]

We would need to do something more comprehensive to update T-Route to comprehend the naming/labeling of the IDs...

I have modified the PR to add user defined value in yaml file for the prefix string.

jameshalgren · 2024-05-20T13:41:35Z

src/troute-config/troute/config/output_parameters.py

 class StreamOutput(BaseModel):
    # NOTE: required if writing StreamOutput files
    stream_output_directory: Optional[DirectoryPath] = None
    stream_output_time: int = 1
-    stream_output_type: streamOutput_allowedTypes = ".nc"
+    stream_output_type:streamOutput_allowedTypes = ".nc"


Remove this formatting change from the commit?

jameshalgren · 2024-05-20T13:42:54Z

src/troute-nwm/src/nwm_routing/output.py

-    lakeids = np.fromiter(crosswalk.keys(), dtype=int)
+    lakeids = np.fromiter(crosswalk.keys(), dtype = int)
    idxs = target_df.index.to_numpy()
    lake_index_intersect = np.intersect1d(
        idxs,
        lakeids,
-        return_indices=True
+        return_indices = True
    )

    # replace lake ids with link IDs in the target_df index array
-    linkids = np.fromiter(crosswalk.values(), dtype=int)
+    linkids = np.fromiter(crosswalk.values(), dtype = int)
    idxs[lake_index_intersect[1]] = linkids[lake_index_intersect[2]]

    # (re) set the target_df index
-    target_df.set_index(idxs, inplace=True)
+    target_df.set_index(idxs, inplace = True)

    return target_df


-def _parquet_output_format_converter(df, start_datetime, dt, configuration):
+def _parquet_output_format_converter(df, start_datetime, dt, configuration, prefix_ids):


We can drop most of these format only changes to keep the PR super clean.

I have reverted the format changes.

karnesh added 5 commits March 5, 2024 11:42

Added functionality to write flow, velocity and depth to parquet

184493d

Modified parquet output format to match TEEHR input

3ab6d38

cleaned the code

77928ba

cleaned the code

5a1563b

sample yaml file changes for parquet output

c2bf517

hellkite500 requested changes Apr 26, 2024

View reviewed changes

added functionality to include user defined prefix for IDs

80a844d

jameshalgren reviewed May 20, 2024

View reviewed changes

karnesh added 4 commits May 20, 2024 10:23

reverted back the formatting changes

8b5ac0f

reverted back the formatting changes

c81c2ce

reverted back the formatting changes

e1aa624

reverted back the formatting changes

33a0aa5

karnesh marked this pull request as ready for review May 23, 2024 01:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added functionality to output to parquet #761

Added functionality to output to parquet #761

karnesh commented Apr 23, 2024 •

edited

hellkite500 left a comment

hellkite500 Apr 26, 2024

jameshalgren May 2, 2024

hellkite500 Apr 26, 2024

karnesh May 9, 2024 •

edited

hellkite500 Apr 26, 2024

jameshalgren May 2, 2024

karnesh May 9, 2024

jameshalgren May 20, 2024 •

edited

jameshalgren May 20, 2024

karnesh May 20, 2024

Added functionality to output to parquet #761

Are you sure you want to change the base?

Added functionality to output to parquet #761

Conversation

karnesh commented Apr 23, 2024 • edited

Additions

Changes

Testing

Screenshots

Notes

Todos

Checklist

Testing checklist

Target Environment support

hellkite500 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karnesh May 9, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jameshalgren May 20, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karnesh commented Apr 23, 2024 •

edited

karnesh May 9, 2024 •

edited

jameshalgren May 20, 2024 •

edited