BSB to Athena Datetime Encoding Corrupted #221

T-Man-Stan · 2021-04-22T18:21:21Z

Describe the bug

The year portion of the datetime string (e.g., "2012" in "2012-07-03") is being incorrectly encoded (e.g., "+43969-07-03") when dataframes are written as parquet files during postprocessing of buildstockbatch runs. These tables are written as parquet files and uploaded to S3/Athena in AWS.

When data is subsequently queried in the EEDR workflow the parser in the dateutil package barfs and produces the following error: "ParserError: Unknown string format: +43969-07-03"

Error appears to stem from when the "write_dataframe_as_parquet(df, fs, filename)" function is called in postprocessing.py. It seems that pyarrow changed the arguments for the "pq.write_table()" function (see links below).

https://arrow.apache.org/docs/python/parquet.html#storing-timestamps

buildstockbatch/buildstockbatch/postprocessing.py

Lines 147 to 150 in 0d5feb9

    
           def write_dataframe_as_parquet(df, fs, filename): 
        
               tbl = pa.Table.from_pandas(df, preserve_index=False) 
        
               with fs.open(filename, 'wb') as f: 
        
                   parquet.write_table(tbl, f, flavor='spark')

Upgrading the Athena engine for our working group (eedr) in the AWS console "fixed" the error (i.e., "+43969-07-03" is now encoded as "2012-07-03"), however, it has caused another error in our querying that we're still looking into. Regardless, the code in the postprocessing.py script might need to be updated anyways.

Possible solution - something like:

pq.write_table(table, where, coerce_timestamps='ms',
allow_truncated_timestamps=True)

OR

[from https://arrow.apache.org/docs/python/parquet.html#storing-timestamps]

"Older Parquet implementations use INT96 based storage of timestamps, but this is now deprecated. This includes some older versions of Apache Impala and Apache Spark. To write timestamps in this format, set the use_deprecated_int96_timestamps option to True in write_table"

To Reproduce
I didn't run this workflow so I'm not totally sure, however, using the newest version of BSB and running the postprocessing script with test data will likely re-produce the behavior and this can then be viewed in Athena. @mleachNREL

Expected behavior
"+43969-07-03" should be encoded as "2012-07-03" in the time column of our data for all rows and tables.

Platform (please complete the following information):

Simulation platform: Eagle, data gets written to Athena on AWS
BuildStockBatch version, branch, or sha: will follow up on this detail. @mleachNREL
resstock or comstock repo version, branch, or sha: @mleachNREL
Local Desktop OS: data was being accessed from Athena via a jupyter NB on an Eagle node

nmerket · 2021-04-26T19:30:41Z

@T-Man-Stan, here's the current state of things:

Up until recently we had to store all our parquet files with this deprecated timeseries format because Spark and Athena liked them that way. It's the flavor='spark' stuff in postprocessing.py.
In pyarrow 3.0, they changed that argument to something else, so newer outputs are being saved in the newer timeseries format that Athena doesn't like.
Around the same time AWS released a new version of the Athena (prestodb) engine they're calling v2. This new version can read the new timestamp versions correctly.
Yesterday I updated everyone's workgroups in Athena to use v2. (They were going to force the update eventually anyway.)
Timestamps work again!

I think to "fix" this we remove the flavor='spark' and tell everyone to use Athena engine v2.

nmerket · 2021-04-26T19:31:55Z

In the mean time, your results will work if you just switch your workgroup to use Athena engine v2

T-Man-Stan added the bug Something isn't working label Apr 22, 2021

T-Man-Stan assigned elainethale, nmerket and mleachNREL Apr 22, 2021

nmerket mentioned this issue Apr 26, 2021

Writing parquet files with newer datetime format #224

Merged

7 tasks

nmerket closed this as completed in #224 Apr 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BSB to Athena Datetime Encoding Corrupted #221

BSB to Athena Datetime Encoding Corrupted #221

T-Man-Stan commented Apr 22, 2021

nmerket commented Apr 26, 2021

nmerket commented Apr 26, 2021

BSB to Athena Datetime Encoding Corrupted #221

BSB to Athena Datetime Encoding Corrupted #221

Comments

T-Man-Stan commented Apr 22, 2021

nmerket commented Apr 26, 2021

nmerket commented Apr 26, 2021