You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The year portion of the datetime string (e.g., "2012" in "2012-07-03") is being incorrectly encoded (e.g., "+43969-07-03") when dataframes are written as parquet files during postprocessing of buildstockbatch runs. These tables are written as parquet files and uploaded to S3/Athena in AWS.
When data is subsequently queried in the EEDR workflow the parser in the dateutil package barfs and produces the following error: "ParserError: Unknown string format: +43969-07-03"
Error appears to stem from when the "write_dataframe_as_parquet(df, fs, filename)" function is called in postprocessing.py. It seems that pyarrow changed the arguments for the "pq.write_table()" function (see links below).
Upgrading the Athena engine for our working group (eedr) in the AWS console "fixed" the error (i.e., "+43969-07-03" is now encoded as "2012-07-03"), however, it has caused another error in our querying that we're still looking into. Regardless, the code in the postprocessing.py script might need to be updated anyways.
"Older Parquet implementations use INT96 based storage of timestamps, but this is now deprecated. This includes some older versions of Apache Impala and Apache Spark. To write timestamps in this format, set the use_deprecated_int96_timestamps option to True in write_table"
To Reproduce
I didn't run this workflow so I'm not totally sure, however, using the newest version of BSB and running the postprocessing script with test data will likely re-produce the behavior and this can then be viewed in Athena. @mleachNREL
Expected behavior
"+43969-07-03" should be encoded as "2012-07-03" in the time column of our data for all rows and tables.
Platform (please complete the following information):
Simulation platform: Eagle, data gets written to Athena on AWS
BuildStockBatch version, branch, or sha: will follow up on this detail. @mleachNREL
resstock or comstock repo version, branch, or sha: @mleachNREL
Local Desktop OS: data was being accessed from Athena via a jupyter NB on an Eagle node
The text was updated successfully, but these errors were encountered:
Up until recently we had to store all our parquet files with this deprecated timeseries format because Spark and Athena liked them that way. It's the flavor='spark' stuff in postprocessing.py.
In pyarrow 3.0, they changed that argument to something else, so newer outputs are being saved in the newer timeseries format that Athena doesn't like.
Around the same time AWS released a new version of the Athena (prestodb) engine they're calling v2. This new version can read the new timestamp versions correctly.
Yesterday I updated everyone's workgroups in Athena to use v2. (They were going to force the update eventually anyway.)
Timestamps work again!
I think to "fix" this we remove the flavor='spark' and tell everyone to use Athena engine v2.
Describe the bug
The year portion of the datetime string (e.g., "2012" in "2012-07-03") is being incorrectly encoded (e.g., "+43969-07-03") when dataframes are written as parquet files during postprocessing of buildstockbatch runs. These tables are written as parquet files and uploaded to S3/Athena in AWS.
When data is subsequently queried in the EEDR workflow the parser in the dateutil package barfs and produces the following error: "ParserError: Unknown string format: +43969-07-03"
Error appears to stem from when the "write_dataframe_as_parquet(df, fs, filename)" function is called in postprocessing.py. It seems that pyarrow changed the arguments for the "pq.write_table()" function (see links below).
https://arrow.apache.org/docs/python/parquet.html#storing-timestamps
buildstockbatch/buildstockbatch/postprocessing.py
Lines 147 to 150 in 0d5feb9
Upgrading the Athena engine for our working group (eedr) in the AWS console "fixed" the error (i.e., "+43969-07-03" is now encoded as "2012-07-03"), however, it has caused another error in our querying that we're still looking into. Regardless, the code in the postprocessing.py script might need to be updated anyways.
Possible solution - something like:
pq.write_table(table, where, coerce_timestamps='ms',
allow_truncated_timestamps=True)
OR
[from https://arrow.apache.org/docs/python/parquet.html#storing-timestamps]
"Older Parquet implementations use INT96 based storage of timestamps, but this is now deprecated. This includes some older versions of Apache Impala and Apache Spark. To write timestamps in this format, set the use_deprecated_int96_timestamps option to True in write_table"
To Reproduce
I didn't run this workflow so I'm not totally sure, however, using the newest version of BSB and running the postprocessing script with test data will likely re-produce the behavior and this can then be viewed in Athena. @mleachNREL
Expected behavior
"+43969-07-03" should be encoded as "2012-07-03" in the time column of our data for all rows and tables.
Platform (please complete the following information):
The text was updated successfully, but these errors were encountered: