Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BSB to Athena Datetime Encoding Corrupted #221

Closed
T-Man-Stan opened this issue Apr 22, 2021 · 2 comments · Fixed by #224
Closed

BSB to Athena Datetime Encoding Corrupted #221

T-Man-Stan opened this issue Apr 22, 2021 · 2 comments · Fixed by #224
Assignees
Labels
bug Something isn't working

Comments

@T-Man-Stan
Copy link
Collaborator

Describe the bug

The year portion of the datetime string (e.g., "2012" in "2012-07-03") is being incorrectly encoded (e.g., "+43969-07-03") when dataframes are written as parquet files during postprocessing of buildstockbatch runs. These tables are written as parquet files and uploaded to S3/Athena in AWS.

When data is subsequently queried in the EEDR workflow the parser in the dateutil package barfs and produces the following error: "ParserError: Unknown string format: +43969-07-03"

Error appears to stem from when the "write_dataframe_as_parquet(df, fs, filename)" function is called in postprocessing.py. It seems that pyarrow changed the arguments for the "pq.write_table()" function (see links below).

https://arrow.apache.org/docs/python/parquet.html#storing-timestamps

def write_dataframe_as_parquet(df, fs, filename):
tbl = pa.Table.from_pandas(df, preserve_index=False)
with fs.open(filename, 'wb') as f:
parquet.write_table(tbl, f, flavor='spark')

Upgrading the Athena engine for our working group (eedr) in the AWS console "fixed" the error (i.e., "+43969-07-03" is now encoded as "2012-07-03"), however, it has caused another error in our querying that we're still looking into. Regardless, the code in the postprocessing.py script might need to be updated anyways.

Possible solution - something like:

pq.write_table(table, where, coerce_timestamps='ms',
allow_truncated_timestamps=True)

OR

[from https://arrow.apache.org/docs/python/parquet.html#storing-timestamps]

"Older Parquet implementations use INT96 based storage of timestamps, but this is now deprecated. This includes some older versions of Apache Impala and Apache Spark. To write timestamps in this format, set the use_deprecated_int96_timestamps option to True in write_table"

To Reproduce
I didn't run this workflow so I'm not totally sure, however, using the newest version of BSB and running the postprocessing script with test data will likely re-produce the behavior and this can then be viewed in Athena. @mleachNREL

Expected behavior
"+43969-07-03" should be encoded as "2012-07-03" in the time column of our data for all rows and tables.

Platform (please complete the following information):

  • Simulation platform: Eagle, data gets written to Athena on AWS
  • BuildStockBatch version, branch, or sha: will follow up on this detail. @mleachNREL
  • resstock or comstock repo version, branch, or sha: @mleachNREL
  • Local Desktop OS: data was being accessed from Athena via a jupyter NB on an Eagle node
@T-Man-Stan T-Man-Stan added the bug Something isn't working label Apr 22, 2021
@nmerket
Copy link
Member

nmerket commented Apr 26, 2021

@T-Man-Stan, here's the current state of things:

  1. Up until recently we had to store all our parquet files with this deprecated timeseries format because Spark and Athena liked them that way. It's the flavor='spark' stuff in postprocessing.py.
  2. In pyarrow 3.0, they changed that argument to something else, so newer outputs are being saved in the newer timeseries format that Athena doesn't like.
  3. Around the same time AWS released a new version of the Athena (prestodb) engine they're calling v2. This new version can read the new timestamp versions correctly.
  4. Yesterday I updated everyone's workgroups in Athena to use v2. (They were going to force the update eventually anyway.)
  5. Timestamps work again!

I think to "fix" this we remove the flavor='spark' and tell everyone to use Athena engine v2.

@nmerket
Copy link
Member

nmerket commented Apr 26, 2021

In the mean time, your results will work if you just switch your workgroup to use Athena engine v2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants