Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet: Handling of datetimes not working as expected #256

Open
totalgit74 opened this issue Nov 29, 2022 · 2 comments
Open

Parquet: Handling of datetimes not working as expected #256

totalgit74 opened this issue Nov 29, 2022 · 2 comments

Comments

@totalgit74
Copy link

When running ChoETL.Parquet 1.0.1.24 with ChoETL.NetStandard 1.2.1.50 I am having an issue retrieving datetime values.

My aim is to be able to use Parquet as an information exchange format between Python and .Net. It has the potential to do so, but problems with handling dates consistently between the two languages when using the file format are proving a sticking point when large time-series are your game. I would like to be able to store/read .Net DateTime values which I would expect to convert to/from datetime64[ns].

Reading

If I create a simple parquet file from Python

import pandas as pd
dr = pd.date_range(start='2022-11-05', end='2022-12-01', freq='d')
df = pd.DataFrame(dr, columns=['date'])
df.to_parquet('d:/temp/ts.parquet')

and I read this back in using ChoParquetReader with ParquetOptions = { TreatBigIntegersAsDates = true } in the ChoParquetRecordConfiguration into a DataTable using reader.AsDataTable() the column is just a big integer.

Am I misunderstanding this as I would expect the option setting to have caused the integer to be converted back to a DateTime?

Writing

When writing data, both DateTime and DateTimeOffset appear to be written as DateTimeOffset. This can be shown by:

  • Creating some typed objects containing both DateTime and DateTimeOffset fields
  • Writing to a parquet file using the ChoParquetWriter
  • Reading back into a DataTable as before
  • When viewed the columns can clearly be seen to hold the same values and types.

This is even more of a problem as the data type coming back is not matching the date type written out. When using Pandas

df.to_parquet(...)
df = pd.read_parquet(...)

Datetime values written out match to the type read back in, there is no conversion from a non-timezone to a timezone aware format.

@Cinchoo
Copy link
Owner

Cinchoo commented Dec 18, 2022

pls take https://www.nuget.org/packages/ChoETL.Parquet/1.0.1.25-beta2 and give it try.

Let me know.

@totalgit74
Copy link
Author

Apologies for the delay as I've been on other tasks.

When using

var reader = new ChoParquetReader(path, new ChoParquetRecordConfiguration
{
    ParquetOptions = { TreatBigIntegersAsDates = true }
});

var dt = reader.AsDataTable();

(path is just a string path to a parquet file saved using Python 3.9.12)

I get

System.MissingMethodException: Method not found: 'System.Collections.Generic.IDictionary`2<System.String,System.Object> ChoETL.ChoRecordReader.MigrateToNewSchema(System.Collections.Generic.IDictionary`2<System.String,System.Object>, System.Collections.Generic.IDictionary`2<System.String,System.Type>)'.
   at ChoETL.ChoParquetRecordReader.<AsEnumerable>d__25.MoveNext()
   at ChoETL.ChoParquetRecordReader.<AsEnumerable>d__20.MoveNext()
   at ChoETL.ChoParquetReader`1.<>c__DisplayClass40_0.<GetEnumerator>b__0()
   at ChoETL.ChoEnumeratorWrapper.ChoEnumeratorWrapperInternal`1.MoveNext()
   at ChoETL.ChoEnumeratorWrapper.<BuildEnumerable>d__0`1.MoveNext()
   at System.Linq.Enumerable.WhereSelectEnumerableIterator`2.MoveNext()
   at System.Linq.Enumerable.<OfTypeIterator>d__95`1.MoveNext()
   at ChoETL.ChoPeekEnumerator`1.MoveToNext()
   at ChoETL.ChoPeekEnumerator`1.MoveNext()
   at ChoETL.ChoEnumerableDataReader..ctor(IEnumerable collection, IChoDeferedObjectMemberDiscoverer dom)
   at ChoETL.ChoEnumerableEx.AsDataReader(IEnumerable collection, Action`1 membersDiscovered, String[] selectedFields, String[] excludeFields)
   at ChoETL.ChoParquetReader`1.AsDataReader(Action`1 membersDiscovered)
   at ChoETL.ChoParquetReader`1.AsDataTable(String tableName)
   at Risk.ChoETL.ChoETLArrow.ReadParquet(String path) in D:\Code\Prototype\Parquet\Risk.ChoETL\ChoETLArrow.cs:line 24

Packages installed (Framework 4.8) were

<?xml version="1.0" encoding="utf-8"?>
<packages>
  <package id="ChoETL.NETStandard" version="1.2.1.61" targetFramework="net48" />
  <package id="ChoETL.Parquet" version="1.0.1.25-beta2" targetFramework="net48" />
  <package id="IronSnappy" version="1.2.2" targetFramework="net48" />
  <package id="Microsoft.CSharp" version="4.4.1" targetFramework="net48" />
  <package id="Newtonsoft.Json" version="13.0.1" targetFramework="net48" />
  <package id="Parquet.Net" version="3.7.4" targetFramework="net48" />
  <package id="System.Buffers" version="4.5.1" targetFramework="net48" />
  <package id="System.CodeDom" version="4.4.0" targetFramework="net48" />
  <package id="System.ComponentModel.Annotations" version="4.4.1" targetFramework="net48" />
  <package id="System.Configuration.ConfigurationManager" version="4.4.1" targetFramework="net48" />
  <package id="System.Data.SqlClient" version="4.8.5" targetFramework="net48" />
  <package id="System.Memory" version="4.5.4" targetFramework="net48" />
  <package id="System.Numerics.Vectors" version="4.5.0" targetFramework="net48" />
  <package id="System.Reflection.Emit" version="4.3.0" targetFramework="net48" />
  <package id="System.Reflection.Emit.Lightweight" version="4.7.0" targetFramework="net48" />
  <package id="System.Runtime.CompilerServices.Unsafe" version="4.5.3" targetFramework="net48" />
</packages>

I couldn't find my original prototype code for testing so I started from scratch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants