Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Account for Timezone when Casting Timestamp to Date32 #5605

Merged
merged 4 commits into from
Apr 15, 2024

Conversation

Lordworms
Copy link
Contributor

@Lordworms Lordworms commented Apr 8, 2024

Which issue does this PR close?

Closes #5598

Rationale for this change

consider time zone when casting timestamp -> Date32

What changes are included in this PR?

Are there any user-facing changes?

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to make use of chrono, i.e. using DateTime<Tz> in order to be correct

self.with_timezone_opt(Some(timezone.into()))
let timezone_str = timezone.into().to_lowercase();

if timezone_str == "utc" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need this to help transfer timezone str to corresponding Tz


impl Tz {
/// get timezone
pub fn get_time_zone(&self) -> FixedOffset {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will misbehave if the chrono-tz feature is enabled and we support timezones with DST

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll change it

@@ -1366,7 +1366,12 @@ impl<T: ArrowTimestampType> PrimitiveArray<T> {

/// Construct a timestamp array with new timezone
pub fn with_timezone(self, timezone: impl Into<Arc<str>>) -> Self {
self.with_timezone_opt(Some(timezone.into()))
let timezone_arc = timezone.into();
if timezone_arc.eq_ignore_ascii_case("utc") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be unnecessary

pub fn get_time_zone_min(&self) -> i32 {
match self.0 {
TzInner::Timezone(tz) => {
let utc_datetime: DateTime<Utc> = Utc::now();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect, the timezone must be computed with reference to the time in question. Perhaps you could look at the other temporal cast kernels for inspiration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I was just looking for a function to convert the timezone to int32/64

Copy link
Contributor

@liukun4515 liukun4515 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the fix is not right, we should not follow the timezone in the datatype to convert the value of date32.

We need to add the timezone config in the CastOptions which can't be configured by other system like datafusion.

@liukun4515
Copy link
Contributor

The value of timezone in the datatype of timestamp means the value of timestamp is a reference to the UTC/UNIX time.
https://github.com/apache/arrow/blob/main/format/Schema.fbs#L283

@tustvold
Copy link
Contributor

tustvold commented Apr 13, 2024

We need to add the timezone config in the CastOptions which can't be configured by other system like datafusion.

The conversion should take the date of the timestamp in the timezone configured on the timestamp data type, we should not need to configure this on CastOptions, we have very carefully avoided introducing the notion of a "local" timezone.

So for example if we had 2019-10-12T07:20:50.52 +10:00, the date returned would be 2019-10-12 even though with reference to the UTC epoch (which is what the timestamp integer encodes) the date would be 2019-10-11.

@Lordworms
Copy link
Contributor Author

@tustvold Hi, I have refactored the code to use DateTime, please take a look when you have time. Thank you!

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this, I took the liberty of refactoring it slightly and adding a little more test coverage. Nicely done

@tustvold tustvold changed the title adding timezone when casting timestamp Account for Timezone when Casting Timestamp to Date32 Apr 15, 2024
@tustvold tustvold merged commit 89767cc into apache:master Apr 15, 2024
26 checks passed
@liukun4515
Copy link
Contributor

liukun4515 commented Apr 16, 2024

We need to add the timezone config in the CastOptions which can't be configured by other system like datafusion.

The conversion should take the date of the timestamp in the timezone configured on the timestamp data type, we should not need to configure this on CastOptions, we have very carefully avoided introducing the notion of a "local" timezone.

So for example if we had 2019-10-12T07:20:50.52 +10:00, the date returned would be 2019-10-12 even though with reference to the UTC epoch (which is what the timestamp integer encodes) the date would be 2019-10-11.

Hi @tustvold
From the definition of the timestamp in the arrow schema: https://github.com/apache/arrow/blob/main/format/Schema.fbs#L283

It's will cause inconvenient usage, let me give an example:

When I write a parquet file with timestamp column using the spark engine, the timestamp column is encoded by https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#instant-semantics-timestamps-normalized-to-utc and is normalized to UTC.

But when I use arrow-parquet to read the timestamp column, i will just get the column with timestamp(unit, "UTC") data type by this code

fn from_parquet(parquet_type: &Type) -> Result<DataType> {
and
(None, ConvertedType::TIMESTAMP_MILLIS) => Ok(DataType::Timestamp(

It's even possible to use int96 to store the timestamp in the parquet file, we will get the column with timestamp(nanosecond, None) data type by

PhysicalType::INT96 => Ok(DataType::Timestamp(TimeUnit::Nanosecond, None)),

But i need this to used in the UTC-7 timezone, how did you resolve this issue? @alamb

@tustvold
Copy link
Contributor

You will need to explicitly cast the timezone back, unfortunately as parquet has no mechanism for encoding the timezone this will need to be done manually. Arrow writers embed the arrow schema allowing this to be done automatically, but Spark doesn't do this

@alamb
Copy link
Contributor

alamb commented Apr 16, 2024

Maybe we can add some way to the parquet arrow reader to override its choice of data type for certain columns to allow users to specify types for cases where it is not clear from the parquet file itself.

@mapleFU and I have been discussing the need for something similar for deciding what Array type to use when reading strings from Parquet files on #5530 -- see #5530 (comment))

If we had such an API then people using spark created parquet files could specify that the timestamp column should always be UTC (as suggested by @tustvold ) without having to add an explicit cast afterwards

@liukun4515
Copy link
Contributor

Maybe we can add some way to the parquet arrow reader to override its choice of data type for certain columns to allow users to specify types for cases where it is not clear from the parquet file itself.

@mapleFU and I have been discussing the need for something similar for deciding what Array type to use when reading strings from Parquet files on #5530 -- see #5530 (comment))

If we had such an API then people using spark created parquet files could specify that the timestamp column should always be UTC (as suggested by @tustvold ) without having to add an explicit cast afterwards

I think it's a solution adding the option in the parquet reader, it can help us to resolve some issues.

But many issues can't be resolve smoothly.

As describe in the comment: apache/datafusion#9981 (comment)
The same timestamp column maybe used by different timezone user, they may want to do compute on the same timestamp with different timezone. How to do that? @tustvold Do we need to add or wrap the cast expr explicitly to the target timestamp column?

@tustvold
Copy link
Contributor

tustvold commented Apr 17, 2024

Do we need to add or wrap the cast expr explicitly to the target timestamp column?

Yes arrow does not have a notion of "local" timezone, users need to be explicit about the timezone they wish to use for a particular array. Theoretically a DF frontend could add such a notion, but I would strongly recommend against it as it is leads to all sorts of peculiar bugs.

The following would be an example DataFusion SQL query that is explicit about timezones

SELECT arrow_cast(time_column, 'Timestamp(Nanosecond, Some("+07:00"))';

@liukun4515
Copy link
Contributor

Do we need to add or wrap the cast expr explicitly to the target timestamp column?

Yes arrow does not have a notion of "local" timezone, users need to be explicit about the timezone they wish to use for a particular array. Theoretically a DF frontend could add such a notion, but I would strongly recommend against it as it is leads to all sorts of peculiar bugs.

The following would be an example DataFusion SQL query that is explicit about timezones

SELECT arrow_cast(time_column, 'Timestamp(Nanosecond, Some("+07:00"))';

Thanks for your feedback for

Do we need to add or wrap the cast expr explicitly to the target timestamp column?
I also know that adding explicit cast expr is not a good way.

Maybe the suggestion from @alamb is a good solution to resolve this after thinking about it again.
We can specify the timezone which is defined from parquet reader options when read the timestamp column.
what do you think about this method?

@tustvold
Copy link
Contributor

Filed #5657

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

confused result when cast timestamp with timezone to date32
4 participants