Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trino Iceberg Timestamp with Timezone Query Rendering/Display Enhancment #21871

Open
sbrackenbury-teranet opened this issue May 8, 2024 · 2 comments

Comments

@sbrackenbury-teranet
Copy link

sbrackenbury-teranet commented May 8, 2024

A Trino query against an iceberg table timestamp column renders the query result differently from how Spark renders the query result for the same iceberg table:

Spark-SQL:

set time zone local;

CREATE TABLE spark_catalog.ts_iceberg.example_timestamp_ltz (ltz TIMESTAMP) USING iceberg 
LOCATION '<a_path>/ts_iceberg.db/example_timestamp_ltz' 
TBLPROPERTIES ('current-snapshot-id' = 'none', 
                                 'format' = 'iceberg/parquet', 
                                 'format-version' = '2')

insert into ts_iceberg.example_timestamp_ltz values(timestamp '2024-03-25 12:12:12.123456 America/Toronto');


select * from ts_iceberg.example_timestamp_ltz;
2024-03-25 16:12:12.123456  <- UTC normalized and timestamp is rendered using local time zone (UTC)

set time zone 'America/Toronto';
select * from ts_iceberg.example_timestamp_ltz;
2024-03-25 12:12:12.123456  <- timestamp rendered using time zone ‘America/Toronto’

Trino SQL (against the same Spark generated iceberg table above):

set time zone local;
select * from iceberg.ts_iceberg.example_timestamp_ltz;
2024-03-25 16:12:12.123456 UTC  <- timestamp rendered with ‘UTC’ suffix which is different than Spark

set time zone 'America/Toronto';
select * from iceberg.ts_iceberg.example_timestamp_ltz;
2024-03-25 16:12:12.123456 UTC  <- set time zone‘America/Toronto’ ignored – UTC localized with UTC suffix

Postgres rendering behavior for a table with a timestamptz column is like Spark:

create table public.test_iceberg (test_column timestamptz NULL);

ez_utility_uat=> show timezone;
TimeZone
----------
 UTC
(1 row)

ez_utility_uat=> insert into public.test_iceberg_ts_wtz values(timestamptz '2024-03-25 12:12:12.123456 America/Toronto');
INSERT 0 1

ez_utility_uat=> select * from public.test_iceberg_ts_wtz;
          test_column
-------------------------------
 2024-03-25 16:12:12.123456+00  <- UTC normalized and timestamp is rendered using local time zone (UTC)
(1 row)

ez_utility_uat=> set time zone 'America/Toronto';
SET

ez_utility_uat=> select * from public.test_iceberg_ts_wtz;
          test_column
-------------------------------
 2024-03-25 12:12:12.123456-04  <- timestamp rendered using session time zone ‘America/Toronto’
(1 row) 

The only option available in Trino for rendering the timestamp column at the desired timezone is to employ the Trino at_timezone function. This is undesirable as it requires a code change to existing queries. Adding the ability for Trino to transparently render the Timestamp based on a set/configured Timezone similar to how Spark and Postgres behaves is preferred because it is less intrusive.

This PR introduces configurable Trino query behavior for rendering Iceberg table timestamp columns in the same manner that Spark and Postgres render timestamp with time zone data type columns. It simplifies and encourages the adoption of Iceberg format tables that have columns of type timestamp with time zone.

The enhancement ensures that an Iceberg table timestamp column value:

  1. Is normalized to UTC time
  2. Is rendered at query time according to a default or specified session Time Zone
@martint
Copy link
Member

martint commented May 11, 2024

In Trino, timestamp with time zone is rendered with it's stored time zone. This is intentional, as the insertion time zone is considered to have meaning. Readers can choose to render it in a different time zone via at time zone. The session time zone is used for conversions between timestamp and timestamp with time zone inside the query engine, as described by the SQL specification.

There are roughly three phases of processing when reading timestamp values that need to be considered:

  • Internal representation after value is read from storage
  • How the value is transported to the client
  • Rendering the value to the user

Changing the representation after the value is read from storage is problematic. It makes it impossible to observe the value as it was stored, with its original time zone. Changing the representation when the value is transported to the client requires changes to the protocol to allow clients to express such preference. Adjusting the rendering in the client requires each client to be able to do so independently.

As an aside, Iceberg's timestamp with time zone type is stored in UTC, and the timezone information is lost. From the Iceberg spec (https://iceberg.apache.org/spec/#primitive-types):

Timestamp values with time zone represent a point in time: values are stored as UTC and do not retain a source time zone (2017-11-16 17:10:34 PST is stored/retrieved as 2017-11-17 01:10:34 UTC and these values are considered identical).

@sbrackenbury-teranet
Copy link
Author

sbrackenbury-teranet commented May 15, 2024

First of all, thank-you @martint for taking the time to review and consider this issue and our PR.

After reviewing your comments and doing further testing of our PR. We recognize we need to revisit our code and the PR submission.

We still have a strong preference for the rendering/display implementation of timestamp with time zone on the server side and not client side precisely because, to your point, adjusting the rendering/display on the client would be challenging. We prefer a configurable behavior on the server side so as to be transparent to the client, while supporting backward compatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants