Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add timestamp parser to parse timestamp string with time zone #1539

Closed
wants to merge 37 commits into from

Conversation

res-life
Copy link
Collaborator

@res-life res-life commented Nov 6, 2023

contributes to NVIDIA/spark-rapids#6846
contributes to #1654
closes #1721

Add kernel code for SparkDateTimeUtils functions:

  • SparkDateTimeUtils.stringToTimestamp
  • SparkDateTimeUtils.stringToTimestampWithoutTimeZone
  • SparkDateTimeUtils.stringToTimestampAnsi

Main logic is in: parseTimestampString:

This PR contains 3 parts:

  • parser
  • time rebase part:
    rebase local time construction (year,month,day,hour,minute,second) in a time zone to UTC time.
  • JNI part.

Cast(string as timestamp) for special strings: now, today, ...

from pyspark.sql.types import *
schema = StructType([
    StructField("c1", StringType()),
    StructField("c2", IntegerType()),
])
data = [
    ("today",1),
    ("now",2)
]
df = spark.createDataFrame(
        SparkContext.getOrCreate().parallelize(data, numSlices=2),
        schema).createOrReplaceTempView("tab")
spark.sql("select cast(c1 as timestamp) from tab").show()
  • Spark 311: return non-null values. Supports special strings.
  • Spark 320 and 320+: return null values. Do not support special strings.

Note:

  • This Kernel does not supports special strings: now, today...
  • Will create another PR to handle: Cast to date
  • Do not support parse just time, refer to follow-up issue

Signed-off-by: sperlingxx lovedreamf@gmail.com
Signed-off-by: Chong Gao res_life@163.com

Copy link

@winningsix winningsix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Later, please add some UTs.

src/main/cpp/src/datetime_parser.cu Outdated Show resolved Hide resolved
src/main/cpp/src/datetime_parser.cu Outdated Show resolved Hide resolved
src/main/cpp/src/datetime_parser.hpp Outdated Show resolved Hide resolved
src/main/cpp/src/datetime_parser.cu Outdated Show resolved Hide resolved
src/main/cpp/src/datetime_parser.cu Outdated Show resolved Hide resolved
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be nice to clarify that this is for a single timestamp format. CSV/JSON and others allow you to configure the timestamp format. That is not a requirement here, but a follow on issue should be filed for it.

src/main/cpp/src/datetime_parser.cu Outdated Show resolved Hide resolved
@res-life
Copy link
Collaborator Author

This PR contains 3 parts:

  • parser: ready for review.
  • time rebase part:
    rebase local time construction (year,month,day,hour,minute,second) in a time zone to UTC time. @sperlingxx will add this part in this PR.
  • JNI part. Will add later.

@revans2 Could you first review the parser part?

@res-life res-life assigned res-life and unassigned res-life Dec 14, 2023
@res-life
Copy link
Collaborator Author

res-life commented Dec 14, 2023

It would also be nice to clarify that this is for a single timestamp format. CSV/JSON and others allow you to configure the timestamp format. That is not a requirement here, but a follow on issue should be filed for it.

Currently suports formats:

  `[+-]yyyy*`
  `[+-]yyyy*-[m]m`
  `[+-]yyyy*-[m]m-[d]d`
  `[+-]yyyy*-[m]m-[d]d `
  `[+-]yyyy*-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]`
  `[+-]yyyy*-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]`

For ToUnixTimestamp and GetTimestamp, they require a format parameter. Refer to Spark link

val formatter = formatterOption.getOrElse(getFormatter(fmt.toString))
formatter.parse()

I think our GPU implemetation currently does not support non-utc:

def parseStringAsTimestamp(

We may need a new kernel to replace current GPU implemetation parseStringAsTimestamp

@res-life
Copy link
Collaborator Author

Later, please add some UTs.

Done.

@res-life res-life marked this pull request as ready for review January 2, 2024 06:37
@res-life
Copy link
Collaborator Author

res-life commented Jan 2, 2024

@revans2 @hyperbolic2346 Could you review this PR.

This PR contains 2 parts:

  • Parser: ready for review.
  • JNI part. Ready for review. But will be modified by Alfred to add more parameters.

Alfred will post a PR for a sub-task.

  • Time rebase part. Not ready for review. (Alfred will post)

Maybe we can first merge this PR and then review Alfred's PR(based on this PR).

@res-life res-life changed the title Add timestamp parser to parse timestamp string with time zone Part 1: Add timestamp parser to parse timestamp string with time zone Jan 2, 2024
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks okay to me. I am not thrilled with this only being half done and the APIs give no indication of that. At a minimum they need to be marked in some way so that they are not used while the timezone parts are added in.

I am also not thrilled with the tests being set to test that the answer is wrong. I would like to see each test tagged that it is doing the wrong thing and ideally have code (commented out if needed) for what the correct result really is.

src/main/java/com/nvidia/spark/rapids/jni/CastStrings.java Outdated Show resolved Hide resolved
src/main/java/com/nvidia/spark/rapids/jni/CastStrings.java Outdated Show resolved Hide resolved
src/main/java/com/nvidia/spark/rapids/jni/CastStrings.java Outdated Show resolved Hide resolved
src/main/cpp/tests/datetime_parser.cpp Outdated Show resolved Hide resolved
src/main/cpp/src/datetime_parser.cu Outdated Show resolved Hide resolved
src/main/cpp/src/datetime_parser.cu Outdated Show resolved Hide resolved
src/main/cpp/src/datetime_parser.cu Outdated Show resolved Hide resolved
src/main/cpp/src/datetime_parser.cu Outdated Show resolved Hide resolved
src/main/cpp/src/datetime_parser.cu Outdated Show resolved Hide resolved
@NVnavkumar
Copy link
Collaborator

NVnavkumar commented Jan 3, 2024

The timestamp rebase that is referenced in the TODOs ultimately is the requirement for #6846. Can we update the description so that merging this PR doesn't close that issue? I think the TODOs here should ultimately reference that issue.

Right now from what I'm gathering, this code right now is purely an implementation of the Spark date time parser so that we're more consistent with Spark. This should ultimately close the overflow bug here NVIDIA/spark-rapids#10083.

@res-life res-life changed the title Part 1: Add timestamp parser to parse timestamp string with time zone Add timestamp parser to parse timestamp string with time zone Jan 8, 2024
@res-life res-life marked this pull request as draft January 8, 2024 06:15
@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

Two follow-up tasks:

@res-life
Copy link
Collaborator Author

build

Chong Gao added 3 commits January 25, 2024 10:18
@res-life
Copy link
Collaborator Author

[commit]:

Only one follow-up task:
Replace looseInstant with epoch to save computation. Will target Release 24.04.

@res-life
Copy link
Collaborator Author

@revans2 Help review.

Copy link
Collaborator

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. It is very complex and daunting of a task, I am sure. My comments look worse than they really are, this is coming along nicely.

src/main/cpp/src/datetime_parser.hpp Outdated Show resolved Hide resolved
src/main/cpp/src/datetime_parser.cu Outdated Show resolved Hide resolved
src/main/cpp/src/datetime_parser.cu Outdated Show resolved Hide resolved
src/main/cpp/src/datetime_parser.cu Outdated Show resolved Hide resolved
src/main/cpp/src/datetime_parser.cu Outdated Show resolved Hide resolved
src/main/cpp/src/datetime_parser.cu Outdated Show resolved Hide resolved
src/main/cpp/src/datetime_parser.cu Show resolved Hide resolved
src/main/cpp/src/datetime_parser.cu Outdated Show resolved Hide resolved
src/main/cpp/src/datetime_parser.cu Outdated Show resolved Hide resolved
@res-life
Copy link
Collaborator Author

@hyperbolic2346 Thanks a lot for your review.
Only this comment is not done.

It is not ideal to pass in these pointers like this. Better would be having the parse_string_to_timestamp_us return those things via a move. Then they could be const as well.

auto [ts_comp, tz_lit_ptr, tz_lit_len, result] = parse_string_to_timestamp_us(d_str);
switch (result) {
 ...

I'm not strongly against this, but it is more the cudf way.

@res-life
Copy link
Collaborator Author

build

@res-life res-life changed the base branch from branch-24.02 to branch-24.04 January 29, 2024 02:11
@res-life
Copy link
Collaborator Author

@hyperbolic2346 I addressed all the comments, please review agian.

@res-life
Copy link
Collaborator Author

build

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits. I still want us to have a unified way of dealing with the gap, and I'd rather not incur a lot of tech debt just because we don't want to stop and think about it now.

// use this reference to indicate if time zone cache is initialized.
// `fixedTransitions` saves transitions for deduplicated time zones, diferent time zones
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit different is misspelled.

// Note: Spark uses ZoneId.SHORT_IDS
// `TimeZone.getAvailableIDs` contains all keys in `ZoneId.SHORT_IDS`
// So do not need extra work for ZoneId.SHORT_IDS, here just check this assumption
for (String tz : ZoneId.SHORT_IDS.keySet()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It might be nice to only do this when assertions are enabled, but this is really minor.

@hyperbolic2346
Copy link
Collaborator

My review is for the C++ changes, please address other comments and requests before merging.

@res-life
Copy link
Collaborator Author

A few nits. I still want us to have a unified way of dealing with the gap, and I'd rather not incur a lot of tech debt just because we don't want to stop and think about it now.

Working on the design for DST/TimeAdd, after the design doc are reviewed, then develop unified way first.

@NVnavkumar NVnavkumar marked this pull request as draft February 12, 2024 19:39
@NVnavkumar
Copy link
Collaborator

Converted this to draft, since a newer unified way is being developed to parse timestamps.

@res-life
Copy link
Collaborator Author

Close it first, and will open if it's needed again.

@res-life res-life closed this Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] GpuTimeZoneDB missed some time zones which are not normalized.
7 participants