Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Refactor timestamp codecs #333

Open
flipp5b opened this issue Jan 23, 2024 · 2 comments
Open

[RFC] Refactor timestamp codecs #333

flipp5b opened this issue Jan 23, 2024 · 2 comments

Comments

@flipp5b
Copy link
Contributor

flipp5b commented Jan 23, 2024

For now, LocalDateTime is a "base" type for INT96 timestamp encoding/decoding: conversion for a Timestamp and Instant goes via LocalDateTime. I see a little mismatch here:

  • If I'm not mistaken, INT96 format is UTC adjusted.
  • LocalDateTime requires an additional piece of data to become UTC adjusted - timezone.
  • On the other hand, Instant is already UTC adjusted (and Timestamp can be directly converted to Instant).

So, it may be a little confusing that you need to specify a timezone to encode/decode an Instant or a Timestamp.

Meanwhile, we could use an Instant as a "base" type as follows:

private[parquet4s] object TimeValueCodecs {
// ...
  private val SecondsPerDay = TimeUnit.DAYS.toSeconds(1)

  def encodeInstant(instant: Instant): Value = BinaryValue {
    val julianSec  = instant.getEpochSecond + JulianDayOfEpoch * SecondsPerDay
    val julianDays = julianSec / SecondsPerDay
    val nanos      = TimeUnit.SECONDS.toNanos(julianSec % SecondsPerDay) + instant.getNano

    ByteBuffer
      .allocate(12)
      .order(ByteOrder.LITTLE_ENDIAN)
      .putLong(nanos)
      .putInt(julianDays.toInt)
      .array()
  }
// ...
}

trait TimeValueEncoders {
  implicit val localDateTimeEncoder: OptionalValueEncoder[LocalDateTime] = new OptionalValueEncoder[LocalDateTime] {
    def encodeNonNull(data: LocalDateTime, configuration: ValueCodecConfiguration): Value =
      TimeValueCodecs.encodeInstant(localDateTimeToInstant(data, configuration.timeZone))
  }

  implicit val instantEncoder: OptionalValueEncoder[Instant] = new OptionalValueEncoder[Instant] {
    def encodeNonNull(data: Instant, configuration: ValueCodecConfiguration): Value =
      TimeValueCodecs.encodeInstant(data)
  }

  implicit val sqlTimestampEncoder: OptionalValueEncoder[java.sql.Timestamp] = new OptionalValueEncoder[Timestamp] {
    def encodeNonNull(data: Timestamp, configuration: ValueCodecConfiguration): Value =
      TimeValueCodecs.encodeInstant(data.toInstant)
  }
}

In such a case, we specify timezone only for a LocalDateTime and the encodeInstant method itself looks a bit simpler than the encodeLocalDateTime.

@flipp5b
Copy link
Contributor Author

flipp5b commented Jan 23, 2024

How naive of me: as always, time-related machinery is like a rabbit hole 😅

Ambiguity

Given

  • Date Time: 2019-01-01T00:00
  • Time Zone: Africa/Nairobi

Current LocalDateTime-based encoder "output"

  • Julian day: 2458485
  • Nanos: -10800000000000

Suggested Instant-based encoder "output"

  • Julian day: 2458484
  • Nanos: 75600000000000

Both these outputs denote the same point in time. Instant-based decoder correctly restores value produced by LocalDateTime-based encoder and vice versa. But this is still an observable behavior change. I suppose this is undesirable (or not?). Anyway, I have no idea which of the outputs is correct/preferable because of the next issue.

INT96 format itself

The INT96 format is deprecated (see here) and even lacks any documentation (see discussion here).

With all that said, wouldn't it be better to soft-deprecate default timestamp codecs in parquet4s and encourage users to choose INT64-based ones? Also, we could switch to an Instant internally and use the LocalDateTime-based implementation only as a fallback for the INT96 format.

@flipp5b flipp5b changed the title Refactor timestamp codecs [RFC] Refactor timestamp codecs Jan 24, 2024
@mjakubowski84
Copy link
Owner

mjakubowski84 commented Feb 5, 2024

I suppose this is undesirable (or not?)

I guess there must be a small bug that's causing it. Definitely, it is undesirable because produces different data.

The INT96 format is deprecated (see here) and even lacks any documentation (see discussion apache/parquet-format#49).

Yes, it is deprecated, but yet (!) it is still a default format in such priminent tools as Spark, Impala and many others.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants