Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-28249: Parquet legacy timezone conversion converts march 1st to … #5241

Merged
merged 9 commits into from May 16, 2024

Conversation

simhadri-g
Copy link
Member

@simhadri-g simhadri-g commented May 7, 2024

…29th feb and fails with not a leap year exception

Example: Invalid date 'February 29' as '200' is not a leap year

What changes were proposed in this pull request?

When handling legacy time stamp conversions in parquet,'February 29' year '200' is an edge case.
This is because, according to this: https://www.lanl.gov/Caesar/node202.html
The Julian day for 200 CE/02/29 in the Julian calendar is different from the Julian day in Gregorian Calendar .

Date (BC/AD) Date (CE) Julian Day Julian Day
-   - (Julian Calendar) (Gregorian Calendar)
200 AD/02/28 200 CE/02/28 1794166 1794167
200 AD/02/29 200 CE/02/29 1794167 1794168
200 AD/03/01 200 CE/03/01 1794168 1794168
300 AD/02/28 300 CE/02/28 1830691 1830691
300 AD/02/29 300 CE/02/29 1830692 1830692
300 AD/03/01 300 CE/03/01 1830693 1830692

Because of this:

int julianDay = nt.getJulianDay(); returns julian day 1794167 https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/timestamp/NanoTimeUtils.java#L92

Later Timestamp result = Timestamp.valueOf(formatter.format(date));
formatter.format(date) returns 29-02-200 as it seems to be using julian calendar
but Timestamp.valueOf(29-02-200) seems to be using gregorian calendar and fails with "not a leap year exception" for 29th Feb 200"
https://github.com/apache/hive/blob/master/common/src/java/org/apache/hadoop/hive/common/type/TimestampTZUtil.java#L196

Since hive stores timestamp in UTC, when converting 200 CE/03/01 between timezones, hive runs into an exception and fails with "not a leap year exception" for 29th Feb 200 even if the actual record inserted was 200 CE/03/01 in Asia/Singapore timezone.

Full Stacktrace

java.lang.RuntimeException: java.io.IOException: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/Users/simhadri.govindappa/Documents/apache/hive/itests/qtest/target/localfs/warehouse/test_sgt/sgt000
at org.apache.hadoop.hive.ql.exec.FetchTask.executeInner(FetchTask.java:210)
at org.apache.hadoop.hive.ql.exec.FetchTask.execute(FetchTask.java:95)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:212)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:154)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:149)
at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:185)
at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:230)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:257)
at org.apache.hadoop.hive.cli.CliDriver.processCmd1(CliDriver.java:201)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:127)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:425)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:356)
at org.apache.hadoop.hive.ql.QTestUtil.executeClientInternal(QTestUtil.java:732)
at org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:702)
at org.apache.hadoop.hive.cli.control.CoreCliDriver.runTest(CoreCliDriver.java:116)
at org.apache.hadoop.hive.cli.control.CliAdapter.runTest(CliAdapter.java:157)
at org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver(TestMiniLlapLocalCliDriver.java:62)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.apache.hadoop.hive.cli.control.CliAdapter$2$1.evaluate(CliAdapter.java:135)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
at org.junit.runners.Suite.runChild(Suite.java:128)
at org.junit.runners.Suite.runChild(Suite.java:27)
at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
at org.apache.hadoop.hive.cli.control.CliAdapter$1$1.evaluate(CliAdapter.java:95)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:377)
at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:138)
at org.apache.maven.surefire.booter.ForkedBooter.run(ForkedBooter.java:465)
at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:451)
Caused by: java.io.IOException: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/Users/simhadri.govindappa/Documents/apache/hive/itests/qtest/target/localfs/warehouse/test_sgt/sgt000
at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:628)
at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:535)
at org.apache.hadoop.hive.ql.exec.FetchTask.executeInner(FetchTask.java:194)
... 55 more
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/Users/simhadri.govindappa/Documents/apache/hive/itests/qtest/target/localfs/warehouse/test_sgt/sgt000
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:264)
at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:210)
at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:84)
at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:89)
at org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:775)
at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:339)
at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:566)
... 57 more
Caused by: java.time.DateTimeException: Invalid date 'February 29' as '200' is not a leap year
at java.time.LocalDate.create(LocalDate.java:429)
at java.time.LocalDate.of(LocalDate.java:269)
at java.time.LocalDateTime.of(LocalDateTime.java:361)
at jodd.time.JulianDate.toLocalDateTime(JulianDate.java:344)
at org.apache.hadoop.hive.ql.io.parquet.timestamp.NanoTimeUtils.getTimestamp(NanoTimeUtils.java:111)
at org.apache.hadoop.hive.ql.io.parquet.convert.ETypeConverter$9$2.convert(ETypeConverter.java:782)
at org.apache.hadoop.hive.ql.io.parquet.convert.ETypeConverter$9$2.convert(ETypeConverter.java:764)
at org.apache.hadoop.hive.ql.io.parquet.convert.ETypeConverter$BinaryConverter.setDictionary(ETypeConverter.java:975)
at org.apache.parquet.column.impl.ColumnReaderBase.(ColumnReaderBase.java:415)
at org.apache.parquet.column.impl.ColumnReaderImpl.(ColumnReaderImpl.java:46)
at org.apache.parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:82)
at org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:271)
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:147)
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:109)
at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:177)
at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:109)
at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:141)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:230)
... 63 more

Why are the changes needed?

Even if the actual record inserted was 200 CE/03/01 in Asia/Singapore timezone hive runs into an exception and fails with "not a leap year exception" for 29th Feb 200 when reading the file.

Does this PR introduce any user-facing change?

Is the change a dependency upgrade?

No

How was this patch tested?

Qtest with a parquet datafile that recreates the issue and manual test.

With parquet 1.8:
with hive cli

  1. Create table in Asia/Singapore Timezone:
    create table default.test_sgt(currtime timestamp) stored as parquet;
  2. Insert record with date 0200-03-01
    insert into default.test_sgt values ('0200-03-01 00:00:00')

Migrate the datafile sgt100

With parquet 1.10.99:

--! qt:timezone:Asia/Singapore
CREATE EXTERNAL TABLE `TEST_SGT`(`currtime` timestamp) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';

LOAD DATA LOCAL INPATH '../../data/files/sgt000' INTO TABLE TEST_SGT;

SELECT * FROM TEST_SGT;

…29th feb and fails with not a leap year exception

Example: Invalid date 'February 29' as '200' is not a leap year
deniskuzZ
deniskuzZ previously approved these changes May 10, 2024
Copy link
Member

@deniskuzZ deniskuzZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1

Calendar calendar = getGMTCalendar();
calendar.set(Calendar.YEAR, jDateTime.toLocalDateTime().getYear());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that the exception originates from JulianDate class makes me believe that this is a bug/problem to be addressed in the jodd library. I also think that this exception is mainly a result of HIVE-25054 which upgraded the jodd library.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @zabetak ,
Thanks for the review.

The entire issue starts from Removal of JDateTime here in jodd v4.3 oblac/jodd@3d9772d

In the change , they have moved from jodd.datetime.JDateTime to java.time.LocalDateTime and LocalDateTime doesn't gracefully handle julian leap year dates such as 200-02-29

jDateTime = JulianDate.of((double) julianDay);
JulianDate jDateTime = JulianDate.of((double) julianDay);
LocalDateTime localDateTime;
int leapYearDateAdjustment = legacyConversion ? julianLeapYearAdjustment(jDateTime) : 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure that this leap year problem affects only the legacyConversion? What is preventing us from getting a problematic julian day when legacyConversion is false?

Comment on lines 137 to 170
private static int julianLeapYearAdjustment(JulianDate jDateTime) {
Set<Integer> validDaysOfMonth = new HashSet<>();
validDaysOfMonth.add(1);
validDaysOfMonth.add(2);
validDaysOfMonth.add(27);
validDaysOfMonth.add(28);
validDaysOfMonth.add(29);
LocalDateTime localDateTime;
try {
localDateTime = jDateTime.toLocalDateTime();
} catch (DateTimeException e) {
if (e.getMessage().contains(ErrorMsg.NOT_LEAP_YEAR.getMsg())) {
return 2;
} else {
throw e;
}
}

int year = localDateTime.getYear();
int dayOfMonth = localDateTime.getDayOfMonth();
boolean isJulianLeapYear = (year % 4 == 0 && year % 400 != 0 && year % 100 == 0);

if (year < 1582 && isJulianLeapYear && validDaysOfMonth.contains(dayOfMonth)) {
switch (localDateTime.getMonth()) {
case FEBRUARY:
return -2;
case MARCH:
return 2;
default:
return 0;
}
}
return 0;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is complex and pretty expensive in terms of performance for something that will be done for every single timestamp value (especially the try catch block).

According to the simple test below the problematic Julian days seem to be rather few and they are known in advance:

    int start = JulianDate.of(LocalDate.of(0, 1, 1)).getInteger();
    int end = JulianDate.of(LocalDate.of(9999, 12, 31)).getInteger();
    System.out.println("Start Julian day " + start);
    System.out.println("End Julian day " + end);
    for (int i = start; i <= end; i++) {
      JulianDate date = new JulianDate(i, 0.0);
      try {
        date.toLocalDateTime();
      } catch (Exception e) {
        System.out.println("Problem converting Julian day " + i);
      }
    }
Start Julian day 1721057
End Julian day 5373483
Problem converting Julian day 1757642
Problem converting Julian day 1794167
Problem converting Julian day 1830692
Problem converting Julian day 1903742
Problem converting Julian day 1940267
Problem converting Julian day 1976792
Problem converting Julian day 2049842
Problem converting Julian day 2086367
Problem converting Julian day 2122892
Problem converting Julian day 2195942
Problem converting Julian day 2232467
Problem converting Julian day 2268992

Wouldn't be easier and more efficient to do the shifting based on the 12 dates printed above? Am I missing more problematic dates?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moreover it kinda feels like this kind of shifting should be done inside the JulianDate class. I haven't fully formulated an opinion yet but I raised oblac/jodd-util#21 to see what the jodd authors have to say about this use case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a jodd time issue cause by this change in jodd time: oblac/jodd@3d9772d

In the jodd change mentioned above , they have moved from jodd.datetime.JDateTime to java.time.LocalDateTime and LocalDateTime doesn't gracefully handle julian leap year dates such as 200-02-29 .

result = Timestamp.valueOf(formatter.format(date));
} catch (IllegalArgumentException ex) {
if (ex.getCause().getMessage().contains(NOT_LEAP_YEAR)) {
int leapYearDateAdjustment = ts.getMonth() == Month.MARCH.getValue() ? 2 : -2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 and -1 here are ok

Copy link
Member

@deniskuzZ deniskuzZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM +1, minor comment + pending tests

Copy link

sonarcloud bot commented May 15, 2024

Quality Gate Passed Quality Gate passed

Issues
3 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

@simhadri-g simhadri-g merged commit ea70fb8 into apache:master May 16, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants