New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIVE-28249: Parquet legacy timezone conversion converts march 1st to … #5241
Conversation
…29th feb and fails with not a leap year exception Example: Invalid date 'February 29' as '200' is not a leap year
a6b2d47
to
27d6675
Compare
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/timestamp/NanoTimeUtils.java
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/timestamp/NanoTimeUtils.java
Outdated
Show resolved
Hide resolved
common/src/java/org/apache/hadoop/hive/common/type/TimestampTZUtil.java
Outdated
Show resolved
Hide resolved
common/src/java/org/apache/hadoop/hive/common/type/TimestampTZUtil.java
Outdated
Show resolved
Hide resolved
aea8034
to
5281013
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM +1
d490b1a
to
88a777e
Compare
88a777e
to
c6941d7
Compare
Calendar calendar = getGMTCalendar(); | ||
calendar.set(Calendar.YEAR, jDateTime.toLocalDateTime().getYear()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that the exception originates from JulianDate
class makes me believe that this is a bug/problem to be addressed in the jodd
library. I also think that this exception is mainly a result of HIVE-25054 which upgraded the jodd library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @zabetak ,
Thanks for the review.
The entire issue starts from Removal of JDateTime here in jodd v4.3 oblac/jodd@3d9772d
In the change , they have moved from jodd.datetime.JDateTime to java.time.LocalDateTime and LocalDateTime doesn't gracefully handle julian leap year dates such as 200-02-29
jDateTime = JulianDate.of((double) julianDay); | ||
JulianDate jDateTime = JulianDate.of((double) julianDay); | ||
LocalDateTime localDateTime; | ||
int leapYearDateAdjustment = legacyConversion ? julianLeapYearAdjustment(jDateTime) : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure that this leap year problem affects only the legacyConversion
? What is preventing us from getting a problematic julian day when legacyConversion
is false
?
private static int julianLeapYearAdjustment(JulianDate jDateTime) { | ||
Set<Integer> validDaysOfMonth = new HashSet<>(); | ||
validDaysOfMonth.add(1); | ||
validDaysOfMonth.add(2); | ||
validDaysOfMonth.add(27); | ||
validDaysOfMonth.add(28); | ||
validDaysOfMonth.add(29); | ||
LocalDateTime localDateTime; | ||
try { | ||
localDateTime = jDateTime.toLocalDateTime(); | ||
} catch (DateTimeException e) { | ||
if (e.getMessage().contains(ErrorMsg.NOT_LEAP_YEAR.getMsg())) { | ||
return 2; | ||
} else { | ||
throw e; | ||
} | ||
} | ||
|
||
int year = localDateTime.getYear(); | ||
int dayOfMonth = localDateTime.getDayOfMonth(); | ||
boolean isJulianLeapYear = (year % 4 == 0 && year % 400 != 0 && year % 100 == 0); | ||
|
||
if (year < 1582 && isJulianLeapYear && validDaysOfMonth.contains(dayOfMonth)) { | ||
switch (localDateTime.getMonth()) { | ||
case FEBRUARY: | ||
return -2; | ||
case MARCH: | ||
return 2; | ||
default: | ||
return 0; | ||
} | ||
} | ||
return 0; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is complex and pretty expensive in terms of performance for something that will be done for every single timestamp value (especially the try catch block).
According to the simple test below the problematic Julian days seem to be rather few and they are known in advance:
int start = JulianDate.of(LocalDate.of(0, 1, 1)).getInteger();
int end = JulianDate.of(LocalDate.of(9999, 12, 31)).getInteger();
System.out.println("Start Julian day " + start);
System.out.println("End Julian day " + end);
for (int i = start; i <= end; i++) {
JulianDate date = new JulianDate(i, 0.0);
try {
date.toLocalDateTime();
} catch (Exception e) {
System.out.println("Problem converting Julian day " + i);
}
}
Start Julian day 1721057
End Julian day 5373483
Problem converting Julian day 1757642
Problem converting Julian day 1794167
Problem converting Julian day 1830692
Problem converting Julian day 1903742
Problem converting Julian day 1940267
Problem converting Julian day 1976792
Problem converting Julian day 2049842
Problem converting Julian day 2086367
Problem converting Julian day 2122892
Problem converting Julian day 2195942
Problem converting Julian day 2232467
Problem converting Julian day 2268992
Wouldn't be easier and more efficient to do the shifting based on the 12 dates printed above? Am I missing more problematic dates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moreover it kinda feels like this kind of shifting should be done inside the JulianDate
class. I haven't fully formulated an opinion yet but I raised oblac/jodd-util#21 to see what the jodd authors have to say about this use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a jodd time issue cause by this change in jodd time: oblac/jodd@3d9772d
In the jodd change mentioned above , they have moved from jodd.datetime.JDateTime to java.time.LocalDateTime and LocalDateTime doesn't gracefully handle julian leap year dates such as 200-02-29 .
common/src/java/org/apache/hadoop/hive/common/type/TimestampTZUtil.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/timestamp/NanoTimeUtils.java
Outdated
Show resolved
Hide resolved
fe6d136
to
eeac4a5
Compare
result = Timestamp.valueOf(formatter.format(date)); | ||
} catch (IllegalArgumentException ex) { | ||
if (ex.getCause().getMessage().contains(NOT_LEAP_YEAR)) { | ||
int leapYearDateAdjustment = ts.getMonth() == Month.MARCH.getValue() ? 2 : -2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 and -1 here are ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM +1, minor comment + pending tests
Quality Gate passedIssues Measures |
…29th feb and fails with not a leap year exception
Example: Invalid date 'February 29' as '200' is not a leap year
What changes were proposed in this pull request?
When handling legacy time stamp conversions in parquet,'February 29' year '200' is an edge case.
This is because, according to this: https://www.lanl.gov/Caesar/node202.html
The Julian day for 200 CE/02/29 in the Julian calendar is different from the Julian day in Gregorian Calendar .
Because of this:
int julianDay = nt.getJulianDay();
returns julian day 1794167 https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/timestamp/NanoTimeUtils.java#L92Later
Timestamp result = Timestamp.valueOf(formatter.format(date));
formatter.format(date
) returns 29-02-200 as it seems to be using julian calendarbut
Timestamp.valueOf(29-02-200)
seems to be using gregorian calendar and fails with "not a leap year exception" for 29th Feb 200"https://github.com/apache/hive/blob/master/common/src/java/org/apache/hadoop/hive/common/type/TimestampTZUtil.java#L196
Since hive stores timestamp in UTC, when converting 200 CE/03/01 between timezones, hive runs into an exception and fails with "not a leap year exception" for 29th Feb 200 even if the actual record inserted was 200 CE/03/01 in Asia/Singapore timezone.
Full Stacktrace
Why are the changes needed?
Even if the actual record inserted was 200 CE/03/01 in Asia/Singapore timezone hive runs into an exception and fails with "not a leap year exception" for 29th Feb 200 when reading the file.
Does this PR introduce any user-facing change?
Is the change a dependency upgrade?
No
How was this patch tested?
Qtest with a parquet datafile that recreates the issue and manual test.
With parquet 1.8:
with hive cli
create table default.test_sgt(currtime timestamp) stored as parquet;
insert into default.test_sgt values ('0200-03-01 00:00:00')
Migrate the datafile sgt100
With parquet 1.10.99: