Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding LocalDateRange #190

Open
PeterAttardo opened this issue Mar 9, 2022 · 11 comments
Open

Adding LocalDateRange #190

PeterAttardo opened this issue Mar 9, 2022 · 11 comments
Milestone

Comments

@PeterAttardo
Copy link

Per request I am opening an issue to explain #189

My use case comes from my recent exploration of the new Kotlin Spark API. In data engineering, a lot of batch jobs are parameterized with startDate and endDate. Processing the latest date of data in a daily task runner (such as Airflow) is done by feeding the same date as both startDate and endDate; processing a backfill is done by feeding an appropriate range to (re)process.

For some jobs it is necessary to iterate over every date in the range. This could be required if the relevant data files are organized by date in the S3/HDFS/etc path (s3://some-bucket/data_type/{date}/[00..n].parquet), or if the job calls a process that is parameterized by a single date, or another reason.

As mentioned, there is a similar request for Instant Interval #34. I contemplated adding something similar in my pull request (specifically I considered doing for LocalDateTime what I did for LocalDate), but decided against it for the following reasons:

  • The pattern used by the builtin Kotlin progressions requires a default step size. Integers, Longs, and Chars all have a natural value for that step, because they are discrete. LocalDate is likewise discrete, and therefore a progression with a step size of one day is a natural fit. LocalDateTime is not discrete, and therefore there is no natural step size. Should (dateTime1..dateTime2).forEach { ... } run for every second between the two LocalDateTimes? Every hour? Every nanosecond?
  • Without Kotlin's Progression concept being a good fit for LocalDateTime, there is still a valid use case for the general range semantics. However, most of that use case is already covered by the generic ClosedRange<T : [Comparable]<T>> that Kotlin gives you out of the box for any Comparable. (dateTime1..dateTime2).contains(dateTime3) is already provided free of charge.
  • Arithmetic on LocalDateTime is not provided. The reasons for this are likely the same reasons that a LocalDateTime progression is a bad idea.

The third reason does not apply to Instant, but the first two reasons do. If others disagree that they are sufficient to preclude a use case for an InstantRange that applies Kotlin's progression semantics, I can add it to my pull request. But for now I omitted it.

@dkhalanskyjb
Copy link
Contributor

Is it correct to say that the use case is that the API you're working with requires querying data date by date with no option to request the data for a wider range?

@PeterAttardo
Copy link
Author

That would be correct.

@leviathan-n
Copy link

I'm doing this by adding some extensions for ClosedRange<LocalDate> currently
Hope this be accepted

@dkhalanskyjb
Copy link
Contributor

dkhalanskyjb commented Apr 3, 2024

I've browsed through GitHub to see how people are using kotlinx-datetime, and it's clear that this is an important use case (I've seen several implementations of date progressions), and we should implement this. There's a question that we need to address first, though: what constitutes a step between two dates?

  • (The approach taken in Local date range #189) A step is an arbitrary DatePeriod, where months and days can have opposite signs. This leads to some issues already solved in the PR, but also some new ones:
    • When I write date1.rangeTo(date2).step(step) and date2 is later than date1, I don't expect to receive dates earlier than date1, but this can happen;
    • DateTimePeriod(months = 1, days = -30) is considered positive, so (LocalDate(2024, 1, 31)..LocalDate(2024, 1, 30)).step(DatePeriod(months = 1, days = -30)) is empty, even though LocalDate(2024, 1, 31) + DatePeriod(months = 1, days = -30) == LocalDate(2024, 1, 30); that is, actually attempting to iterate over the range produces the same result.
  • A step is an arbitrary DatePeriod, but months and days are required to have the same sign.
    • This limits some of the use cases, but I haven't seen anyone anywhere needing them. Question to the audience: does anyone need date progressions whose steps is some number of months minus some number of days?
    • This is slightly non-idiomatic. DatePeriod(months = 12, days = -1) does seem useless, but it's unclear why passing this particular DatePeriod is forbidden when passing other DatePeriod values is fine.
    • With this, we can easily ensure that iterating never fails. (1..2).step(Int.MAX_VALUE).toList() == listOf(1), not a crash and not some value corrupted by numeric overflows; if we know that a given DatePeriod always acts in the same direction, we can ensure the same behavior for dates.
  • A multiple of DateTimeUnit.DateBased: (date1..date2).step(6, DateTimeUnit.MONTH)
    • This limits some use cases even further. Question to the audience: does anyone need date progressions where the period between dates needs to have both non-zero months and non-zero days, for example, DatePeriod(months = 1, days = 2)? My immediate assumption, given the use cases I've actually seen mentioned in this repo and implemented in the wild, is that no, no one actually needs this.
    • We can ensure that constructing ranges and progressions will never fail with an exception. This is a huge win.
    • How do we represent the step in the progression object?
      • As a DatePeriod that's guaranteed to be some multiple of one DateTimeUnit.DateBased.
        • This can actually result in numeric overflow: (date1..date2).step(Int.MAX_VALUE, DateTimeUnit.MONTH * 12), for example. However, instead of failing, we can just consider the step to be a DatePeriod(months = Int.MAX_VALUE). It's 12 times less than the value that was actually passed in, but the resulting range should be the same in all cases: either empty if date2 < date1 or listOf(date1) otherwise.
        • It's a bit strange to take out what you didn't put. That said, accessing the step is probably a very advanced use case, so this shouldn't cause much confusion in practice.
      • As two fields, like stepScalar: Int and stepUnit: DateTimeUnit.DateBased.
      • Both? stepAsPeriod, stepScalar, and stepUnit?

@PeterAttardo
Copy link
Author

Related to the choice of arbitrary DatePeriod:

I'm not hugely concerned about the first case. A range that goes backward temporarily, but forward ultimately, isn't wrong so much as it is perhaps unexpected. Given that LocalDate arithmetic, with anything other than days, is already a bit of a "here be dragons" proposition, I would expect anyone using complex or compound DatePeriods to be aware of the pitfalls already. For example the implicit rounding down of dates that don't occur in a month means that LocalDate(2001, 1, 31) + DatePeriod(months=1) + DatePeriod(months=-1) yields 2001-01-28; and the fact that arithmetic is done on months prior to days means that LocalDate(2001, 1, 31) + DatePeriod(months=-1, days=31) yields 2001-01-31 whereas if the arithmetic order were flipped it would yield 2001-02-03. Users have to be aware of the underlying assumptions that give rise to those results. Given that the temporarily-backward range is both an uncommon usecase, and one for which the library already assumes a level of user sophistication, I don't think it will present a meaningful problem.

The second case is more interesting. It did occur to me that there could be DatePeriods that behaved differently over an arbitrarily long period than they did over a shorter period, and that treating them as the former may prohibit their use in cases of the latter. It's a bit like a highway that in general trends South to North, but for short sections trends North to South. In those sections are you still heading "North" on the highway, since that's the ultimate direction? Or are you heading South, because that's the instantaneous direction? In terms of date iteration, is DatePeriod(months = 1, days = -30) positive because if you sum it enough times, it will increase? Or is it positive or negative based on the specific months and/or days it is being added to? A related question is how a comparison operation for DatePeriods would work. Does it even make mathematical sense to compare DatePeriods? As a practical matter, I don't see a way to predict these cases without actually conducting the iteration. Empty ranges could only be determined through an exhaustive iteration, which would be exceedingly computationally expensive.

I think ultimately restricting DatePeriods to having the same sign is the cleanest way to guarantee these issues are avoided, but if someone presents a usecase for having mixed signs, that these issues aren't severe enough to restrict that valid usecase.

As a matter of syntax, for step to be an infix function, it can only take a single parameter. So having the step be a DatePeriod is nicer than constructing a scalar/unit pair, or abandoning the infix notation altogether.

@PeterAttardo
Copy link
Author

Thinking about this more, there is an issue that persists in any choice that allows for a step in months (as opposed to just days, or just years*):

If one were to create the following date progression: LocalDate(2001, 1, 31)..LocalDate(2002, 1, 31) step DatePeriod(months=1), would we expect it to yield the 31st of each month, excepting months with 30 days, where it would yield the 30th, and February, where it would yield the 28th? Or would we expect it to yield the 31st of January, and then the 28th of every month thereafter (as once it rounds down to the 28th of February, it stays at the 28th of each month when it progresses from one month to the next)?

The latter is the current behavior with simple DatePeriod arithmetic, as it arises naturally from how month addition works in the library. It feels off, but likely only because I'm (mis)interpreting the progression as being equivalent to "the 31st of each month" when it is not. It may be the case that certain "of each month" progressions are best achieved by creating a stable progression (first of the month, step size of one month, for example), and then modifying each resulting date within the scope function. I don't know if that's a natural first assumption for most users though.

The more I think about it, the more I believe that any progression that allows months inherently comes with caveats. Even if we restrict the steps to only positive months, we would still have caveats like the one above. It may make sense to trend toward maximum freedom, and just put strong warnings in the docs about known quirks.

*technically this would also occur if one tried to iterate, by year, starting on the 29th of February, as one encountered non-leap years, but that's an edge case of an edge case

@dkhalanskyjb
Copy link
Contributor

As a matter of syntax, for step to be an infix function, it can only take a single parameter. So having the step be a DatePeriod is nicer than constructing a scalar/unit pair, or abandoning the infix notation altogether.

It is okay to abandon the infix notation: it's consistent with things like LocalDate(2023, 1, 30).plus(5, DateTimeUnit.DAY), where we can't use the infix notation for plus for the same reason.

LocalDate(2001, 1, 31)..LocalDate(2002, 1, 31) step DatePeriod(months=1), would we expect it to yield the 31st of each month, excepting months with 30 days, where it would yield the 30th, and February, where it would yield the 28th? Or would we expect it to yield the 31st of January, and then the 28th of every month thereafter (as once it rounds down to the 28th of February, it stays at the 28th of each month when it progresses from one month to the next)?

That's a great point. Actually, we can implement this in a way that the last day of the month is always returned (treat this as pseudocode, I neither compiled it nor checked it for edge cases):

sequence {
  var i = 0
  while (true) {
    val nextDate = startDate.plus(i, stepUnit)
    if (nextDate > endDate) break
    yield(nextDate)
    ++i
  }
}

We only have to think carefully about what's less unintuitive / more likely to be useful.

@PeterAttardo
Copy link
Author

I think we can distill this problem down to the following question:

Which of the following accurately describes the next date in a progression?:

  1. The prior date in the progression + the step
  2. The first date in the progression + (the step * the "index" of the date)

The end-of-month edge case is one we've already uncovered where the distinction is meaningful. Are there any others?

@dkhalanskyjb
Copy link
Contributor

  • DateTimeUnit.MONTH: this irregularity only occurs at the end of the month. If the target month contains the day-of-the-month of the source date, then the day-of-the-month is preserved.
  • DateTimeUnit.DAY: there's no distinction at all: date.plus(n, DateTimeUnit.DAY).plus(m, DateTimeUnit.DAY) == date.plus(n + m, DateTimeUnit.DAY).
  • DatePeriod: adding together two DatePeriod values seems semantically meaningless, we can't go down this path at all: addition of DatePeriod is defined to be "first add months, then, add days", so first adding DatePeriod(months = m1, days = d1) and then DatePeriod(months = m2, days = d2) yields date.plus(m1, DateTimeUnit.MONTH).plus(d1, DateTimeUnit.DAY).plus(m2, DateTimeUnit.MONTH).plus(d2, DateTimeUnit.DAY). Additions of days and months can't be swapped, so this is not the same at all as adding DatePeriod(months = m1 + m2, days = d1 + d2). Defining addition on date periods would be a stretch on its own, and including it only in date ranges would mean that one can't use the knowledge of the rest of the library to predict how they behave.

@PeterAttardo
Copy link
Author

Addition on DatePeriods is already defined, is it not? https://github.com/Kotlin/kotlinx-datetime/blob/master/core/common/src/DateTimePeriod.kt#L445

Multiplication could be defined largely the same way:

public operator fun DatePeriod.times(other: Int): DatePeriod = DatePeriod(
    safeMultiply(totalMonths, other),
    safeMultiply(days, other)
)

I think this is ultimately a question of user expectation. Which of the following is more intuitive to a user and less likely to cause confusion:

range[N] = range[N-1] + step
range[N] = range[0] + step * N

@dkhalanskyjb dkhalanskyjb added this to the 0.9.0 milestone Apr 12, 2024
@dkhalanskyjb
Copy link
Contributor

Addition on DatePeriods is already defined, is it not?

Oh, good point, filed an issue: #381

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@PeterAttardo @leviathan-n @dkhalanskyjb and others