Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow conversion of dates from one tz to another #779

Open
reder2000 opened this issue Mar 1, 2023 · 12 comments
Open

slow conversion of dates from one tz to another #779

reder2000 opened this issue Mar 1, 2023 · 12 comments

Comments

@reder2000
Copy link

Converting a large amount of tz-dates to another tz is painfully slow. This does not look good wrt say, python pandas.

Most of the time is spent in sys_info find_rule(...) which seems to use a linear search.

Maybe a lower_bound could/should be used initially, but the code is somewhat not easy to apprehend and I could not devise a solution by myself.

@Erroneous1
Copy link

I don't know about python pandas, but with Linux's glibc when you convert a struct tm between timezones it only takes into account the current offset from UTC, and not the offset as it was given the original time. This is very fast because it doesn't have to keep looking in the zone_info files, but not accurate.

What you may want to do if you're converting a lot of zoned_times is use the time_zone class's get_info. This will return a sys_info when given a sys_time, but you can also use the local_info.first or local_info.second if you start out with a local_time. From there you can check to see if the next lookup you're going to do is within the sys_info.begin and sys_info.end to make sure you don't need to lookup the info again. Then you can simply add the sys_info.offset to a sys_time or subtract it from a local_time.

@reder2000
Copy link
Author

Clearly approximation is not acceptable for the task I'm pursuing. Not sure about what pandas do though.

Thanks for the help, but it looks more complicated than this. I've stepped through the code. In sys_info find_rule(...) it looks like the search is done linearly through calling find_next_rule
For instance, converting UTC today's date in America/New_York loops 113 times before finding the correct "rule".
The test for accepting rules is not very clear with lots of conversions... Being able to bisect would speed up the thing 20 times.

@HowardHinnant
Copy link
Owner

HowardHinnant commented Mar 1, 2023

Is using USE_OS_TZDB=1 an option for you? This is known to have higher performance. But it isn't an option on Windows.

Documentation: https://howardhinnant.github.io/date/tz.html#Installation

@reder2000
Copy link
Author

Is that different to USE_SYSTEM_TZ_DB (already ON) ?
I have tried USE_OS_TZDB=1 but did not find any performance difference. But maybe the option is not really swallowed (not sure FetchContent_Declare really gets it).
And yes, I'd prefer to have the same performance regardless of OS.

ZonedTime forced me to parallelize / adapt lots of things (csv parser, vector zt change, vector zt serialization with only 1 zt, custom string parser).
And yet date is also much better wrt to performance than MS chrono....

@HowardHinnant
Copy link
Owner

I don't use the CMake systems associated with this project and deeply regret their introduction. They introduce way too much complication for compiling a single source: tz.cpp. Their over encapsulation is a continual source of errors.

For initializing a vector<zoned_time<D>> consider: vector<zoned_time<D>> v{N, zoned_time<D>{}};.

@HowardHinnant
Copy link
Owner

I don't have a quick fix for this besides USE_OS_TZDB=1. However the suggestion by @Erroneous1 is a good one, and is achievable with a little effort.

Consider this HelloWorld timing test of converting 10,000 local times from New York to London:

#include <cassert>
#include <iostream>
#include <stdexcept>
#include <numeric>
#include <vector>

int
main()
{
    using namespace date;
    using namespace std;
    using namespace std::chrono;

    auto tz_ny = locate_zone("America/New_York");
    auto tz_lon = locate_zone("Europe/London");
    vector<local_time<system_clock::duration>> v_ny(10'000);
    for (auto& tp : v_ny)
        tp = tz_ny->to_local(system_clock::now());
    vector<local_time<system_clock::duration>> v_lon;
    v_lon.reserve(v_ny.size());
    auto t0 = steady_clock::now();
    for (auto tp : v_ny)
        v_lon.push_back(tz_lon->to_local(tz_ny->to_sys(tp)));
    auto t1 = steady_clock::now();
    cout << t1 - t0 << '\n';
}

For me this prints out about:

65,199,481ns

However I can optimize it by caching recently used time zones and infos with:

template <class Duration>
auto
to_local(date::sys_time<Duration> tp, date::time_zone const* tz)
{
    using namespace date;

    thread_local time_zone const* tz_save = tz;
    thread_local sys_info info = tz->get_info(tp);

    if (tz != tz_save || !(info.begin <= tp && tp < info.end))
    {
        tz_save = tz;
        info = tz->get_info(tp);
    }
    return local_time<Duration>{tp.time_since_epoch() + info.offset};
}

template <class Duration>
auto
to_sys(date::local_time<Duration> tp, date::time_zone const* tz)
{
    using namespace date;

    auto get_info = [](date::local_time<Duration> tp, date::time_zone const* tz)
        {
            auto info = tz->get_info(tp);
            if (info.result != local_info::unique)
                throw std::runtime_error("local time point is not unique");
            return info;
        };

    thread_local time_zone const* tz_save = tz;
    thread_local local_info info = get_info(tp, tz);

    if (tz != tz_save)
    {
        tz_save = tz;
        info = get_info(tp, tz);
        return sys_time<Duration>{tp.time_since_epoch() - info.first.offset};
    }
    sys_time<Duration> utc_tp{tp.time_since_epoch() - info.first.offset};
    if (info.first.begin <= utc_tp && utc_tp < info.first.end)
        return utc_tp;
    info = get_info(tp, tz);
    return sys_time<Duration>{tp.time_since_epoch() - info.first.offset};
}

The test changes from:

    for (auto tp : v_ny)
        v_lon.push_back(tz_lon->to_local(tz_ny->to_sys(tp)));

to:

    for (auto tp : v_ny)
        v_lon.push_back(to_local(to_sys(tp, tz_ny), tz_lon));

Now my output is more like:

1,951,266ns

About 33x faster. Of course this won't be effective if the time zones involved change a lot, or if the time points are not clumped such that they are often close together. ymmv.

@reder2000
Copy link
Author

I don't use the CMake systems associated with this project and deeply regret their introduction. They introduce way too much complication for compiling a single source: tz.cpp. Their over encapsulation is a continual source of errors.

CMake is a necessary evil. Without it (and vcpkg) C++ folks look like idiots when they try to bring a library in, compared to Python people.

@HowardHinnant
Copy link
Owner

Do you know if it is possible for clients of a repository to manage a CMake-less repository with their own CMake scripts?

@reder2000
Copy link
Author

reder2000 commented Mar 3, 2023

About 33x faster. Of course this won't be effective if the time zones involved change a lot, or if the time points are not clumped such that they are often close together. ymmv.

Thanks a lot, that indeed helped. In more than one place. With // I convert 7 million dates in 1 sec in debug. And I handled default value as well. The whole thing flies.
You should add those function somewhere.

@reder2000
Copy link
Author

Do you know if it is possible for clients of a repository to manage a CMake-less repository with their own CMake scripts?

One can use fetch_content to populate one own's tree at configure time, and include them in their build.
But no one would do that (and no one does, even for header-only libraries) since it is a bit convoluted to find back where those files are stored, and properly set include paths.

I quite honestly think that the CMakeFile correctly transcripts the various build parameters if one needs to move away from vcpkg defaults (which I did for windows while I struggled with pybind11).

@HowardHinnant
Copy link
Owner

You should add those function somewhere.

I'm toying with a new tz database format that has the potential to obsolete any performance concerns. But is is very early days on that...

@Erroneous1
Copy link

Erroneous1 commented Mar 3, 2023

It's possible to produce your own FindDate.cmake file which would allow a CMake based project use this library, however it's generally much less error-prone when using an installed library or a strictly header-only library that doesn't have many flags. Providing your own CMake files, however, solves issues that weren't apparent when you first made date.h

  • date is popular as it solves a major STL hole (yay C++20/23), and many libraries have started to use it
  • date might have its own dependencies (cURL)
  • The intended purpose is to have the final project use date directly, including the header only date.h and tz.h as needed, and optionally compiling tz.cpp, not to have things like date::year_month_day or date::time_zone referenced in intermediary libraries
  • Since it's popular, multiple dependency libraries might require date and might compile with different versions or with different flags
  • Generally to get around compiler options, many libraries install shared/static libraries and headers to the system and provide one or more of:
    • dateconf.h which has the compiled options (e.g. libpng provides pngconf.h)
    • lib/pkgconfig/libdate.pc for pkg-config projects (e.g. libpng provides libpng.pc)
    • lib/cmake/Date/DateConfig.cmake for CMake (e.g. boost provides BoostConfig.cmake)
    • a binary like Date-config which has options for --cppflags, --ldflags, etc. (e.g. ImageMagick provides Magic++-config)

Of these, CMake might be the easiest to do, even though it has its problems. It also has the advantage of being used in vcpkg which takes care of many Microsoft-based developers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants