Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resync resampler on system time changes #802

Open
wants to merge 6 commits into
base: v1.x.x
Choose a base branch
from

Conversation

matthias-wende-frequenz
Copy link
Contributor

The resampler was initializing an internal time on initialization by syncing itself to the system clock.

This sync was never redone and thus lead to out-of-sync issues when the system clock changed, which could cause that the resampling window was empty even when samples, with the new system time, arrived.

@github-actions github-actions bot added the part:data-pipeline Affects the data pipeline label Dec 1, 2023
@matthias-wende-frequenz
Copy link
Contributor Author

We can discuss getting pulling this in and add a test shortly after in a separate PR, to speed up the patch release. What do you think?

@matthias-wende-frequenz matthias-wende-frequenz added the priority:URGENT Address this immediately or the sky will fall label Dec 1, 2023
@llucax
Copy link
Contributor

llucax commented Dec 1, 2023

We can discuss getting pulling this in and add a test shortly after in a separate PR, to speed up the patch release. What do you think?

What's the plan here? If the plan is that this is tested in production (or somewhere else), any reason not to just temporarily point the SDK dependency to your fork's branch (https://github.com/matthias-wende-frequenz/frequenz-sdk-python/tree/fix_resampling)?

I'd be OK to merge first and do the test afterwards if there is no other way around, but if we can do whatever we need without accepting PRs without tests, I think it makes more sense.

@matthias-wende-frequenz
Copy link
Contributor Author

matthias-wende-frequenz commented Dec 1, 2023

I'd be OK to merge first and do the test afterwards if there is no other way around, but if we can do whatever we need without accepting PRs without tests, I think it makes more sense.

I was thinking that we can unblock the fix. But I feel it's worth adding the tests before merging. Also the tests are almost ready. Will add them to this PR soon.

@github-actions github-actions bot added the part:docs Affects the documentation label Dec 2, 2023
@github-actions github-actions bot added the part:tests Affects the unit, integration and performance (benchmarks) tests label Dec 4, 2023
@matthias-wende-frequenz matthias-wende-frequenz marked this pull request as ready for review December 5, 2023 12:08
@matthias-wende-frequenz matthias-wende-frequenz requested a review from a team as a code owner December 5, 2023 12:08
Signed-off-by: Matthias Wende <matthias.wende@frequenz.com>
Adds a boolean parameter if an extra period should be added
to the calculated window end or not.
Signed-off-by: Matthias Wende <matthias.wende@frequenz.com>
Signed-off-by: Matthias Wende <matthias.wende@frequenz.com>
Signed-off-by: Matthias Wende <matthias.wende@frequenz.com>
Signed-off-by: Matthias Wende <matthias.wende@frequenz.com>
Signed-off-by: Matthias Wende <matthias.wende@frequenz.com>
@@ -457,6 +449,34 @@ def remove_timeseries(self, source: Source) -> bool:
return False
return True

def _sync_timer(self, extra_period: bool = True) -> datetime:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silly, but I would either remove this extra period, as is it only a startup glitch we are trying to fix (I guess it made sense in the original code because it was free), or to switch the default, as the normal situation seems to be not adding an extra period.

src/frequenz/sdk/timeseries/_resampling.py Show resolved Hide resolved
# new samples aligned to the new system time have been received.
# Thus we resync `self._window_end` to the new system time in case
# it drifted more then one resampling period away from the system time.
if abs(self._window_end - now) - drift > self._config.resampling_period:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly this is still not enough. Both drifts need to be handled independently, the mixing of both could end up hiding issues. For example:

monotonic clock
0         1         2         3         4         5         6 
|---------|---------|---------|---------|---------|---------|---------> time
o         o         o         o         o         o
          t1        t2        t3        t4        t5                    expected ticks
            T1       T2       T3                     T4                 observed ticks
O           *        *        *                      *
|------------|------------|------------|------------|------------|-------> time
0            1            2            3            4            5
wall clock

resampling_period = 1.0, and the wall clock has an accumulating drift of 0.3s per second.

T1:
* now = ~0.9
* window_end = 1.0
* drift = 0.2
* abs(self._window_end - now) - drift > self._config.resampling_period
  abs(             1.0 - 0.9) - 0.2   > 1.0  
                          0.1 - 0.2   > 1.0
                               -0.1   > 1.0 => False

T2:
* now = ~1.7
* window_end = 2.0
* drift = 0.1
* abs(self._window_end - now) - drift > self._config.resampling_period
  abs(             2.0 - 1.7) - 0.1   > 1.0  
                          0.3 - 0.1   > 1.0
                                0.2   > 1.0 => False

T3:
* now = ~2.4
* window_end = 3.0
* drift = 0
* abs(self._window_end - now) - drift > self._config.resampling_period
  abs(             3.0 - 2.4) - 0.0   > 1.0  
                          0.6 - 0.0   > 1.0
                                0.6   > 1.0 => False

T4:
* now = ~4.1
* window_end = 4.0
* drift = 1.3
* abs(self._window_end - now) - drift > self._config.resampling_period
  abs(             4.0 - 4.1) - 1.3   > 1.0  
                          0.1 - 1.3   > 1.0
                               -1.2   > 1.0 => False

I don't know, from this maybe it's enough to use abs(abs(we - now) - abs(drift)) > period, I have to think about it a bit more. But before doing that, now I really wonder why we are doing all this... 🤔

Don't we just want to always sync to the wall clock, so effectively make the resampler count periods in terms of the wall clock instead of the monotonic clock?

O maybe we just want to emit warnings if the wall clock is drifting? Having a drifting wall clock is really an operational issue, so I'm not sure if we should try to fix it in the code, as it will probably bring all sorts of other issues too.

And actually I'm not even sure we should warn about it in the SDK, probably we need some system-level monitoring for this instead. Is like if there were a disk failure and we try to warn the user about it and cope with it by reallocating stuff. I think is beyond the SDK's responsibilities.

It really looks like we are trying to patch a broken system here, and I'm not sure if it is a good idea...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I quickly drafted a solution that I think makes everything much easier. I just created a new timer type that is attached to the wall clock especially for the Resampler. We could then polish it and move it to the channels repo if we think it is generally useful. For now it doesn't have all the complexity of the missing ticks policies of the channels timer, it only implements the trigger all missing policy that's needed by the resampler.

Here is the proof of concept PR, against this PR:

The resampler gets much simplied with this, as it doesn't need to care about any time management anymore.

@llucax llucax modified the milestones: v1.0.0-rc4, v1.0.0-rc5 Jan 29, 2024
@llucax llucax modified the milestones: v1.0.0-rc5, v1.0.0-rc6 Feb 19, 2024
@llucax llucax modified the milestones: v1.0.0-rc6, v1.0.0-rc7 Mar 26, 2024
@llucax llucax removed the priority:URGENT Address this immediately or the sky will fall label May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
part:data-pipeline Affects the data pipeline part:docs Affects the documentation part:tests Affects the unit, integration and performance (benchmarks) tests
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

None yet

3 participants