-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resync resampler on system time changes #802
base: v1.x.x
Are you sure you want to change the base?
Resync resampler on system time changes #802
Conversation
We can discuss getting pulling this in and add a test shortly after in a separate PR, to speed up the patch release. What do you think? |
What's the plan here? If the plan is that this is tested in production (or somewhere else), any reason not to just temporarily point the SDK dependency to your fork's branch (https://github.com/matthias-wende-frequenz/frequenz-sdk-python/tree/fix_resampling)? I'd be OK to merge first and do the test afterwards if there is no other way around, but if we can do whatever we need without accepting PRs without tests, I think it makes more sense. |
I was thinking that we can unblock the fix. But I feel it's worth adding the tests before merging. Also the tests are almost ready. Will add them to this PR soon. |
e66b651
to
c84327b
Compare
c84327b
to
c4bb046
Compare
d64c7c2
to
f0ecc06
Compare
Signed-off-by: Matthias Wende <matthias.wende@frequenz.com>
Adds a boolean parameter if an extra period should be added to the calculated window end or not. Signed-off-by: Matthias Wende <matthias.wende@frequenz.com>
Signed-off-by: Matthias Wende <matthias.wende@frequenz.com>
f0ecc06
to
9600ba7
Compare
Signed-off-by: Matthias Wende <matthias.wende@frequenz.com>
Signed-off-by: Matthias Wende <matthias.wende@frequenz.com>
Signed-off-by: Matthias Wende <matthias.wende@frequenz.com>
9600ba7
to
4cc5902
Compare
@@ -457,6 +449,34 @@ def remove_timeseries(self, source: Source) -> bool: | |||
return False | |||
return True | |||
|
|||
def _sync_timer(self, extra_period: bool = True) -> datetime: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Silly, but I would either remove this extra period, as is it only a startup glitch we are trying to fix (I guess it made sense in the original code because it was free), or to switch the default, as the normal situation seems to be not adding an extra period.
# new samples aligned to the new system time have been received. | ||
# Thus we resync `self._window_end` to the new system time in case | ||
# it drifted more then one resampling period away from the system time. | ||
if abs(self._window_end - now) - drift > self._config.resampling_period: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly this is still not enough. Both drifts need to be handled independently, the mixing of both could end up hiding issues. For example:
monotonic clock
0 1 2 3 4 5 6
|---------|---------|---------|---------|---------|---------|---------> time
o o o o o o
t1 t2 t3 t4 t5 expected ticks
T1 T2 T3 T4 observed ticks
O * * * *
|------------|------------|------------|------------|------------|-------> time
0 1 2 3 4 5
wall clock
resampling_period = 1.0, and the wall clock has an accumulating drift of 0.3s per second.
T1:
* now = ~0.9
* window_end = 1.0
* drift = 0.2
* abs(self._window_end - now) - drift > self._config.resampling_period
abs( 1.0 - 0.9) - 0.2 > 1.0
0.1 - 0.2 > 1.0
-0.1 > 1.0 => False
T2:
* now = ~1.7
* window_end = 2.0
* drift = 0.1
* abs(self._window_end - now) - drift > self._config.resampling_period
abs( 2.0 - 1.7) - 0.1 > 1.0
0.3 - 0.1 > 1.0
0.2 > 1.0 => False
T3:
* now = ~2.4
* window_end = 3.0
* drift = 0
* abs(self._window_end - now) - drift > self._config.resampling_period
abs( 3.0 - 2.4) - 0.0 > 1.0
0.6 - 0.0 > 1.0
0.6 > 1.0 => False
T4:
* now = ~4.1
* window_end = 4.0
* drift = 1.3
* abs(self._window_end - now) - drift > self._config.resampling_period
abs( 4.0 - 4.1) - 1.3 > 1.0
0.1 - 1.3 > 1.0
-1.2 > 1.0 => False
I don't know, from this maybe it's enough to use abs(abs(we - now) - abs(drift)) > period
, I have to think about it a bit more. But before doing that, now I really wonder why we are doing all this... 🤔
Don't we just want to always sync to the wall clock, so effectively make the resampler count periods in terms of the wall clock instead of the monotonic clock?
O maybe we just want to emit warnings if the wall clock is drifting? Having a drifting wall clock is really an operational issue, so I'm not sure if we should try to fix it in the code, as it will probably bring all sorts of other issues too.
And actually I'm not even sure we should warn about it in the SDK, probably we need some system-level monitoring for this instead. Is like if there were a disk failure and we try to warn the user about it and cope with it by reallocating stuff. I think is beyond the SDK's responsibilities.
It really looks like we are trying to patch a broken system here, and I'm not sure if it is a good idea...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I quickly drafted a solution that I think makes everything much easier. I just created a new timer type that is attached to the wall clock especially for the Resampler
. We could then polish it and move it to the channels repo if we think it is generally useful. For now it doesn't have all the complexity of the missing ticks policies of the channels timer, it only implements the trigger all missing policy that's needed by the resampler.
Here is the proof of concept PR, against this PR:
The resampler gets much simplied with this, as it doesn't need to care about any time management anymore.
The resampler was initializing an internal time on initialization by syncing itself to the system clock.
This sync was never redone and thus lead to out-of-sync issues when the system clock changed, which could cause that the resampling window was empty even when samples, with the new system time, arrived.