Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sntp: fix the handling of the rtc counter wrapping #148

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

sntp: fix the handling of the rtc counter wrapping #148

wants to merge 2 commits into from

Conversation

ghost
Copy link

@ghost ghost commented Jun 20, 2016

This reworks the accounting of the time to address the wrapping of the
rtc counter. It currently assumes that the time is accessed more
frequently than the counter would wrap. It also adds a semaphore to
synchronize access to the time storage.

#define tim_ref (RTC.SCRATCH[2])
// Calibration value
#define cal (RTC.SCRATCH[3])

// To protect access to the above.
xSemaphoreHandle sntp_sem = NULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be declared static, as it's only accessed inside this source file?

@ghost
Copy link
Author

ghost commented Jun 26, 2016

Thank you for the feedback. Changed the semaphore to be static.

I have since tried other ways to improve this code and think more could be done, but this PR fixes some significant issues. Fwiw I found the RTC and/or the calibrations to be rather unstable and gave up on using sntp for obtaining times for a data logging application and will log the rtc count and the calibration and have the time estimated when analysing the data.

@ghost
Copy link
Author

ghost commented Jun 26, 2016

Just add a bit more. I also tried to use the sntp times to estimate the calibration, and only using the internal calibration to reset this if the time step was large. The corrections to the calibration were filtered and it did converge to give a lot better time than just using the internal calibration. However, it still received impulse errors larger than a second and then took time to settle. When there was a network outage it drifted significantly - the error had typically not settled so it had little chance.

I guess it needed to also take into account the jitter etc. Also resetting the time with each sntp response causes jumps in the time, and these can be backwards, so I think that is the wrong approach. When the time difference is within a small bounds it should probably just be making small periodic adjustments to avoid jumps and to avoid going backwards. Surely there is a lot of state-of-the-art NTP code that could be followed.

For my application at hand it seemed to be less demanding on the node to just log the times when the node interacts with the server etc, and it might even have GPS events logged with a time etc in some deployments. A smooth time line might then be fitted to the events during analysis.

@projectgus
Copy link
Contributor

Thanks for this fix and the comments. I think the suggestions you make sound sensible, although I agree that for a lot of purposes this is good enough.

I'm keen to merge this, but I just want to check first if @doragasu - who contributed the original SNTP implementation - has any comments.

@doragasu
Copy link

I had a quick look and it looks good, thanks for the work! I already knew that synchronizing access to time keeping structures was needed, but wasn't brave enought to try because I don't know newlib internals, and I thought that it provided a built-in synchronization mechanism (instead of having to use semaphores). But after reading your patch, I suppose that's not the case.

Other than that, I'd like to throw a comment. Inside sntp_update_rtc() you are reading the timer value after xSemaphoreTake(). I think the order should be reversed, so the instant you read the timer is as close as possible to the instant the SNTP server sends you the updated time. Have in mind that as the semaphore might block, reading the timer after taking the semaphore might introduce very big differences.

If I get some time I'll try reviewing the code more in-depth and testing it.

iosen added 2 commits July 1, 2016 00:05
This reworks the accounting of the time to address the wrapping of the
rtc counter. It currently assumes that the time is accessed more
frequently than the counter would wrap. It also adds a semaphore to
synchronize access to the time storage.
@ghost
Copy link
Author

ghost commented Jun 30, 2016

Thank you for the feedback. @doragasu the mutex protects the state that handles the counter wrapping too, so if the RTC counter were read before taking the mutex then another task might win a race to update the tim_ref in sntp_get_rtc_time() in which case the time difference in sntp_update_rtc would be negative which is not handled there. It might need some more design work.

@doragasu
Copy link

doragasu commented Jul 1, 2016

@Iosen I understand, thanks for explaining. If I read the timer inside sntp_update_rtc(), then after I enter the critical section, the timer wraps and sntp_get_rtc_time() gets executed, tim_ref will be updated and when finally sntp_update_rtc() enters the critical section, things will go nuts.

OK, having that cleared, I have absolutely no problem with the changes, they look good to me.

BTW, when I did some tests of the implementation accuracy, I saw that it is heavily dependent on the network you use. When I did tests at my worplace, I got much lower correction values each time SNTP was updated that when I did the same tests at home (my home network is slower and has greater ping). As SNTP does not take RTT into account, I suppose this is expected.

I don't think it makes too much sense trying to improve this implementation accuracy: if you want better accuracy maybe the correct thing would be implementing a full blown NTP client.

@ghost
Copy link
Author

ghost commented Jul 1, 2016

@doragasu See #157 for an alternative design which seems to be more stable. The results of sdk_system_rtc_clock_cali_proc() seem so noisy here that it is not great for calibration.

@sheinz sheinz added the extra label Nov 29, 2016
@funnydog
Copy link
Contributor

funnydog commented Mar 15, 2017

About the accuracy of the algorithm, I was looking at the relevant RFC4330 and in section 5 there is a simple algorithm that compensates for the round trip time of the packets.

I've hacked together a proof of concept and it seems to work well here: I converted every timestamp to int64_t and changed sntp_update_rtc() to take an int64_t offset instead.

However I deeply changed also the source of sntp.c because it's a mess of #ifdefs and I wanted to understand the code before doing anything, so I'm afraid the changes I made are not directly appliable.

@funnydog
Copy link
Contributor

After a lot of experiments this is how it converges with an update interval of 15 sec (Delay and Offset are in usecs):

Delay: 33945, Offset: 2041547930, Cali: 25804
Delay: 33737, Offset: 13475, Cali: 25827
Delay: 36122, Offset: 9730, Cali: 25843
Delay: 14981, Offset: 186, Cali: 25843
Delay: 17382, Offset: 5573, Cali: 25852
Delay: 15722, Offset: 3952, Cali: 25858
Delay: 15534, Offset: 167, Cali: 25858
Delay: 15403, Offset: 6756, Cali: 25869
Delay: 15337, Offset: 1585, Cali: 25871
Delay: 16062, Offset: -417, Cali: 25871

While the offsets become very small it seems to me that the embedded RTC is quite unstable. The calibration value keeps increasing and still cannot compensate the offsets; and then suddenly the same calibration value is too big.

With longer update intervals the situation gets worse.

Besides my code updates the RTC.SCRATCH[0..1] locations directly so the clock is not monotonic. I tried to adjust gradually but with those fluctuations it doesn't really work.

For the record this is how the sntp_update_rtc() function looks right now:

void sntp_update_rtc(int64_t offset)
{
	static uint32_t last_cal_tim = 0;

	xSemaphoreTake(sntp_sem, portMAX_DELAY);
	uint32_t tim = RTC.COUNTER;
	uint32_t diff = tim - tim_ref;
	tim_ref = tim;

	uint64_t diff_us = ((uint64_t)diff * sntp_cali) >> 12;
	sntp_base = sntp_base + diff_us + offset;
	if (-100000 < offset && offset <= 100000) {
		sntp_cali += offset * 4096 / (tim - last_cal_tim);
	}
	last_cal_tim = tim;

	xSemaphoreGive(sntp_sem);
}

@Rutger798
Copy link

I see there is a fix for the race conditions in the sntp_fun, but nobody merged them?
is the SuperHouse/esp-open-rtos EndOfLife or could you please merge this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants