Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential fix for the celery beat scheduler issue when not running in UTC #4324

Merged
merged 2 commits into from
Dec 13, 2017
Merged

Potential fix for the celery beat scheduler issue when not running in UTC #4324

merged 2 commits into from
Dec 13, 2017

Conversation

matteius
Copy link
Contributor

I first encountered the traced the issue of #4184 in a project space I worked on but the work around was to change to UTC. I was playing with adding logging tonight and used these two django settings to convert between the working and non working version (adding these makes it non-working):

CELERY_ENABLE_UTC = False
CELERY_TIMEZONE = 'America/New_York'

So it seems the issue is that when localize is called by to_local_fallback that it is already time zone aware. In fact it was always aware up the stack, but when it gets converted using astimezone, it becomes a date in the future and so it makes celery beat think that the task is always past due. My print output may not make the most sense, but I annotated a few methods of the same name and you can see in my example the exact moment the datetime went from a -04:00 offset to +20:00. Since this is a common method and this line of code hasn't changed since 2012, I am not sure the exact direction to take fixing the bug, but I do want to fix it. I have much time invested at this point.

[2017-10-11 22:40:24,316] Scheduler: Sending due task service.tasks.send_reminder (service.tasks.send_reminder)
[2017-10-11 22:40:24,317] service.tasks.send_reminder sent. id->2d97d6e5-80ce-488a-83f1-67700dbbc7d7
[2017-10-11 22:40:24,317] crontab.->is_due
[2017-10-11 22:40:24,317] 2017-10-11 22:40:24.316813-04:00
[2017-10-11 22:40:24,318] :remaining_delta start
[2017-10-11 22:40:24,318] :to_local
[2017-10-11 22:40:24,318] 2017-10-11 22:40:24.316813-04:00
[2017-10-11 22:40:24,318] utc not enabled
[2017-10-11 22:40:24,318] <celery.utils.time._Zone object at 0x7f329ba84d30>
[2017-10-11 22:40:24,318] :to_local_fallback
[2017-10-11 22:40:24,318] 2017-10-11 22:40:24.316813-04:00
[2017-10-11 22:40:24,318] <LocalTimezone: UTC-04>
[2017-10-11 22:40:24,318] ::localize
[2017-10-11 22:40:24,318] 2017-10-12 22:40:24.316813+20:00
[2017-10-11 22:40:24,318] <LocalTimezone: UTC-04>
[2017-10-11 22:40:24,318] AttributeError
[2017-10-11 22:40:24,318] 'LocalTimezone' object has no attribute 'normalize'
[2017-10-11 22:40:24,319] 2017-10-12 22:40:24.316813+20:00
[2017-10-11 22:40:24,319] ----
[2017-10-11 22:40:24,319] :to_local
[2017-10-11 22:40:24,319] 2017-10-11 22:40:24.318128-04:00
[2017-10-11 22:40:24,319] utc not enabled
[2017-10-11 22:40:24,319] <celery.utils.time._Zone object at 0x7f329ba84d30>
[2017-10-11 22:40:24,319] :to_local_fallback
[2017-10-11 22:40:24,319] 2017-10-11 22:40:24.318128-04:00
[2017-10-11 22:40:24,319] <LocalTimezone: UTC-04>
[2017-10-11 22:40:24,319] ::localize
[2017-10-11 22:40:24,319] 2017-10-12 22:40:24.318128+20:00
[2017-10-11 22:40:24,319] <LocalTimezone: UTC-04>
[2017-10-11 22:40:24,319] AttributeError
[2017-10-11 22:40:24,319] 'LocalTimezone' object has no attribute 'normalize'
[2017-10-11 22:40:24,319] 2017-10-12 22:40:24.318128+20:00
[2017-10-11 22:40:24,319] ----
[2017-10-11 22:40:24,320] 2017-10-12 22:40:24.316813+20:00
[2017-10-11 22:40:24,320] ffwd(hour=11, minute=0, second=0, microsecond=0, weeks=0, weekday=3)
[2017-10-11 22:40:24,320] 2017-10-12 22:40:24.318128+20:00
[2017-10-11 22:40:24,320] :remaining_delta finish
[2017-10-11 22:40:24,320] -1 day, 12:19:35.681872
[2017-10-11 22:40:24,320] :remaining_delta start
[2017-10-11 22:40:24,320] :to_local
[2017-10-11 22:40:24,320] 2017-10-11 22:40:24.320389-04:00
[2017-10-11 22:40:24,320] utc not enabled
[2017-10-11 22:40:24,320] <celery.utils.time._Zone object at 0x7f329ba84d30>
[2017-10-11 22:40:24,320] :to_local_fallback
[2017-10-11 22:40:24,321] 2017-10-11 22:40:24.320389-04:00
[2017-10-11 22:40:24,321] <LocalTimezone: UTC-04>
[2017-10-11 22:40:24,321] ::localize
[2017-10-11 22:40:24,321] 2017-10-12 22:40:24.320389+20:00
[2017-10-11 22:40:24,321] <LocalTimezone: UTC-04>
[2017-10-11 22:40:24,321] AttributeError
[2017-10-11 22:40:24,321] 'LocalTimezone' object has no attribute 'normalize'
[2017-10-11 22:40:24,321] 2017-10-12 22:40:24.320389+20:00
[2017-10-11 22:40:24,321] ----
[2017-10-11 22:40:24,321] :to_local
[2017-10-11 22:40:24,321] 2017-10-11 22:40:24.320505-04:00
[2017-10-11 22:40:24,321] utc not enabled
[2017-10-11 22:40:24,322] <celery.utils.time._Zone object at 0x7f329ba84d30>
[2017-10-11 22:40:24,322] :to_local_fallback
[2017-10-11 22:40:24,322] 2017-10-11 22:40:24.320505-04:00
[2017-10-11 22:40:24,322] <LocalTimezone: UTC-04>
[2017-10-11 22:40:24,322] ::localize
[2017-10-11 22:40:24,322] 2017-10-12 22:40:24.320505+20:00
[2017-10-11 22:40:24,322] <LocalTimezone: UTC-04>
[2017-10-11 22:40:24,322] AttributeError
[2017-10-11 22:40:24,322] 'LocalTimezone' object has no attribute 'normalize'
[2017-10-11 22:40:24,322] 2017-10-12 22:40:24.320505+20:00
[2017-10-11 22:40:24,322] ----
[2017-10-11 22:40:24,322] 2017-10-12 22:40:24.320389+20:00
[2017-10-11 22:40:24,322] ffwd(hour=11, minute=0, second=0, microsecond=0, weeks=0, weekday=3)
[2017-10-11 22:40:24,322] 2017-10-12 22:40:24.320505+20:00
[2017-10-11 22:40:24,322] :remaining_delta finish
[2017-10-11 22:40:24,323] is due:
[2017-10-11 22:40:24,323] True
[2017-10-11 22:40:24,323] next time to run:
[2017-10-11 22:40:24,323] 0

@matteius matteius changed the title not a permenant fix, but part of the problem Potential fix for the celery beat scheduler issue when not running in UTC Oct 12, 2017
@matteius
Copy link
Contributor Author

matteius commented Oct 12, 2017

#4184

I have pin-pointed the exact line of code that is converting my already timezone aware datetime to a future datetime in an offset +20 that I am sure isn't a real timezone offset.
2017-10-11 22:42:27.041931-04:00 gets converted to 2017-10-12 22:42:27.041931+20:00 at the line in my Pull Request.
Apparently in UTC mode the datetime object stays the same at this point in the code. So what happens next is the result of remaining_delta is interpreted as being -1 day, 1:27:32.958069 behind the task schedule. So it sends out the task and doesn't sleep long because its always behind. It just keeps beating out the tasks, always becuase its -1 days until the task is due.

Granted my PR is commenting out a real old line of code, and yet all the unit tests seemed to pass and it did solve this issue in my testing. Looking from collaborator feedback.

@thedrow
Copy link
Member

thedrow commented Oct 15, 2017

That's quite a scary change you came up with 😃
I'll have to review this more closely.

@matteius
Copy link
Contributor Author

@thedrow I suspect an alternative change might be to check if the dt passed to localize already has tzinfo, and if it does not already have it to still call .astimezone(tz) on it. It seems there are some usages of localize outside the scope of what I was testing, but it also seems there are lacking unit tests around these date time expectations, because as I am able to show, this line of code from 2012 breaks no tests and apparently hinders the BaseSchedule when it calls to_local() without being in UTC time, hence this bug.

@thedrow
Copy link
Member

thedrow commented Oct 16, 2017

The real fix should be to use pytz for localizing timezones since it should handle it correctly.
Furthermore, if you don't mind adding some test coverage for your change that will ensure such timezone bugs won't regress.

@matteius
Copy link
Contributor Author

Are you suggesting that the pytz should be called directly and replace this localize(dt, tz) method? It seems this is a compat method that tries to support multiple versions of pytz. For example, its calling tz.normalize which is present in pytz 2012f however in the latest pytz 2017.2 this always raises AttributeError.

I'd be happy to add some test coverage as soon as I finish rationalizing what the correct code change to resolve this bug actually is. I think my strategy would be to test the scheduler(s) which would add tests these methods implicitly, though let me know if you have more ideas of what test cases we should be adding are.

@thedrow
Copy link
Member

thedrow commented Oct 17, 2017

We should definitely drop support for old pytz versions. We can change our requirements accordingly.

@matteius
Copy link
Contributor Author

I am spending some time reading up on pytz tonight, but after an initial audit of projects I know about there are a wide range of pinned versions of pytz out there. If we can require the latest and greatest to release whatever this refactor will amount to great, but I also just read about an alternative to pytz called pendulum that seems fairly promising in general, curious your thoughts on pytz over pendulum, this is an interesting read: https://pendulum.eustace.io/blog/a-faster-alternative-to-pyz.html

@matteius
Copy link
Contributor Author

matteius commented Oct 18, 2017

So assuming that we stick with pytz, I am seeing a lot of inconsistency between the way that pytz talks about how it "only supports two ways of building a localized time" and "the preferred way of dealing with times is to always work in UTC, converting to localtime only when generating output to be read by humans." Anyway, it seems like all of the time code in celery deserves some kind of review. I just searched open issues for time and I while this PR doesn't fix the issue (its already i trunk) it does indicate the time methods in celery have had a lot of disparate changes:
#4173

Copy link
Member

@thedrow thedrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a small flake8 error.
Other than that, is this PR ready for merge?
Does it fix the issue?

@matteius
Copy link
Contributor Author

matteius commented Oct 23, 2017

I am aware of the error and no @thedrow its not ready for merge. I have some local changesets to this, started adding some unit tests, but the problem is the two I added pass with or without my change and the third one fails both before and after this change. I'll let you know when I think I am complete.

@matteius
Copy link
Contributor Author

@thedrow Well I added some unit tests that I thought would fail on the old code but alas they do not. I know for sure the bug is fixed with this code change however, it had something to do with converting from the already local time (Ex: 2017-10-11 22:40:24.316813-04:00) calling astimezone would yield an incorrect (Ex: 2017-10-12 22:40:24.316813+20:00) which is the same timestamp with a different zone applied, thus incorrect when it gets to is_due the tasks are always seen as due.

I think to really show a unit test of the bug that was fixed, we would have to somehow understand and add tests of the is_due time remaining logic. I'm not sure how much time I have to dedicate to that pursuit right now (no pun intended) however I am happy that this PR passes the CI, adds new tests of the method I changed and actually fixes the bug outlined in the issue tracker. Mostly Done? I'm not sure why I got a flake 8 failure this time ... fixed the prior line length issue. The log seems to contain no real clue on the failure this time and I'm not overly familiar with tox running it locally: https://travis-ci.org/celery/celery/jobs/291876905

@@ -300,7 +300,10 @@ def make_aware(dt, tz):

def localize(dt, tz):
"""Convert aware :class:`~datetime.datetime` to another timezone."""
dt = dt.astimezone(tz)
if is_naive(dt): # Ensure timezone aware datetime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code section is what maybe_make_aware performs, isn't that correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@georgepsarakis Trying to get this cleaned up for possible merge for 4.2. I agree that it is a similar check to maybe_make_aware but the bug here is that you can't safely call datetime.astimezone on all timezones. It doesn't make sense to call maybe_make_aware here because that just in turn calls localize which would create an infinite recursive loop. Are you suggesting I actually make a different change here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. My comment was mostly on finding out if the logic is duplicated, or a refactoring would be a better approach.

@georgepsarakis
Copy link
Contributor

@matteius do not spend time on these flake8 errors, I think the latest version now detects more possible issues in the code. We will fix them in a separate PR.

Copy link
Member

@auvipy auvipy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matteius could you please re base and complete the remaining work so that this fix could go for 4.2 release?

@auvipy auvipy added this to the v4.2 milestone Dec 12, 2017
@matteius matteius closed this Dec 12, 2017
@matteius matteius reopened this Dec 12, 2017
@matteius
Copy link
Contributor Author

@auvipy Just spent the last 4 hours trying to do/understand rebase better. After several attempts at pushing out stuff forcfully I think I got it, except who knows if tox will think the imports are in the right order since when I run that locally I get weird results. Strangely, my latest build was canceled and Geoffrey Bauduin started a new build of my PR yet it doesn't show on the output of this page. I am very confused and would like some help closing to bow on this one.

@matteius
Copy link
Contributor Author

Very confusing ... I clearly have an lsort issue but not sure the best way to resolve those other than try random orderings which creates more commits for me to rebase. Ahhhh! Also, a lot of PRs are failing with the TOXENV=bandit and the latest builds to succeed don't even list having run TOXENV=bandit ... how is this possible?

@AlexHill
Copy link
Contributor

Hi @matteius, thanks for sticking with this. Rebase can certainly be tricky if it's not part of your normal workflow.

I've just fixed the problem with bandit, so PRs should build cleanly now on Travis. Not sure about the isort problems - you should be able to see the specific errors in the build log if you push changes.

make astimezone call in localize more safe; with tests
@matteius
Copy link
Contributor Author

Thanks @AlexHill at this point I think the PR is in a final state of readiness. Please let me know if there is anything else I can do to shore it up.

@thedrow thedrow dismissed georgepsarakis’s stale review December 13, 2017 06:41

Code review commented were addressed.

@thedrow thedrow merged commit 9765151 into celery:master Dec 13, 2017
Photonios pushed a commit to SectorLabs/celery that referenced this pull request Feb 21, 2019
make astimezone call in localize more safe; with tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants