Skip to content
This repository has been archived by the owner on Jun 30, 2021. It is now read-only.

Scheduler tasks can hang on dropped HTTP connection #91

Open
icook opened this issue Aug 27, 2014 · 9 comments
Open

Scheduler tasks can hang on dropped HTTP connection #91

icook opened this issue Aug 27, 2014 · 9 comments
Labels

Comments

@icook
Copy link
Member

icook commented Aug 27, 2014

Tasks that make remote requests can hang indefinitely if a socket connection is silently dropped. Since APScheduler will not run two instances of the same task at once then the situation never resolves itself without a restart of the scheduler. Simple solutions is to set:

socket._GLOBAL_DEFAULT_TIMEOUT = 60

to cause all socket connections to eventually timeout. This is likely the cause of Celery hanging on the simplecrypto/pool_list as well.

Manifested by Worker Count going to 0 on SimpleVert.com when a connection to a geo stratum was dropped.

@icook icook added the bug label Aug 27, 2014
@ericecook
Copy link
Member

Currently 0 worker count on SV. Scheduler restart appears to not have solved the issue

@icook
Copy link
Member Author

icook commented Sep 1, 2014

Hmm, how long did you wait after the restart? I believe the worker count cache code only runs every 10 minutes. If this is the case then perhaps I misdiagnosed....

@ericecook
Copy link
Member

So to clarify - I'm not 100% sure how long I waited - I'd guestimate ~15 mins.

Its quite possible that whatever is causing the issue happened soonish after I restarted, although I did make several attempts.

I have several times restarted the scheduler and have it fix the bug.

The bug is also occuring (pretty frequently) on SimpleDoge atm as well - so not related to Geo.

@icook
Copy link
Member Author

icook commented Sep 1, 2014

Good to know, I'll look into it more once multi is out.

@ericecook
Copy link
Member

It actually may be a bug in the new code. I'm not sure exactly when it first appeared - but its only been on Doge + Vert, and only just recently.

@icook
Copy link
Member Author

icook commented Sep 1, 2014

We haven't deployed code to them in over 3 weeks, so it seems unlikely to be a code change. OVH's network definitely has seemed more flaky lately, so I'm betting it's related to that.

@ericecook
Copy link
Member

I was thinking it could be an issue introduced by the new Powerpool code.

Possibly its setup to handle http requests differently or something? idk, it doesn't seem terribly likely - but the timing is suspicious

@icook
Copy link
Member Author

icook commented Sep 2, 2014

We're not running new powerpools on Doge yet, so probably not.

@ericecook
Copy link
Member

Yea thats right, doh. Well its not a network issue - Doge is having the problem, and it doesn't use Geos.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants