New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
requests leaves a hanging connection for each request #520
Comments
Digging further, this seems due to a reference loop and is really a problem inherited from httplib: The HTTPResponse keeps a reference to the file object as _fileobject. I would argue that that reference should be dropped after the complete response was read (it should not be possible to read from the socket of a pooled connection using some old response). |
Okay, and the reason this shows up is because there is a reference cycle between the request and response object. |
Is the the same issue as in #239? |
@shazow any thoughts? |
Are we talking about urllib3 request/response objects, or requests request/response objects? urllib3 cleans up in the response.release_conn method. The connection gets None'd for the exact reason that @Bluehorn mentioned. We could also None the We can definitely be more aggressive with closing connections in the cleanup of Requests's context manager. Right now to manually close connections, you'd need to iterate for each pool and pop its queue of connections and close them if they're open. If that's what we want to do, then I can definitely add a helper in urllib3 for "closing pools" which will do that in one call. |
Another approach would be to break the hard cycle between response and request by changing req.response to a weakref proxy before returning it from the session: http://dpaste.org/GOqzA/. |
@Bluehorn's test continues to reproduce in trunk for me. Thanks! Here's something else interesting. If I change the test like so: diff --git a/conntest.py b/conntest.py
index 3eac6c7..ec30d3f 100644
--- a/conntest.py
+++ b/conntest.py
@@ -2,8 +2,10 @@ import gc, os, subprocess, requests
gc.disable() # Just to make sure - I have seen this with gc enabled
+s = requests.session()
+
for x in range(20):
- requests.head("http://www.google.com/")
+ s.head("http://www.google.com/") I get the same result, 20 leaked |
Sorry, documentation-reading fail. I wasn't consuming the content of the response objects. If I do that, things work as expected, connections get reused, and in the end only one is left alive. |
Here's my understanding of this issue so far:
So here are the three viable approaches that occur to me:
Random thought: it seems like keep-alive should be off by default. Outside the context of a session, we have no way to reuse the connection anyway. |
In favor of option 3: any solution that relies on eager garbage collection won't work with PyPy. |
Once upon a time Also it's pretty rare that you'll only make 1 request to a host in a script, so having |
Can you clarify the reasons for defaulting So --- defaulting |
Perhaps the original reason was that it made urllib3 and requests perform better in naive benchmarks. Upon further consideration, it also made sense to fetch only the data we're actually using, especially for large responses when we only really care about headers. Someday in the future when we add ETAG-based caching, this will be a necessity. Overall, lots of upsides, no real downsides (until now). I'm not super familiar with the Requests code in this context, but I'd imagine we'd want a reused static PoolManager instance for these things. |
A static or module-level mechanism for sharing and reusing the connections would make sense. On the other hand, it would significantly worsen the current "gotcha" with leaked connections, because if a client doesn't call |
What if we made an explicit |
Then we'd just leak connections up to the size of the LRU cache (10 by default?). My vote would be to enable |
Is an O(1) leak really a leak? ;) I'm -0 on |
In the real world, everything is a constant factor ;-) |
Module-level connection pool doesn't sound like a good idea to me. Setting |
@piotr-dobrogost if we don't prefetch, I don't see a good solution that works on all platforms, including those like PyPy that lack eager garbage collection. For comparison, here's something that always infuriated me about urllib2: try:
fileobj = urllib2.urlopen(MY_URL)
text = fileobj.read()
finally:
fileobj.close() Seems like the 80%, batteries-included thing is to have an API that reads the whole page in a fire-and-forget way. |
A module-level connection pool? Absolutely not. That's what |
That makes sense, as long as we turn off
This will resolve the connection leak. Optionally, we could also break the reference cycle, as a memory-usage optimization for CPython. |
All requests are made within a session. |
Right. For "actual Session", read "non-ephemeral Session", i.e., "Session created by the client with the expectation of reuse." |
But I guess this doesn't answer all the questions about expected behavior of
|
It shouldn't matter if it's ephemeral or not. When the session dies (and the connections therein), so should the connection pool. There should be no specific behavior for different platforms. |
Well, we wouldn't special-case behavior inside Requests itself based on platforms. CPython would do the right thing whether or not the client called As for why it would matter whether the session is ephemeral or not, @craigholm makes a good point on #458: a client that does not support persistent connections "must" send |
@Bluehorn that's fair. I think the test case points towards other gotchas that aren't fixed by doing a HEAD. For example:
|
Essentially, some servers don't support HEAD requests. That's why prefetch isn't default. |
Intuitively, a property access is not supposed to have side effects; Changing a default or creating even a small API break would definitely be unfortunate, but I'm having trouble seeing another way to protect the client from socket leaks. |
A user of Requests shouldn't be even remotely concerned with sockets. The solution's quite simple — don't leave references to sockets/sessions that aren't needed. |
I agree completely: no one should have to think about sockets. But if we don't prefetch, then a reference to the open socket must be retained --- whether the So when do we clean up this reference? The possible answers seem to be:
|
Let's do both. There's no side effect worth consideration, because that's all handled by connectionpool. |
BTW: I habe no problem if not reading the reply keeps the connection alive as long as this does not apply to head requests. |
* prefetch now defaults to True, ensuring that by default, sockets are returned to the urllib3 connection pool on request end * sessions now have a close() method, notifying urllib3 to close pooled connections * the module-level API, e.g., `requests.get('http://www.google.com')`, explicitly closes its session when finished When prefetch is False, the open socket becomes part of the state of the Response object, and it's the client's responsibility to read the whole body, at which point the socket will be returned to the pool.
address connection leak issue from #520
Fixed! |
I'm running into this with |
I believe I have confirmed this to be a problem with sessions + timeout. When using the etcd-py package, a new connection is created any time this call times out; no dangling connection when the request returns data:
|
I believe this problem is in urllib3, and will be incidentally fixed by urllib3/urllib3#252. |
@Lukasa thanks for the tip. After more research I am pretty sure this is a problem with the server side code in etcd. Thanks! |
The fix from this issue was reverted here kennethreitz@92d5703 ? Any reason why ? I am specially talking about the changes in requests/api.py. And can some one please explain what is the guid line when using ephemeral session like what most requests/api.py shortcuts does ? A new session mean a new connection pool which really doesn't make any sense if you want to just do a |
@mouadino The commit you're talking about was the rewrite that took place for V1.0.0. This was basically a total rewrite of the library, so the fix was not 'reverted', it was just no longer necessary. I'm not entirely clear on what you're asking in the last part, but you've got a substantial misunderstanding of how Requests behaves. When we create a connection pool we don't open all of those connections at once: that would be ridiculous, and HTTP doesn't work that way anyway. We open one connection when you make your first request, and then reuse it as much as possible. There is an inefficiency when using |
@Lukasa Thanks for the quick response, and you miss understood my question, sorry for not being clear, what i meant is actually what is described in this bug report. If i have in my script a call like this But how about if i do another So the result now is that we have 2 connection open to our server which will not be close until my program stop (clean up by the kernel !?) or the server clean them up (if it's support a keep alive timeout), so you can imagine what will happen if i have a lot of First of all does this make sense ? If so what what is the guide line of using ephemeral session ? if this laters will not clean up after them self, unless i call |
Ah, that makes more sense. =) The answer is no, you won't leak connections. This is because eventually the Session, the Connection Pool it contains and all of its active connections will get garbage collected, and when they do any open connections are closed by the garbage collector. Of course, this doesn't matter because the server will close them well beforehand. Most HTTP servers don't keep the connection alive for more than a few seconds because it's a severe resource overhead for that server. The guide for using ephemeral Sessions is simple: use them if you don't care about efficiency. If you use them you're accepting that you don't care whether your resources are cleaned up right away or not. If you believe that cleaning up after yourself is important then you should use an explicit Session. =) |
Aha, because underneath they are actually socket which are actually closed automatically when they are GC, but relaying on the GC for my use case is not really good, and the worse is that i am in presence of a http server that doesn't clean up opened connections (or maybe i didn't figure out yet how to do it), but i can see that this in not something that requests can fix :=) Thanks again for you response :) |
What about for a streaming requests session. How to ensure this is closed? |
This was already noted in #458, but deserves an extra ticket: When using the simple API (via requests.head/get/post etc.), the requests module leaves a connection open until it is garbage collected. This eats up resources on the server site and it really impolite.
How to reproduce:
Run this script (can somebody please tell me how one can embed a gist in markdown? The help page shows only the result of embedding...):
https://gist.github.com/2234274
Result on my system:
I think the sockets should be closed immediately when the session is unrefed. FYI, I found this when testing a local webservice by running it in CherryPy which defaults to 10 threads - after 10 requests, the web server was blocked.
The text was updated successfully, but these errors were encountered: