-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak (gocoroutines) #45
Comments
Hi William. Thanks for reporting that! Once upon a time, I was working on a simple queue system. Based on chan. The change is basically like this: d4d9c07 Can you test with that branch with your scraper and give a feedback about it?. So that we can merge this to the master. Cause Ive used that approach a couple time, but I was just unsure about it. |
But max in memory request queue size is hardcoded rn: 1,000,000. We can make it configurable though. Or something dynamic as default, based on the RAM, like Max queue size= (Ram in MB * 10) etc. And then configurable setting. |
The queue size could be a problem for larger websites as it'll deadlock (I tested with lower number) once it fills up. Could we perhaps add a select for that and default to dumping any additional queued items > 1,000,000? I've tested the branch, and memory usage is FAR FAR better, but I've made a couple of tweaks in this PR which increase performance more, in short:
The next optimisation step would be to reduce the weight of each queue item, I can only assume that the client.Request is fairly heavy hence even though the changes are adding some (clear) optimisations, there is room to improve. Although that change is much deeper, I think what this branch brings is already a significant improvement. To illustrate the improvements, 100 threads, Wikipedia.org (i'm polling memory/goroutines each second): Before queue branch:
After queue branch:
After my PR:
|
The PR mentioned above doesn't cover this challenge, but I also feel the Robots middleware should be extracted from the middleware stack and moved to the g.do method. The robots middleware also creates I/O overhead which is quite likely unnecessary if a request would otherwise be cancelled by a separate middleware. Potentially extracting a third middleware "PreProcessMiddleware" or similar or a flag during middleware addition of when it should be executed? This way consumers of the app could also add I/O overhead immediately prior to the execution vs. during request addition as is the case right now. That would mean:
Naming might need some work, but I think there is some value in having 3 differing middleware types for differing purposes. |
Yeah, I understand what you mean, I just gave another thinking, that, if we change the time that request middlewares, (just after do function and not when they are going to be executed), then this could affect the production systems that rely on custom middlewares. So, Instead of running all the middlewares before the queue, we should add another middleware system that runs just before queued. So that you/anyone can cancel the request just before they queued. And they wont consume resources also |
BUT, if thats the case, why don't the developer does this cancel check just before they queue? And rely on the middleware? Like before even running g.Do method ? |
As it stands I don't think you lose anything by adding the queue system, it's just an internal method of handling the queuing - regardless the use of middleware currently is already inefficient so making changes won't impact that (they just won't benefit from the changes being made herein). Personally I don't feel this should be an optional flag and instead be the default, if the queue isn't in place the internals simply block instead and the semaphore blocks instead (just much heavier given the goroutines) - it "just" more efficiently handles queuing. |
Sorry I meant changing middlewares system. Not the queue. I really like the current status of queuing system. |
I just made some changes in addition to your code in the queue branch. So I basically added reqPreQueueMiddlewares. And then moved all the middlewares that can cancel the request. (allowed domains, duplicate requests, robots) In this commit: 24250c6 Can you look at that. Is this good in your opinion? It fixes the issue you mentioned. |
1M is ~8MB of memory usage I believe - I think that's a reasonable default. I've checked the commit and all looks good from my perspective - thanks! |
At the moment, each request that is queued up is first pushed into a gocoroutine, which is then blocked until space appears in the queue - however, if you're crawling very large websites with lots of links this causes exponential gocoroutines to build-up, at ~8kb a pop (each link in the "queue")
I attempted to fix this on my side with a semaphore which blocks requests getting added to the queue, however as the middleware's can cancel the request I'm not getting exposed any kind of response to handle that semaphore and I run into deadlocks.
In short, I feel this needs a queue in the core vs. spinning up tens of thousands of go coroutines, if you try crawling, say, Wikipedia with 100 connections, you'll see memory usage accelerate incredibly quickly (into gb's in seconds).
What I was essentially doing on my side was adding a semaphore (very similar to your existing semaphore) but having this block at the point of URL addition/response consumption before the go coroutine is created - which would solve the issue, if it wasn't for my deadlocks(!)
Let me know if I can add any more context! :-)
The text was updated successfully, but these errors were encountered: