Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak (gocoroutines) #45

Open
williamjulianvicary opened this issue Apr 1, 2022 · 10 comments
Open

Memory leak (gocoroutines) #45

williamjulianvicary opened this issue Apr 1, 2022 · 10 comments

Comments

@williamjulianvicary
Copy link

At the moment, each request that is queued up is first pushed into a gocoroutine, which is then blocked until space appears in the queue - however, if you're crawling very large websites with lots of links this causes exponential gocoroutines to build-up, at ~8kb a pop (each link in the "queue")

I attempted to fix this on my side with a semaphore which blocks requests getting added to the queue, however as the middleware's can cancel the request I'm not getting exposed any kind of response to handle that semaphore and I run into deadlocks.

In short, I feel this needs a queue in the core vs. spinning up tens of thousands of go coroutines, if you try crawling, say, Wikipedia with 100 connections, you'll see memory usage accelerate incredibly quickly (into gb's in seconds).

What I was essentially doing on my side was adding a semaphore (very similar to your existing semaphore) but having this block at the point of URL addition/response consumption before the go coroutine is created - which would solve the issue, if it wasn't for my deadlocks(!)

Let me know if I can add any more context! :-)

@musabgultekin
Copy link
Collaborator

Hi William. Thanks for reporting that! Once upon a time, I was working on a simple queue system. Based on chan.
Here is the branch: https://github.com/geziyor/geziyor/tree/queue

The change is basically like this: d4d9c07

Can you test with that branch with your scraper and give a feedback about it?. So that we can merge this to the master. Cause Ive used that approach a couple time, but I was just unsure about it.

@musabgultekin
Copy link
Collaborator

But max in memory request queue size is hardcoded rn: 1,000,000. We can make it configurable though. Or something dynamic as default, based on the RAM, like Max queue size= (Ram in MB * 10) etc. And then configurable setting.

@williamjulianvicary
Copy link
Author

The queue size could be a problem for larger websites as it'll deadlock (I tested with lower number) once it fills up. Could we perhaps add a select for that and default to dumping any additional queued items > 1,000,000?

I've tested the branch, and memory usage is FAR FAR better, but I've made a couple of tweaks in this PR which increase performance more, in short:

  • RequestMiddleware is run before the g.do method is called - this avoids queuing up something unnecessarily and...
  • Importantly it stops the rate limiter from being invoked (and blocking) for requests that are going to be cancelled!

The next optimisation step would be to reduce the weight of each queue item, I can only assume that the client.Request is fairly heavy hence even though the changes are adding some (clear) optimisations, there is room to improve. Although that change is much deeper, I think what this branch brings is already a significant improvement.

To illustrate the improvements, 100 threads, Wikipedia.org (i'm polling memory/goroutines each second):

Before queue branch:

> go run main.go
2022/04/01 14:42:56 Starting crawl
2022/04/01 14:42:57 Num goroutines: 665 - 14 MB Mem
2022/04/01 14:42:58 Num goroutines: 1416 - 45 MB Mem
2022/04/01 14:42:59 Num goroutines: 23359 - 151 MB Mem
2022/04/01 14:43:00 Num goroutines: 62491 - 191 MB Mem
2022/04/01 14:43:01 Num goroutines: 85113 - 326 MB Mem
2022/04/01 14:43:02 Num goroutines: 134556 - 399 MB Mem
2022/04/01 14:43:03 Num goroutines: 174339 - 644 MB Mem
2022/04/01 14:43:04 Num goroutines: 242212 - 563 MB Mem
2022/04/01 14:43:05 Num goroutines: 295873 - 875 MB Mem
2022/04/01 14:43:06 Num goroutines: 342488 - 1191 MB Mem

After queue branch:

> go run main.go
2022/04/01 14:41:27 Starting crawl
2022/04/01 14:41:28 Num goroutines: 474 - 37 MB Mem
2022/04/01 14:41:29 Num goroutines: 1039 - 58 MB Mem
2022/04/01 14:41:30 Num goroutines: 1004 - 139 MB Mem
2022/04/01 14:41:31 Num goroutines: 1053 - 170 MB Mem
2022/04/01 14:41:32 Num goroutines: 1283 - 208 MB Mem
2022/04/01 14:41:33 Num goroutines: 1447 - 331 MB Mem
2022/04/01 14:41:34 Num goroutines: 1564 - 392 MB Mem
2022/04/01 14:41:35 Num goroutines: 1520 - 387 MB Mem
2022/04/01 14:41:36 Num goroutines: 1693 - 596 MB Mem
2022/04/01 14:41:37 Num goroutines: 1864 - 707 MB Mem

After my PR:

> go run main.go
2022/04/01 14:42:15 Starting crawl
2022/04/01 14:42:16 Num goroutines: 412 - 23 MB Mem
2022/04/01 14:42:17 Num goroutines: 394 - 74 MB Mem
2022/04/01 14:42:18 Num goroutines: 527 - 151 MB Mem
2022/04/01 14:42:19 Num goroutines: 617 - 153 MB Mem
2022/04/01 14:42:20 Num goroutines: 651 - 168 MB Mem
2022/04/01 14:42:21 Num goroutines: 643 - 182 MB Mem
2022/04/01 14:42:22 Num goroutines: 638 - 167 MB Mem
2022/04/01 14:42:23 Num goroutines: 655 - 124 MB Mem
2022/04/01 14:42:24 Num goroutines: 639 - 130 MB Mem
2022/04/01 14:42:25 Num goroutines: 644 - 224 MB Mem

@williamjulianvicary
Copy link
Author

The PR mentioned above doesn't cover this challenge, but I also feel the Robots middleware should be extracted from the middleware stack and moved to the g.do method. The robots middleware also creates I/O overhead which is quite likely unnecessary if a request would otherwise be cancelled by a separate middleware.

Potentially extracting a third middleware "PreProcessMiddleware" or similar or a flag during middleware addition of when it should be executed? This way consumers of the app could also add I/O overhead immediately prior to the execution vs. during request addition as is the case right now.

That would mean:

  • RequestMiddleware - Executed prior to queuing a request
  • PreProcessMiddleware - Executed immediately prior to executing the HTTP request from the queue
  • ResponseMiddleware - as-is

Naming might need some work, but I think there is some value in having 3 differing middleware types for differing purposes.

@musabgultekin
Copy link
Collaborator

Yeah, I understand what you mean,
So this was not a concern before the queueing system. Since we're planning to integrate that, we should think of ways of doing this better. Like you mentioned.

I just gave another thinking, that, if we change the time that request middlewares, (just after do function and not when they are going to be executed), then this could affect the production systems that rely on custom middlewares.

So, Instead of running all the middlewares before the queue, we should add another middleware system that runs just before queued. So that you/anyone can cancel the request just before they queued. And they wont consume resources also

@musabgultekin
Copy link
Collaborator

musabgultekin commented Apr 2, 2022

BUT, if thats the case, why don't the developer does this cancel check just before they queue? And rely on the middleware? Like before even running g.Do method ?

@williamjulianvicary
Copy link
Author

As it stands I don't think you lose anything by adding the queue system, it's just an internal method of handling the queuing - regardless the use of middleware currently is already inefficient so making changes won't impact that (they just won't benefit from the changes being made herein).

Personally I don't feel this should be an optional flag and instead be the default, if the queue isn't in place the internals simply block instead and the semaphore blocks instead (just much heavier given the goroutines) - it "just" more efficiently handles queuing.

@musabgultekin
Copy link
Collaborator

Sorry I meant changing middlewares system. Not the queue. I really like the current status of queuing system.
So instaed of adding a delay middleware, we'll have a different middleware system. Let me write code so that you understand better

@musabgultekin
Copy link
Collaborator

I just made some changes in addition to your code in the queue branch.

So I basically added reqPreQueueMiddlewares. And then moved all the middlewares that can cancel the request. (allowed domains, duplicate requests, robots)
And then delay middleware will stick in the requests middleware.
And also made the queue size configurable.

In this commit: 24250c6

Can you look at that. Is this good in your opinion? It fixes the issue you mentioned.
Also, do you think 1M is good ? Or we should increase it in any way that is much more dynamic, based on RAM available in the system etc.

@williamjulianvicary
Copy link
Author

1M is ~8MB of memory usage I believe - I think that's a reasonable default.

I've checked the commit and all looks good from my perspective - thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants