Performance Issues #442

papadako · 2020-04-24T12:00:33Z

Dear all,

I am currently experimenting with crawler4j to download pages from the web (I would like to download if possible billions of pages). But at least in some of my early experiments it seems that this is not feasible. For example, if I start with 300 seeds then after one day of crawling, which means about 200.000 downloaded pages, things slow down a lot and the system's cpu usage is 100% in all of my cores, downloading a page every minute or so.

So basically, is this an expected behavior or maybe something is wrong in my setup? Are there any guidelines about how I can improve things? What are the current bottlenecks in crawler4j that inhibit scale-up?

Best regards
Panagiotis

papadako · 2020-04-26T19:10:38Z

I would like to help speed-up (if possible) crawler4j a bit. But since I am new to its codebase it would be nice if someone could point to relevant information/posts or any kind of documentation or comment about possible bottlenecks or where should I start looking at.

P.S.
My insight is that probably the bottleneck for scaling things up is the usage of BerkeleyDB.

Chaiavi · 2020-04-27T05:26:42Z

Currently, BerklyDB is integrated into the code The right way will be to create an interface for the DB where several implementations could be suggested, one would be berklyDB, then another internal inmemoryDB could replace it. I think someone had tried to do just that, I am not sure what was the conclusion

…

On Sun, Apr 26, 2020 at 10:10 PM Papadakos Panagiotis < ***@***.***> wrote: I would like to help speed-up (if possible) crawler4j a bit. But since I am new to its codebase it would be nice if someone could point to relevant information/posts or any kind of documentation or comment about possible bottlenecks or where should I start looking at. P.S. My insight is that probably the bottleneck for scaling things up is the usage of BerkeleyDB. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#442 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANNWWZ3HQY7GTARHO3AS7LROSBLZANCNFSM4MQBGDSQ> .

papadako · 2020-04-27T16:26:58Z

Yep, that would be nice. But I guess, initially we have to find which are the bottlenecks in the current implementation. It might be the db, it might be the way crawler4j uses it or it might be something else.

rzo1 · 2020-04-28T12:57:40Z

If you want to fetch billions of Web pages, you might look into other (distributed) Web crawler frameworks written in Java, i.e. Nutch or Stormcrawler.

Did you enable resumable crawling? On which OS does your crawler4j is running on?

100% CPU usage sounds like IO wait. Did you check this?

Before I switched to Stormcrawler (for scalabilty reasons), I used crawler4j quite heavily for the purpose of focused crawling: I did never experience such issues.

papadako · 2020-04-30T07:01:01Z

Thanks for the reply rzo1. I will try to do some kind of profiling to see what is going on.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Issues #442

Performance Issues #442

papadako commented Apr 24, 2020

papadako commented Apr 26, 2020

Chaiavi commented Apr 27, 2020 via email

papadako commented Apr 27, 2020

rzo1 commented Apr 28, 2020

papadako commented Apr 30, 2020

Performance Issues #442

Performance Issues #442

Comments

papadako commented Apr 24, 2020

papadako commented Apr 26, 2020

Chaiavi commented Apr 27, 2020 via email

papadako commented Apr 27, 2020

rzo1 commented Apr 28, 2020

papadako commented Apr 30, 2020