Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Issues #442

Open
papadako opened this issue Apr 24, 2020 · 5 comments
Open

Performance Issues #442

papadako opened this issue Apr 24, 2020 · 5 comments

Comments

@papadako
Copy link

Dear all,

I am currently experimenting with crawler4j to download pages from the web (I would like to download if possible billions of pages). But at least in some of my early experiments it seems that this is not feasible. For example, if I start with 300 seeds then after one day of crawling, which means about 200.000 downloaded pages, things slow down a lot and the system's cpu usage is 100% in all of my cores, downloading a page every minute or so.

So basically, is this an expected behavior or maybe something is wrong in my setup? Are there any guidelines about how I can improve things? What are the current bottlenecks in crawler4j that inhibit scale-up?

Best regards
Panagiotis

@papadako
Copy link
Author

I would like to help speed-up (if possible) crawler4j a bit. But since I am new to its codebase it would be nice if someone could point to relevant information/posts or any kind of documentation or comment about possible bottlenecks or where should I start looking at.

P.S.
My insight is that probably the bottleneck for scaling things up is the usage of BerkeleyDB.

@Chaiavi
Copy link
Contributor

Chaiavi commented Apr 27, 2020 via email

@papadako
Copy link
Author

Yep, that would be nice. But I guess, initially we have to find which are the bottlenecks in the current implementation. It might be the db, it might be the way crawler4j uses it or it might be something else.

@rzo1
Copy link
Contributor

rzo1 commented Apr 28, 2020

If you want to fetch billions of Web pages, you might look into other (distributed) Web crawler frameworks written in Java, i.e. Nutch or Stormcrawler.

Did you enable resumable crawling? On which OS does your crawler4j is running on?

100% CPU usage sounds like IO wait. Did you check this?

Before I switched to Stormcrawler (for scalabilty reasons), I used crawler4j quite heavily for the purpose of focused crawling: I did never experience such issues.

@papadako
Copy link
Author

Thanks for the reply rzo1. I will try to do some kind of profiling to see what is going on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants