Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory footprint of multiple logically same RobotsData acquired from different inputs #5

Open
temoto opened this issue Sep 10, 2012 · 2 comments

Comments

@temoto
Copy link
Owner

temoto commented Sep 10, 2012

Currently FromResponseBytes will return singleton RobotsData of allow-all for 404 status code and disallow-all for 401/403 status codes. For any other input, unique RobotsData will be created even though they could share subset of or all rules. Sharing all rules is equivalent to having another singleton RobotsData.

Plan:

  1. Get representative subset of all robots.txt files (excluding non-200 responses since those are already covered)
  2. Parse
  3. Find clusters of unique rules sets, analyse distribution
    Wild guess is that most (say 95%) fall into one of "allow all" or "disallow all".

If hypothesis is confirmed, some normalization technique could be applied to reduce memory footprint and cache locality of real world web crawlers using robotstxt.go library.

Possible normalizations:

  1. Post-process array of rules after parsing, try to reduce to predefined RobotsData singletons; Easy to implement but useful only for predefined unique rule sets, relies on singletons; TODO: analyse output distribution, benchmark
  2. Export a unique value representing parsed rules; Does not reduce memory footprint by itself, rather provides an instrument for that, but useful for arbitrary repeating rule sets, application may implement arbitrary cache. One possible candidate for such value is input text pre-processing: remove comments, white space; TODO: analyse output distribution, benchmark

These two techniques do not even conflict, post-processing parsed rules seems a worthy optimisation anyway, exporting unique value could further allow to cache non-trivial but still popular rules.

Even in unlikely event that rule sets distribution is closer to uniform, distribution of individual rules definitely must exhibit large spikes around agent=* and url=/. For that case, library can return singleton popular rules. Now that i think of it, maintaining a few extremely popular individual Rule singletons could be a worthy optimisation on its own. TODO: benchmark.

@igrigorik
Copy link
Contributor

I think step #1, as you identified, is to do a crawl and gather some data on the frequency of these rules. My guess is, you're intuition is right and we'll find that most rules are, in fact, very similar. What those rules are will also dictate the solution.

If in 90% of the cases the rules are simple allow/disallow all, then just taking care of that would give you a major win.. Beyond that, there are all kinds of compression tricks that you could apply towards building a compact representation of the rules. But once again, gather data before making any of these design decisions.. :)

@temoto
Copy link
Owner Author

temoto commented Oct 2, 2012

Preliminary research based on 36000 robots.txt from random web sites of Alexa top 1M produced following unsatisfying results: 11.2% of robots.txt-s allow all, 1.5% disallow all. Few observations (e.g. 20% of hosts served by Wordpress) lead to conclusion that analyzed data is biased, not representative subset of whole Web. I will conduct another research involving more hosts later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants