Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for crawling subdomains #27

Open
wants to merge 1 commit into
base: next
Choose a base branch
from
Open

Add support for crawling subdomains #27

wants to merge 1 commit into from

Conversation

alexspeller
Copy link

Merge changes to support subdomain crawling from runa@91559bd

@MaGonglei
Copy link

This feature is very useful.
I think anemone should also support for printing out the external links, just print out it but not scan it in deep.
The link checker tool XENU (http://home.snafu.de/tilman/xenulink.html) has this feature.

@wokkaflokka
Copy link

MaGonglei: It is very simple to gather external links using Anemone, and comparably simple to actually check these links to verify they are valid, etc. The 'on_every_page' block is very helpful in this regard.

If you'd like some code that does exactly what you are asking, I could send an example your way.

@MaGonglei
Copy link

Hi,wokkaflokka,thanks for your reply.
I think I know what you mean,but I prefer to have this feature when I initialize the anemone crawl like :
Anemone.crawl("http://www.example.com",:external_links => false) do |anemone|
....
end

Because if I use the "on_every_page" block to search the external links (e.g. "page.doc.xpath '//a[@href]') ,it seemed cost too much CPU and Memorys.

If I'm wrong,give me the example.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants