Skip to content
This repository has been archived by the owner on Aug 6, 2020. It is now read-only.

Robots.txt not respected if first page is redirected #9

Closed
MothOnMars opened this issue Nov 29, 2017 · 8 comments
Closed

Robots.txt not respected if first page is redirected #9

MothOnMars opened this issue Nov 29, 2017 · 8 comments

Comments

@MothOnMars
Copy link

If you set Medusa to crawl http://www.foo.com, which is redirected to https://www.foo.com, Medusa will successfully crawl the site, but it will not respects robots.txt. This appears to be happening because Robotex will attempt pull the robots.txt file from http://www.foo.com/robots.txt without following the redirection. This results in no robot rules for the domain www.foo.com.

Example:
In https://www.yelp.com/robots.txt:
Disallow: /biz_link

> robotex = Robotex.new "My User Agent"

> robotex.allowed?("https://www.yelp.com/biz_link")
false

> robotex = Robotex.new "My User Agent"

> robotex.allowed?("http://www.yelp.com/biz_link")
true

I'd be happy to put in a PR to resolve this, but I've been going back and forth about whether the fix should be done in Robotex or Medusa.

@brutuscat
Copy link
Owner

@MothOnMars makes sense!

Google robots.txt page, says this:

3xx (redirection)
Redirects will generally be followed until a valid result can be found (or a loop is recognized). We will follow a limited number of redirect hops (RFC 1945 for HTTP/1.0 allows up to 5 hops) and then stop and treat it as a 404. Handling of robots.txt redirects to disallowed URLs is undefined and discouraged. Handling of logical redirects for the robots.txt file based on HTML content that returns 2xx (frames, JavaScript, or meta refresh-type redirects) is undefined and discouraged.

Now since robotex seems to be discountinued also, I would argue that you could try to fix this in Medusa or replace robotex with some other working gem that does the same. Both I like!

@MothOnMars
Copy link
Author

Thanks for the feedback, @brutuscat. Other robots parser gems I've looked at also don't have a lot of recent dev activity, so first I'll put a PR in to fix this in Robotex. If there's no response at that point, I'll look into swapping out the gem in Medusa.

@MothOnMars
Copy link
Author

MothOnMars commented Jul 27, 2020

PR for Robotex: chriskite/robotex#8
Issue: chriskite/robotex#7

@brutuscat
Copy link
Owner

brutuscat commented Jul 28, 2020

@MothOnMars given that this is a problem with robotex, couldn't you just bundle update the robotex gem to your GitHub branch in your project? We do not ship a Gemfile.lock and the current constraint on the gem is on >= 1 then for your project this should work given that you will be using Medusa but with a newer/better version of robotex. Does it makes sense?

See https://bundler.io/guides/git.html

@MothOnMars
Copy link
Author

Yeah, that's how I resolved the issue in my own repo. I was just adding the Robotex PR/issue info here for visibility in case other Medusa users encounter the issue. I'll close this up.

@brutuscat
Copy link
Owner

@MothOnMars great thank you. Going forward I will be removing or revamping @chriskite gems that were abandoned either with gems that are getting support or branching and modernising its code.

BTW will you be open or have some time this year (or the next) to give me feedback on the upcoming changes on the Medusa gem? As you can see in #14 some changes are coming that I expect to be somehow disrupting but I would like to understand how I could help you and other users to migrate to this.

@MothOnMars
Copy link
Author

MothOnMars commented Jul 30, 2020

Sure, I'd be happy to. Ruby is short on good crawlers, and Medusa has been the best I've found. I'd love to see it become an official gem. I can also take a look at how the moneta-medusa-storage branch would work for us in its current state.

@brutuscat
Copy link
Owner

@MothOnMars please do!

I also look forward to publish the gem, it is that just now I do not consider it is v1. Once I'm done replacing all "stalled" or old gems it will be ready.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants