Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#157319228] implement SearchgovDomain#delay method #49

Merged
merged 2 commits into from
May 7, 2018

Conversation

MothOnMars
Copy link
Contributor

No description provided.

Copy link
Contributor

@noremmie noremmie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM except that the requests to fetch robots.txt files are over HTTP instead of HTTPS. Is that because not all sites support HTTPS? Would it be too crazy to try HTTPS first and fall back on HTTP for sites that aren't HTTPS yet?

@noremmie
Copy link
Contributor

noremmie commented May 5, 2018

I should add that the reason I'm wondering whether we should be using HTTPS is not because I think it's important for those requests to be encrypted but because I'm wondering whether it's important that we verify the authenticity of the site that we're sending requests to. If someone somehow hijacks DNS and is able to therefore trick one of our jobs to fetch and process their own robots.txt file, is there a way they could somehow make that robots.txt file dangerous for us to process? I looked at the robotex gem, and I can't see a way that someone could make a robots.txt file malicious to it.

I'm thinking that maybe the difficulty of hijacking DNS combined with the fact that it might not be possible to craft a malicious robots.txt file makes it not worth the effort to first try HTTPS and then fall back on HTTP.

@noremmie
Copy link
Contributor

noremmie commented May 6, 2018

I think I've convinced myself that using HTTP instead of HTTPS doesn't pose a danger to us, but I've come across a separate issue with the robotex gem and fetching robots.txt over HTTP instead of HTTPS. For sites that automatically redirect HTTP requests to HTTPS (which I'm assuming a lot of .gov sites do), the robotex gem doesn't report the correct crawl delay when using HTTP requests.

This is because the HTTP request to fetch robots.txt gets a redirect as a response which robotex does not follow. Since robotex can't find any crawl delay in that response, it returns nil. As an example, look at https://www.gsa.gov/robots.txt and see that it specifies a crawl delay of 10 seconds. If you use robotex with HTTPS then you get the correct crawl delay:

[1] pry(main)> Robotex.new('usasearch').delay('https://www.gsa.gov/')
=> 10

But if you use robotex with HTTP you don't get the correct crawl delay:

[2] pry(main)> Robotex.new('usasearch').delay('http://www.gsa.gov/')
=> nil

@MothOnMars
Copy link
Contributor Author

MothOnMars commented May 7, 2018

@noremmie , the issue you address is fixed by chriskite/robotex@51c7c12. There's a PR for that in that repo (chriskite/robotex#8), but considering I've gotten no response from the repo owner for months, maybe you could review that in my branch:
MothOnMars/robotex@51c7c12

I should also add that I've considered adding a scheme column to the searchgov_domains table, so that we're not constantly trying one scheme vs another, but I have yet to convince myself that that is the best idea.

@noremmie
Copy link
Contributor

noremmie commented May 7, 2018

@MothOnMars your Robotex change in your redirection branch LGTM, and I also verified that the change does fix the redirect issue. I apologize for not following through and seeing all your proposed changes to that gem. I think it's OK to proceed with this PR by either merging your redirection branch to your master branch or updating the Gemfile here with a reference to your redirection branch.

I think my initial reaction to seeing HTTP instead of HTTPS was dogmatic, and now that I've thought it through more I don't think it's dangerous for us to fetch robots.txt files without first verifying the authenticity of the site we're getting it from. Thanks for putting up with my indecision!

@MothOnMars
Copy link
Contributor Author

No worries. I should have had you review that anyway. I'm going to update the Gemfile and add a spec for the issue I fixed in Robotex.

Copy link
Contributor

@noremmie noremmie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@MothOnMars MothOnMars merged commit fa6a7a5 into master May 7, 2018
@MothOnMars MothOnMars deleted the mct/domain_delay branch May 7, 2018 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants