[#157319228] implement SearchgovDomain#delay method #49

MothOnMars · 2018-05-04T22:45:03Z

No description provided.

noremmie

This LGTM except that the requests to fetch robots.txt files are over HTTP instead of HTTPS. Is that because not all sites support HTTPS? Would it be too crazy to try HTTPS first and fall back on HTTP for sites that aren't HTTPS yet?

noremmie · 2018-05-05T00:46:53Z

I should add that the reason I'm wondering whether we should be using HTTPS is not because I think it's important for those requests to be encrypted but because I'm wondering whether it's important that we verify the authenticity of the site that we're sending requests to. If someone somehow hijacks DNS and is able to therefore trick one of our jobs to fetch and process their own robots.txt file, is there a way they could somehow make that robots.txt file dangerous for us to process? I looked at the robotex gem, and I can't see a way that someone could make a robots.txt file malicious to it.

I'm thinking that maybe the difficulty of hijacking DNS combined with the fact that it might not be possible to craft a malicious robots.txt file makes it not worth the effort to first try HTTPS and then fall back on HTTP.

noremmie · 2018-05-06T18:55:10Z

I think I've convinced myself that using HTTP instead of HTTPS doesn't pose a danger to us, but I've come across a separate issue with the robotex gem and fetching robots.txt over HTTP instead of HTTPS. For sites that automatically redirect HTTP requests to HTTPS (which I'm assuming a lot of .gov sites do), the robotex gem doesn't report the correct crawl delay when using HTTP requests.

This is because the HTTP request to fetch robots.txt gets a redirect as a response which robotex does not follow. Since robotex can't find any crawl delay in that response, it returns nil. As an example, look at https://www.gsa.gov/robots.txt and see that it specifies a crawl delay of 10 seconds. If you use robotex with HTTPS then you get the correct crawl delay:

[1] pry(main)> Robotex.new('usasearch').delay('https://www.gsa.gov/')
=> 10

But if you use robotex with HTTP you don't get the correct crawl delay:

[2] pry(main)> Robotex.new('usasearch').delay('http://www.gsa.gov/')
=> nil

MothOnMars · 2018-05-07T16:23:47Z

@noremmie , the issue you address is fixed by chriskite/robotex@51c7c12. There's a PR for that in that repo (chriskite/robotex#8), but considering I've gotten no response from the repo owner for months, maybe you could review that in my branch:
MothOnMars/robotex@51c7c12

I should also add that I've considered adding a scheme column to the searchgov_domains table, so that we're not constantly trying one scheme vs another, but I have yet to convince myself that that is the best idea.

noremmie · 2018-05-07T18:53:49Z

@MothOnMars your Robotex change in your redirection branch LGTM, and I also verified that the change does fix the redirect issue. I apologize for not following through and seeing all your proposed changes to that gem. I think it's OK to proceed with this PR by either merging your redirection branch to your master branch or updating the Gemfile here with a reference to your redirection branch.

I think my initial reaction to seeing HTTP instead of HTTPS was dogmatic, and now that I've thought it through more I don't think it's dangerous for us to fetch robots.txt files without first verifying the authenticity of the site we're getting it from. Thanks for putting up with my indecision!

MothOnMars · 2018-05-07T19:10:10Z

No worries. I should have had you review that anyway. I'm going to update the Gemfile and add a spec for the issue I fixed in Robotex.

noremmie

LGTM 👍

[#157319228] implement SearchgovDomain#delay method

2a179c4

MothOnMars assigned noremmie May 4, 2018

noremmie reviewed May 5, 2018

View reviewed changes

squashme: add spec, bump robotex

2d245b0

MothOnMars force-pushed the mct/domain_delay branch from e3708e1 to 2d245b0 Compare May 7, 2018 19:35

noremmie approved these changes May 7, 2018

View reviewed changes

MothOnMars merged commit fa6a7a5 into master May 7, 2018

MothOnMars deleted the mct/domain_delay branch May 7, 2018 20:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#157319228] implement SearchgovDomain#delay method #49

[#157319228] implement SearchgovDomain#delay method #49

MothOnMars commented May 4, 2018

noremmie left a comment

noremmie commented May 5, 2018

noremmie commented May 6, 2018

MothOnMars commented May 7, 2018 •

edited

noremmie commented May 7, 2018

MothOnMars commented May 7, 2018

noremmie left a comment

[#157319228] implement SearchgovDomain#delay method #49

[#157319228] implement SearchgovDomain#delay method #49

Conversation

MothOnMars commented May 4, 2018

noremmie left a comment

Choose a reason for hiding this comment

noremmie commented May 5, 2018

noremmie commented May 6, 2018

MothOnMars commented May 7, 2018 • edited

noremmie commented May 7, 2018

MothOnMars commented May 7, 2018

noremmie left a comment

Choose a reason for hiding this comment

MothOnMars commented May 7, 2018 •

edited