Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check robots.txt #183

Open
EdJoPaTo opened this issue Apr 24, 2023 · 2 comments
Open

Check robots.txt #183

EdJoPaTo opened this issue Apr 24, 2023 · 2 comments
Labels
enhancement New feature or request pinned Prevent from automatically closing due to inactivity

Comments

@EdJoPaTo
Copy link
Owner

Is your feature request related to a problem? Please describe.

bots on the internet should honor the robots.txt (see RFC 9309

Describe the solution you'd like

Checking the robots.txt of every domain being crawled before crawling the actual content.
I think the tool should provide an option to ignore the robots.txt but being annoyed about it on stdout when enabled.

The downside is an additional request to the server on every crawling attempt

Describe alternatives you've considered

Provide an additional subcommand to check the domains in the config for the robots.txt. The user of this tool can run the command to see if they are allowed to do this by the host.
This way the additional requests are only done on demand and the user can decide to remove their crawling attempts.
Maybe integrate this into the check command which checks the config and error when the robots.txt denies a path?

@EdJoPaTo EdJoPaTo added enhancement New feature or request pinned Prevent from automatically closing due to inactivity labels Apr 24, 2023
@Teufelchen1
Copy link
Collaborator

This is a good idea. I'm not sure which solution I prefer. Maybe the last time the robots.txt got crawled could be cached? That way one could implement a behavior that is in between/ a combination of the proposed two solutions. But I'm not sure if that is worth the increase of complexity.

Probing the robots.txt when checking the config seems like a behavior we should have, regardless of the other behaviors in discussion. Not only do we check the robots.txt at least once, we can also leverage this to check if the domains/hosts are actually reachable - a nice extra ux-candy for free. (This ofc. adds the deployment-dependency of being run on a(n) connected / online machine, but given the nature of the tool, this is acceptable imho.)

@Teufelchen1
Copy link
Collaborator

Okay, I think our best bet is robotstxt. It has zero dependencies and the code looks well commented. An alternative could be robotparser-rs. It depends on url and percent-encoding, has slightly less "used-by" but seems to be under more active development when looking at the git hist.

I also considered writing a parser ourselves as I wanted to learn more about parser-generators but all rust related projects seemed either over-kill or not matching the use case (or what I imagined would be the use case for a parser-generator 🤡). Examples could be pest, which has the very neat ability to take PEG (think of it as (A)BNF) and generate a matching parser for that. Since the robots.txt RFC specifies the BNF for parsing, this could be doable also I didn't investigate the BNF <-> PEG conversion. I believe pest is over-kill.
Here an alternative could be nom which is a parser-combinator. We would break down the ABNF from the rfc into small parsers for each single case and nom would combine them to a complete and proper parser. The downside is that a lot more work is needed and we lose the 1:1 "verification" of the correct implementation according to the ABNF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request pinned Prevent from automatically closing due to inactivity
Projects
None yet
Development

No branches or pull requests

2 participants