Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for the X-Robots-Tag noml header #365

Open
robrwo opened this issue Nov 28, 2023 · 1 comment
Open

Add support for the X-Robots-Tag noml header #365

robrwo opened this issue Nov 28, 2023 · 1 comment

Comments

@robrwo
Copy link

robrwo commented Nov 28, 2023

See https://noml.info/

diff --git a/README.md b/README.md
index 12fd5e6..f9b65d1 100644
--- a/README.md
+++ b/README.md
@@ -21,7 +21,7 @@ For better performance, it's highly recommended to set up a fast dns resolver, s
 
 ## Opt-out directives
 
-Websites can pass the http headers `X-Robots-Tag: noai`, `X-Robots-Tag: noindex` , `X-Robots-Tag: noimageai` and `X-Robots-Tag: noimageindex`
+Websites can pass the http headers `X-Robots-Tag: noai`, `X-Robots-Tag: noindex` , `X-Robots-Tag: noimageai` , `X-Robots-Tag: noimageindex` and `X-Robots-Tag: noml`
 By default img2dataset will ignore images with such headers.
 
 To disable this behavior and download all images, you may pass --disallowed_header_directives '[]'
diff --git a/img2dataset/main.py b/img2dataset/main.py
index 94f32df..bc71418 100644
--- a/img2dataset/main.py
+++ b/img2dataset/main.py
@@ -111,7 +111,7 @@ def download(
 ):
     """Download is the main entry point of img2dataset, it uses multiple processes and download multiple files"""
     if disallowed_header_directives is None:
-        disallowed_header_directives = ["noai", "noimageai", "noindex", "noimageindex"]
+        disallowed_header_directives = ["noai", "noimageai", "noindex", "noimageindex", "noml"]
     if len(disallowed_header_directives) == 0:
         disallowed_header_directives = None
@pabl0
Copy link

pabl0 commented Dec 14, 2023

I think it would be fair to respect noml as well as noai. Of course, it is just a proposal, but there is almost no cost if this gets implemented.

On the other hand, it could be argued that the typical use case for this piece of software is not to crawl for search engines, so perhaps consider removing noindex and noimageindex from the default set? Of course, I can see people who advocate opt-in instead of opt-out (#293) getting even more angry with such change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants