You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
diff --git a/README.md b/README.md
index 12fd5e6..f9b65d1 100644
--- a/README.md+++ b/README.md@@ -21,7 +21,7 @@ For better performance, it's highly recommended to set up a fast dns resolver, s
## Opt-out directives
-Websites can pass the http headers `X-Robots-Tag: noai`, `X-Robots-Tag: noindex` , `X-Robots-Tag: noimageai` and `X-Robots-Tag: noimageindex`+Websites can pass the http headers `X-Robots-Tag: noai`, `X-Robots-Tag: noindex` , `X-Robots-Tag: noimageai` , `X-Robots-Tag: noimageindex` and `X-Robots-Tag: noml`
By default img2dataset will ignore images with such headers.
To disable this behavior and download all images, you may pass --disallowed_header_directives '[]'
diff --git a/img2dataset/main.py b/img2dataset/main.py
index 94f32df..bc71418 100644
--- a/img2dataset/main.py+++ b/img2dataset/main.py@@ -111,7 +111,7 @@ def download(
):
"""Download is the main entry point of img2dataset, it uses multiple processes and download multiple files"""
if disallowed_header_directives is None:
- disallowed_header_directives = ["noai", "noimageai", "noindex", "noimageindex"]+ disallowed_header_directives = ["noai", "noimageai", "noindex", "noimageindex", "noml"]
if len(disallowed_header_directives) == 0:
disallowed_header_directives = None
The text was updated successfully, but these errors were encountered:
I think it would be fair to respect noml as well as noai. Of course, it is just a proposal, but there is almost no cost if this gets implemented.
On the other hand, it could be argued that the typical use case for this piece of software is not to crawl for search engines, so perhaps consider removing noindex and noimageindex from the default set? Of course, I can see people who advocate opt-in instead of opt-out (#293) getting even more angry with such change.
See https://noml.info/
The text was updated successfully, but these errors were encountered: