Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid following advertisements in news feeds and sitemaps #58

Open
sebastian-nagel opened this issue Nov 14, 2023 · 0 comments
Open

Avoid following advertisements in news feeds and sitemaps #58

sebastian-nagel opened this issue Nov 14, 2023 · 0 comments

Comments

@sebastian-nagel
Copy link
Collaborator

See also this discussion on Common Crawl's user group.

Some news sites sell slots in their news feeds and sitemaps and put advertisements there. The crawler follows these links the same way as it follows links to news articles. Because of a news sitemap auto-detection feature, thousands of "news" articles
from the target site are then possibly crawled.

Potential ways to fight these ads:

  • block following cross-site links, ie. implement a cross submission validation
  • disable sitemap autodetect (of course, this may cause that sitemap seeds are lost if the URL changes)
  • manually adjust URL filters
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant