Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ignore_regex configuration option in config.cfg is not working properly #217

Open
4 of 7 tasks
marvingabler opened this issue Sep 3, 2021 · 1 comment
Open
4 of 7 tasks

Comments

@marvingabler
Copy link

marvingabler commented Sep 3, 2021

Mandatory

  • I read the documentation (readme and wiki).
  • I searched other issues (including closed issues) and could not find any to be related. If you find related issues post them below or directly add your issue to the most related one.
  • I confirm that this bug report does not report on a specific news site where news-please does not work. Please keep in mind that news-please is a generic crawler so it is expected that it will not work for all sites well or even at all.

Related issues:

  • add them here

Describe the bug
The ignore_regex configuration option in config.cfg seems to be ignored. URL's that contain the specified Regex are still being downloaded.

To Reproduce

  1. Update ignore_regex in config.cfg with ^.*video.*$|^.*mediathek.*$
  2. Add http://welt.de to sitelist.hjson
  3. Run news-please and inspect that URL's still contain video and mediathek

See this regex validation for example welt.de URL's

Expected behavior
According to the config.cfg in line 64:
urls which match the following regex are ignored for recursive crawling

Log
Add a log to help explain your problem, e.g., the full output of the tool that results from running the minimal working example you provided in To Reproduce.

[newsplease.pipeline.pipelines:523|INFO] Saving HTML to /home/neo/news-please-repo/data/2021/09/03/welt.de/vermischtes_video192762855_Beliebt-gegen-den-Kater-Gurkensaft-Verkaufsschlager-in-New-York_1630695592.html
[newsplease.pipeline.pipelines:548|INFO] Saving JSON to /home/neo/news-please-repo/data/2021/09/03/welt.de/vermischtes_video192762855_Beliebt-gegen-den-Kater-Gurkensaft-Verkaufsschlager-in-New-York_1630695592.html.json
[newsplease.helper_classes.sub_classes.heuristics_manager:49|INFO] Checking site: https://www.welt.de/mediathek/dokumentation/technik-und-wissen/sendung192055601/Extreme-Phaenomene-Die-Macht-der-Natur.html
[newsplease.pipeline.pipelines:523|INFO] Saving HTML to /home/neo/news-please-repo/data/2021/09/03/welt.de/mediathek_dokumentation_technik-und-wissen_sendung192055601_Extreme-Phaenomene-Die-Macht-der-Natur_1630695592.html
[newsplease.pipeline.pipelines:548|INFO] Saving JSON to /home/neo/news-please-repo/data/2021/09/03/welt.de/mediathek_dokumentation_technik-und-wissen_sendung192055601_Extreme-Phaenomene-Die-Macht-der-Natur_1630695592.html.json
[newsplease.helper_classes.sub_classes.heuristics_manager:49|INFO] Checking site: https://www.welt.de/mediathek/dokumentation/technik-und-wissen/sendung192055609/Extreme-Konstruktionen-Spektakulaere-Bauwerke.html
[newsplease.pipeline.pipelines:523|INFO] Saving HTML to /home/neo/news-please-repo/data/2021/09/03/welt.de/mediathek_dokumentation_technik-und-wissen_sendung192055609_Extreme-Konstruktionen-Spektakulaere-Bauwerke_1630695592.html
[newsplease.pipeline.pipelines:548|INFO] Saving JSON to /home/neo/news-please-repo/data/2021/09/03/welt.de/mediathek_dokumentation_technik-und-wissen_sendung192055609_Extreme-Konstruktionen-Spektakulaere-Bauwerke_1630695592.html.json
[newsplease.helper_classes.sub_classes.heuristics_manager:49|INFO] Checking site: https://www.welt.de/mediathek/dokumentation/gesellschaft/sendung192112689/Die-verruecktesten-Urlaubsvideos-Hoellische-Ferien.html
[newsplease.pipeline.pipelines:523|INFO] Saving HTML to /home/neo/news-please-repo/data/2021/09/03/welt.de/mediathek_dokumentation_gesellschaft_sendung192112689_Die-verruecktesten-Urlaubsvideos-Hoellische-Ferien_1630695592.html
[newsplease.pipeline.pipelines:548|INFO] Saving JSON to /home/neo/news-please-repo/data/2021/09/03/welt.de/mediathek_dokumentation_gesellschaft_sendung192112689_Die-verruecktesten-Urlaubsvideos-Hoellische-Ferien_1630695592.html.json

Versions (please complete the following information):

  • OS: Ubuntu 20.04
  • Python Version 3.8
  • news-please Version 1.5.21

Intent (optional; we'll use this info to prioritize upcoming tasks to work on)

  • personal
  • academic
  • business
  • other
  • Some information on your project: Private playing arround

Btw great project!

@flatplate
Copy link

Which crawler are you using?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants