Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finished crawling with no results #175

Open
3 of 6 tasks
tobiasstrauss opened this issue Aug 26, 2020 · 12 comments
Open
3 of 6 tasks

Finished crawling with no results #175

tobiasstrauss opened this issue Aug 26, 2020 · 12 comments

Comments

@tobiasstrauss
Copy link

tobiasstrauss commented Aug 26, 2020

Mandatory

  • I read the documentation (readme and wiki).
  • I searched other issues (including closed issues) and could not find any to be related. If you find related issues post them below or directly add your issue to the most related one.

Related issues:

  • add them here

Describe your question
The the given CLI example returns no pages from zeit.de. I have the same problems with other web pages. No error is thrown, it just returns and claims to be finished. So the question is if there is a way to approach the problem. I attached the log file.
log.txt

Versions (please complete the following information):

  • OS: [e.g. MacOS 10.2] Ubuntu 18.4
  • Python Version [e.g. 3.6] 3.6
  • news-please Version [e.g. 1.2] 1.5.13

Intent (optional; we'll use this info to prioritize upcoming tasks to work on)

  • personal
  • academic
  • business
  • other
  • Some information on your project:

I train language models for finetuning them on other tasks like ner or text classification

@fhamborg
Copy link
Owner

fhamborg commented Sep 5, 2020

Strange, also that there's no error in the log! When not using the CLI mode but the library mode (see readme.md) does the extraction work for you?

@tobiasstrauss
Copy link
Author

Acutally not. The problem seems to be that one hast to accept the advertisement popup first. The output was:
zeit.de mit Werbung Besuchen Sie zeit.de wie gewohnt mit Werbung und Tracking. Details zum Tracking finden Sie in der Datenschutzerklärung und im Privacy Center .
:-/

@fhamborg
Copy link
Owner

fhamborg commented Sep 15, 2020

Did I understand you correctly that:

  1. when using library mode, e.g., from_url, you retrieve the above text? In which is this set of the NewsArticle object, e.g., title, maintext, etc.?

  2. And respectively, when using CLI mode, nothing is returned, not even an (empty) article object?

@JermellB
Copy link

I had this problem myself, I am pretty sure I had a configuration issue that was failing silently. I remade my configuration file basing it off of the examples and things seemed to start working. My assumption was some weird python tabs or spaces problem.

@tobiasstrauss
Copy link
Author

tobiasstrauss commented Oct 13, 2020

Did I understand you correctly that:

  1. when using library mode, e.g., from_url, you retrieve the above text? In which is this set of the NewsArticle object, e.g., title, maintext, etc.?
  2. And respectively, when using CLI mode, nothing is returned, not even an (empty) article object?

To 1. exactly! I just asked for maintext and title.
To 2. In the CLI mode there is not even a folder referring to zeit.de.
Meanwhile I set up a new system with Ubuntu 20.04. Same problem. Also with a new configuration. I just used the configuration given in the example.
This is a strange behavior since other pages like faz seem to work perfectly.
@fhamborg thanks for sharing this great tool. Although zeit.de is not working for me, I was able to crawl many other pages.

edit:
my config file:

{
  # Every URL has to be in an array-object in "base_urls".
  # The same URL in combination with the same crawler may only appear once in this array.
  "base_urls" : [
    {
      # zeit.de has a blog which we do not want to crawl
      "url": "http://www.zeit.de",

      "overwrite_heuristics": {
        # because we do not want to crawl that blog, disable all downloads from
        # subdomains
        "is_not_from_subdomain": true
      },
      # Update the condition as well, all the other heuristics are enabled in
      # newscrawler.cfg
      "pass_heuristics_condition": "is_not_from_subdomain and og_type and self_linked_headlines and linked_headlines"
    }
  ]
}

@peterkabz
Copy link

@tobiasstrauss I agree with you, the issue was the website pop up at https://www.zeit.de/zustimmung?url=https%3A%2F%2Fwww.zeit.de%2Findex

@woxxel
Copy link

woxxel commented Jul 22, 2021

Hey there @tobiasstrauss
you're able to bypass the issue by sending the appropriate cookie with the crawl request (cookie named 'zonconsent' - you would have to get the appropriate value of the cookie by visiting the site manually once). I've been implementing a couple of changes including this one, which I could push - though I'm not a 100% sure if there are any legal implications to programmatically bygoing such consent-popups.
is anyone more literate on the according legal issues?

@SamuelHelspr
Copy link

Hey @woxxel,
I am currently experiencing the same issues as @tobiasstrauss. Could you share your approach on how to send the cookie with the crawl request? I tried to implement it myself but failed so far.
Thanks a lot!

@loughnane
Copy link

@SamuelHelspr or @woxxel have either of you (or anyone reading) figured out how to send a cookie? I've been using the from_url function and it seems there's no option to pass it.

@JermellB
Copy link

JermellB commented Aug 8, 2023 via email

@loughnane
Copy link

Hey @JermellB i'd gladly take you up on that patch.

@BilalReffas
Copy link

I just made the same experience. Interestingly some sites (guardian, FAZ) are working fine even though there are ads in between.

But for Spiegel the maintext is not being returned at all for most content.

@JermellB any updates from you? Do you need help how you starting this patch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants