Finished crawling with no results #175

tobiasstrauss · 2020-08-26T12:44:05Z

Mandatory

I read the documentation (readme and wiki).
I searched other issues (including closed issues) and could not find any to be related. If you find related issues post them below or directly add your issue to the most related one.

Related issues:

add them here

Describe your question
The the given CLI example returns no pages from zeit.de. I have the same problems with other web pages. No error is thrown, it just returns and claims to be finished. So the question is if there is a way to approach the problem. I attached the log file.
log.txt

Versions (please complete the following information):

OS: [e.g. MacOS 10.2] Ubuntu 18.4
Python Version [e.g. 3.6] 3.6
news-please Version [e.g. 1.2] 1.5.13

Intent (optional; we'll use this info to prioritize upcoming tasks to work on)

personal
academic
business
other
Some information on your project:

I train language models for finetuning them on other tasks like ner or text classification

fhamborg · 2020-09-05T09:34:25Z

Strange, also that there's no error in the log! When not using the CLI mode but the library mode (see readme.md) does the extraction work for you?

tobiasstrauss · 2020-09-10T19:23:13Z

Acutally not. The problem seems to be that one hast to accept the advertisement popup first. The output was:
zeit.de mit Werbung Besuchen Sie zeit.de wie gewohnt mit Werbung und Tracking. Details zum Tracking finden Sie in der Datenschutzerklärung und im Privacy Center .
:-/

fhamborg · 2020-09-15T11:01:42Z

Did I understand you correctly that:

when using library mode, e.g., from_url, you retrieve the above text? In which is this set of the NewsArticle object, e.g., title, maintext, etc.?
And respectively, when using CLI mode, nothing is returned, not even an (empty) article object?

JermellB · 2020-10-12T17:38:52Z

I had this problem myself, I am pretty sure I had a configuration issue that was failing silently. I remade my configuration file basing it off of the examples and things seemed to start working. My assumption was some weird python tabs or spaces problem.

tobiasstrauss · 2020-10-13T15:51:48Z

Did I understand you correctly that:

when using library mode, e.g., from_url, you retrieve the above text? In which is this set of the NewsArticle object, e.g., title, maintext, etc.?

And respectively, when using CLI mode, nothing is returned, not even an (empty) article object?

To 1. exactly! I just asked for maintext and title.
To 2. In the CLI mode there is not even a folder referring to zeit.de.
Meanwhile I set up a new system with Ubuntu 20.04. Same problem. Also with a new configuration. I just used the configuration given in the example.
This is a strange behavior since other pages like faz seem to work perfectly.
@fhamborg thanks for sharing this great tool. Although zeit.de is not working for me, I was able to crawl many other pages.

edit:
my config file:

{
  # Every URL has to be in an array-object in "base_urls".
  # The same URL in combination with the same crawler may only appear once in this array.
  "base_urls" : [
    {
      # zeit.de has a blog which we do not want to crawl
      "url": "http://www.zeit.de",

      "overwrite_heuristics": {
        # because we do not want to crawl that blog, disable all downloads from
        # subdomains
        "is_not_from_subdomain": true
      },
      # Update the condition as well, all the other heuristics are enabled in
      # newscrawler.cfg
      "pass_heuristics_condition": "is_not_from_subdomain and og_type and self_linked_headlines and linked_headlines"
    }
  ]
}

peterkabz · 2021-03-13T03:15:49Z

@tobiasstrauss I agree with you, the issue was the website pop up at https://www.zeit.de/zustimmung?url=https%3A%2F%2Fwww.zeit.de%2Findex

woxxel · 2021-07-22T14:55:48Z

Hey there @tobiasstrauss
you're able to bypass the issue by sending the appropriate cookie with the crawl request (cookie named 'zonconsent' - you would have to get the appropriate value of the cookie by visiting the site manually once). I've been implementing a couple of changes including this one, which I could push - though I'm not a 100% sure if there are any legal implications to programmatically bygoing such consent-popups.
is anyone more literate on the according legal issues?

SamuelHelspr · 2022-01-05T21:55:07Z

Hey @woxxel,
I am currently experiencing the same issues as @tobiasstrauss. Could you share your approach on how to send the cookie with the crawl request? I tried to implement it myself but failed so far.
Thanks a lot!

loughnane · 2023-08-08T13:34:26Z

@SamuelHelspr or @woxxel have either of you (or anyone reading) figured out how to send a cookie? I've been using the from_url function and it seems there's no option to pass it.

JermellB · 2023-08-08T14:08:34Z

If no one has figured this out in a week ping me and I'll write a quick patch for you. I was doing some decently large scale crawls with this and to get scale that was something I had to do.

…

On Tue, Aug 8, 2023 at 9:34 AM Chris Loughnane ***@***.***> wrote: @SamuelHelspr <https://github.com/SamuelHelspr> or @woxxel <https://github.com/woxxel> have either of you (or anyone reading) figured out how to send a cookie? I've been using the from_url function and it seems there's no option to pass it. — Reply to this email directly, view it on GitHub <#175 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABM63N6EDNYBSEAUAIGYSDDXUI56ZANCNFSM4QLZBU7A> . You are receiving this because you commented.Message ID: ***@***.***>

loughnane · 2023-08-25T00:23:59Z

Hey @JermellB i'd gladly take you up on that patch.

BilalReffas · 2023-09-25T16:39:20Z

I just made the same experience. Interestingly some sites (guardian, FAZ) are working fine even though there are ads in between.

But for Spiegel the maintext is not being returned at all for most content.

@JermellB any updates from you? Do you need help how you starting this patch?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finished crawling with no results #175

Finished crawling with no results #175

tobiasstrauss commented Aug 26, 2020 •

edited

fhamborg commented Sep 5, 2020

tobiasstrauss commented Sep 10, 2020

fhamborg commented Sep 15, 2020 •

edited

JermellB commented Oct 12, 2020

tobiasstrauss commented Oct 13, 2020 •

edited

peterkabz commented Mar 13, 2021

woxxel commented Jul 22, 2021

SamuelHelspr commented Jan 5, 2022

loughnane commented Aug 8, 2023

JermellB commented Aug 8, 2023 via email

loughnane commented Aug 25, 2023

BilalReffas commented Sep 25, 2023

Finished crawling with no results #175

Finished crawling with no results #175

Comments

tobiasstrauss commented Aug 26, 2020 • edited

fhamborg commented Sep 5, 2020

tobiasstrauss commented Sep 10, 2020

fhamborg commented Sep 15, 2020 • edited

JermellB commented Oct 12, 2020

tobiasstrauss commented Oct 13, 2020 • edited

peterkabz commented Mar 13, 2021

woxxel commented Jul 22, 2021

SamuelHelspr commented Jan 5, 2022

loughnane commented Aug 8, 2023

JermellB commented Aug 8, 2023 via email

loughnane commented Aug 25, 2023

BilalReffas commented Sep 25, 2023

tobiasstrauss commented Aug 26, 2020 •

edited

fhamborg commented Sep 15, 2020 •

edited

tobiasstrauss commented Oct 13, 2020 •

edited