Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error reddit scraping #1005

Open
MatFrancois opened this issue Jul 5, 2023 · 2 comments
Open

Error reddit scraping #1005

MatFrancois opened this issue Jul 5, 2023 · 2 comments
Labels
bug Something isn't working module:reddit

Comments

@MatFrancois
Copy link

Describe the bug

I have the following error when I try to scrape reddit :
snscrape.base.ScraperException: 4 requests to https://api.pushshift.io/reddit/search/submission?q=toto&limit=1000 failed, giving up.
I also tried with the python package and subreddit search but it doesn't work either.
I tried to do it from another device but the same result...
Any idea?

How to reproduce

run snscrape -n 100 -vv reddit-search toto

Expected behaviour

Get data?

Screenshots and recordings

No response

Operating system

Kubuntu 22.04

Python version: output of python3 --version

3.8.8

snscrape version: output of snscrape --version

snscrape 0.7.0.20230622

Scraper

reddit-search

How are you using snscrape?

CLI (snscrape ... as a command, e.g. in a terminal)

Backtrace

snscrape.base.ScraperException: 4 requests to https://api.pushshift.io/reddit/search/submission?q=toto&limit=1000 failed, giving up.

Log output

2023-07-05 09:56:59.215  INFO  snscrape.base  Retrieving https://api.pushshift.io/reddit/search/submission?q=toto&limit=1000
2023-07-05 09:56:59.216  DEBUG  snscrape.base  ... with headers: {'User-Agent': 'snscrape/0.7.0.20230622'}
2023-07-05 09:56:59.216  DEBUG  snscrape.base  ... with environmentSettings: {'verify': True, 'proxies': OrderedDict(), 'stream': False, 'cert': None}
2023-07-05 09:56:59.217  DEBUG  urllib3.connectionpool  Starting new HTTPS connection (1): api.pushshift.io:443
2023-07-05 09:56:59.285  DEBUG  snscrape.base  Connected to: ('172.67.219.85', 443)
2023-07-05 09:56:59.285  DEBUG  snscrape.base  Connection cipher: ('TLS_AES_256_GCM_SHA384', 'TLSv1.3', 256)
2023-07-05 09:56:59.682  DEBUG  urllib3.connectionpool  https://api.pushshift.io:443 "GET /reddit/search/submission?q=toto&limit=1000 HTTP/1.1" 403 30
2023-07-05 09:56:59.684  INFO  snscrape.base  Retrieved https://api.pushshift.io/reddit/search/submission?q=toto&limit=1000: 403
2023-07-05 09:56:59.684  DEBUG  snscrape.base  ... with response headers: {'Date': 'Wed, 05 Jul 2023 07:56:59 GMT', 'Content-Type': 'application/json', 'Content-Length': '30', 'Connection': 'keep-alive', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains', 'CF-Cache-Status': 'BYPASS', 'Report-To': '{"endpoints":[{"url":"https:\\/\\/a.nel.cloudflare.com\\/report\\/v3?s=y4%2BHEpMTSBJTXPGm4t7j95SB7FGvFVoPOkhN7%2BoPzIMt8rFnrbVatyYC2TKIviyCyOuaYt%2B%2FtN02NPN3AZa%2BCtunP7oatjwYM8k51iOBRkXNrBcTndwFIxVJTfEqILZlwQTp"}],"group":"cf-nel","max_age":604800}', 'NEL': '{"success_fraction":0,"report_to":"cf-nel","max_age":604800}', 'Vary': 'Accept-Encoding', 'Server': 'cloudflare', 'CF-RAY': '7e1e0df6bb3ad4e5-CDG', 'alt-svc': 'h3=":443"; ma=86400'}
2023-07-05 09:56:59.684  INFO  snscrape.base  Error retrieving https://api.pushshift.io/reddit/search/submission?q=toto&limit=1000: non-200 status code, retrying
2023-07-05 09:56:59.684  INFO  snscrape.base  Waiting 1 seconds
2023-07-05 09:57:00.687  INFO  snscrape.base  Retrieving https://api.pushshift.io/reddit/search/submission?q=toto&limit=1000
2023-07-05 09:57:00.687  DEBUG  snscrape.base  ... with headers: {'User-Agent': 'snscrape/0.7.0.20230622'}
2023-07-05 09:57:00.688  DEBUG  snscrape.base  ... with environmentSettings: {'verify': True, 'proxies': OrderedDict(), 'stream': False, 'cert': None}
2023-07-05 09:57:00.809  DEBUG  urllib3.connectionpool  https://api.pushshift.io:443 "GET /reddit/search/submission?q=toto&limit=1000 HTTP/1.1" 403 30
2023-07-05 09:57:00.810  INFO  snscrape.base  Retrieved https://api.pushshift.io/reddit/search/submission?q=toto&limit=1000: 403
2023-07-05 09:57:00.811  DEBUG  snscrape.base  ... with response headers: {'Date': 'Wed, 05 Jul 2023 07:57:00 GMT', 'Content-Type': 'application/json', 'Content-Length': '30', 'Connection': 'keep-alive', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains', 'CF-Cache-Status': 'BYPASS', 'Report-To': '{"endpoints":[{"url":"https:\\/\\/a.nel.cloudflare.com\\/report\\/v3?s=RRI7V4%2FKORopA2%2FQFWbrrUnkFlm%2Ftd5O9SismrizB9mCRFBeF2tTFM0L%2FhbJTzPPwHYyQiOZ6ZzhjUyUc%2BkSPQla5B1BqN%2BTV3LcE2%2Fv3y9Q%2FYeQHPp6gIGrjqjfaDO8dRC3"}],"group":"cf-nel","max_age":604800}', 'NEL': '{"success_fraction":0,"report_to":"cf-nel","max_age":604800}', 'Vary': 'Accept-Encoding', 'Server': 'cloudflare', 'CF-RAY': '7e1e0dff78f0d4e5-CDG', 'alt-svc': 'h3=":443"; ma=86400'}
2023-07-05 09:57:00.811  INFO  snscrape.base  Error retrieving https://api.pushshift.io/reddit/search/submission?q=toto&limit=1000: non-200 status code, retrying
2023-07-05 09:57:00.811  INFO  snscrape.base  Waiting 2 seconds
2023-07-05 09:57:02.815  INFO  snscrape.base  Retrieving https://api.pushshift.io/reddit/search/submission?q=toto&limit=1000
2023-07-05 09:57:02.815  DEBUG  snscrape.base  ... with headers: {'User-Agent': 'snscrape/0.7.0.20230622'}
2023-07-05 09:57:02.815  DEBUG  snscrape.base  ... with environmentSettings: {'verify': True, 'proxies': OrderedDict(), 'stream': False, 'cert': None}
2023-07-05 09:57:02.938  DEBUG  urllib3.connectionpool  https://api.pushshift.io:443 "GET /reddit/search/submission?q=toto&limit=1000 HTTP/1.1" 403 30
2023-07-05 09:57:02.938  INFO  snscrape.base  Retrieved https://api.pushshift.io/reddit/search/submission?q=toto&limit=1000: 403
2023-07-05 09:57:02.939  DEBUG  snscrape.base  ... with response headers: {'Date': 'Wed, 05 Jul 2023 07:57:02 GMT', 'Content-Type': 'application/json', 'Content-Length': '30', 'Connection': 'keep-alive', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains', 'CF-Cache-Status': 'BYPASS', 'Report-To': '{"endpoints":[{"url":"https:\\/\\/a.nel.cloudflare.com\\/report\\/v3?s=iwHMuR85a9T6e4AsOzZ3nlYUMI4G2ke71fL7PEhrcNRyy%2BUhlTw9OhJgogU4NAWUKAY1gXhPNQgoSAZSct65B2fLZviQvfVhJwWAS7EWe%2BG0jcjKm4ot9p11cAMDQQQLmJ3P"}],"group":"cf-nel","max_age":604800}', 'NEL': '{"success_fraction":0,"report_to":"cf-nel","max_age":604800}', 'Vary': 'Accept-Encoding', 'Server': 'cloudflare', 'CF-RAY': '7e1e0e0cc998d4e5-CDG', 'alt-svc': 'h3=":443"; ma=86400'}
2023-07-05 09:57:02.939  INFO  snscrape.base  Error retrieving https://api.pushshift.io/reddit/search/submission?q=toto&limit=1000: non-200 status code, retrying
2023-07-05 09:57:02.939  INFO  snscrape.base  Waiting 4 seconds
2023-07-05 09:57:06.945  INFO  snscrape.base  Retrieving https://api.pushshift.io/reddit/search/submission?q=toto&limit=1000
2023-07-05 09:57:06.945  DEBUG  snscrape.base  ... with headers: {'User-Agent': 'snscrape/0.7.0.20230622'}
2023-07-05 09:57:06.945  DEBUG  snscrape.base  ... with environmentSettings: {'verify': True, 'proxies': OrderedDict(), 'stream': False, 'cert': None}
2023-07-05 09:57:07.066  DEBUG  urllib3.connectionpool  https://api.pushshift.io:443 "GET /reddit/search/submission?q=toto&limit=1000 HTTP/1.1" 403 30
2023-07-05 09:57:07.067  INFO  snscrape.base  Retrieved https://api.pushshift.io/reddit/search/submission?q=toto&limit=1000: 403
2023-07-05 09:57:07.067  DEBUG  snscrape.base  ... with response headers: {'Date': 'Wed, 05 Jul 2023 07:57:07 GMT', 'Content-Type': 'application/json', 'Content-Length': '30', 'Connection': 'keep-alive', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains', 'CF-Cache-Status': 'BYPASS', 'Report-To': '{"endpoints":[{"url":"https:\\/\\/a.nel.cloudflare.com\\/report\\/v3?s=sbdEsUwu7UnHrollCV0oSOt0FUSXPBvUgjqRiWXSV0A%2BNpdcPvsXdaETxaF8GYBdD0k02i5vWa8sK%2FnZnSCNU5T0VPs3FMTx5yhC7E9LkDFzczUz5ZkXmrzoHoN4%2FcQEJYqI"}],"group":"cf-nel","max_age":604800}', 'NEL': '{"success_fraction":0,"report_to":"cf-nel","max_age":604800}', 'Vary': 'Accept-Encoding', 'Server': 'cloudflare', 'CF-RAY': '7e1e0e2699d8d4e5-CDG', 'alt-svc': 'h3=":443"; ma=86400'}
2023-07-05 09:57:07.067  ERROR  snscrape.base  Error retrieving https://api.pushshift.io/reddit/search/submission?q=toto&limit=1000: non-200 status code
2023-07-05 09:57:07.067  CRITICAL  snscrape.base  4 requests to https://api.pushshift.io/reddit/search/submission?q=toto&limit=1000 failed, giving up.
2023-07-05 09:57:07.067  CRITICAL  snscrape.base  Errors: non-200 status code, non-200 status code, non-200 status code, non-200 status code
2023-07-05 09:57:07.118  CRITICAL  snscrape._cli  Dumped stack and locals to /tmp/snscrape_locals_j8mi7h4g
Traceback (most recent call last):
  File "/home/matthieu-inspiron/anaconda3/bin/snscrape", line 8, in <module>
    sys.exit(main())
  File "/home/matthieu-inspiron/anaconda3/lib/python3.8/site-packages/snscrape/_cli.py", line 323, in main
    for i, item in enumerate(scraper.get_items(), start = 1):
  File "/home/matthieu-inspiron/anaconda3/lib/python3.8/site-packages/snscrape/modules/reddit.py", line 219, in get_items
    yield from self._iter_api_submissions_and_comments({type(self)._apiField: self._name})
  File "/home/matthieu-inspiron/anaconda3/lib/python3.8/site-packages/snscrape/modules/reddit.py", line 185, in _iter_api_submissions_and_comments
    tipSubmission = next(submissionsIter)
  File "/home/matthieu-inspiron/anaconda3/lib/python3.8/site-packages/snscrape/modules/reddit.py", line 143, in _iter_api
    obj = self._get_api(url, params = params)
  File "/home/matthieu-inspiron/anaconda3/lib/python3.8/site-packages/snscrape/modules/reddit.py", line 94, in _get_api
    r = self._get(url, params = params, headers = self._headers, responseOkCallback = self._handle_rate_limiting)
  File "/home/matthieu-inspiron/anaconda3/lib/python3.8/site-packages/snscrape/base.py", line 275, in _get
    return self._request('GET', *args, **kwargs)
  File "/home/matthieu-inspiron/anaconda3/lib/python3.8/site-packages/snscrape/base.py", line 271, in _request
    raise ScraperException(msg)
snscrape.base.ScraperException: 4 requests to https://api.pushshift.io/reddit/search/submission?q=toto&limit=1000 failed, giving up.

Dump of locals

I prefer to send it in private

Additional context

No response

@MatFrancois MatFrancois added the bug Something isn't working label Jul 5, 2023
@JustAnotherArchivist
Copy link
Owner

Pushshift is effectively dead, so yeah, this is expected and can't work anymore. Pushshift was the only way to retrieve (a) useful search results since Reddit's own search is awful, (b) get all submissions in a subreddit since Reddit limits that to 1000 results, and (c) get all submissions/comments by a user due to the same limitation on Reddit.

Potentially, PullPush could serve as a replacement, but since Reddit's API changes are rolling out this month, I'll wait for that to happen before making any changes.

(If the Reddit API itself is sufficient for your purposes, I recommend using PRAW rather than snscrape.)

@ihabpalamino

This comment was marked as spam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working module:reddit
Projects
None yet
Development

No branches or pull requests

3 participants