Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

save cookies on redirect #478

Open
zeliboba7 opened this issue Jan 16, 2024 · 1 comment
Open

save cookies on redirect #478

zeliboba7 opened this issue Jan 16, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@zeliboba7
Copy link

zeliboba7 commented Jan 16, 2024

I tried to use trafilatura with a website and got the following error:

(venv) mlosx:~/Sources/python_sitemap$ trafilatura --sitemap "https://www.mvideo.ru/" --list
unknown error: https://www.mvideo.ru HTTPSConnectionPool(host='www.mvideo.ru', port=443): Max retries exceeded with url: https://www.mvideo.ru/ (Caused by ResponseError('too many redirects'))
urllib3.exceptions.ResponseError: too many redirects

The above exception was the direct cause of the following exception:
...
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.mvideo.ru', port=443): Max retries exceeded with url: https://www.mvideo.ru/ (Caused by ResponseError('too many redirects'))

It seems the website wants to set cookies in the initial request and redirects to the same page. If the cookie is not returned in the next request, it redirects again (and again, again). Is it possible to handle this kind of website? i.e. to sent the required cookies on redirect? (like wget does this, see below)

It seems wget handles cookies by default:

(venv) mlosx:~/Sources/python_sitemap$ wget --server-response https://www.mvideo.ru
--2024-01-16 21:02:48--  https://www.mvideo.ru/
Resolving www.mvideo.ru (www.mvideo.ru)... 185.71.67.88
Connecting to www.mvideo.ru (www.mvideo.ru)|185.71.67.88|:443... connected.
HTTP request sent, awaiting response...
  HTTP/1.1 302 Moved Temporarily
  Server: nginx
  Date: Tue, 16 Jan 2024 14:02:48 GMT
  Content-Length: 0
  Connection: keep-alive
  Location: https://www.mvideo.ru/
  Set-Cookie: __hash_=91c1d62c023630be976ba7643d676c91; Max-Age=1800; Path=/
Location: https://www.mvideo.ru/ [following]
--2024-01-16 21:02:48--  https://www.mvideo.ru/
Reusing existing connection to www.mvideo.ru:443.
HTTP request sent, awaiting response...
  HTTP/1.1 200 OK
  Server: nginx
  Date: Tue, 16 Jan 2024 14:02:48 GMT
  Content-Type: text/html
  Content-Length: 30429
  Connection: keep-alive
  Set-Cookie: __lhash_=729e7ad9ccc1bb394a0d3f88f6f97812; Max-Age=604800; Path=/
  last-modified: Mon, 15 Jan 2024 21:39:31 GMT
  cache-control: max-age=0
  accept-ranges: bytes
  MVID-Uber-Trace-Id: dced347839fe0739:721d8cdca998ae5e:dced347839fe0739:1
  expires: Tue, 16 Jan 2024 14:02:48 GMT
  x-powered-by: Express
  set-cookie: MVID_AB_PERSONAL_RECOMMENDS=true; Domain=.mvideo.ru; Path=/; Expires=Tue, 30 Jan 2024 07:01:00 GMT
...
  set-cookie: MVID_ENVCLOUD=prod2; path=/
  rev: 04
  etag: "65a5a613-76dd"
  lbu-ha: prod2
  lbu: prod2-fc
  Cache-Control: no-cache
Length: 30429 (30K) [text/html]
Saving to: ‘index.html’

     0K .......... .......... .........                       100% 42.0M=0.001s

2024-01-16 21:02:49 (42.0 MB/s) - ‘index.html’ saved [30429/30429]

But redirection is repeated when cookies are off:

(venv) mlosx:~/Sources/python_sitemap$ wget --server-response --no-cookies https://www.mvideo.ru
--2024-01-16 21:12:40--  https://www.mvideo.ru/
Resolving www.mvideo.ru (www.mvideo.ru)... 185.71.67.88
Connecting to www.mvideo.ru (www.mvideo.ru)|185.71.67.88|:443... connected.
HTTP request sent, awaiting response...
  HTTP/1.1 302 Moved Temporarily
  Server: nginx
  Date: Tue, 16 Jan 2024 14:12:40 GMT
  Content-Length: 0
  Connection: keep-alive
  Location: https://www.mvideo.ru/
  Set-Cookie: __hash_=91c1d62c023b30beb76ba7643d676c91; Max-Age=1800; Path=/
Location: https://www.mvideo.ru/ [following]
--2024-01-16 21:12:40--  https://www.mvideo.ru/
Reusing existing connection to www.mvideo.ru:443.
HTTP request sent, awaiting response...
  HTTP/1.1 302 Moved Temporarily
  Server: nginx
  Date: Tue, 16 Jan 2024 14:12:40 GMT
  Content-Length: 0
  Connection: keep-alive
  Location: https://www.mvideo.ru/
  Set-Cookie: __hash_=91c1d62c023b30beb76ba7643d676c91; Max-Age=1800; Path=/
Location: https://www.mvideo.ru/ [following]
...
20 redirections exceeded.
(venv) mlosx:~/Sources/python_sitemap$
@adbar adbar added the enhancement New feature or request label Jan 17, 2024
@adbar
Copy link
Owner

adbar commented Jan 17, 2024

Good point, I know this kind of problem. There are two different libraries performing the requests, depending on whether the machine has pycurl or not. It would mean finding a common logic which is complicated provided this library doesn't focus on advanced downloads.

A first goal would be to configure the urllib3 session (the default) to accept and store cookies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants