You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried to use trafilatura with a website and got the following error:
(venv) mlosx:~/Sources/python_sitemap$ trafilatura --sitemap "https://www.mvideo.ru/" --list
unknown error: https://www.mvideo.ru HTTPSConnectionPool(host='www.mvideo.ru', port=443): Max retries exceeded with url: https://www.mvideo.ru/ (Caused by ResponseError('too many redirects'))
urllib3.exceptions.ResponseError: too many redirects
The above exception was the direct cause of the following exception:
...
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.mvideo.ru', port=443): Max retries exceeded with url: https://www.mvideo.ru/ (Caused by ResponseError('too many redirects'))
It seems the website wants to set cookies in the initial request and redirects to the same page. If the cookie is not returned in the next request, it redirects again (and again, again). Is it possible to handle this kind of website? i.e. to sent the required cookies on redirect? (like wget does this, see below)
Good point, I know this kind of problem. There are two different libraries performing the requests, depending on whether the machine has pycurl or not. It would mean finding a common logic which is complicated provided this library doesn't focus on advanced downloads.
A first goal would be to configure the urllib3 session (the default) to accept and store cookies.
I tried to use trafilatura with a website and got the following error:
It seems the website wants to set cookies in the initial request and redirects to the same page. If the cookie is not returned in the next request, it redirects again (and again, again). Is it possible to handle this kind of website? i.e. to sent the required cookies on redirect? (like
wget
does this, see below)It seems
wget
handles cookies by default:But redirection is repeated when cookies are off:
The text was updated successfully, but these errors were encountered: