Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MechanicalSoup needs an interface to set raw cookie strings #360

Open
johnhawkinson opened this issue Mar 28, 2021 · 10 comments
Open

MechanicalSoup needs an interface to set raw cookie strings #360

johnhawkinson opened this issue Mar 28, 2021 · 10 comments

Comments

@johnhawkinson
Copy link
Contributor

Mechanize has an interface to create cookies from raw cookie strings, browser.set_cookie(cookie_string), https://mechanize.readthedocs.io/en/latest/browser_api.html#mechanize.Browser.set_cookie, and MechanicalSoup does not.

Not having the interface makes it really painful to do this.

My situation is I have to deal with JavaScript that uses document.cookie="blah=value; path=/; domain=.foobar.com".
With Mechanize I just extracted the cookie strings from the JS with a regexp and called browser.set_cookie() on each one.

By contrast, this is the cleanest I could come up with using MechanicalSoup, and it involves using an undocumented private interface to http.cookiejar and is still far hairier than I would like. But manual parsing of the cookie string to cons up a "cookie" object to pass to cookie_jar.set_cookie() would be even worse.

Here's my code, with the javascript parsing omitted:

import http.cookiejar
import mechanicalsoup
import urllib.request

def parse_cookies():
    # ...
    raw_cookies_from_js = [
        "blah=value; path=/; domain=.foobar.com",
        "ughh=morev; path=/; domain=.foobar.com",
        "Session=6l11l1l; path=/; domain=.foobar.com; expires=Thu, 01-Jan-1970 00:00:01 GMT; secure;",
        "CV=; path=/; domain=.foobar.com; expires=Thu, 01-Jan-1970 00:00:01 GMT",
        ]

    fake_headers = []
    for raw_cookie in raw_cookies_from_js:
        fake_headers.append(raw_cookie)

    fake_request = urllib.request.Request('file://localhost')
    cookie_jar = browser.get_cookiejar()
    cookie_attrs = http.cookiejar.parse_ns_headers(fake_headers)
    cookies = cookie_jar._cookies_from_attrs_set(cookie_attrs, fake_request)
    for cookie in cookies:
        cookie_jar.set_cookie(cookie)
@hemberger
Copy link
Contributor

Great question, thanks for asking!

The Requests session has a few methods for cookiejar manipulation, and I could definitely envision adding a set_cookie method to MechanicalSoup that wraps one of these. Here are two options from a nice StackOverflow post: https://stackoverflow.com/a/51904567/8834658 (i.e. session.cookies.set or session.cookies.create_cookie). Would either of these work in your case?

@johnhawkinson
Copy link
Contributor Author

The Requests session has a few methods for cookiejar manipulation

If only they were sufficient.

Here are two options from a nice StackOverflow post: https://stackoverflow.com/a/51904567/8834658 (i.e. session.cookies.set or session.cookies.create_cookie). Would either of these work in your case?

They would not. Both of those take an already-parsed broken-down cookie and add it to the cookie jar:

 session.cookies.set("COOKIE_NAME", "the cookie works", domain="example.com")

But that's not what I have. I have an un-parsed raw cookie string, because that's what the Javascript document.cookie interface uses, so it's what the JS I'm parsing has.

Requests does not appear to have an interface for such cookies — it leverages Python's urllib and urllib and that's why I am calling their methods.

But I definitely do not love what I came up with above. If someone can point me at a better solution, ideally one that does not involve creating a fake request, that would be great!

@hemberger
Copy link
Contributor

Thanks for the clarification. Hmm, what about something like this?

from http.cookies import SimpleCookie
cookie_jar = browser.get_cookiejar()
for raw_cookie in raw_cookies_from_js:
    cookie = SimpleCookie(raw_cookie)
    cookie_jar.set_cookie(cookie)

@johnhawkinson
Copy link
Contributor Author

johnhawkinson commented Mar 29, 2021

cookie = SimpleCookie(raw_cookie)

If only! A similar problem. The SimpleCookie() constructor does not take a raw cookie string at least, it is not usable as input to cookie_jar.set_cookie() when used that way, even though its documentation seems to suggest this would work..

I spent several hours trying to figure out how these interfaces (SimpleCookie, Morsels, http.cookies, etc., etc.) are supposed to work and failed to come up with anything useful. It seems needlessly complex and yet also unusable.

Here's how that plays out:

>>> raw_cookie="blah=value; path=/; domain=.foobar.com"
>>> from http.cookies import SimpleCookie
>>> cookie_jar = browser.get_cookiejar()
>>> cookie = SimpleCookie(raw_cookie)
>>> cookie
<SimpleCookie: blah='value'>
>>> cookie_jar.set_cookie(cookie)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.9/site-packages/requests/cookies.py", line 344, in set_cookie
    if hasattr(cookie.value, 'startswith') and cookie.value.startswith('"') and cookie.value.endswith('"'):
AttributeError: 'SimpleCookie' object has no attribute 'value'
>>> cookie_jar
<RequestsCookieJar[]>
>>> 

@hemberger
Copy link
Contributor

Thanks for your follow-up. I guess I was naive in thinking that such a straightforward solution would be possible! It does seem frustrating, but I think that having a hack to make a fake request in MechanicalSoup would be equally unappealing. I don't have the expertise to know if this is a fundamental limitation of how cookies work, or if it's just a poor implementation in urllib. Maybe you could get more visibility of the issue by escalating it to Requests?

@johnhawkinson
Copy link
Contributor Author

Having an ugly hack in MechanicalSoup is unappealing, but not as unappealing as having to have it in every users' code that has to deal with this problem, I think.

I am definitely not convinced that my solution is Correct or even the cleanest. It was the best I could come up with.
I will take another run at it, now 3 weeks later, and see if I missed something easy on the first go 'round.
For instance, your suggestion that that the SimpleCookie() class should take a raw cookie string seems right — it just doesn't work, at least not the way I tried it.

If I don't get anywhere I'll ask with Requests.

I suspect, though, that I will get a response suggesting that it's not a good idea to encourage the use of raw cookie strings, since effectively bypass controls on who can write what cookies, and you wouldn't want Site X to be able to control Site Y's cookies. The rejoinder here is that of course, when matching the brower's JS API, this is the hand we are dealt. But it may be that it's the kind of thing one doesn't encounter for users of Requests who are not dealing with MechanicalSoup.

It belatedly occurs to me to look at what Mechanize does:

https://github.com/python-mechanize/mechanize/blob/3acb1836f3fd8edc5a758a417dd46b53832ae3b5/mechanize/_mechanize.py#L465-L469

        cookiejar = self._ua_handlers["_cookies"].cookiejar
        response = self.response()  # copy
        headers = response.info()
        headers["Set-cookie"] = cookie_string
        cookiejar.extract_cookies(response, self.request)

Hrmm. That does seem cleaner, assuming it translates properly. This time it involves a fake response rather than a fake request.

@ism
Copy link

ism commented Apr 20, 2021

Here's how that plays out:

Hello everyone!
This is a correct way to use SimpleCookie with raw data (taken from https://stackoverflow.com/questions/32281041/converting-cookie-string-into-python-dict)

>>> from http.cookies import SimpleCookie
>>> raw_cookie="blah=value; path=/; domain=.foobar.com"
>>> cookie = SimpleCookie()
>>> cookie.load(raw_cookie)
>>> cookie
<SimpleCookie: blah='value'>

@johnhawkinson
Copy link
Contributor Author

johnhawkinson commented Apr 20, 2021

@ism, that ends up no differently than above in #360 (comment), where we got just as far, setting and examining the SimpleCookie, albeit with the constructor rather than with the .load() method. But once that's done, you need to add it to a cookie jar, and your method fails the same way:

Hello everyone!
This is a correct way to use SimpleCookie with raw data (taken from https://stackoverflow.com/questions/32281041/converting-cookie-string-into-python-dict)

>>> from http.cookies import SimpleCookie
>>> raw_cookie="blah=value; path=/; domain=.foobar.com"
>>> cookie = SimpleCookie()
>>> cookie.load(raw_cookie)
>>> cookie
<SimpleCookie: blah='value'>

So, once we've done that:

>>> cookie_jar = browser.get_cookiejar()
>>> cookie_jar.set_cookie(cookie)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.9/site-packages/requests/cookies.py", line 344, in set_cookie
    if hasattr(cookie.value, 'startswith') and cookie.value.startswith('"') and cookie.value.endswith('"'):
AttributeError: 'SimpleCookie' object has no attribute 'value'

If you keep reading the accepted answer to that SO post, after the part you quoted, it gets a bit hokey:

# Even though SimpleCookie is dictionary-like, it internally uses a Morsel object
# which is incompatible with requests. Manually construct a dictionary instead.
cookies = {}
for key, morsel in cookie.items():
    cookies[key] = morsel.value

That, of course, gives you a dictionary that is missing much of the parameters from our test example:

>>> cookies
{'blah': 'value'}

So, yeah, you could use it:

>>> browser.set_cookiejar(cookies)
>>> 

(setting aside the fact that doing so in that fashion would lose any existing cookies in the cookie jar), but the whole point of using http.cookiejar.parse_ns_headers andcookie_jar._cookies_from_attrs_set was the leverage the existing parsing code to handle more complicated raw cookie strings.

So, in sum: no, that doesn't solve the problem.

@ism
Copy link

ism commented Apr 20, 2021

Sorry, it was my bad, i misread you couldn't parse raw cookie.

Now, back to topic:

but the whole point of using http.cookiejar.parse_ns_headers andcookie_jar._cookies_from_attrs_set was the leverage the existing parsing code to handle more complicated raw cookie strings.

So, in sum: no, that doesn't solve the problem.

I don't understand why do MechanicalSoup have to keep and parse all that info? :)

After all what a client is sending to webserver is name/value tuple, removing everything else. From my experience, if additional cookie is needed i just pass it to request as additional header

extra_header = {'Cookie' : 'name=value'}
resp = browser.submit(form, url, headers=extra_header)

As another workaround is to pass your own request Session to browser constructor and have access to all underlying transport variables you can modify. (like you may add a cookie to session.cookies without losing the rest)

To recap, i don't think MechanicalSoup need any extra interface to handle such specific usecase as parsing extra cookie information and pass it to requests.

Cheers!

@johnhawkinson
Copy link
Contributor Author

johnhawkinson commented Apr 20, 2021

I don't understand why do MechanicalSoup have to keep and parse all that info? :)

The additional information adds constraints. It adds, for instance, a time after which you should not send the cookie. It adds path constraints, so if the cookie is not sent to all endpoints on a given web server. It adds domain constraints, so if you're dealing with multiple servers, there is a way to determine which cookies should go to which servers. And perhaps importantly, it adds the secure attribute, which avoids sending cookies in cleartext channels.

All of these things are important to mimicing what a first-class browser does. And where cookies are used for authentication, as they often are, it's important to get them right.

After all what a client is sending to webserver is name/value tuple, removing everything else. From my experience, if additional cookie is needed i just pass it to request as additional header

See above. These are controls that determine when it is appropriate to send the cookie.
As for "i just pass" "additional cookies," that would be because you have special knowledge about them, perhaps because you have parsed the raw cookie strings by hand (or some other mechanism). I'd like to avoid having special knowledge, since that means rolling my own cookie handling code, which is really what we're trying to avoid here.

As another workaround is to pass your own request Session to browser constructor and have access to all underlying transport variables you can modify. (like you may add a cookie to session.cookies without losing the rest)

I don't understand what problem this is trying to solve. Another workaround for what, precisely?

To recap, i don't think MechanicalSoup need any extra interface to handle such specific usecase as parsing extra cookie information and pass it to requests.

Well, if you need to parse raw cookie strings, you need such an interface.
Obviously most people don't have to do so, it's a bit specialized, since most people don't have to deal with Javascript's document.cookie interface.
But I certainly need such an interface, and mechanize provided one, so I think we should as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants