Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Robust form selection should be encouraged #366

Open
johnhawkinson opened this issue Apr 6, 2021 · 3 comments
Open

Robust form selection should be encouraged #366

johnhawkinson opened this issue Apr 6, 2021 · 3 comments

Comments

@johnhawkinson
Copy link
Contributor

johnhawkinson commented Apr 6, 2021

The problem

As I was getting up to speed on Mechanical Soup, I was in a bit of a rush and just ended up using the notional:

browser.select_form(nr=0)
browser["password"] = "correct horse battery staple"
browser.submit_selected()

But this is not desirable — scripts that automate websites should be as robust as possible to minor changes in those websites, and should avoid hard-coding anything that they don't have to. In that case, the form number. Looking at the examples, and indeed the API, though, it seems to be hard to do better.

The tutorial recommends:

browser.select_form('form[action="/post"]')

and the examples either do the same or instead:

browser.select_form('#login form')

These are not robust. They depend on the numbering of the form, the action URL of the form, or the id property of the form. All of these can easily change in ways that shouldn't matter to most users or to automation, but will break.

Instead, the script really should be selecting the form that has the password field.
This was easily accomplished in the OG, perl's WWW::Mechanize:

$mech->form_with_fields('password')

and indeed, you could skip the form selection and combine setting the field with the submission:

$mech->submit_form( with_fields => { password => "correct horse battery staple" } )

And in mechanize it was a bit hairier, but select_form() offered a predicate parameter, allowing:

browser.select_form(predicate=lambda form: form.find_control("password"))

But to accomplish this in MechanicalSoup, it's much harder.

Edit (18 April): This was not quite right. form input is a selector returning an <input> element, not a <form> element, as was desired here. This can be worked around with form:has(input).
select_form() takes a CSS selector, but that can't robustly look for fields, because they can be different tags (input, select, textarea). (Obviously a "password" field is not likely to be anything other than an input, but many other controls are not so specific.):

browser.select_form('form input[name="password"]')

I guess one could write:
browser.select_form('form select[name="password"], textarea[name="password"], input[name="password"]')
# or
[ browser.select_form(f'form {input}[name="password"]') for input in ['input', 'select', 'textarea'] ]
# or
browser.select_form(
", ".join([ f'form {input}[name="password"]' for input in ['input', 'select', 'textarea'] ])
)

But none of that seems reasonable. Especially for something that should be a common task.

We also can give select_form() a bs4 form object, so I suppose we can:

form = browser.page.find(['input', 'textarea', 'select'], attrs={"name": "password"}).find_parent('form')
browser.select_form(form)

but that seems horribly cumbersome and not a pattern we want to recommend.
And worse, none of these scale up for multiple fields. That is,

$mech->submit_form( with_fields => { login => "xkcd", password => "correct horse battery staple" } )

would address a form like this:

<form action="remind"> Need a password reminder? <input name="login"> </form>
<form action="entropy"> Check a password for entropy! <input name="password"> </form>
<form action="login"> Login to the site <input name="login"> <input name="password"> </form>

Yet all of the above MechanicalSoup solutions don't really work for this.

It's also worth noting that presumably the big competition is Puppeteer, and over there it would just be

await page.type('.password input', 'correct horse battery staple');
await page.type(String.fromCharCode(13)); // or something, not actually sure how you submit? looks like a lot of choices.

I guess that's not universally better. It requires you to know the field is an <input>, but it does leverage the browser's computation of which form the input belongs to — something that browsers are particular good at.

Why?

Why did we abandon WWW:Mechanize's approach? It seemed like a pretty good example of an API designed to make it easy to do the Right Thing?

How should we fix it?

We could do more than one of these things:

  1. Give select_form() a predicate option, a callable that it would pass to bs4.find_all("form", callable), allowing
   select_form(predicate=lambda f: form.find(['input', 'textarea', 'select'], attrs={"name": "password"}))

This would add compatibility with mechanize but it's not a nice pattern.

  1. Give select_form() a with_fields allowing
select_form(with_fields=["login", "password"])
  1. Allow selection, filling-of-fields, and submission in one go
submit_form(with_fields={'login': 'xkcd', 'password': 'correct horse battery staple'})

I like this, but not the naming. We already have Browser.submit() which has no form state, and StatefulBrowser.submit_selected(). I'm not sure how we would name such a method without conflicting with Browser — any method like submit_form() would take the same cognitive spot as Browser's submit() and it's not clear how a person is supposed to remember submit() goes to Browser and submit_form() to StatefulBrowser.

Maybe StatefulBrowser should override submit()?

And, of course, we should update the examples and the tutorial as well. (Perhaps those things should be more tightly linked?)

@johnhawkinson
Copy link
Contributor Author

Any ideas?

@hemberger
Copy link
Contributor

Thanks for your thoughtful and detailed analysis of this issue! I agree completely that in an ideal world it would be easy to do things the robust way and hard to do things the fragile way.

These are not robust. They depend on the numbering of the form, the action URL of the form, or the id property of the form. All of these can easily change in ways that shouldn't matter to most users or to automation, but will break.

A minor point of disagreement here, but I would argue that if the id of a form changes, I (personally) would not want to automatically assume that it still does the same thing just because it has the same input fields. Here's a contrived example, but let's say you're downloading photos from some online storage, identified by the date they were taken:

<form id="download">
    <input name="date" />
    <input type="submit" value="Download" />
</form>

Now let's say the site has been updated, and you end up on a page that deletes photos identified by that date instead of downloading them. Then you might have a form:

<form id="delete">
    <input name="date" />
    <input type="submit" value="Delete" />
</form>

But maybe in an alternate reality, the form changed from id="download" to id="downloadPhotos", in which case your id-based selector fails and now you need to spend extra time debugging why it failed. Is it wrong to be conservative here?

This is a long way of saying that what makes form submission and navigation fragile/dangerous depends very sensitively on the source of the HTML that's being processed, and how the user might expect it to change in the future. I think that highlights the strength of CSS selectors:

  1. They are agnostic to their purpose, and are very versatile. Your selectors can be detailed and prescriptive (though admittedly the format is quite dense), or short and simple.
  2. We don't have to maintain any of that code here, allowing MechanicalSoup to remain a lightweight wrapper around these two amazing Python libraries.

I can see where your frustration comes from though. CSS selectors are not perfect, and as you point out they are not very robust to changes in HTML that don't change the nature of the page (which is clearly something that is important to you!).

One thing to note is that bs4 implements CSS selectors via Soup Sieve, which supports the CSS4 :has() pseudo-selector (https://drafts.csswg.org/selectors-4/#relational). I feel like this might work perfectly for some of your use cases, e.g. "just give me the form element that has username and password input children". Obviously this doesn't make it any harder to do things in a fragile way, but maybe we could at least demonstrate how one might use CSS selectors in a robust way in our examples?

I'm not personally opposed to adding some convenience functions like the ones you proposed (especially if someone is willing to contribute them!), but I'd want to make sure to get some feedback from @moy, since he was more involved in the original API design.

Either way, I think we can all agree that selecting a form based on the numerical ordering of forms on the page (and nothing else) is scary, except in trivial examples. Maybe that's something we could specifically do more to discourage. :)

@johnhawkinson
Copy link
Contributor Author

Thanks, Dan, for the feedback! Reordering slightly.

I'm not personally opposed to adding some convenience functions like the ones you proposed (especially if someone is willing to contribute them!), but I'd want to make sure to get some feedback from @moy, since he was more involved in the original API design.

To be clear, I'm certainly willing to contribute such code. I just don't want to put in effort that won't be accepted, and I also think it's worth discussing the proposed API before doing it (but maybe I spend too much time on that!).

A minor point of disagreement here, but I would argue that if the id of a form changes, I (personally) would not want to automatically assume that it still does the same thing just because it has the same input fields.

That's certainly a fair point, and it does seem to vary depending on circumstances, perhaps based on the presence of credentials as an implication for the plausibility of a read-write operation versus a read-only query.

But on the other hand, you could make the same argument about the action of a form, or the name of the form, or the contents of the fields. ("If the action of a form changes, I (personally) would not want to automatically assume that it still does the same thing just because it has the same id.") It seems clear that specifying the action, the id, the name, and the constituent fields is overkill, though, and would probably result in a fragile script.

Looking at our examples, we do have 1 (counting both example{,_manual}.py as one) that filters forms using the id of a parent div; 2 using the id of the form itself; and 2 using the action attribute:

example.py:browser.select_form('#login form')
example_manual.py:login_form = mechanicalsoup.Form(login_page.soup.select_one('#login form'))
expl_duck_duck_go.py:browser.select_form('#search_form_homepage')
expl_google.py:browser.select_form('form[action="/search"]')
expl_httpbin.py:browser.select_form('form[action="/post"]')
expl_qwant.py:browser.select_form('#search-form')

But maybe in an alternate reality, the form changed from id="download" to id="downloadPhotos", in which case your id-based selector fails and now you need to spend extra time debugging why it failed. Is it wrong to be conservative here?

I guess it really depends on how you define these terms — if we're talking about a Google search form, it's more "robust" to submit the form we think is most likely to be correct, because it will work if the form URL changes, and is not going to, say, delete search results from the global Internet. (Is it "conservative"?)

If we're talking about a credentialed photo sharing website, there is a real possibility (though remote) that a download form will turn into a deletion form, and so it is more "conservative" to fail instead of taking the wrong action. (Is it "robust"?)

Probably neither "robust" nor "conservative" is really a great fit for this concept.

One thing to note is that bs4 implements CSS selectors via Soup Sieve, which supports the CSS4 :has() pseudo-selector (https://drafts.csswg.org/selectors-4/#relational). I feel like this might work perfectly for some of your use cases, e.g. "just give me the form element that has username and password input children".

This confused me for a bit when I compared it to the pattern I had originally suggested, and I realized I had made an error in the lead post, suggesting that browser.select_form('form input[name="password"]') was effective, when it was not. That CSS selector returns the <input> element, not the <form> element. I've editted it and added some strikethrough.

But it doesn't seem practical to use :has in this way in raw CSS, because of the multiplicative problem. That is, if we want a form that has a username and password element, without specifying which kind of element they are, that's this:

browser.select_form('form'
                    ':has(select[name="login"],input[name="login"],textarea[name="login"])'
                    ':has(select[name="password"],input[name="password"],textarea[name="password"])')

If we were willing to limit them to <input> and not <select> or <textarea>, it reduces to

browser.select_form('form:has(input[name="login"]):has(input[name="password"]')

which at least fits on one 80-column line, but is not very satisfying.

Obviously this doesn't make it any harder to do things in a fragile way, but maybe we could at least demonstrate how one might use CSS selectors in a robust way in our examples?

I am less concerned with making it harder to do things in a fragile way, since that implies breaking existing APIs.

Either way, I think we can all agree that selecting a form based on the numerical ordering of forms on the page (and nothing else) is scary, except in trivial examples. Maybe that's something we could specifically do more to discourage. :)

It doesn't help that browser.select_form() defaults to this behavior, i.e. nr=0. That's probably right when there's a narrow selector given, but maybe not with selector='form', the default.

I don't see how that could reasonably be deprecated though :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants