New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Robust form selection should be encouraged #366
Comments
Any ideas? |
Thanks for your thoughtful and detailed analysis of this issue! I agree completely that in an ideal world it would be easy to do things the robust way and hard to do things the fragile way.
A minor point of disagreement here, but I would argue that if the <form id="download">
<input name="date" />
<input type="submit" value="Download" />
</form> Now let's say the site has been updated, and you end up on a page that deletes photos identified by that date instead of downloading them. Then you might have a form: <form id="delete">
<input name="date" />
<input type="submit" value="Delete" />
</form> But maybe in an alternate reality, the form changed from This is a long way of saying that what makes form submission and navigation fragile/dangerous depends very sensitively on the source of the HTML that's being processed, and how the user might expect it to change in the future. I think that highlights the strength of CSS selectors:
I can see where your frustration comes from though. CSS selectors are not perfect, and as you point out they are not very robust to changes in HTML that don't change the nature of the page (which is clearly something that is important to you!). One thing to note is that bs4 implements CSS selectors via Soup Sieve, which supports the CSS4 I'm not personally opposed to adding some convenience functions like the ones you proposed (especially if someone is willing to contribute them!), but I'd want to make sure to get some feedback from @moy, since he was more involved in the original API design. Either way, I think we can all agree that selecting a form based on the numerical ordering of forms on the page (and nothing else) is scary, except in trivial examples. Maybe that's something we could specifically do more to discourage. :) |
Thanks, Dan, for the feedback! Reordering slightly.
To be clear, I'm certainly willing to contribute such code. I just don't want to put in effort that won't be accepted, and I also think it's worth discussing the proposed API before doing it (but maybe I spend too much time on that!).
That's certainly a fair point, and it does seem to vary depending on circumstances, perhaps based on the presence of credentials as an implication for the plausibility of a read-write operation versus a read-only query. But on the other hand, you could make the same argument about the Looking at our examples, we do have 1 (counting both example.py:browser.select_form('#login form')
example_manual.py:login_form = mechanicalsoup.Form(login_page.soup.select_one('#login form'))
expl_duck_duck_go.py:browser.select_form('#search_form_homepage')
expl_google.py:browser.select_form('form[action="/search"]')
expl_httpbin.py:browser.select_form('form[action="/post"]')
expl_qwant.py:browser.select_form('#search-form')
I guess it really depends on how you define these terms — if we're talking about a Google search form, it's more "robust" to submit the form we think is most likely to be correct, because it will work if the form URL changes, and is not going to, say, delete search results from the global Internet. (Is it "conservative"?) If we're talking about a credentialed photo sharing website, there is a real possibility (though remote) that a download form will turn into a deletion form, and so it is more "conservative" to fail instead of taking the wrong action. (Is it "robust"?) Probably neither "robust" nor "conservative" is really a great fit for this concept.
This confused me for a bit when I compared it to the pattern I had originally suggested, and I realized I had made an error in the lead post, suggesting that But it doesn't seem practical to use browser.select_form('form'
':has(select[name="login"],input[name="login"],textarea[name="login"])'
':has(select[name="password"],input[name="password"],textarea[name="password"])') If we were willing to limit them to browser.select_form('form:has(input[name="login"]):has(input[name="password"]') which at least fits on one 80-column line, but is not very satisfying.
I am less concerned with making it harder to do things in a fragile way, since that implies breaking existing APIs.
It doesn't help that I don't see how that could reasonably be deprecated though :( |
The problem
As I was getting up to speed on Mechanical Soup, I was in a bit of a rush and just ended up using the notional:
But this is not desirable — scripts that automate websites should be as robust as possible to minor changes in those websites, and should avoid hard-coding anything that they don't have to. In that case, the form number. Looking at the examples, and indeed the API, though, it seems to be hard to do better.
The tutorial recommends:
and the examples either do the same or instead:
These are not robust. They depend on the numbering of the form, the action URL of the form, or the
id
property of the form. All of these can easily change in ways that shouldn't matter to most users or to automation, but will break.Instead, the script really should be selecting the form that has the
password
field.This was easily accomplished in the OG, perl's WWW::Mechanize:
and indeed, you could skip the form selection and combine setting the field with the submission:
And in
mechanize
it was a bit hairier, butselect_form()
offered apredicate
parameter, allowing:But to accomplish this in MechanicalSoup, it's much harder.
Edit (18 April): This was not quite right.
form input
is a selector returning an<input>
element, not a<form>
element, as was desired here. This can be worked around withform:has(input)
.:select_form()
takes a CSS selector, but that can't robustly look for fields, because they can be different tags (input
,select
,textarea
). (Obviously a "password" field is not likely to be anything other than aninput
, but many other controls are not so specific.)browser.select_form('form input[name="password"]')
I guess one could write:browser.select_form('form select[name="password"], textarea[name="password"], input[name="password"]')
# or
[ browser.select_form(f'form {input}[name="password"]') for input in ['input', 'select', 'textarea'] ]
# or
browser.select_form(
", ".join([ f'form {input}[name="password"]' for input in ['input', 'select', 'textarea'] ])
)
But none of that seems reasonable. Especially for something that should be a common task.
We also can give
select_form()
abs4
form object, so I suppose we can:but that seems horribly cumbersome and not a pattern we want to recommend.
And worse, none of these scale up for multiple fields. That is,
would address a form like this:
Yet all of the above MechanicalSoup solutions don't really work for this.
It's also worth noting that presumably the big competition is Puppeteer, and over there it would just be
I guess that's not universally better. It requires you to know the field is an
<input>
, but it does leverage the browser's computation of which form the input belongs to — something that browsers are particular good at.Why?
Why did we abandon WWW:Mechanize's approach? It seemed like a pretty good example of an API designed to make it easy to do the Right Thing?
How should we fix it?
We could do more than one of these things:
select_form()
a predicate option, acallable
that it would pass tobs4.find_all("form", callable)
, allowingThis would add compatibility with
mechanize
but it's not a nice pattern.select_form()
awith_fields
allowingI like this, but not the naming. We already have
Browser.submit()
which has no form state, andStatefulBrowser.submit_selected()
. I'm not sure how we would name such a method without conflicting withBrowser
— any method likesubmit_form()
would take the same cognitive spot asBrowser
'ssubmit()
and it's not clear how a person is supposed to remembersubmit()
goes toBrowser
andsubmit_form()
toStatefulBrowser
.Maybe
StatefulBrowser
should overridesubmit()
?And, of course, we should update the examples and the tutorial as well. (Perhaps those things should be more tightly linked?)
The text was updated successfully, but these errors were encountered: