Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No response for xml page. #373

Open
bedus-creation opened this issue May 29, 2021 · 4 comments
Open

No response for xml page. #373

bedus-creation opened this issue May 29, 2021 · 4 comments

Comments

@bedus-creation
Copy link

Python version: 3.8.5

I am expecting to get xml page, but return NoneType.

def get_response(url):
    browser = mechanicalsoup.StatefulBrowser(
        soup_config={'features': 'lxml'},
        user_agent='Googlebot/2.1: https://www.google.com/bot.html'
    )
    browser.open(url)
    return browser.get_current_page()

get_response("https://jagirhouse.com/sitemap.xml")
@johnhawkinson
Copy link
Contributor

In mechanicalsoup, a "page" is defined to be a BeautifulSoup object, and it's assumed that BeautifulSoup can only parse HTML, although that is not correct — bs4 can parse XML if instructed to do so (see "Parsing XML").
The workaround, and probably what you want to do, is to preserve the response and then deal with it yourself, whether through bs4 or some other parser.

>>> import mechanicalsoup
>>> import bs4
>>> browser = mechanicalsoup.StatefulBrowser(
    soup_config={'features': 'lxml'},
    user_agent='Googlebot/2.1: https://www.google.com/bot.html'
)
>>> url="https://jagirhouse.com/sitemap.xml"
>>> response=browser.open(url)
>>> len(response.content)
409
>>> response.headers['content-type']
'text/xml; charset=UTF-8'
>>> soup = bs4.BeautifulSoup(response.content, "xml")
>>> soup.loc
<loc>https://jagirhouse.com/jobs-sitemap.xml</loc>
>>> 

Given that page is defined as Get the current page as a soup object, this code is probably unnecessarily restrictive:

def add_soup(response, soup_config):
"""Attaches a soup object to a requests response."""
if ("text/html" in response.headers.get("Content-Type", "") or
Browser.__looks_like_html(response)):

and should probably check for text/xml and invoke bs4 on it, and probably it should do so prior to any Browser.__looks_like_html() heuristics.

On the other hand, I guess it could be argued that the target of URIs that return text/html content are not "pages," but if so, the documentation should be more clear.

@moy
Copy link
Collaborator

moy commented May 29, 2021

We could probably relax some constraints, but if we do so we also need to be more careful with methods like follow_link, Form and friends, who really assume HTML.

@johnhawkinson
Copy link
Contributor

Would we?

To the extent the XML has "links" (in which case it is probably XHTML or something which…huh, uses a variety of content-types it seems, including application/xhtml+xml), then it seems link follow_link() and friends work fine.
And if they fail to parse the content, it is mostly no-harm/no-foul, right?

I suppose there may be some unnecessary computation resources blown in the attempt?

@moy
Copy link
Collaborator

moy commented May 29, 2021

In the case of XHTML, yes, everything should still work (supporting application/xhtml+xml in addition to text/html would be straightforward and just work).

Other varieties of XML may have a notion of links, but with different syntax. For example RSS and Atom have a <link> tag, so searching for <a href=... won't work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants