No response for xml page. #373

bedus-creation · 2021-05-29T03:07:45Z

Python version: 3.8.5

I am expecting to get xml page, but return NoneType.

def get_response(url):
    browser = mechanicalsoup.StatefulBrowser(
        soup_config={'features': 'lxml'},
        user_agent='Googlebot/2.1: https://www.google.com/bot.html'
    )
    browser.open(url)
    return browser.get_current_page()

get_response("https://jagirhouse.com/sitemap.xml")

johnhawkinson · 2021-05-29T05:12:11Z

In mechanicalsoup, a "page" is defined to be a BeautifulSoup object, and it's assumed that BeautifulSoup can only parse HTML, although that is not correct — bs4 can parse XML if instructed to do so (see "Parsing XML").
The workaround, and probably what you want to do, is to preserve the response and then deal with it yourself, whether through bs4 or some other parser.

>>> import mechanicalsoup
>>> import bs4
>>> browser = mechanicalsoup.StatefulBrowser(
    soup_config={'features': 'lxml'},
    user_agent='Googlebot/2.1: https://www.google.com/bot.html'
)
>>> url="https://jagirhouse.com/sitemap.xml"
>>> response=browser.open(url)
>>> len(response.content)
409
>>> response.headers['content-type']
'text/xml; charset=UTF-8'
>>> soup = bs4.BeautifulSoup(response.content, "xml")
>>> soup.loc
<loc>https://jagirhouse.com/jobs-sitemap.xml</loc>
>>>

Given that page is defined as Get the current page as a soup object, this code is probably unnecessarily restrictive:

MechanicalSoup/mechanicalsoup/browser.py

Lines 68 to 71 in b5b42e3

    
           def add_soup(response, soup_config): 
        
               """Attaches a soup object to a requests response.""" 
        
               if ("text/html" in response.headers.get("Content-Type", "") or 
        
                       Browser.__looks_like_html(response)):

and should probably check for text/xml and invoke bs4 on it, and probably it should do so prior to any Browser.__looks_like_html() heuristics.

On the other hand, I guess it could be argued that the target of URIs that return text/html content are not "pages," but if so, the documentation should be more clear.

moy · 2021-05-29T10:46:13Z

We could probably relax some constraints, but if we do so we also need to be more careful with methods like follow_link, Form and friends, who really assume HTML.

johnhawkinson · 2021-05-29T12:07:21Z

Would we?

To the extent the XML has "links" (in which case it is probably XHTML or something which…huh, uses a variety of content-types it seems, including application/xhtml+xml), then it seems link follow_link() and friends work fine.
And if they fail to parse the content, it is mostly no-harm/no-foul, right?

I suppose there may be some unnecessary computation resources blown in the attempt?

moy · 2021-05-29T14:40:52Z

In the case of XHTML, yes, everything should still work (supporting application/xhtml+xml in addition to text/html would be straightforward and just work).

Other varieties of XML may have a notion of links, but with different syntax. For example RSS and Atom have a <link> tag, so searching for <a href=... won't work.

johnhawkinson mentioned this issue May 30, 2021

add_soup(): Don't match Content-type with in #374

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No response for xml page. #373

No response for xml page. #373

bedus-creation commented May 29, 2021

johnhawkinson commented May 29, 2021

moy commented May 29, 2021

johnhawkinson commented May 29, 2021

moy commented May 29, 2021

No response for xml page. #373

No response for xml page. #373

Comments

bedus-creation commented May 29, 2021

johnhawkinson commented May 29, 2021

moy commented May 29, 2021

johnhawkinson commented May 29, 2021

moy commented May 29, 2021