New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No response for xml page. #373
Comments
In >>> import mechanicalsoup
>>> import bs4
>>> browser = mechanicalsoup.StatefulBrowser(
soup_config={'features': 'lxml'},
user_agent='Googlebot/2.1: https://www.google.com/bot.html'
)
>>> url="https://jagirhouse.com/sitemap.xml"
>>> response=browser.open(url)
>>> len(response.content)
409
>>> response.headers['content-type']
'text/xml; charset=UTF-8'
>>> soup = bs4.BeautifulSoup(response.content, "xml")
>>> soup.loc
<loc>https://jagirhouse.com/jobs-sitemap.xml</loc>
>>> Given that MechanicalSoup/mechanicalsoup/browser.py Lines 68 to 71 in b5b42e3
and should probably check for On the other hand, I guess it could be argued that the target of URIs that return text/html content are not "pages," but if so, the documentation should be more clear. |
We could probably relax some constraints, but if we do so we also need to be more careful with methods like |
Would we? To the extent the XML has "links" (in which case it is probably XHTML or something which…huh, uses a variety of content-types it seems, including I suppose there may be some unnecessary computation resources blown in the attempt? |
In the case of XHTML, yes, everything should still work (supporting Other varieties of XML may have a notion of links, but with different syntax. For example RSS and Atom have a |
Python version: 3.8.5
I am expecting to get xml page, but return NoneType.
The text was updated successfully, but these errors were encountered: