Suggestions for improvement on this time-consuming function: __looks_like_html() #432

cdhigh · 2024-01-28T13:04:52Z

In practical, it has been observed that this function can be very time-consuming if encountering image files or other larger binary files. It is recommended to modify it to this or similars:

def __looks_like_html(response):
        """Guesses entity type when Content-Type header is missing.
        Since Content-Type is not strictly required, some servers leave it out.
        """
        #text = response.text.lstrip().lower()
        #return text.startswith('<html') or text.startswith('<!doctype')
        return re.search(br'<html|<!doctype', response.content[:200]) is not None

MechanicalSoup/mechanicalsoup/browser.py

Line 62 in 91b1207

def __looks_like_html(response):

The text was updated successfully, but these errors were encountered:

moy · 2024-02-08T08:42:43Z

The suggestion looks good, but I think you need to add re.IGNORECASE as third argument to re.search to also match uppercase tags. Can you turn this into a proper pull-request?

Thanks,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestions for improvement on this time-consuming function: __looks_like_html() #432

Suggestions for improvement on this time-consuming function: __looks_like_html() #432

cdhigh commented Jan 28, 2024

moy commented Feb 8, 2024

Suggestions for improvement on this time-consuming function: __looks_like_html() #432

Suggestions for improvement on this time-consuming function: __looks_like_html() #432

Comments

cdhigh commented Jan 28, 2024

moy commented Feb 8, 2024