Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestions for improvement on this time-consuming function: __looks_like_html() #432

Open
cdhigh opened this issue Jan 28, 2024 · 1 comment

Comments

@cdhigh
Copy link

cdhigh commented Jan 28, 2024

In practical, it has been observed that this function can be very time-consuming if encountering image files or other larger binary files. It is recommended to modify it to this or similars:

def __looks_like_html(response):
        """Guesses entity type when Content-Type header is missing.
        Since Content-Type is not strictly required, some servers leave it out.
        """
        #text = response.text.lstrip().lower()
        #return text.startswith('<html') or text.startswith('<!doctype')
        return re.search(br'<html|<!doctype', response.content[:200]) is not None

def __looks_like_html(response):

@moy
Copy link
Collaborator

moy commented Feb 8, 2024

The suggestion looks good, but I think you need to add re.IGNORECASE as third argument to re.search to also match uppercase tags. Can you turn this into a proper pull-request?

Thanks,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants