Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

image scrapping: check for Content-Type header instead of file extensions #282

Open
clemux opened this issue Jan 16, 2022 · 1 comment
Open

Comments

@clemux
Copy link
Contributor

clemux commented Jan 16, 2022

Archivy does not currently save images if there is not file extension. It should check the Content-Type header instead of the file extension in src tag.
However, that requires downloading at least part of each file. For example:

In [8]: r = requests.head(
   ...:     "https://core.telegram.org/file/811140081/1/VldnlW70g2I/a140b0216e7d156ebc"
   ...: )

In [9]: r.headers["Content-Type"]
Out[9]: 'text/html'
In [10]: r = requests.get(
    ...:     "https://core.telegram.org/file/811140081/1/VldnlW70g2I/a140b0216e7d156ebc"
    ...: )

In [11]: r.headers["Content-Type"]
Out[11]: 'image/jpeg'
@Uzay-G
Copy link
Member

Uzay-G commented Jan 17, 2022

@clemux did some work on this here: clemux@d0df0d8

Will look at it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants