image scrapping: check for Content-Type header instead of file extensions #282

clemux · 2022-01-16T12:58:55Z

Archivy does not currently save images if there is not file extension. It should check the Content-Type header instead of the file extension in src tag.
However, that requires downloading at least part of each file. For example:

In [8]: r = requests.head(
   ...:     "https://core.telegram.org/file/811140081/1/VldnlW70g2I/a140b0216e7d156ebc"
   ...: )

In [9]: r.headers["Content-Type"]
Out[9]: 'text/html'

In [10]: r = requests.get(
    ...:     "https://core.telegram.org/file/811140081/1/VldnlW70g2I/a140b0216e7d156ebc"
    ...: )

In [11]: r.headers["Content-Type"]
Out[11]: 'image/jpeg'

The text was updated successfully, but these errors were encountered:

Uzay-G · 2022-01-17T10:17:55Z

@clemux did some work on this here: clemux@d0df0d8

Will look at it

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

image scrapping: check for Content-Type header instead of file extensions #282

image scrapping: check for Content-Type header instead of file extensions #282

clemux commented Jan 16, 2022

Uzay-G commented Jan 17, 2022

image scrapping: check for Content-Type header instead of file extensions #282

image scrapping: check for Content-Type header instead of file extensions #282

Comments

clemux commented Jan 16, 2022

Uzay-G commented Jan 17, 2022