"Content-Encoding" header gets stripped from response headers #1988

mborho · 2016-05-12T15:26:02Z

See https://github.com/scrapy/scrapy/blob/master/scrapy/downloadermiddlewares/httpcompression.py#L36

IMHO the "Content-Encoding" header should get preserved, since the spider probably wants to see all the original response headers.

kmike · 2016-05-16T11:35:10Z

It makes sense to me, +1. Likely the header is changed/deleted to signal that response body is decoded; we should provide another way to do that (add a flag to response.flags?) if we keep the header.

The change is backwards incompatible, so I think we should make it optional, disable it by default, but enable in settings.py generated by scrapy startproject.

foromer4 · 2016-05-26T10:18:18Z

I started to implement this as suggested, but I'm unsure about the settings:
in the httpcompression class.
I can access the global settings by caliing
new Settings(),
but how can I change the value there for test purposes? If I use

 settings = Settings()
 settings.set('HTTPCACHE_REMOVE_ENCODING_HEADER_ON_DECODED_RESPONSE', 'False' , 'spider')

In the test code, it does not effect the value read within the httpcompression class ...

trim21 · 2018-09-16T22:13:14Z

"Content-Encoding" header is still removed from response.header and can't be disabled.
Is there any progress about this issue up to now?
Can I provide any help?

VorBoto · 2019-09-11T23:14:46Z

Hello, I am looking to try and help and this seemed the least intimidating task.
I want to see if I'm getting my bearing on what is looking to be done correct.

There is a desire to have a bool that can be adjusted directly in default_settings.py, or in a different file, to prevent HttpCompresionMiddleware.process_responce() from stripping away the content-encoding header value.

Could this be done by passing process_response an instance of settings then use settings.getbool() like is done in the HttpErrorMiddleware.init(self, settings)? Also like what some downloadermiddleware like in ajaxcrawl and redirect that have an int with setting passed to them?
Or would a class variable like redirect.BaseRedirectMiddleware has 'enable_setting' be perfered?

Also where are downloadermiddleares called or imported I have been looking but not finding yet.

…e class (scrapy#1988)

Gallaecio · 2019-09-16T13:54:02Z

Could this be done by passing process_response an instance of settings then use settings.getbool() like is done in the HttpErrorMiddleware.init(self, settings)? Also like what some downloadermiddleware like in ajaxcrawl and redirect that have an int with setting passed to them?
Or would a class variable like redirect.BaseRedirectMiddleware has 'enable_setting' be perfered?

I believe the way to go is to store crawler.settings into self.settings in __init__, and then use self.settings.getbool() from process_response.

It may be even better to use self.settings.getbool() in __init__ and save the settings value instead to an instance variable, I’m simply not sure if the settings are fully filled by that time, I’m not familiar enough yet with the setting life cycle to know that off the top of my head.

Also where are downloadermiddleares called or imported I have been looking but not finding yet.

There’s a setting to define the downloader middlewares that you wish to be loaded. If you find where that setting is used in code, you should be able to locate the point where they are imported. Although I don’t think you need to know to get this implemented.

VorBoto · 2019-09-16T21:43:33Z

Okay, that was my other thought but was worried I'd have to find every instance of HttpCompressionMiddleware being instantiated and add settings as an argument. I figure I can model it after the AjaxCrawlMiddleware that gets passed settings for its init and then has an instance variable for what its looking for: self.lookup_bytes = settings.getint('AJAXCRAWL_MAXSIZE', 32768)?

Also, find where AJAXCRAWL_ENABLED and AJAXCRAWL_MAXSIZE are stored so I place HTTPCOMPRESSION_HEADERS_KEEP in the hopefully correct file with it. (AJAXCRAWLER_ENABLE at least is in default_settings.py)

VorBoto · 2019-09-17T22:50:30Z

I got it to "pass" the tox tests after adding an init that stores an instance variable 'to keep or not keep' the headers. When I say pass it's only for py37 but there are still 13 xfails and 56 skips. I'm assuming on the feed the 'x' and 's' instead of '.' represent those respectively. It looks like all but one of the xfails was in test_squeues and the other in test_linkextractors.

I did modify the set up in the httpcompression test file by adding from scrapy.utils.tests import get_crawler and creating a crawler with both HTTPCOMPRESSION_HEADERS_KEEP and COMPRESSION_ENABLED passed for as settings dict. It failed without the COMPRESSION_ENABLED because the from_crawler method used to instantiate the instance test variable self.mw checks if compression is enabled first thing.

On the last 'thing' I pull requested @elacuesta was helpful at guiding me mentioning a few things. First, that in the logic at the tail end might result in the wrong value being kept in 'Content-Encoding' I think they meant that as the response is processed its encoding before process_repsonce will be different than what it leaves as so the value of 'Content-Encoding' would then be wrong wrt to the encoding that is returned in the processed response. Could you possibly confirm I'm understanding that correctly?
Second, the use of custom metadata in the responsethat is being discussed but then deferring to what @kmike brought up above of having there be a flag placed in the subsequent response.

To do that it looks like I need to use Responce.replace(self, *args, **kwargs) which is already used in HttpCompressionMiddleware.process_responce. So was thinking I could add a new entry that corresponds to the proper 'Content-Encoding' into the method variable that is ultimately passed to the response.replace call.

This should be up to date if seeing the code makes this easier to understand.

T0shik · 2021-10-20T18:43:10Z

does this still need doing, do you want me to pick this up? Or for that fact any other "good first issue"? if just let me know seems like there are some open PR's however most of the are abandoned

Gallaecio · 2021-10-20T18:52:00Z

There is already #4025

Gallaecio · 2021-10-20T18:56:04Z

That said, it has been a while, though, and @VorBoto might have moved on. Maybe someone could look into resuming their work.

T0shik · 2021-10-21T08:57:59Z

@Gallaecio that's what I saw, I can pick up their branch and finish it up )

Gallaecio · 2021-10-21T09:22:47Z

Sounds good to me. Thanks!

delaneyscofield · 2024-04-21T18:03:58Z

Does this issue still need to be fixed or has it been resolved?

Gallaecio · 2024-04-22T07:23:57Z

Still pending, with 3 open PRs. It may not be as trivial as it sounds.

Gallaecio added enhancement good first issue labels Aug 14, 2019

VorBoto pushed a commit to VorBoto/scrapy that referenced this issue Sep 14, 2019

Add HEADERS_KEEP variable to settings and to HttpCompressionMiddlewar…

5f068dc

…e class (scrapy#1988)

VorBoto mentioned this issue Sep 16, 2019

Add HEADERS_KEEP to settings and to HttpCompressionMiddleware (#1988) #4017

Closed

VorBoto pushed a commit to VorBoto/scrapy that referenced this issue Sep 16, 2019

Added init to HttpCompMiddleware scrapy#1988

2b6fb1b

VorBoto pushed a commit to VorBoto/scrapy that referenced this issue Sep 16, 2019

Added init to HttpCompMiddleware scrapy#1988 Typeo

96a8dfc

VorBoto pushed a commit to VorBoto/scrapy that referenced this issue Sep 16, 2019

Trying with passing init crawler scrapy#1988

9010d9f

elacuesta mentioned this issue Sep 16, 2019

Adding keep-encoding option to HttpCompressionMiddleware issue #1988 #4021

Closed

VorBoto added a commit to VorBoto/scrapy that referenced this issue Sep 18, 2019

Place encodings in response's flags issue scrapy#1988

ccbb524

VorBoto linked a pull request Sep 19, 2019 that will close this issue

Place Content-encoding header into response's flags is desired #1988 #4025

Open

T0shik linked a pull request Oct 23, 2021 that will close this issue

Place Content-encoding header into response's flags is desired #5290

Open

wRAR linked a pull request Jun 2, 2023 that will close this issue

Add Content-Encoding header in response flag #5943

Open

cakemd mentioned this issue Nov 27, 2023

Follow vorboto/feature/httpcompression keeep response encoding #6156

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Content-Encoding" header gets stripped from response headers #1988

"Content-Encoding" header gets stripped from response headers #1988

mborho commented May 12, 2016 •

edited

kmike commented May 16, 2016 •

edited

foromer4 commented May 26, 2016 •

edited

trim21 commented Sep 16, 2018 •

edited

VorBoto commented Sep 11, 2019

Gallaecio commented Sep 16, 2019

VorBoto commented Sep 16, 2019

VorBoto commented Sep 17, 2019

T0shik commented Oct 20, 2021 •

edited

Gallaecio commented Oct 20, 2021

Gallaecio commented Oct 20, 2021

T0shik commented Oct 21, 2021

Gallaecio commented Oct 21, 2021 •

edited

delaneyscofield commented Apr 21, 2024

Gallaecio commented Apr 22, 2024

"Content-Encoding" header gets stripped from response headers #1988

"Content-Encoding" header gets stripped from response headers #1988

Comments

mborho commented May 12, 2016 • edited

kmike commented May 16, 2016 • edited

foromer4 commented May 26, 2016 • edited

trim21 commented Sep 16, 2018 • edited

VorBoto commented Sep 11, 2019

Gallaecio commented Sep 16, 2019

VorBoto commented Sep 16, 2019

VorBoto commented Sep 17, 2019

T0shik commented Oct 20, 2021 • edited

Gallaecio commented Oct 20, 2021

Gallaecio commented Oct 20, 2021

T0shik commented Oct 21, 2021

Gallaecio commented Oct 21, 2021 • edited

delaneyscofield commented Apr 21, 2024

Gallaecio commented Apr 22, 2024

mborho commented May 12, 2016 •

edited

kmike commented May 16, 2016 •

edited

foromer4 commented May 26, 2016 •

edited

trim21 commented Sep 16, 2018 •

edited

T0shik commented Oct 20, 2021 •

edited

Gallaecio commented Oct 21, 2021 •

edited