Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Content-Encoding" header gets stripped from response headers #1988

Open
mborho opened this issue May 12, 2016 · 14 comments · May be fixed by #4025, #5290 or #5943
Open

"Content-Encoding" header gets stripped from response headers #1988

mborho opened this issue May 12, 2016 · 14 comments · May be fixed by #4025, #5290 or #5943

Comments

@mborho
Copy link

mborho commented May 12, 2016

See https://github.com/scrapy/scrapy/blob/master/scrapy/downloadermiddlewares/httpcompression.py#L36

IMHO the "Content-Encoding" header should get preserved, since the spider probably wants to see all the original response headers.

@kmike
Copy link
Member

kmike commented May 16, 2016

It makes sense to me, +1. Likely the header is changed/deleted to signal that response body is decoded; we should provide another way to do that (add a flag to response.flags?) if we keep the header.

The change is backwards incompatible, so I think we should make it optional, disable it by default, but enable in settings.py generated by scrapy startproject.

@foromer4
Copy link
Contributor

foromer4 commented May 26, 2016

I started to implement this as suggested, but I'm unsure about the settings:
in the httpcompression class.
I can access the global settings by caliing
new Settings(),
but how can I change the value there for test purposes? If I use

 settings = Settings()
 settings.set('HTTPCACHE_REMOVE_ENCODING_HEADER_ON_DECODED_RESPONSE', 'False' , 'spider')

In the test code, it does not effect the value read within the httpcompression class ...

@trim21
Copy link

trim21 commented Sep 16, 2018

"Content-Encoding" header is still removed from response.header and can't be disabled.
Is there any progress about this issue up to now?
Can I provide any help?

@VorBoto
Copy link

VorBoto commented Sep 11, 2019

Hello, I am looking to try and help and this seemed the least intimidating task.
I want to see if I'm getting my bearing on what is looking to be done correct.

There is a desire to have a bool that can be adjusted directly in default_settings.py, or in a different file, to prevent HttpCompresionMiddleware.process_responce() from stripping away the content-encoding header value.

Could this be done by passing process_response an instance of settings then use settings.getbool() like is done in the HttpErrorMiddleware.init(self, settings)? Also like what some downloadermiddleware like in ajaxcrawl and redirect that have an int with setting passed to them?
Or would a class variable like redirect.BaseRedirectMiddleware has 'enable_setting' be perfered?

Also where are downloadermiddleares called or imported I have been looking but not finding yet.

VorBoto pushed a commit to VorBoto/scrapy that referenced this issue Sep 14, 2019
@Gallaecio
Copy link
Member

Could this be done by passing process_response an instance of settings then use settings.getbool() like is done in the HttpErrorMiddleware.init(self, settings)? Also like what some downloadermiddleware like in ajaxcrawl and redirect that have an int with setting passed to them?
Or would a class variable like redirect.BaseRedirectMiddleware has 'enable_setting' be perfered?

I believe the way to go is to store crawler.settings into self.settings in __init__, and then use self.settings.getbool() from process_response.

It may be even better to use self.settings.getbool() in __init__ and save the settings value instead to an instance variable, I’m simply not sure if the settings are fully filled by that time, I’m not familiar enough yet with the setting life cycle to know that off the top of my head.

Also where are downloadermiddleares called or imported I have been looking but not finding yet.

There’s a setting to define the downloader middlewares that you wish to be loaded. If you find where that setting is used in code, you should be able to locate the point where they are imported. Although I don’t think you need to know to get this implemented.

@VorBoto
Copy link

VorBoto commented Sep 16, 2019

Okay, that was my other thought but was worried I'd have to find every instance of HttpCompressionMiddleware being instantiated and add settings as an argument. I figure I can model it after the AjaxCrawlMiddleware that gets passed settings for its init and then has an instance variable for what its looking for: self.lookup_bytes = settings.getint('AJAXCRAWL_MAXSIZE', 32768)?

Also, find where AJAXCRAWL_ENABLED and AJAXCRAWL_MAXSIZE are stored so I place HTTPCOMPRESSION_HEADERS_KEEP in the hopefully correct file with it. (AJAXCRAWLER_ENABLE at least is in default_settings.py)

VorBoto pushed a commit to VorBoto/scrapy that referenced this issue Sep 16, 2019
VorBoto pushed a commit to VorBoto/scrapy that referenced this issue Sep 16, 2019
VorBoto pushed a commit to VorBoto/scrapy that referenced this issue Sep 16, 2019
@VorBoto
Copy link

VorBoto commented Sep 17, 2019

I got it to "pass" the tox tests after adding an init that stores an instance variable 'to keep or not keep' the headers. When I say pass it's only for py37 but there are still 13 xfails and 56 skips. I'm assuming on the feed the 'x' and 's' instead of '.' represent those respectively. It looks like all but one of the xfails was in test_squeues and the other in test_linkextractors.

I did modify the set up in the httpcompression test file by adding from scrapy.utils.tests import get_crawler and creating a crawler with both HTTPCOMPRESSION_HEADERS_KEEP and COMPRESSION_ENABLED passed for as settings dict. It failed without the COMPRESSION_ENABLED because the from_crawler method used to instantiate the instance test variable self.mw checks if compression is enabled first thing.

On the last 'thing' I pull requested @elacuesta was helpful at guiding me mentioning a few things. First, that in the logic at the tail end might result in the wrong value being kept in 'Content-Encoding' I think they meant that as the response is processed its encoding before process_repsonce will be different than what it leaves as so the value of 'Content-Encoding' would then be wrong wrt to the encoding that is returned in the processed response. Could you possibly confirm I'm understanding that correctly?
Second, the use of custom metadata in the responsethat is being discussed but then deferring to what @kmike brought up above of having there be a flag placed in the subsequent response.

To do that it looks like I need to use Responce.replace(self, *args, **kwargs) which is already used in HttpCompressionMiddleware.process_responce. So was thinking I could add a new entry that corresponds to the proper 'Content-Encoding' into the method variable that is ultimately passed to the response.replace call.

This should be up to date if seeing the code makes this easier to understand.

@T0shik
Copy link

T0shik commented Oct 20, 2021

does this still need doing, do you want me to pick this up? Or for that fact any other "good first issue"? if just let me know seems like there are some open PR's however most of the are abandoned

@Gallaecio
Copy link
Member

There is already #4025

@Gallaecio
Copy link
Member

That said, it has been a while, though, and @VorBoto might have moved on. Maybe someone could look into resuming their work.

@T0shik
Copy link

T0shik commented Oct 21, 2021

@Gallaecio that's what I saw, I can pick up their branch and finish it up )

@Gallaecio
Copy link
Member

Gallaecio commented Oct 21, 2021

Sounds good to me. Thanks!

@delaneyscofield
Copy link

Does this issue still need to be fixed or has it been resolved?

@Gallaecio
Copy link
Member

Still pending, with 3 open PRs. It may not be as trivial as it sounds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment