New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Content-Encoding" header gets stripped from response headers #1988
Comments
It makes sense to me, +1. Likely the header is changed/deleted to signal that response body is decoded; we should provide another way to do that (add a flag to response.flags?) if we keep the header. The change is backwards incompatible, so I think we should make it optional, disable it by default, but enable in settings.py generated by |
I started to implement this as suggested, but I'm unsure about the settings:
In the test code, it does not effect the value read within the httpcompression class ... |
"Content-Encoding" header is still removed from |
Hello, I am looking to try and help and this seemed the least intimidating task. There is a desire to have a bool that can be adjusted directly in default_settings.py, or in a different file, to prevent HttpCompresionMiddleware.process_responce() from stripping away the content-encoding header value. Could this be done by passing process_response an instance of settings then use settings.getbool() like is done in the HttpErrorMiddleware.init(self, settings)? Also like what some downloadermiddleware like in ajaxcrawl and redirect that have an int with setting passed to them? Also where are downloadermiddleares called or imported I have been looking but not finding yet. |
I believe the way to go is to store It may be even better to use
There’s a setting to define the downloader middlewares that you wish to be loaded. If you find where that setting is used in code, you should be able to locate the point where they are imported. Although I don’t think you need to know to get this implemented. |
Okay, that was my other thought but was worried I'd have to find every instance of HttpCompressionMiddleware being instantiated and add settings as an argument. I figure I can model it after the AjaxCrawlMiddleware that gets passed settings for its init and then has an instance variable for what its looking for: Also, find where |
I got it to "pass" the tox tests after adding an init that stores an instance variable 'to keep or not keep' the headers. When I say pass it's only for py37 but there are still 13 xfails and 56 skips. I'm assuming on the feed the 'x' and 's' instead of '.' represent those respectively. It looks like all but one of the xfails was in test_squeues and the other in test_linkextractors. I did modify the set up in the httpcompression test file by adding On the last 'thing' I pull requested @elacuesta was helpful at guiding me mentioning a few things. First, that in the logic at the tail end might result in the wrong value being kept in 'Content-Encoding' I think they meant that as the response is processed its encoding before To do that it looks like I need to use This should be up to date if seeing the code makes this easier to understand. |
does this still need doing, do you want me to pick this up? Or for that fact any other "good first issue"? if just let me know seems like there are some open PR's however most of the are abandoned |
There is already #4025 |
That said, it has been a while, though, and @VorBoto might have moved on. Maybe someone could look into resuming their work. |
@Gallaecio that's what I saw, I can pick up their branch and finish it up ) |
Sounds good to me. Thanks! |
Does this issue still need to be fixed or has it been resolved? |
Still pending, with 3 open PRs. It may not be as trivial as it sounds. |
See https://github.com/scrapy/scrapy/blob/master/scrapy/downloadermiddlewares/httpcompression.py#L36
IMHO the "Content-Encoding" header should get preserved, since the spider probably wants to see all the original response headers.
The text was updated successfully, but these errors were encountered: