Adding keep-encoding option to HttpCompressionMiddleware issue #1988 #4021

VorBoto · 2019-09-16T21:57:23Z

Model this after the ajaxcrawl.py middle ware to pass setting to an init and store the setting in an instance variable.

…e class (scrapy#1988)

VorBoto · 2019-09-16T23:16:36Z

@elacuesta @Gallaecio
Looking at the tests from the travis-ci and the tests of both httpcompression and ajaxcrawl AjaxCrawlMiddleware calls a class function from_crawler to instantiate its instance. I'm assuming that's what the decorator @classmethod does with return cls(crawler.settings) . But then for httpcompression its just HttpCompressionMiddleware(), in the tests, would I need to go through and add the settings or a from_crawler class method to HttpCompressionMiddleware and load it in like Ajax is in the test. Of would simply adding going back to pass HttpCompressionMiddleware(self, settings) and adding the settings passed to it in each test?

Am I allowed to touch tests?

elacuesta

The crawler needs to be passed in this line

elacuesta · 2019-09-16T23:02:27Z

scrapy/downloadermiddlewares/httpcompression.py

@@ -18,6 +18,12 @@
 class HttpCompressionMiddleware(object):
    """This middleware allows compressed (gzip, deflate) traffic to be
    sent/received from web sites"""
+
+    def __int__(self, crawler):
+        #setting to decide to keep or discard encoding header


I think these comments are unnecessary, the setting should be documented instead.

elacuesta · 2019-09-16T23:16:39Z

scrapy/downloadermiddlewares/httpcompression.py

@@ -45,8 +51,9 @@ def process_response(self, request, response, spider):
                    # responsetypes guessing is reliable
                    kwargs['encoding'] = None
                response = response.replace(**kwargs)
-                if not content_encoding:
-                    del response.headers['Content-Encoding']
+                if not self.keep_encoding_header:


The problem I see with this solution is that responses could end up with a "Content-Encoding" header that does not fully match the actual encoding.
I think the ideal solution for this would be #1877, but while that is not available, Response.flags is probably the less ugly way, as mentioned in #1988 (comment)

No worries I put it in just as you were commenting. I need to slow down a little.

So first with the from_crawler classmethod I overlooked, have it be return cls(crawler.settings) similar to Ajax. Then if I want to continue that path instead of passing HttpCompressionMiddleware anything it should get called with the from_crawler method. Which should be accessible since HttpComp test imports what ajax does, besides from scrapy.utils.test import get_crawler.

Then to fix the possible mix match of the header value. Should I double check that the bytes passed are the same encoding as the value specified? That would need to be before its decoded. Or are you thinking to have the header value that is passed with the returned response object has the header encoding value be that of what it was decoded to?

Then response flags. Looks like they are passes when getting a Response object or through the replace method that takes keyword pairs. So then reworking the logic at the tail end to simply replace the encoding header value when it's being kept cause I don't see a way to tack on an additional header value.

elacuesta · 2019-09-16T23:24:48Z

@VorBoto Sorry, I started my review before seeing your latest comment.
Indeed, you are not only allowed to touch tests, I'd say you are encouraged to do so. In this particular case, new tests are probably needed as well.

VorBoto · 2019-09-17T00:17:11Z

So @elacuesta this is what I've done now
I removed the comments.
Placed return cls(crawler.settings) in the from_crawler method.
Placed self.mw = HttpCompressionMiddleware.from_crawler(crawler) in its test.

I'm just not 100 on how to handle the Response.flags. I see the response gets stripped down in the processing. But the headers do get placed in the object that will ultimately be used to replace the kwargs. Would that be the place to adjust the 'Content-Encoding' value, once I know what to adjust it to?

VorBoto · 2019-09-17T00:40:53Z

There is a problem with cls not being an object that takes parameters.
I will need to read up on @classmethod decorators and cls before I attempt this again I think.

VorBoto · 2019-09-17T02:08:46Z

I'm sorry everyone for clogging this up with a bunch of trash. I am going to do things more properly and use tox before I just toss things in the system.

cwen13 added 5 commits September 13, 2019 22:15

Add HEADERS_KEEP variable to settings and to HttpCompressionMiddlewar…

5f068dc

…e class (scrapy#1988)

fixed type

6a725ca

Added init to HttpCompMiddleware scrapy#1988

2b6fb1b

Added init to HttpCompMiddleware scrapy#1988 Typeo

96a8dfc

Trying with passing init crawler scrapy#1988

9010d9f

elacuesta mentioned this pull request Sep 16, 2019

Add HEADERS_KEEP to settings and to HttpCompressionMiddleware (#1988) #4017

Closed

elacuesta requested changes Sep 16, 2019

View reviewed changes

Adding settings to HttpCompressionMiddleware instantiation in test

608b342

Slowed down and tried to take things slower

3350d0f

VorBoto added 2 commits September 16, 2019 21:29

I mispelled init -_-

cb09ef5

changed the how its intialized in test

f949126

Added a crawler to HttpCompressionTest using the Spider it created

c94f35b

VorBoto closed this Sep 17, 2019

VorBoto mentioned this pull request Sep 17, 2019

"Content-Encoding" header gets stripped from response headers #1988

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding keep-encoding option to HttpCompressionMiddleware issue #1988 #4021

Adding keep-encoding option to HttpCompressionMiddleware issue #1988 #4021

VorBoto commented Sep 16, 2019

VorBoto commented Sep 16, 2019

elacuesta left a comment

elacuesta Sep 16, 2019

elacuesta Sep 16, 2019 •

edited

VorBoto Sep 16, 2019

elacuesta commented Sep 16, 2019

VorBoto commented Sep 17, 2019

VorBoto commented Sep 17, 2019

VorBoto commented Sep 17, 2019 •

edited

Adding keep-encoding option to HttpCompressionMiddleware issue #1988 #4021

Adding keep-encoding option to HttpCompressionMiddleware issue #1988 #4021

Conversation

VorBoto commented Sep 16, 2019

VorBoto commented Sep 16, 2019

elacuesta left a comment

Choose a reason for hiding this comment

elacuesta Sep 16, 2019

Choose a reason for hiding this comment

elacuesta Sep 16, 2019 • edited

Choose a reason for hiding this comment

VorBoto Sep 16, 2019

Choose a reason for hiding this comment

elacuesta commented Sep 16, 2019

VorBoto commented Sep 17, 2019

VorBoto commented Sep 17, 2019

VorBoto commented Sep 17, 2019 • edited

elacuesta Sep 16, 2019 •

edited

VorBoto commented Sep 17, 2019 •

edited