Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding keep-encoding option to HttpCompressionMiddleware issue #1988 #4021

Closed
wants to merge 10 commits into from

Conversation

VorBoto
Copy link

@VorBoto VorBoto commented Sep 16, 2019

Model this after the ajaxcrawl.py middle ware to pass setting to an init and store the setting in an instance variable.

@VorBoto
Copy link
Author

VorBoto commented Sep 16, 2019

@elacuesta @Gallaecio
Looking at the tests from the travis-ci and the tests of both httpcompression and ajaxcrawl AjaxCrawlMiddleware calls a class function from_crawler to instantiate its instance. I'm assuming that's what the decorator @classmethod does with return cls(crawler.settings) . But then for httpcompression its just HttpCompressionMiddleware(), in the tests, would I need to go through and add the settings or a from_crawler class method to HttpCompressionMiddleware and load it in like Ajax is in the test. Of would simply adding going back to pass HttpCompressionMiddleware(self, settings) and adding the settings passed to it in each test?

Am I allowed to touch tests?

Copy link
Member

@elacuesta elacuesta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The crawler needs to be passed in this line

@@ -18,6 +18,12 @@
class HttpCompressionMiddleware(object):
"""This middleware allows compressed (gzip, deflate) traffic to be
sent/received from web sites"""

def __int__(self, crawler):
#setting to decide to keep or discard encoding header
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these comments are unnecessary, the setting should be documented instead.

@@ -45,8 +51,9 @@ def process_response(self, request, response, spider):
# responsetypes guessing is reliable
kwargs['encoding'] = None
response = response.replace(**kwargs)
if not content_encoding:
del response.headers['Content-Encoding']
if not self.keep_encoding_header:
Copy link
Member

@elacuesta elacuesta Sep 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem I see with this solution is that responses could end up with a "Content-Encoding" header that does not fully match the actual encoding.
I think the ideal solution for this would be #1877, but while that is not available, Response.flags is probably the less ugly way, as mentioned in #1988 (comment)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries I put it in just as you were commenting. I need to slow down a little.

So first with the from_crawler classmethod I overlooked, have it be return cls(crawler.settings) similar to Ajax. Then if I want to continue that path instead of passing HttpCompressionMiddleware anything it should get called with the from_crawler method. Which should be accessible since HttpComp test imports what ajax does, besides from scrapy.utils.test import get_crawler.

Then to fix the possible mix match of the header value. Should I double check that the bytes passed are the same encoding as the value specified? That would need to be before its decoded. Or are you thinking to have the header value that is passed with the returned response object has the header encoding value be that of what it was decoded to?

Then response flags. Looks like they are passes when getting a Response object or through the replace method that takes keyword pairs. So then reworking the logic at the tail end to simply replace the encoding header value when it's being kept cause I don't see a way to tack on an additional header value.

@elacuesta
Copy link
Member

@VorBoto Sorry, I started my review before seeing your latest comment.
Indeed, you are not only allowed to touch tests, I'd say you are encouraged to do so. In this particular case, new tests are probably needed as well.

@VorBoto
Copy link
Author

VorBoto commented Sep 17, 2019

So @elacuesta this is what I've done now
I removed the comments.
Placed return cls(crawler.settings) in the from_crawler method.
Placed self.mw = HttpCompressionMiddleware.from_crawler(crawler) in its test.

I'm just not 100 on how to handle the Response.flags. I see the response gets stripped down in the processing. But the headers do get placed in the object that will ultimately be used to replace the kwargs. Would that be the place to adjust the 'Content-Encoding' value, once I know what to adjust it to?

@VorBoto
Copy link
Author

VorBoto commented Sep 17, 2019

There is a problem with cls not being an object that takes parameters.
I will need to read up on @classmethod decorators and cls before I attempt this again I think.

@VorBoto
Copy link
Author

VorBoto commented Sep 17, 2019

I'm sorry everyone for clogging this up with a bunch of trash. I am going to do things more properly and use tox before I just toss things in the system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants