New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding keep-encoding option to HttpCompressionMiddleware issue #1988 #4021
Conversation
@elacuesta @Gallaecio Am I allowed to touch tests? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The crawler
needs to be passed in this line
@@ -18,6 +18,12 @@ | |||
class HttpCompressionMiddleware(object): | |||
"""This middleware allows compressed (gzip, deflate) traffic to be | |||
sent/received from web sites""" | |||
|
|||
def __int__(self, crawler): | |||
#setting to decide to keep or discard encoding header |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these comments are unnecessary, the setting should be documented instead.
@@ -45,8 +51,9 @@ def process_response(self, request, response, spider): | |||
# responsetypes guessing is reliable | |||
kwargs['encoding'] = None | |||
response = response.replace(**kwargs) | |||
if not content_encoding: | |||
del response.headers['Content-Encoding'] | |||
if not self.keep_encoding_header: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem I see with this solution is that responses could end up with a "Content-Encoding" header that does not fully match the actual encoding.
I think the ideal solution for this would be #1877, but while that is not available, Response.flags
is probably the less ugly way, as mentioned in #1988 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries I put it in just as you were commenting. I need to slow down a little.
So first with the from_crawler classmethod I overlooked, have it be return cls(crawler.settings)
similar to Ajax. Then if I want to continue that path instead of passing HttpCompressionMiddleware anything it should get called with the from_crawler method. Which should be accessible since HttpComp test imports what ajax does, besides from scrapy.utils.test import get_crawler
.
Then to fix the possible mix match of the header value. Should I double check that the bytes passed are the same encoding as the value specified? That would need to be before its decoded. Or are you thinking to have the header value that is passed with the returned response object has the header encoding value be that of what it was decoded to?
Then response flags. Looks like they are passes when getting a Response object or through the replace method that takes keyword pairs. So then reworking the logic at the tail end to simply replace the encoding header value when it's being kept cause I don't see a way to tack on an additional header value.
@VorBoto Sorry, I started my review before seeing your latest comment. |
So @elacuesta this is what I've done now I'm just not 100 on how to handle the Response.flags. I see the response gets stripped down in the processing. But the headers do get placed in the object that will ultimately be used to replace the kwargs. Would that be the place to adjust the 'Content-Encoding' value, once I know what to adjust it to? |
There is a problem with cls not being an object that takes parameters. |
I'm sorry everyone for clogging this up with a bunch of trash. I am going to do things more properly and use tox before I just toss things in the system. |
Model this after the ajaxcrawl.py middle ware to pass setting to an init and store the setting in an instance variable.