Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Place Content-encoding header into response's flags is desired #5290

Open
wants to merge 42 commits into
base: master
Choose a base branch
from

Conversation

T0shik
Copy link

@T0shik T0shik commented Oct 23, 2021

Fixes #1988

Continuation of #4025 due to abandonment.

additional changes:

  • 'tests/ignores.txt' in conftest.py to rely on Path
  • tests to use unittest assertions
  • remove create_spider_mw and make config test specific (can change this if needed)

Resolves #4025

cwen13 and others added 30 commits September 13, 2019 22:15
# Conflicts:
#	scrapy/downloadermiddlewares/httpcompression.py
#	tests/test_downloadermiddleware_httpcompression.py
Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reviving these changes!

Most of my feedback is probably about code you might not even have written yourself, and feedback that I failed to provide to the previous author, so apologies beforehand for that.

conftest.py Outdated Show resolved Hide resolved
scrapy/downloadermiddlewares/httpcompression.py Outdated Show resolved Hide resolved
scrapy/downloadermiddlewares/httpcompression.py Outdated Show resolved Hide resolved
scrapy/downloadermiddlewares/httpcompression.py Outdated Show resolved Hide resolved
scrapy/settings/default_settings.py Outdated Show resolved Hide resolved
tests/test_downloadermiddleware_httpcompression.py Outdated Show resolved Hide resolved
tests/test_downloadermiddleware_httpcompression.py Outdated Show resolved Hide resolved
- HttpCompressionMiddleware constructor deprecation warning
- refactor process_response
- put COMPRESSION_KEEP_ENCODING_HEADER in settings.py.tmpl
- test HttpCompressionMiddleware init
- test COMPRESSION_KEEP_ENCODING_HEADERS presence
- ensure b'decoded' flag is set if decoded
@T0shik
Copy link
Author

T0shik commented Oct 23, 2021

in addition to the previous comments:

  • fixed inconsistency with having COMPRESSION_KEEP_ENCODING_HEADER and COMPRESSION_KEEP_ENCODING_HEADERS
  • HttpCompressionMiddleware.process_response to always return b'decoded' in decoding happened. not when keep_encoding_header is True

@wRAR
Copy link
Member

wRAR commented Oct 25, 2021

Please note that the tests failed.

Copy link
Member

@elacuesta elacuesta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is similar to the one I had before. In this case, the "decoded" flag does not actually say which was the value of the original header.
Since I wrote the above comment, #1877 was solved, so I think now an alternative would be to add a response metadata attribute to store this value (or at a minimum, store content-encoding=<value> as a flag).

Furthermore, I see no reason for the flag to be bytes, existing flags are all str.

scrapy/downloadermiddlewares/httpcompression.py Outdated Show resolved Hide resolved
scrapy/downloadermiddlewares/httpcompression.py Outdated Show resolved Hide resolved
scrapy/downloadermiddlewares/httpcompression.py Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Oct 26, 2021

Codecov Report

Merging #5290 (ed157d5) into master (9b8285d) will increase coverage by 0.02%.
The diff coverage is 93.54%.

@@            Coverage Diff             @@
##           master    #5290      +/-   ##
==========================================
+ Coverage   88.56%   88.58%   +0.02%     
==========================================
  Files         163      163              
  Lines       10635    10648      +13     
  Branches     1812     1815       +3     
==========================================
+ Hits         9419     9433      +14     
  Misses        939      939              
+ Partials      277      276       -1     
Impacted Files Coverage Δ
scrapy/downloadermiddlewares/httpcompression.py 92.50% <93.33%> (-0.15%) ⬇️
scrapy/settings/default_settings.py 98.75% <100.00%> (+<0.01%) ⬆️
scrapy/core/downloader/__init__.py 92.48% <0.00%> (+1.50%) ⬆️

- amend tests & logic to follow the new default of COMPRESSION_KEEP_ENCODING_HEADER=True
- remove redundant condition
@T0shik
Copy link
Author

T0shik commented Oct 30, 2021

I couldn't figure out the previous build error
https://github.com/scrapy/scrapy/runs/4026144087?check_suite_focus=true

docs/conftest.py:24: in <module>
    CodeBlockParser(future_imports=['print_function']),
E   TypeError: __init__() got an unexpected keyword argument 'future_imports'

I've amended my changes to touch less files, while in there I spotted the mutation on the original response content encoding array and decided to remove it.

I think if the following build fails in a similar fashion I'll need a hand

@Gallaecio
Copy link
Member

I couldn't figure out the previous build error https://github.com/scrapy/scrapy/runs/4026144087?check_suite_focus=true

docs/conftest.py:24: in <module>
    CodeBlockParser(future_imports=['print_function']),
E   TypeError: __init__() got an unexpected keyword argument 'future_imports'

#5299

@T0shik
Copy link
Author

T0shik commented Oct 31, 2021

@Gallaecio thank you, will keep an eye out for when that's merged

@Gallaecio
Copy link
Member

Closing and reopening for a fresh CI run.

@Gallaecio Gallaecio closed this Jan 7, 2022
@Gallaecio Gallaecio reopened this Jan 7, 2022
Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@T0shik Thanks again for your work. I am very sorry it took me so long to get back to this.

@elacuesta

My concern is similar to the one I had before. In this case, the "decoded" flag does not actually say which was the value of the original header. Since I wrote the above comment, #1877 was solved, so I think now an alternative would be to add a response metadata attribute to store this value (or at a minimum, store content-encoding=<value> as a flag).

I don’t think we need this, since after this change it is possible to access the original header.

Furthermore, I see no reason for the flag to be bytes, existing flags are all str.

While I agree, I think we should keep it as bytes for backward compatibility. We could add it as both, and deprecate somehow the bytes version (wrapping the set to detect access by key?), but I think that would be a non-trivial change and I think it may be best to keep it out of this pull request.

Comment on lines +32 to +49
if not stats:
warnings.warn(
"HttpCompressionMiddleware now accepts a 'stats' parameter which should be specified.",
ScrapyDeprecationWarning,
)
if settings:
self.keep_encoding_header = settings.getbool('COMPRESSION_KEEP_ENCODING_HEADER')
if not self.keep_encoding_header:
warnings.warn(
"Setting COMPRESSION_KEEP_ENCODING_HEADER=False is deprecated",
ScrapyDeprecationWarning,
)
else:
self.keep_encoding_header = False
warnings.warn(
"HttpCompressionMiddleware now accepts a 'settings' parameter which should be specified.",
ScrapyDeprecationWarning,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to add stacklevel=2 to the first and last warn() calls, so that the warning becomes about the super().__init__() line that is missing the corresponding parameters.

Also, I don’t agree with what @elacuesta suggested earlier about changing the warning message in the middle. Precisely because of what he explains between parenthesis (“when not passing settings at all also stops being supported, I guess”), I find the current message confusing, because a user that did not set the setting to any value will now be told not to set it to False, which the user is already not doing. I think the previous message was better. Maybe we could give it additional context for clarification instead:

                warnings.warn(
                    "The default value of COMPRESSION_KEEP_ENCODING_HEADER, "
                    "False, is deprecated, and will stop working and stop "
                    "being its default value in a future version of Scrapy. "
                    "Set COMPRESSION_KEEP_ENCODING_HEADER=True in your "
                    "settings to remove this warning.",
                    ScrapyDeprecationWarning,
                    stacklevel=2,
                )

@@ -37,6 +37,7 @@
COMMANDS_MODULE = ''

COMPRESSION_ENABLED = True
COMPRESSION_KEEP_ENCODING_HEADER = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be False for backward compatibility. It should only be set to True in settings.py.tmpl.

Accordingly, we modify self.assertIn(b'gzip', newresponse.headers['Content-Encoding']) and similar in the non-setting tests to use assertNotIn instead and remove the follow-up line that checks the header value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"Content-Encoding" header gets stripped from response headers
6 participants