Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Added option to disable the webkit in-memory and network caches #821

Merged
merged 7 commits into from
Oct 16, 2018
Merged

[MRG+1] Added option to disable the webkit in-memory and network caches #821

merged 7 commits into from
Oct 16, 2018

Conversation

wsdookadr
Copy link
Contributor

@wsdookadr wsdookadr commented Oct 7, 2018

Hi, I've been reading through the existing open issues.
This pull request is intended to address #203 , #519 , scrapy-plugins/scrapy-splash#168 , #817 , webrecorder/har2warc#1

Please review

Thanks,
Stefan

Note: To try this out, you can use this docker image:

docker pull wsdookadr/splash:v1
docker run -it -p 8050:8050  --name splash_v1 wsdookadr/splash:v1 --disable-browser-caches

And then load a web page and generate a .har file from it like so

curl "http://localhost:8050/render.har?url=https://www.digitalocean.com/community/tutorials/how-to-install-java-with-apt-on-debian-9&timeout=13&wait=4&response_body=1" > /tmp/q.har

If you run this multiple times, the amount of data fetched by curl should be about the same (4mb for that url). Before this pull-request, the first time, it would produce around 4mb, and the subsequent ones about ~200-300k (the .har files would be missing resources that are cached by webkit in its page cache, the first time the page was loaded).

@cathalgarvey cathalgarvey changed the title added option to disable the webkit in-memory and network caches [MRG+1] Added option to disable the webkit in-memory and network caches Oct 8, 2018
@cathalgarvey
Copy link

On cursory review the code changes are clean and their purpose is clear. Some minor formatting changes to comply with PEP8, and a typo in the documentation changes, but nothing that should delay a merge. I'll +1 it and a make a few inline suggestions; if another maintainer can review and confirm also I expect we can merge it.

Copy link

@cathalgarvey cathalgarvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two minor changes recommended but should not delay a merge. I think this is a valuable feature for some use-cases and adds a level of control that might be useful generally for experimenting and testing.

It would be nice, if the new functionality could be tested, to include tests for it.

docs/faq.rst Outdated Show resolved Hide resolved
splash/server.py Outdated Show resolved Hide resolved
docs/faq.rst Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Oct 8, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@a3f7e75). Click here to learn what that means.
The diff coverage is 100%.

@@            Coverage Diff            @@
##             master     #821   +/-   ##
=========================================
  Coverage          ?   87.46%           
=========================================
  Files             ?       41           
  Lines             ?     5670           
  Branches          ?      781           
=========================================
  Hits              ?     4959           
  Misses            ?      533           
  Partials          ?      178
Impacted Files Coverage Δ
splash/network_manager.py 85.81% <100%> (ø)
splash/server.py 71.06% <100%> (ø)

@kmike
Copy link
Member

kmike commented Oct 9, 2018

Hey @wsdookadr! The PR looks good, but could you please add a test case? Check e.g. here how a custom startup option is tested. I guess you'd need to define some resource in

class Root(Resource):
with caching behavior, then request it

  1. twice with a Splash started with default options - resource should be in har only for one of them,
  2. twice with a Splash started with --disable-browser-caches - resource should be in har in both cases

@wsdookadr
Copy link
Contributor Author

@kmike I've pushed another commit with tests for this. However, it seems like the problem is not being captured by the test, in other words both default options and the new --disable-browser-caches options will bring all the entries with content in the HAR.
The new test can be run using the following command:

pytest -s --verbosity=6 ./splash/tests/test_request_filters.py -k test_disable_browser_caches

@use_chunked_encoding
def render_GET(self, request):
return ("""<html><head>
<link rel="stylesheet" href="subresources-with-caching/style.css" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem could be that related resources don't set any specific cache headers, so the browser doesn't cache them. Could you try setting Cache-Control header for these static resources?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmike you are correct, I've pushed a new commit that sets that Cache-Control header and now it seems that at least for the image.gif resource, the entire scenario is reproducible.

Used the Cache-Control header to allow the browser to cache
the image.gif that is being fetched. This now actually reproduces
the behaviour of the browser which would discard the image.gif resource
from the HAR if the page/in-memory cache was enabled.
# pytest -s --verbosity=6 ./splash/tests/test_request_filters.py -k test_disable_browser_caches
def test_disable_browser_caches(self):
# entries if run with caches enabled
# in this case we should have [[content,content,content],[content,content,no content]]
Copy link
Member

@kmike kmike Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why isn't it working for css files. Maybe there are some other response headers needed to make it work, or maybe browser uses ETag or something like that, so there is no "full" response, but still a response. It doesn't look good that our code sets cache headers, but then checks that the resource is not cached; test will break if resource start to be cached.

I think there are two ways to fix that:

  1. Investigate what's going on and make sure css is cached as well;
  2. remove a check for css file, check only html and gif.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you on that and I went with the 2nd approach, I have removed the css checks. I've pushed a new commit for this.
For this pull request I'm ok with having it this way.

return self.Image()
return self

class StyleSheet(Resource):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be removed now, as well as <link rel=stylesheet> tag in html?

@@ -901,6 +901,39 @@ def render_GET(self, request):
return base64.decodebytes(b'R0lGODlhAQABAAD/ACwAAAAAAQABAAACADs=')


class SubresourcesWithCaching(Resource):
"""
Embedded css and image.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstring is no longer correct, as it embeds only image

@use_chunked_encoding
def render_GET(self, request):
return ("""<html><head>
</head>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove <head>, to make an example a bit shorter.

@kmike kmike merged commit 7eb9a3c into scrapinghub:master Oct 16, 2018
@kmike
Copy link
Member

kmike commented Oct 16, 2018

Thanks @wsdookadr! 🎉

@Mideen
Copy link

Mideen commented Dec 19, 2018

Hi team,
When can we expect the next stable version with these changes?
@kmike

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants