New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adblock: Support cosmetic filtering (element hiding) and scriptlets #7629
base: main
Are you sure you want to change the base?
Conversation
Since on that last PR you mentioned "I'm assuming this PR will not get merged before that" I though I would take the time to add some encouraging comments. No, unfortunately, it's not going to get merged before 3.0 (which makes Qt6 the preferred backend) or even 4.0 (which'll drop webkit and do a bunch of other disruptive maintenance stuff). And even then once we are trying to get through the PR backlog I think this'll be a complicated one due to changes to the emerging extension API. I'm excited to have a concrete and well though out proposal for that API (you did make it onto my aspirational focus board) but while we are focusing on getting PRs merged I think it's going to be quite hard to plan for how we would like extensions to work with the core stuff at the same time. So having a PR that adds a feature and changes the ostensibly-public API makes it automatically more complex. Anyway, I haven't actually looked at this PR except for a while back in February (I have notes, no idea what's in them) so feel free to tell me that you are already doing it the most correct way. But if you can see some way to make the |
As requested, I've isolated the API changes to another PR (#7630) . Unfortunately, I can't just get rid of them since I need to be able access the |
Added a way to interact with the Tab instance before, after starting, and finishing loading a page
First stab at cosmetic filtering with the python-adblock library. Cosmetic filtering is done in two steps: 1. Lookup url-specific cosmetic selectors. 2. Lookup generic selectors based on the element classes and ids present in the page. In our code, we perform step 1 in a load_started hook. The results are then translated to javascript and inserted into the page using Tab.run_js_async. The injected javascript also returns all the class names and ids that appear in the page. A callback then takes this information and runs step 2. Note that scriptlet injection is not yet available because extra resources need to be downloaded and added to the engine before script injection resolves properly.
To add resources to the engine, run :adblock-update-resources. Then, the resources should automatically download, cache, and add themselves to the engine. The way the resources are determined is similar to rust Brave's adblocker code. They download uBlock's redirect-engine.js, parse it, and select resources based on that. They also parse and add the functions and function templates from uBlock's scriptlets.js. We do basically the same thing. Note that this version requires the version of python-adblock whose add_resource function for the engine has 4 arguments (including aliases). This will likely be version 0.5.3. Scriptlet injection seems to mostly work, but there are still some annoying caveats. Because currently scriptlets are injected after the page is loaded, ads on youtube will will not be blocked if the video is opened a new tab.
Also, used a lot of the convenience functions present in the qutebrowser codebase. Everything seems to work, including scriptlets. Also added unit test cases for cosmetic filtering.
To avoid having to do a :adblock-update-resources after every :adblock-update.
This comment was marked as off-topic.
This comment was marked as off-topic.
After updating to HEAD (from an older, pre 3.0.0 version), got the cannot parse error
Env
|
Now properly removes trailing block comments which appear in the upstream redirect-resources.js. Also handles the inclusion of css resources.
@duarm Thanks for letting me know. I've fixed the parse error. However, since I've last touched the code, gorhill/uBlock@18a84d2 was committed, which completely refactors how scriptlets are written. This means scriptlet injection is basically broken for the moment until I write the code to parse the new format, and that may take some time. To others, on a more practical level, this means you'll likely get ads on youtube and whatnot if you run |
@The-Compiler @toofar I'd like some input. To get scriptlet injection working, we need to parse the new So, we have the following options:
The downside with 1 is that it's brittle and less future-proof. The downside with 2 is that it requires interacting with a javascript engine (such as QJSEngine) from a component, which means we need to expose that via some API as well, making the PR process more complicated. Which would you prefer? |
There's a lot I don't understand, so I'll ask some questions that might seem silly because I don't have the time to do my own research.
|
Let me give a bit more context. To function correctly, the adblock engine must first be loaded with 1) filter lists 2) scriptlets (among other things) before visiting any websites. Then once a website is visited, the adblock engine uses the filter lists to determine which scriplets to inject. For example, this is a sample rule for scriplet injection:
This says when youtube.com is visited, run the Thus, code relating to the adblock engine can be divided into two phases.
The question I had asked was in reference to the setup phase (1). To answer your questions:
We cannot inject the whole upstream file as-is, since the filter lists contain the exact details of how the scriplets should be injected. As you deduced, we have to feed the adblock engine with the scriplets, and it determines (based on the loaded filter rules) exactly how to inject the scriptlets.
Yes. If you take a closer look, the
This is why I suggested using a javascript engine.
This seems like it would be confusing from a UI perspective. Based on the documention, Qt's EDIT: Also, thanks for the fast response! |
So if I understand correctly, the scriptlet.js format changed which is basically the "db" for the scriptlets that are getting injected into the browser engine on the page load. And that changed from being a somehow parseable file to a es6/js file? Just throwing in an idea, would it be possible to write a js script, that loads that thing, iterates over all the urls and spits them out in a more sane format that is parseable again? |
Yes
That's what option 2 from #7629 (comment) would do and is the "recommended" way to do things according to the adblock-rust developers. However, something needs to execute the js script, which is where the talk of the javascript engine comes into play. |
I'm currently a bit under the water with other stuff (semester starting and me moving), so just a couple of quick comments:
|
Great. Thanks for the pointers. I'll take a look at using QJSEngine and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took care of some linter ci failures for you, hope it helps
Thanks. I appreciate it! |
Co-authored-by: port19 <port19@port19.xyz>
Co-authored-by: port19 <port19@port19.xyz>
This pull request contains the implementation for cosmetic filtering and scriptlet injection (after the qt6 branch merge). Closes #6480.
The main body of the code for this is spread over the two files:
components/braveadblock.py
, andcomponents/ublock_resources.py
.Note that these changes work only for WebEngine.
Cosmetic filtering is implemented in two steps. First, before the page loads, the Brave ad blocker engine is queried for URL-specific filters that should be applied. The user-specified css (using
stylesheet.js
) is then updated to include these filters via the newly addedadd_dynamic_css
function. Next, after the page loads, an asynchronous javascript function gets the list of classes and ids generated on the current page. The callback to this function calls the Brave ad blocker with this information and applies the returned generic filters, once again using theadd_dynamic_css
function. To facilitate this, the following APIs have been added:api.hook.before_loaded
,api.hook.load_started
,api.hook.load_finished
,api.Tab.add_dynamic_css
,api.Tab.remove_dynamic_css
. In addition, the change mentioned in the critical changes section was implemented.For scriptlet injection, in addition to the filter lists, additional javascript templates/functions need to be downloaded and added to the Brave ad blocker before working properly. The
:adblock-update-resources
command was added to perform this action. The code for this mostly lies incomponents/ublock_resources.py
and takes heavy inspiration inspiration from the original Brave ad blocker implementation (specifically https://github.com/brave/adblock-rust/blob/master/src/resources/resource_assembler.rs). It downloads the resources specified by ublock's redirect-resources.js as well ublock's scriptlets.js and caches them. These resources are not expected to change as often as the filter lists, so they are managed separately. In addition, the urls are fairly unintuitive to the user, so they were not added as a config option, but rather hardcoded into the code.The scriptlets are injected at the same time as the dynamic css (i.e., during the
api.hook.before_loaded
), and uses the newly added functionadd_web_script
to do this. The following API changes have been made to support scriptlet injection:api.Tab.add_web_script
,api.Tab.remove_web_script
,api.usertypes.InjectionPoint
.Finally, a unit test regarding adblocker resources and cosmetic filtering has been added.
Critical Changes
In addition to being emitted during
_load_url_prepare
, theapi.tab.Tab.before_load_started
signal is now also emitted during_on_navigation_request
. The reason for this is that_load_url_prepare
is not always called before a new page loads. For example, this can happen when a non-clickable object is clicked to open in the background. In this case,TabbedBrowser.tabopen
will be called withurl=None
, soload_url
and_load_url_prepare
are not called. However,_on_navigation_request
is always called, and the url is always known by this point. For a reliableapi.hook.before_loaded
hook, it is essential we emitbefore_load_started
during_on_navigation_request
as well. We don't want to remove the emit from_load_url_prepare
either, since that gets called before_on_navigation_request
, and we need the hook to fire as soon as possible.API Changes
api.hook.before_loaded
A hook that is guaranteed to be notified before a page loads. Components which need to do something before a page loads should use this hook. The hook may be called multiple times before a page loads. The
api.Tab
object which is about to load and the url which we are about to load are passed in as arguments.api.hook.load_started
A hook that is guaranteed to be notified after the page loads. Components which need to do something after a page starts loading should use this hook. The
api.Tab
object which is starting to load is passed in as an argument.api.hook.load_finished
A hook that is guaranteed to be notified after the page finishes loading. Components which need to do something after a page load finishes should use this hook. The
api.Tab
object which just finished loading and the success flag of whether the load was done successfully is passed in as arguments.api.Tab.add_dynamic_css
Adds css which will get applied to every page the
api.Tab
object loads. This has lower precedence than the user-specified stylesheet in the user config. This css is expected to change often, hence the "dynamic".api.Tab.remove_dynamic_css
Removes the aforementioned css which will get applied to every page the
api.Tab
object loads.api.Tab.add_web_script
Adds a snippet of javascript to be executed during the
api.Tab
object's page load. The specific moment in which the web script executes can be controlled withapi.usertypes.InjectionPoint
. These "web scripts" are intentionally separated from the greasemonkey scripts. Greasemonkey scripts are added and managed by the end user, but these scripts are meant to be used by the components within qutebrowser. This interface also allows for more control over exactly how the added scripts will be executed.api.Tab.remove_web_script
Removes the aforementioned snippet of javascript to be executed during the
api.Tab
object's page load.api.usertypes.InjectionPoint
A type representing the point at which to execute an injected web script.
Other changes to core implementation
_update_stylesheet
now also includesstylesheet.js
in its injected script (wrapped in the global wrapper) because sometimes asynchronous code runs before injected scripts, andstylesheet.set_css
will obviously fail ifwindow._qutebrowser.stylesheet
has not been specified.Other considerations
load_finished
hook, to make this faster, we may want to have some javascript observe the DOM and update as we go along, but this likely requires QT's web channels.components/ublock_resources.py
, but python-adblock does not have the interface implemented for that. In the future, if/when python-adblock does implement the interface, we should move the code out fromcomponents/ublock_resources.py
, but for now, I did it this way just to get things working.Qt 6
Note that this should work with both Qt 5 and Qt 6 (confirmed to be working with Qt 6.4).