Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Fathom page classification module #78

Closed
wants to merge 15 commits into from

Conversation

victwj
Copy link
Collaborator

@victwj victwj commented May 6, 2021

Description

Resolves #27. Integrates Mozilla Fathom to enable ML classification of pages or DOM elements within a page. At the element level, example applications include recognizing which DOM elements represent overlays, ads, login forms, privacy policies, and cookie banners. At the page level, example applications include recognizing shopping pages, login pages, and news article pages. The user must first train their own model. This library then makes it simple to deploy the trained model through a Rally extension.

Technical

The fathom-web package is installed as devDependency. If the user wants to use this module, they must install this package in their own extension. The broad architecture of this module follows pageManager and pageText. A summary of how it works:

  1. The fathom.js background script exports the webScience.fathom.onFathomData event. The user must call webScience.fathom.onFathomData.addListener() from their own background script, specifying their callback and URL match pattern. Fathom will only be run on URLs which match the specified pattern.
  2. The user must also call the global function window.webScience.fathom.addTrainees() from their own content script to specify their trained Fathom model.
  3. When a tab is opened, the fathom background script sends an webScience.fathom.isClassifiable message to the content script running on the tab. This boolean is determined by the user's URL match patterns.
  4. Upon receiving an isClassifiable == true message, the Fathom content script will proceed to classify the page based on the user's rulesets, which were added via the global window.webScience.fathom.addTrainees().
  5. Once the Fathom content script is done, results are sent to the background script as a webScience.fathom.fathomData message. The Fathom background script then processes the results and calls the user's callbacks on the results.

Discussions

  1. This module currently supports the most basic functionality: run Fathom at most once upon page load, then return the classification results. In reality, the element the user is looking for may be dynamically loaded at some point after the page is loaded. Should a rerunning function based on MutationObserver be implemented in the content script? This would be difficult to do on the user's end, so might be worth doing in this module.

  2. Once an element is classified, should this module handle tracking the amount of time the element is in the user's screen? This should be doable on the user's end and is more study specific, but is likely to be a very commonly used feature, making it worth handling in the module. Most other types of element tracking (such as whether or not the element was clicked) are study specific and would fit better as a user implementation. This module can return the classified DOM elements, after which the user may monitor it as they see fit.

Todos

  1. Memoize page contents to optimize running multiple Fathom rulesets.
  2. Make border drawing a user configurable option.
  3. Potentially change the use of onUpdated as the main event to trigger Fathom runs.
  4. Write jsdoc comments.
  5. Remove console.log lines, which are all temporary and only used for development purposes.
  6. Define a proper classification results schema.

@victwj victwj marked this pull request as ready for review May 6, 2021 19:16
@jonathanmayer jonathanmayer self-requested a review June 8, 2021 17:59
@jonathanmayer jonathanmayer added the enhancement New feature or request label Jun 8, 2021
@@ -18,6 +18,7 @@
"eslint-plugin-import": "^2.22.1",
"eslint-plugin-mocha": "^8.1.0",
"eslint-plugin-node": "^11.1.0",
"fathom-web": "^3.7.3",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please merge in the latest changes on main, since the dependencies changed.


import {ruleset, type} from "fathom-web";

(function() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need an IIFE wrapper. Rollup will add that automatically.

Comment on lines +22 to +40
if ("webScience" in window) {
if ("fathom" in window.webScience) {
return;
}
else {
window.webScience.fathom = {
trainees: new Map(),
addTrainees: addTrainees,
}
}
}
else {
// Else, this is the first webScience initialization
window.webScience = {};
window.webScience.fathom = {
trainees: new Map(),
addTrainees: addTrainees,
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could condense this a bit.

let pageManagerLoaded = false;


// Function to initalize fathom once conditions are fulfilled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say more about what this function does.

}
}

// Whether the elements of this page have been classified.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: would be consistent about punctuation.

Comment on lines +86 to +93
/**
* A callback function to add a fathomDataListener.
* @param {fathomDataCallback} listener - The listener that is being removed.
* @param {Object} options - Options for the listener.
* @param {string[] options.matchPatterns} matchPatterns - The match patterns
* for pages where the listener should be notified.
* @private
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firefox recently added dynamic imports for content scripts. That would be a good way to handle rulesets and avoid the awkwardness with content script globals. Something like...

  • A parameter to addListener specifies a path for one or more ruleset JS files and names for each ruleset.
  • The ruleset JS file must be a module that exports a ruleset() function.
  • When a page visit starts, check the URL against the match pattern set for each listener. Send an array of ruleset names and file paths to the content script.
  • In the content script, import the relevant rulesets, run them, and send back the results.

// When a tab is updated, send it a message if the page should be
// classified with Fathom
// TODO: onUpdated may not be the right event
browser.tabs.onUpdated.addListener((tabId, changeInfo, tab) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would use the pageManager.onPageVisitStart event so you can include a pageId in the message.

const contentScript = await browser.contentScripts.register({
matches: matchPatterns,
js: [{
code: inline.dataUrlToString(fathomContentScript)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We changed how content script imports work.

function messageListener(fathomDataObject) {
console.log("messageListener called");
for (const [listener, listenerRecord] of fathomDataListeners) {
listener(fathomDataObject);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't want to notify the listener unless there's a match.

Comment on lines +185 to +187
if (listenerRecord.matchPatternSet.matches(fathomDataObject.url)) {
listener(fathomDataObject);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need to do this matching twice. Could associate each listener with a unique ID and include that in the message to the content script and the response.

@victwj victwj closed this Nov 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a module for measuring DOM nodes identified by a Fathom classifier
2 participants