-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Fathom page classification module #78
Conversation
@@ -18,6 +18,7 @@ | |||
"eslint-plugin-import": "^2.22.1", | |||
"eslint-plugin-mocha": "^8.1.0", | |||
"eslint-plugin-node": "^11.1.0", | |||
"fathom-web": "^3.7.3", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please merge in the latest changes on main
, since the dependencies changed.
|
||
import {ruleset, type} from "fathom-web"; | ||
|
||
(function() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need an IIFE wrapper. Rollup will add that automatically.
if ("webScience" in window) { | ||
if ("fathom" in window.webScience) { | ||
return; | ||
} | ||
else { | ||
window.webScience.fathom = { | ||
trainees: new Map(), | ||
addTrainees: addTrainees, | ||
} | ||
} | ||
} | ||
else { | ||
// Else, this is the first webScience initialization | ||
window.webScience = {}; | ||
window.webScience.fathom = { | ||
trainees: new Map(), | ||
addTrainees: addTrainees, | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could condense this a bit.
let pageManagerLoaded = false; | ||
|
||
|
||
// Function to initalize fathom once conditions are fulfilled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Say more about what this function does.
} | ||
} | ||
|
||
// Whether the elements of this page have been classified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: would be consistent about punctuation.
/** | ||
* A callback function to add a fathomDataListener. | ||
* @param {fathomDataCallback} listener - The listener that is being removed. | ||
* @param {Object} options - Options for the listener. | ||
* @param {string[] options.matchPatterns} matchPatterns - The match patterns | ||
* for pages where the listener should be notified. | ||
* @private | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Firefox recently added dynamic imports for content scripts. That would be a good way to handle rulesets and avoid the awkwardness with content script globals. Something like...
- A parameter to
addListener
specifies a path for one or more ruleset JS files and names for each ruleset. - The ruleset JS file must be a module that exports a
ruleset()
function. - When a page visit starts, check the URL against the match pattern set for each listener. Send an array of ruleset names and file paths to the content script.
- In the content script, import the relevant rulesets, run them, and send back the results.
// When a tab is updated, send it a message if the page should be | ||
// classified with Fathom | ||
// TODO: onUpdated may not be the right event | ||
browser.tabs.onUpdated.addListener((tabId, changeInfo, tab) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would use the pageManager.onPageVisitStart
event so you can include a pageId
in the message.
const contentScript = await browser.contentScripts.register({ | ||
matches: matchPatterns, | ||
js: [{ | ||
code: inline.dataUrlToString(fathomContentScript) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We changed how content script imports work.
function messageListener(fathomDataObject) { | ||
console.log("messageListener called"); | ||
for (const [listener, listenerRecord] of fathomDataListeners) { | ||
listener(fathomDataObject); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't want to notify the listener unless there's a match.
if (listenerRecord.matchPatternSet.matches(fathomDataObject.url)) { | ||
listener(fathomDataObject); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need to do this matching twice. Could associate each listener with a unique ID and include that in the message to the content script and the response.
Description
Resolves #27. Integrates Mozilla Fathom to enable ML classification of pages or DOM elements within a page. At the element level, example applications include recognizing which DOM elements represent overlays, ads, login forms, privacy policies, and cookie banners. At the page level, example applications include recognizing shopping pages, login pages, and news article pages. The user must first train their own model. This library then makes it simple to deploy the trained model through a Rally extension.
Technical
The fathom-web package is installed as devDependency. If the user wants to use this module, they must install this package in their own extension. The broad architecture of this module follows pageManager and pageText. A summary of how it works:
fathom.js
background script exports thewebScience.fathom.onFathomData
event. The user must callwebScience.fathom.onFathomData.addListener()
from their own background script, specifying their callback and URL match pattern. Fathom will only be run on URLs which match the specified pattern.window.webScience.fathom.addTrainees()
from their own content script to specify their trained Fathom model.webScience.fathom.isClassifiable
message to the content script running on the tab. This boolean is determined by the user's URL match patterns.isClassifiable == true
message, the Fathom content script will proceed to classify the page based on the user's rulesets, which were added via the globalwindow.webScience.fathom.addTrainees()
.webScience.fathom.fathomData
message. The Fathom background script then processes the results and calls the user's callbacks on the results.Discussions
This module currently supports the most basic functionality: run Fathom at most once upon page load, then return the classification results. In reality, the element the user is looking for may be dynamically loaded at some point after the page is loaded. Should a rerunning function based on MutationObserver be implemented in the content script? This would be difficult to do on the user's end, so might be worth doing in this module.
Once an element is classified, should this module handle tracking the amount of time the element is in the user's screen? This should be doable on the user's end and is more study specific, but is likely to be a very commonly used feature, making it worth handling in the module. Most other types of element tracking (such as whether or not the element was clicked) are study specific and would fit better as a user implementation. This module can return the classified DOM elements, after which the user may monitor it as they see fit.
Todos
onUpdated
as the main event to trigger Fathom runs.console.log
lines, which are all temporary and only used for development purposes.