Skip to content
Erik Rose edited this page Apr 5, 2016 · 16 revisions

Please scribble on this page. Answer unanswered questions. Pose more. Go nuts.

Broad Goal

Classify parts of a web page relative to each other and in absolute terms. Find the most likely title, the most likely body text, the most likely next/prev button. But also decide whether there is likely any body text at all or any next button at all. It's fuzzy classification on independent axes.

Reason about the nature of the page, and individual components within the page. What is a likely intent of a user visiting? What are the interesting entities? Structure it so that it's easy to add additional modules to further enhance classification and process reasoned data.

Our proximal use case is to provide input to a full-text indexer, likely client-side, to augment Awesome Bar results.

Other Possible Uses

  • Provide lighter page downloads for people with lower bandwidth or battery.
  • Enable meeting accessibility needs in clever new ways (key equivalents for prev/next navigation and more).
  • Feed into (or be) a categorizer of web pages so we could, for example, cluster ActivityStream entries.

Prior art

  • Readability (used in Safari, FF Reader View). This is a very good start but has high standards for getting the answer "right", giving up altogether when it lacks confidence. For Awesome Bar purposes, it's more important to index something (even if it's every textual thing on the page) rather than nothing. Err on the side of extracting too much.
  • OmniWeb's full-text indexing of visited pages
  • ChromeDistractionFreeBrowsing
  • Diffbot, a startup that extracts semantic data from web pages using visual cues
  • FTS plans for Places circa Firefox 3 & 4
  • Methods for Web Content Analysis and Context Detection - research paper from PSU Capstone program
  • Embedly is a service that pulls out things like dominant colors, content, and keywords from pages. We're using it in ActivityStream.

Indicators

Things we can look at to identify The Content or other metadata:

  • A div tag with p tags in it (Readability)
  • HTML density
  • Regions that are visually largest on the page (unprecedented)
  • Regions with id≈"content"
  • Link density
  • Repeatedness of phrases (expensive) (unprecedented)
  • Stability over time (vs. changing ads etc.)
  • oEmbed embeds
  • Microformats (which Firefox has a full parser for already)
  • Open Graph data

Extract

  • Content
  • Whether a page appears to be one thing (an article) or a list of things (an index, etc.)
  • Dominant colors
  • Icons (beyond favicon?)
  • Page category (recipe, comic, article, photo, video). What if Firefox had a recipe box, for instance? You could search for recipes which used certain ingredients (that you had)—or didn't (vegetarian). Really opens up the potential for semantic querying.
  • Any Next and Previous button (so we can standardize navigation)
  • Nav (so we can put it in a menu)

People to talk to

  • Olivier over in Content Services, Rebecca Weis, and Chuck Harmston worked on automatic page categorization.

Crazy ideas

  • Crowdsource-train the thing.