Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for HTML/XML stream parsing/rewriting. #1222

Open
bahrus opened this issue Aug 30, 2023 · 4 comments
Open

Support for HTML/XML stream parsing/rewriting. #1222

bahrus opened this issue Aug 30, 2023 · 4 comments

Comments

@bahrus
Copy link

bahrus commented Aug 30, 2023

Proposal for server/service worker-side "template instantiation" - HTML / XML stream parsing/rewriting (including moustache token events)

Author: Bruce B. Anderson

Last Updated: 9/2/2023

Backdrop

One amazing achievement the WHATWG can take pride of in the past decade has been its reach beyond the browser -- a whole ecosystem has developed that allows for "isomorphic" code to work both on the server and in the browser (as well as during the build process), driven by the painstaking standards work of the WHATWG.

In particular, the tech stack that service workers tap into -- including fetch, streaming, ES modules, caching, etc. can be utilized on the server-side, with solutions like CloudFlare Workers, Deno, Bun, and increasingly Node.

But I believe there is one significant missing piece in the standards, where the WHATWG could benefit from a bit of humility, perhaps, and absorb ideas (and maybe even code) in the opposite direction: Fundamental support for streaming (x)(ht)ml.

Prior heartaches - already cited use cases by people encountering this missing primitive

RSS Feeds

MS Word integration

Nice use case presented here.

ColladaLoader2 support

Mentioned here.

These use cases are just the tip of the iceberg. How long before we hear from folks using any of:

SOAP/XML Services

They're still out there.

XML Vocabularies

XML still has many uses, and is still a standard.

Not supporting this entire data format in such a broad space of development, while supporting JSON, still strikes me as fundamentally unfair, frankly. I think there are understandable reasons for how we ended up here at this point (baby steps, not my department and all that), but it really is not right, long term. I think it is tipping the scales in the IT industry, leaving whole organizations out in the cold, not allowing the two data formats to compete on an even playing field. And it is quite an insult to the origins of the web.

To this vast list of shortchanged parties, let me add my own petty grievances and desires, discussed below.

We are seeing significant interest in solutions like Astro, that enable easy swapping between server-side vs. client-side components.

Processing HTML streams, plugging in / replacing dynamic data into "parts" with the help of language-neutral, declarative "static" templates (as opposed to servlet-like JavaScript) has proven itself over many decades of web development. I think providing some server-side primitives to help these engines be able to handle complex scenarios, including embedding dynamic data into a stream of static templates or dynamic third party content, would be a "slam-dunk" win for the platform.

Such an idea has taken root in a number of these solutions - the HTML Rewriter. This proposal, in essence, seeks to incorporate an enhanced version of that proven, mature solution (with additional support for moustache markers). Honorable mentions go to other packages which certainly get quite a few downloads, if those numbers are to be believed.

Providing this feature would, I believe, address a significant number of use cases, from the mundane but important "slam-dunk" use cases, to the more revolutionary, as discussed below. It would provide the equivalent of JSON.parse, at least (with the help of a small library, which maybe should be included as part of this proposal). And it would provide a good foundation to create a robust DOM object model on top of, starting, perhaps, in userland.

Highlights of the proposal

  1. Add native support for a SAX-like API built into the platform, accessible from workers and the main thread, capable of working with HTML5, with all its quirks. I think the Cloudflare/Bun.js's HTMLRewriter API is a good, proven, concrete starting point as far as the basic shape of the API, and in how it integrates with streaming API's. I have no suggestions on how to improve upon that basic API, so as far as I'm concerned, it is also a good ending point, at least for rewriting operations.
  2. Add (a subset of?) XPath support (which the HTMLRewriter API doesn't currently support).
  3. Crucially, it must provide support for parsing to a rudimentary object model, similar to parsed JSON. That is already the case with the HTML rewriter, with a judicious paragraph of code. However, I think it would be clearest if another (base?) class, called HTMLReader was defined, which instead of having a "transform" method, would have a "subscribe" method, and the (base?) handler class would only have access to the properties and methods that read from the stream. Code would still be required to generate whatever object the developer needs out of it. Maybe a generic reference example/utility (equivalent of JSON.parse) could be baked into the platform as part of this step.
  4. Using the same basic API shape, support XML with XPath based "events". (XMLRewriter and XMLReader).
  5. Add special support for configurable interpolation and processing markers, that would allow for templating engines to build on top of (e.g. XSLT, Template Instantiation on the server side, etc.) As that is the least proven suggestion, I'm still mulling over what that would look like.

I have too much skin in the game to properly weigh how to prioritize these items, but however they are prioritized, rolling out in stages seems perfectly appropriate (including the supported CSS/XPath matches).

Highlights of open questions (in my mind)

  1. Cloudflare's HTML Rewriter restricts queries to a small subset of the full CSS Selector specification (and modifies the syntax in some cases). There may be some very practical reasons for this (and I think we can live with it). But if it is just a matter of not devoting time to support low usage case scenarios, I don't know that we want to create a permanent "ceiling" in the css queries allowed.

My personal use cases:

Edge of Tomorrow Architectural Pattern

The first two use cases from my list of petty grievances centers around my personal pet peeve, an alarming lack of HTML love shown by the platform. One could argue that these use cases will become superfluous once the platform builds what it has said it will build. But at the rate things are progressing, it will be 2000 B.C. before that happens (as the progress has actually been negative over the past ten years).

The first two use cases center around supporting a userland implementation of "iframes 2.0" without the performance (and rectangular topology) penalty.

To quote the good people of github, addressing the naysayers who argue that a client side include promotes an inferior user experience:

This declarative approach is very similar to SSI or ESI directives. In fact, an edge implementation could replace the markup before it's actually delivered to the client.

<include-fragment src="/github/include-fragment/commit-count" timeout="100">
  <p>Counting commits…</p>
</include-fragment>

A proxy may attempt to fetch and replace the fragment if the request finishes before the timeout. Otherwise the tag is delivered to the client. This library only implements the client side aspect.

So basically, we can have a four-legged "relay race" to deliver content to the user in the most efficient, cost effective manner possible, to address that critique head-on. A server-side cloudflare worker (say) can sift through the HTML it is streaming, and when it encounters an include type instruction, see if it can optimize the naysayers' user experience, without causing a white screen of death. It can first check its cache for that resource, and if not found, optionally retrieve the HTML include from a cdn or dynamically generated site or service, that uses HTML server rendering, within an extremely tight window of time. Once the deadline is hit, "punt" and hand over the HTML stream to the next layer (while caching the resource in a background thread for future requests) -- on to the service worker, which could isomorphically go through the same exact thought process, again searching its cache and then optionally providing a limited time window to retrieve, before punting to a web component or custom element enhancement (during template instantiation or in the live DOM tree (worse-case)).

However, currently, the service worker is significantly constrained in its ability to seek out these include statements in the streaming HTML, because there is no support, without a 1.2MB polyfill, which almost defeats the purpose (high performance).

Or, if using service workers seems like overkill, a web component or custom enhancement, such as be-written could handle includes embedded in the streaming HTML. But such solutions have enough complexity on its hands already it needs to deal with. Having to build its own parser to parse the HTML as it streams in, searching for such includes to inject cached HTML into would again likely measure up in the hundreds of kilobytes or more, based on the libraries cited above, especially if it strives to do the job right. Waiting for the full HTML to stream, before parsing using built-in api's, wouldn't be particularly efficient either.

Iframes 2.0 in userland

If the WHATWG is at all interested in improving the end user experience, especially for those dealing with expensive networks (which I suspect they are, at least in theory), then I think they should be bold and show some leadership, and help us buck the industries' addiction to restful JSON-only API mechanism as the only way (outside iframes) for sharing content.

To quote this article:

If a resource never contains private data, then it's totally safe to put Access-Control-Allow-Origin: * on it. Do it! Do it now!

But one issue with embedding an HTML stream from a third party, is needing to adjust hyperlinks, image links, etc so it points to the right place. This is probably the most mundane, slam-dunk reason for supporting this proposal. Again, this is not only an issue in a service worker, but also in the main thread. The be-written custom enhancement, which tries its best to deal with this, has to use mutation observers, to adjust links as the HTML streams in and gets written to the DOM. This solution would be critical for using this library in a production setting outside tightly controlled scenarios. As it is, it often results in 404's getting logged because the urls aren't adjusted fast enough.

i18n support also seems like a good use case.

Other things for which the lack of a stream parser makes life difficult -- filtering out parts of the HTML stream, like jQuery supports -- filtering out script tags, style tags, etc.

A primitive that would make developing an HTML/XML Parser somewhat trivial

If this primitive (Cloudflare/Bun.s's HTML Rewriter) was built into the browser, creating a full-blown DOM parser would be quite straightforward, which has been a common (but often thwarted) use case. However, I suggest using clear language to indicate that these API's can be used for reading as well as writing:

const reader = new HTMLReader();

reader.on("*", {
  element(el) {
    console.log(el.tagName, el.text, el.attributes, el.lastInTextNode); // "body" | "div" | ...
  },
});
...
reader.subscribe(
  new Response(`
<!DOCTYPE html>
<html>
<!-- comment -->
<head>
  <title>My First HTML Page</title>
</head>
<body>
  <h1>My First Heading</h1>
  <p>My first paragraph.</p>
</body>
`));

(Modified from Bun.js documentation, which hopefully is compatible with Cloudflare's API, which is documented with classes.)

In this case, the handler class would only have readonly access to the content.

I don't mean to underestimate that effort -- creating a simple object structure, like JSON parsing provides, seems almost trivial. But creating a full blown object with bi-directional traversal, supporting CSS or XPATH querying, and the full gamut of DOM manipulation methods, does seem like significantly more work, and likewise, increasing the payload size.

Now what kinds of use cases, running in a service worker, would be better served by a full, bi-directional traversing of the DOM tree, versus use cases that could be done with the more streamlined, low memory SAX-like implementation that can process real time as the HTML/XML streams through the pipeline? I'm not yet sure, but I do suspect, beyond sheer simplicity, that there are such use cases.

But the idea here is it shouldn't be an either/or. Having a SAX Parser like Cloudflare/Bun.js provides, seems like a must. The DOM traversal argument on top of that seems like icing on the cake, that I hope the platform would eventually support, but which I think could, in the meantime, be built in userland with a relatively tiny footprint.

Link preview functionality

Streaming a must, no need for full traversal.

Building a table of contents dynamically as content streams in

Suppose we request, within a large app, an embedded huge document, and the document starts with a table of contents within a menu. If the table of contents shows (or enables) everything at once, users may get frustrated when the links don't work, not realizing that the issue is that the section the link points to hasn't arrived in the browser. One or two such clicks, and the user may abandon use of the table of contents altogether.

So we need a way for the table of contents to grow in accordance to the sections of html being downloaded. This could be accomplished with a mutation observer, but a more elegant and direct approach, I think, would be using a SAX parser such as Cloudflare's/Bun's HTMLRewriter. I think it would perform better as well. This would not be best solved by a service worker, but rather by two web components or custom enhancements working together in the main thread with streaming HTML.

Deriving state from HTML as it streams in.

Similar to the table of contents example. Again, mutation observers are probably a working alternative, but at a cost.

Pushing work off the main thread.

I'm not advocating that this proposal go anywhere near supporting updating the DOM from a worker. For the record, I'm not opposing it either. It just seems like an entirely different proposal. But I do suspect such proposals would benefit from being able to parse streaming HTML in the worker, with the help of the platform, but that request isn't made with this particular proposal.

I do think the argument does apply to some degree with HTML that streams through the service worker on its way to the browser's main thread. In that setting, there may be cached, persisted data from previous visits in IndexedDB, and in some of those scenarios, the code that would need to manipulate that data could be complex enough that doing it prior to leaving the service worker would make a tremendous amount of sense, from a performance point of view. I am alluding to thought-provoking arguments like this one. I do think that the platform's inability to merge such computations with the HTML streaming in, due to lack of SAX parsing support, is a barrier to that vision.

Hydrating streaming HTML - my most central interest in this proposal.

As many have argued, there are great synergies that can be achieved between custom enhancement attributes between the main thread and the server. For many of my enhancements, I first check if the server has created a button I need (with a certain class, say), and then I need to document "for a better user experience, please make the server add such and such button with such and such class". If the enhancement finds no such button, it creates it in the main thread, knowing that that isn't the optimal experience.

I would like to instead provide a "server-side" library I can point the developer to that could execute in the two "back-end legs" isomorphically -- the CloudFare or Bun.js (or Deno.js) service, and the service worker. This would follow the same "Edge of Tomorrow" approach -- if the remote server has already downloaded the library, great, it will add that button. If not, it can punt: "Sorry, I don't know about that attribute yet, maybe the service worker knows about it, but I might know about it next time it passes through". Same logic in the service worker. And for that approach to work in the service worker, it needs to be able to "subscribe" to certain attributes being present in the HTML as it streams in. Which requires an HTMLRewriter built in to the platform. If the library isn't loaded yet, it can punt to the custom enhancement in the main thread, with a slightly degraded user experience (but at least it wasn't for lack of effort of all the proletariat workers).

Knowing that I could use the same API both in a server setting, and in the browser's service worker, would tell me that the approach I'm using will have enough longevity, that it is worth my time to do it. It means developers using a server technology that isn't JavaScript based could at least rely on the service-worker half of the equation. Otherwise, I'd rather target a server technology with more reach (I may anyway, just saying.)

@rniwa
Copy link
Collaborator

rniwa commented Aug 30, 2023

What is the actual proposal here? It's unclear from the post.

@bahrus
Copy link
Author

bahrus commented Aug 30, 2023

Basically, incorporating Cloudflare's html rewriter api, which provides a SAX-like event driven way of manipulating a stream of HTML, but including support for mustache syntax as well. I've been told that's too specific, so trying to focus on use cases first without jumping to the conclusion, which is probably contributing to the confusing way this is being written.

@bahrus
Copy link
Author

bahrus commented Aug 30, 2023

If such a request seems wildly out of bounds of what the browser vendors think it is appropriate to support, I can save myself and everyone else the effort of tabulating (and reading through) use cases.

@bahrus bahrus changed the title Server-side "template instantiation" Server-side "template instantiation" of an HTML or XML stream. Aug 30, 2023
@bahrus bahrus changed the title Server-side "template instantiation" of an HTML or XML stream. Support for HTML/XML stream parsing/rewriting. Aug 31, 2023
@bahrus
Copy link
Author

bahrus commented Sep 2, 2023

Thanks for the feedback. Hopefully it is clearer now. If not, let me know what isn't making sense. Thanks for considering this proposal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants