Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make http response example more robust #460

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

wingman-jr-addon
Copy link

I've been using WebRequest filtering for quite some time now in one of my plugins, and while it's easy to get started with toy examples, I've found that there is some significant complexity that must be addressed when trying to do any sort of text manipulation in a bit more robust manner. (Some of these deficiencies are discussed in #364 , partially addressed in the MDN docs ) I thought extending this extension would be the right place to start as I would have wished to know these details when I started my plugin.

This extension now demonstrates three important things:

  • The accumulation of data through multiple calls to .ondata
  • The decoding of binary data to text in a streaming fashion. (Not necessarily performant, but should be correct.)
  • Text decoding that tries to respect the page's reported encoding via Content-Type.

As an observation, browser support for two things would clean this up tremendously:

  • Exposing the detected character encoding on the webRequest details (at least insofar as it is detectable based on headers) would avoid manual, poor detection logic like that I have here.
  • Adding support for different encodings for TextEncoder so that the original character encoding can be preserved rather than always converting to UTF-8.

At any rate, let me know what you think! I hope this will help provide a bit stronger starting point for others using webRequest.

@wingman-jr-addon
Copy link
Author

wingman-jr-addon commented Nov 7, 2020

@caitmuenster Here is the PR that I was curious about; you can see the question I posted on Discourse as well if you're curious. Thanks! (wingman-jr-addon and jessetrana are both me here, zephyr on Discourse)

@caitmuenster
Copy link
Contributor

Thanks @wingman-jr-addon! Adding this to @Rob--W's queue for review. :)

@Rob--W Rob--W self-requested a review November 10, 2020 17:47
Copy link
Member

@Rob--W Rob--W left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In an example we need to strike a balance between completeness of code and simplicity of code. Currently it is not simple, nor complete - see my comments below.

// taken on by the browser. Here, a simplified approach is taken
// and the complexity is hidden in a helper method.
let decoder, encoder;
[decoder, encoder] = detectCharsetAndSetupDecoderEncoder(details);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is complicated but still not a full solution to the problem.

For simplicity, I recommend to check whether the response is (X)HTML (to ignore responses such as images, media, downloads), and only proceed if needed. Then the MIME/charset detection logic can greatly be simplified (see also comment below about charset detection).

You should also check that the HTTP status code is OK before attaching a stream filter.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The note about the complex logic is discussed in the other relevant comments below, but duly noted.

I agree it would be better to move the logic for detecting the supported MIME types as a pre-check here.

  • Pre-check MIME types

Also agreed on the HTTP status code check.

  • Add HTTP status code check

// Just change any instance of Example or Test in the HTTP response
// to WebExtension Example or WebExtension Test.
let mutatedStr = fullStr.replace(/Example/g, 'WebExtension Example');
mutatedStr = mutatedStr.replace(/Test/g, 'WebExtension Test');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the second one. It is not necessary for the example.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is necessary for the example. I added it because the W3C test suite pages do not have the word "Example" but do have "Test" so it shows that the plugin is mutating the pages when viewing the test suite - I think that's a critical aspect to preserve. However, in the name of simplicity, I could remove the "Example"/example.com from the addon; I was trying to keep a nod to the original behavior but perhaps it just feels kludgy now.

  • Remove extraneous Example/example.com

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you'd like to replace both, you could do /Example|Test/ to demonstrate how one can replace multiple words (and if you wish to prepend "WebExtension", use 'WebExtension $&'. If you don't know about $&, see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace#Specifying_a_string_as_a_parameter

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - I think your first comment made me realize that I had a conflict of "make everyone happy" or "have a better focused example" and so I'm choosing the latter and keeping it simple. As venerable as example.com is, it gets enough press elsewhere.

filter.disconnect();
filter.onerror = e => {
try {
filter.close();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want to explicitly handle errors here?

filter.onerror could occur when the response is redirected (when you've failed to check for status codes, as you did before),
and also repeatedly when an out-of-memory issue occurs - https://bugzilla.mozilla.org/show_bug.cgi?id=1404381

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question - this part of the error handling isn't particularly relevant to the heart of the example either.

  • Remove explicit error handling

// character - the stream parameter is critical for success as this
// method gets called multiple times.
let str = decoder.decode(e.data, {stream: true});
fullStr += str;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Buffering everything before flushing is simple, but not user friendly.

Instead of buffering everything at once, you can consider flushing all but some penultimate bytes/characters. After all, if the string to replace is at most 7 characters, then you can safely flush everything except the last 6 characters, to handle this case:

[prefix this can be flushed]Exampl

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. This is an artifact of the code I was porting this from (where I needed to do a regex search of unknown length) and can be improved. I think the best solution for this would be a streaming regex transformer - I don't suppose something along those lines exists without external library support? That would make my day. Otherwise, I will change it as requested.

  • Streaming replacement

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need an external library for this, the logic can be written in a concise way. See the brief explanation below for the algorithm. It is definitely more involved than buffering all and flushing at once though. If the example becomes very unwieldy if this were to be implemented, then buffering may not be that bad of an idea.

Instead of using replace with a regexp and a string, you can use the .exec method of a regular expression with the g flag to find matches. Flush the prefix, append the replacement, and continue (if you are cutting the string, reset lastIndex to 0 to resume from the start, if you aren't cutting the string you can just call .exec again and it will continue searching from position lastIndex in the string).

When you don't find a match, and know that the regular expression has a maximum number of matches (such as 8 in case of /Example|Test/g), then cut the remaining string in two parts: Keep the last 7 characters (=max max length minus one) for the next time and flush the rest (this is what I explained in my previous comment).

Note: when your regular expression does not have a unique match (e.g. /A{1,3}/g or [0-9]+), then you need extra logic: to make sure that as much is matched as possible, when the regular expression matches the end of the string collected so far, you should not flush and keep the string for the next iteration, in case the regexp is going to match more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted to go for the simpler static string streaming replacement, and added comments around its limits.

// 1) Detects the charset for the TextDecoder so that bytes are properly turned into strings
// 2) Ensures the output Content-Type is UTF-8 because that is what TextEncoder supports
// 3) Returns the decoder/encoder pair
function detectCharsetAndSetupDecoderEncoder(details) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"detect and set up decoder/encoder" suggests that it just does detection and creating new instances of the decoder/encoder.

The method however also mutates the response headers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't like it either. I believe that the real issue here is that TextEncoder does not support anything other than UTF-8, so we must force the content type on the output, which is awkward. As noted in the PR description, the code would benefit from having TextEncoder support more types to avoid this hack. Can you think of a way to preserve the original content encoding without the TextEncoder polyfill? I'd rather bypass the mutation altogether because it seems a bit treacherous.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For search keywords that are compatible with the used encodings (e.g. letters from the latin alphabet have the same bytes in utf-8, iso-8859-2, ..), you don't need to worry about using TextEncoder. For fixed-length search terms, you could convert the string to bytes and search and replace on typed arrays instead of strings. This may be even more performant than converting to strings and back.

But for the example, I think that it is fine to convert to UTF-8. First, it shows the significance of character encodings.
For those who are not that well versed with character encodings, it is a safe default to choose, because utf-8 is very common.

}

// Detect the charset from Content-Type
function detectCharset(contentType) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The charset detection logic as implemented below is incorrect (e.g. when multiple charset parameters are set, the last one should be used; when multiple MIME types are set, the previously set charset should be forgotten). Let's keep the example simple, however, and make it more obvious that the implementation is incomplete.

Here is an example of a more complete implementation that also covers some edge cases: Rob--W/open-in-browser@a6b926e

As you can see, it's quite involved - do not put the code in the PR. You can link to it if you wish though.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's wisdom here; thanks for the link.
I've been thinking about your comment about an obviously incomplete implementation. While the idea of a complete solution here is appealing, the simplified solution will likely cover 99% of the cases and the comment will make the careful reader aware of the problem. Lack of awareness of this problem - albeit obvious in retrospect - is why I'm wanting to contribute back a more robust example in the first place; a user reported a bug to me in a review of my addon. I'd like others to avoid the same fate, as their users may not be as considerate as mine was.

  • Make incompleteness obvious

Given both the complexity of determining the charset as well as the need for it to be accurate in order to filter text properly, do you think the browser-predicted charset would ever be a candidate to expose on the details object? It just seems like right now it would be incredibly difficult to create a truly correct implementation - do you know of any extant addons that are actually correct in this regard?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given both the complexity of determining the charset as well as the need for it to be accurate in order to filter text properly, do you think the browser-predicted charset would ever be a candidate to expose on the details object?

What a good idea. I have thought of it before and posted a feature request at https://bugzilla.mozilla.org/show_bug.cgi?id=1425479
Unfortunately the feature request was denied. If you post a new bug that shows use cases and why alternatives aren't a good fit, the feature request can be re-evaluated.

It just seems like right now it would be incredibly difficult to create a truly correct implementation - do you know of any extant addons that are actually correct in this regard?

The Open in Browser add-on that I linked before was authored by me, and modelled after Firefox's implementation (which includes non-standard behavior).

// Historically, detecting character encoding has been a tricky task
// taken on by the browser. Here, a simplified approach is taken
// and the complexity is hidden in a helper method.
let decoder, encoder;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep the TextDecoder and TextEncoder constructors here as before to improve readability. You can declare a charset variable if you want to use something else instead of UTF-8.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, except that I need to potentially modify the Content-Type output to be UTF-8 if the input type is not already; that would drive having something like detectCharset and forceCharset as separate calls. Would you find that organization to be preferable to the problematic detectCharsetAndSetupDecoderEncoder above?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the simplest case where the Content-Type needs to be an exact match for something that looks like `text/html; charset=utf-8", then you don't need to change the response header. But even if you do change the header, then you can still improve the readability by keeping the main logic in the listener, and using small helper functions to avoid too large function bodies. For example:

function listener(details) {
  let {responseHeaders} = details;

  let ctHeader = responseHeaders.find(h => h.name.toLowerCase() == "content-type");
  // TODO: Verify that ctHeader is supported.
  // If Content-Type header not set, the browser is going to do content-sniffing,
  // and we should also return to avoid trouble (e.g. breaking downloads, PDFs, videos, ...).
  if (content type is not supported) {
    return;
  }

  let charset = ...
  let decoder = new TextDecoder(charset);
  let encoder = new TextEncoder();
  etc...
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took another run at the code. I was able to clean it up in a few places as per your comments - keep the header object reference rather than index, return earlier/don't guess, etc. Still, a full half of the code is dedicated to character encoding alone; I can't say that the code is elegant.

@wingman-jr-addon
Copy link
Author

Thanks @Rob--W for taking the time to look at this and provide helpful feedback, and thanks @caitmuenster for getting this going. I haven't gotten much feedback from the community on this aspect of using WebRequest and yet I think it is so vital to any addon trying to filter text. I'll take a look at your comments; I think you might have some better ideas on how to solve some of the problems I'm facing but it may take a round or two of feedback until I get into the right spot. Thanks!

@wingman-jr-addon
Copy link
Author

@Rob--W I've taken another pass at the code and have closed out the open tasks I set up on the review. Ultimately, I still dislike the code, but I think it's an improvement and thanks for your help. Let me know if it passes muster.

I'd also like to promote to the top level the conversation we had going in the sub-thread about e.g. exposing the detected MIME type. We were saying:

Given both the complexity of determining the charset as well as the need for it to be accurate in order to filter text properly, do you think the browser-predicted charset would ever be a candidate to expose on the details object?

What a good idea. I have thought of it before and posted a feature request at https://bugzilla.mozilla.org/show_bug.cgi?id=1425479
Unfortunately the feature request was denied. If you post a new bug that shows use cases and why alternatives aren't a good fit, the feature request can be re-evaluated.

Also, the link to your original GitHub issue in there has quite valuable implementation details so I'd like link to it as well: https://github.com/Rob--W/open-in-browser/issues/5

I'd like to discuss this issue more. I think - assuming there is enough use of webRequest filtering in general - that there is a strong case for a bug because it is still nigh on impossible to do basic text replacement. I know just enough now of the complexities to understand also why it is such a pain for the API - basically the core question is: at what point is the final MIME type known? I think given the complexity of the process - and the fact that generally the browser wholly owns and controls it - having a scenario where one or more addons are trying to interact with it will be tricky. It's not quite the "halting problem" but I'm pretty sure something to do with MIME type detection could be Turing complete.

As I see it, there are many possible depths of solutions that could work to varying degrees:

  1. The status quo. Push all detection/re-encoding on to the addon. This seems bad because the logic is complicated and duplicates what the browser is already doing - poorly. Coverage: 0%
  2. Provide a solution that does detection solely based on headers - as far as the addon is concerned MIME type would be fixed after that. Also, extend TextEncoder so that it can re-encode to the detected MIME type to avoid requiring Content-Type mutations. This would allow for basically one-line implementations, but would be incorrect for situations where the MIME type is in the meta or is sniffed. Coverage: maybe 98%?
  3. Delay processing by the addon until the content type is sniffed up to the 512 bytes. Unfortunately I think the addon will likely still want to change e.g. Content-Type header based on this, so that sounds like a potential headache for implementation. Otherwise I think something like an onContentTypeSniffed as a new hook (or perhaps as an option for onHeadersReceived) could be tremendously valuable. Yes, it would also cause problems of a sort if multiple addons chained this. So be it. Coverage: maybe 99.5%?
  4. Expose the current MIME type as the browser continues detection as well as the ability to instruct the browser to change MIME type along the way. Complexity? High. Difficulty to implement? Also probably high. Coverage: maybe 99.9%?

So I'd probably lean towards something more in the vein of option 2 or option 3; I think option 3 has the potential to be more problematic from an implementation standpoint.

After pondering the problem for another 3 years, what thoughts have you had on the matter?

@wingman-jr-addon
Copy link
Author

Also, I did a little research tonight. I searched GitHub for "TextEncoder" and "filterResponseData" and surveyed many of the results. Not exactly the most scientific - and I likely have errors in my hasty analysis - but here are my rough findings:

  • 83 source files were examined. I tried to not do too many duplicates, and avoided e.g. forks of FF unit tests.
  • 3 of the sources seemed to have recognize and fix both decoding and encoding based on charset etc. (uBlock, Scriptlet Doctor, Adgard)
  • 2 other sources had at least partial mitigations for both decoding and encoding. (LibreJS, Header Editor)
  • 2 other sources recognized the issue but did not attempt to fix it.
  • 18 or so seem are direct copy/modifies the http-response example or inherit the .disconnect logic in the first .ondata currently present in this example.
  • Many sources did not do HTTP status filtering.
  • Many sources did decode using stream but did not flush in e.g. .onstop

Those are rather abysmal results, but not surprising.
AnalyzeWebRequestUsage.zip

This indicates to me that there is a demonstrated need for awareness as well as mechanisms to make some level of correctness easier - character encoding seems like something that should be a minor footnote for the average addon developer.

@github-actions github-actions bot added the idle Issues and pull requests with no activity for three months. label Jan 22, 2023
@rebloor
Copy link
Collaborator

rebloor commented Aug 11, 2023

@Rob--W I know this will be testing the memory a little but, based on your comment, are you happy about the example? If you are, I will test and make sure it works as expected and then merge the changes.

@github-actions github-actions bot removed the idle Issues and pull requests with no activity for three months. label Aug 12, 2023
@wingman-jr-addon
Copy link
Author

wingman-jr-addon commented Aug 14, 2023

@rebloor @Rob--W I saw that activity popped up on this and thought I'd chime in a bit as I've gotten more experience with the issue as my addon has matured. Unfortunately it hasn't translated to the work here (yet?) as I thought this effort was dead. If there's interest I can take another crack at freshening this up.

In short, I now think that probably the best way to handle the issue is to provide a special linked pair of TextDecoder/TextEncoder that hide the detection altogether and allow the developer to just feed in the stream, get a string, and reencode the string. You still need the raw bytes in many cases but additionally for text handling a built-in like this would help greatly.

The big reason for this approach is primarily because I've now seen many of the different methods that can in theory show up actually do get reported as bugs - including more difficult ones like the <meta> approach which requires sniffing the stream. Handling the work to ensure the encoder and decoder stay in sync as the detected type changes is surely something better left to the browser.

(One further note/update: not having TextEncoder support encoding into all the encodings TextDecoder can support is a frustrating limitation as I can't simply sniff the charset, decode, and then encode cleanly - this is part of what would be nice with the TextEncoder pair being linked.)

I'd also be curious to hear how @Rob--W 's thoughts have evolved now over the past 6 years since his original feature request.

@github-actions github-actions bot added the idle Issues and pull requests with no activity for three months. label Sep 19, 2023
@Rob--W
Copy link
Member

Rob--W commented Sep 30, 2023

@rebloor @Rob--W I saw that activity popped up on this and thought I'd chime in a bit as I've gotten more experience with the issue as my addon has matured. Unfortunately it hasn't translated to the work here (yet?) as I thought this effort was dead. If there's interest I can take another crack at freshening this up.

Oops, this fell through the cracks. I'll perform a review once this is freshed up.

(The rest of the text below is part of the ongoing discussion about a feature request for Firefox; you can stop reading if you are only interested in the PR)

In short, I now think that probably the best way to handle the issue is to provide a special linked pair of TextDecoder/TextEncoder that hide the detection altogether and allow the developer to just feed in the stream, get a string, and reencode the string. You still need the raw bytes in many cases but additionally for text handling a built-in like this would help greatly.

This sounds very convenient from the extension developer's point of view, but a huge burden to the extension framework. Due to content sniffing, the target encoding may sometimes be dependent on the data itself (e.g. binary vs something else). And sometimes there is no lossless way to encode one in another.

The big reason for this approach is primarily because I've now seen many of the different methods that can in theory show up actually do get reported as bugs - including more difficult ones like the <meta> approach which requires sniffing the stream. Handling the work to ensure the encoder and decoder stay in sync as the detected type changes is surely something better left to the browser.

I suppose that you are primarily interested in replacing specific parts of the content without changing the encoding. When the byte sequence is unambiguous (the keyword and replacement), you could also operate on the bytes directly, without using TextEncoder/TextDecoder.

(One further note/update: not having TextEncoder support encoding into all the encodings TextDecoder can support is a frustrating limitation as I can't simply sniff the charset, decode, and then encode cleanly - this is part of what would be nice with the TextEncoder pair being linked.)

Non-utf8 has little utility in web browsers, and having encoders for non-utf8 encodings would bloat the implementation and result in a larger application binary, as well as a larger attack surface. So while I understand the convenience to developers again, I also recognize the issue with having this functionality natively. Note - if you really want to there is of course nothing stopping you from having encoders the other way around again, but that's probably not optimal.

I'd also be curious to hear how @Rob--W 's thoughts have evolved now over the past 6 years since his original feature request.

On https://bugzilla.mozilla.org/show_bug.cgi?id=1425479 ? When I filed that issue I was an external contributor with experience as an extension developer, and browser engineering in Firefox and Chromium (as a volunteer). Now I have been working for a few years as a browser engineer at Mozilla, and one thing that stands out is that there are significantly more ideas than time available to implement them all. Therefore even if an idea on its own seems useful, it may not make the cut due to prioritization. Especially if the feature is niche but requires considerable engineering effort and maintenance.

@github-actions github-actions bot removed the idle Issues and pull requests with no activity for three months. label Oct 1, 2023
@rebloor
Copy link
Collaborator

rebloor commented Oct 1, 2023

@wingman-jr-addon please ping Rob when you've had the opportunity to address his feedback and make any other changes.

@rebloor rebloor added the idle Issues and pull requests with no activity for three months. label Oct 26, 2023
@wingman-jr-addon wingman-jr-addon force-pushed the make-http-response-example-more-robust branch from a0377db to 6a5b6da Compare October 27, 2023 02:28
@github-actions github-actions bot removed the idle Issues and pull requests with no activity for three months. label Oct 27, 2023
@github-actions github-actions bot added the idle Issues and pull requests with no activity for three months. label Nov 27, 2023
@github-actions github-actions bot removed the idle Issues and pull requests with no activity for three months. label Dec 31, 2023
@github-actions github-actions bot added the idle Issues and pull requests with no activity for three months. label Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
idle Issues and pull requests with no activity for three months.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants