Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: avoid cheerio in favor of regex #404

Merged
merged 43 commits into from May 17, 2024

Conversation

GalacticHypernova
Copy link
Contributor

Types of changes

  • Bug fix (a non-breaking change which fixes an issue)
  • New feature (a non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Description

This PR integrates native solution via regex to load and parse HTML, in order to improve runtime performance.
Closes #341

Checklist:

  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes (if not applicable, please state why)

Copy link

vercel bot commented Mar 23, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
nuxt-security ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 10, 2024 4:55pm

@vejja
Copy link
Collaborator

vejja commented Mar 25, 2024

Hey @GalacticHypernova @Baroshem

Here is a description of how it works.
The 3 main routines are 03-subresourceIntegrity, 04-cspSsgHashes and 99-cspSsrNonces. Each one is executed consecutively in this order

  • 03-subresourceIntegrity needs to find all external script tags and detect the name of the file. Integrity hashes were already pre-computed at build time so once the name is found, the integrity tag can be inserted immediately. The regex needs to make the difference between external scripts (<script ... src="filename.js" .../>) and inline scripts (<script ...>//some code</script>) because only external resources can carry integrity. Basically the endgame is to transform <script ... src="filename.js" .../> into <script ... src="filename.js" ... integrity="xxx"/>, so this modifies the HTML source. Same for external links, except that not all links are eligible so the regex should also parse the rel attribute of the link.

  • 04-cspSsgHashes is a bit of the contrary. Only inline scripts are hashed so the regex needs to do the opposite: only find inline elements and ignore external ones. The problem is that integrity hashes for external resources must also be added to the CSP header in the SSG case, so we still need to parse the external scripts for that purpose; However this is not extremely difficult because at the previous step all integrity tags were already inserted so it is easy to parse and find the value. Same for inline styles and links. In short, 04-cspSssgHashes does not modify the HTML, it only collects the hashes that need to be inserted in the CSP directives.

  • 99-cspSsrNonces does not need to make the difference between external resources and inline elements, because the nonce is added to all elements. The integrity hashes were already added in the first step, and they don't need to be repeated in the CSP header so we don't care about integrity tags at all here. 99-cspSsrNonces modifies the HTML source by inserting the nonce attribute to the relevant tags.

Now at a higher level, these three routines are called at the time the html is rendered, with the render:html hook. This hook has an html variable which is an object with keys such as head, body, etc. Normally the scripts should be in head but I found out that sometimes they can be in other sections, so for the sake of completeness we inspect all sections.

Cheerio is used to parse the content of each section and find out the relevant parts of the HTML that need to be modified.
Cheerio takes an HTML string as input, and transforms it into an AST tree of nodes. It then provides very convenient utilities to find elements by tag, make the difference between external and inline, inspect the attributes, modify the nodes, etc. But it is too slow.

This is why I added 02a-preprocessHtml and 99b-recombineHtml : it avoids parsing the same sections each time. So 02a-preprocessHtml starts by loading each of the sections in the cheerio parser, via cheerio.load. Each parsed section is saved under context.cheerios, which is just a temporary object to hold these results. Then 03-subresourceIntegrity, 04-cspSsgHashes and 99-cspSsrNonce can access the context.cheerios[section] temporary results, instead of having to cheerio.load the section over and over again. At the end we just recombine everything back into HTML with 99b-recombineHTML.

As you can see 02a and 99b were added afterwards, for performance reasons only.
Hope this helps

@GalacticHypernova
Copy link
Contributor Author

GalacticHypernova commented Mar 25, 2024

@vejja thanks for the explanation! This really does help a lot, and will be very useful for the PR. Does that mean no further modification needs to be made for sections in 02a (before the 3 routines)? No need to trim or anything like that? I was just wondering if it actually needs to be modified at all.
P.S. I already have an idea in mind to handle caching these so that there won't be a need to iterate over everything again.

@vejja
Copy link
Collaborator

vejja commented Mar 25, 2024

@vejja thanks for the explanation! This really does help a lot, and will be very useful for the PR. Does that mean no further modification needs to be made for sections in 02a (before the 3 routines)? No need to trim or anything like that? I was just wondering if it actually needs to be modified at all. P.S. I already have an idea in mind to handle caching these so that there won't be a need to iterate over everything again.

I think if you can get rid of 02a and 99b, it would be fantastic. Honestly they are a legacy of trying to quick-fix the cheerio issues

@GalacticHypernova
Copy link
Contributor Author

GalacticHypernova commented Mar 25, 2024

I think if you can get rid of 02a and 99b, it would be fantastic. Honestly they are a legacy of trying to quick-fix the cheerio issues

👍
I mean, I do have a caching strategy in mind which might require 02a, I'm not sure yet, but 99b I do believe will be unnecessary.

@GalacticHypernova
Copy link
Contributor Author

GalacticHypernova commented Mar 25, 2024

If I am correct (I haven't worked much with nitro's HTML hooks myself), the bodyAppend/prepend/etc return strings, right? They just return the stringified HTML? Because I see a $('script') inside a forEach so I want to make sure on what I actually need to do. In case it's an array within an array (like an array in each section) I would need to adjust the regex accordingly.

@vejja
Copy link
Collaborator

vejja commented Mar 25, 2024

@GalacticHypernova
Copy link
Contributor Author

Thanks! I remembered that vaguely but I also didn't get the chance to check it myself.

@GalacticHypernova
Copy link
Contributor Author

@vejja in 03, can external scripts have <script src=...></script> syntax, or is it guaranteed to be <script src=.../>

@vejja
Copy link
Collaborator

vejja commented Mar 26, 2024

@vejja in 03, can external scripts have <script src=...></script> syntax, or is it guaranteed to be <script src=.../>

I think in theory it should always be the first syntax as <script> is not supposed to be self-closing. So my mistake for suggesting the other way round.
However there used to be many issues around that, so I guess it’s still possible in theory to have <script …/>

Update:
Upon reasearching the topic, I think we should always assume it is going to be <script ...></script>.

Let's keep our lives simple and not support <script .../>. It is not HTML spec-compliant, and Nuxt does not use this syntax. Trying to have a regex that catches a useless scenario is likely to introduce more bugs than solve problems.

Thanks for bringing this up @GalacticHypernova

@vejja
Copy link
Collaborator

vejja commented Mar 26, 2024

@GalacticHypernova
Copy link
Contributor Author

GalacticHypernova commented Mar 26, 2024

@vejja in 03, can external scripts have <script src=...></script> syntax, or is it guaranteed to be <script src=.../>

I think in theory it should always be the first syntax as <script> is not supposed to be self-closing. So my mistake for suggesting the other way round. However there used to be many issues around that, so I guess it’s still possible in theory to have <script …/>

Update: Upon reasearching the topic, I think we should always assume it is going to be <script ...></script>.

Let's keep our lives simple and not support <script .../>. It is not HTML spec-compliant, and Nuxt does not use this syntax. Trying to have a regex that catches a useless scenario is likely to introduce more bugs than solve problems.

Thanks for bringing this up @GalacticHypernova

Thanks!
In theory I could still support that syntax relatively easily, it depends on whether nitro handles the html conversion to the spec compliant version or not, otherwise user-supplied scripts may not have the integrity if they do use the self-closing syntax.

Also if you're wondering what regex we used to have before Cheerio, you can check out: https://github.com/Baroshem/nuxt-security/blob/v1.0.0-rc.1/src/runtime/nitro/plugins/02-cspSsg.ts https://github.com/Baroshem/nuxt-security/blob/v1.0.0-rc.1/src/runtime/nitro/plugins/99-cspNonce.ts

Yea, I made the patterns a bit more specific so that it would be easier to handle, though we would likely want to test its performance.

@vejja
Copy link
Collaborator

vejja commented Mar 26, 2024

Thanks! In theory I could still support that syntax relatively easily, it depends on whether nitro handles the html conversion to the spec compliant version or not, otherwise user-supplied scripts may not have the integrity if they do use the self-closing syntax.

Up to you. You're absolutely right that otherwise they won't have integrity if they use the self-closing syntax

@@ -2,6 +2,8 @@ import { useStorage, defineNitroPlugin, getRouteRules } from '#imports'
import { isPrerendering } from '../utils'
import { type CheerioAPI } from 'cheerio'

const SCRIPT_RE = /<script((?=[^>]+src="[\w:.-\\/]+")(?![^>]+integrity="[\w-]+")[^>]+)\/>/g
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Progress report on the script regex:

const SCRIPT_RE = /<script((?=[^>]+src="[\w:.-\\/]+")(?![^>]+integrity="[\w-]+")[^>]+)(?:\/>|><\/script>)/g

const a =[
  `<script src="a.k"/><script src="b"></script>`,
  `<div>`,
  `<script>c</script>`,
  `<script src="a" integrity="a"></script> <script src="b"/>`
]
console.log(a.map(r=>r.replace(SCRIPT_RE,(match, mb)=>{return `<script integrity="a"${mb}></script>`})))

image

It works for both self-closing and non-self-closing script tags and automatically converts them to non-self-closing.

We could go out of the way to warn users during build to not use self-closing syntax, but I'll leave that up to you.

Copy link
Collaborator

@vejja vejja Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice !

Can you test also for

  • integrity attribute before src attribute
  • random attribute (e.g. data-xxx) before and after src attribute
  • no src attribute with self-closing syntax
  • many scripts in the same string

As well as edge cases such as :

  • <script>console.log('I like to mess up with javascript by printing </script>')</script>
  • <p> I like to mess up with pre tags such as <pre> a code example: <script></script> </pre> </p> or an even more evil version: <p> I like to mess up with pre tags such as <pre> a code example: <script> that is non-closing by mistake </pre> </p>
  • <script><!-- I like to mess up with HTML comments such as </script> --></script> or its variant: <script><!-- I like to mess up with HTML comments such as <script> --></script>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do and post pictures soon!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Response for no src self-closing script:
image

Tested with this sample (same as in the PR):

const SCRIPT_RE = /<script((?=[^>]+src="([\w:.-\\/]+)")(?![^>]+integrity="[\w-]+")[^>]+)(?:\/>|><\/script>)/g
const sriHashes = {
  "a.k":"a",
  "b":"c",
  "script.js":"newIntegrity"
}
const a = [
  `<script src="a.k"/><script src="b"></script>`, //double scripts
  `<div>`, //non-script
  `<script>c</script>`, //inline script
  `<script src="a" integrity="a"></script> <script src="b"/>`, //mixed script (one with integrity, one without)
  `<script integrity="integrity" src="script.js"></script>`, //integrity before src
  `<script data-xxx src="script.js"></script>`, //random attr before
  `<script src="script.js" data-yyy></script>`, //random attr after
  `<script data-xxx src="script.js" data-yyy></script>`, //random attr before and after
  `<script src="script1.js"></script><script src="script2.js"></script>`, //multiple scripts in the same line
  `<script/>`, // self closing with no src
  `<script>console.log('I like to mess up with javascript by printing </script>')</script>`, //first edge case - print
  `<p> I like to mess up with pre tags such as <pre> a code example: <script></script> </pre> </p>`, // second edge case - <pre>
  `<p> I like to mess up with pre tags such as <pre> a code example: <script> that is non-closing by mistake </pre> </p>`, // second edge case - <pre> (evil version)
  `<script><!-- I like to mess up with HTML comments such as </script> --></script>`, // third edge case - comment inside script
  `<script><!-- I like to mess up with HTML comments such as <script> --></script>` // third edge case - comment inside script (variant)
];
console.log(a.map(r=>r.replace(SCRIPT_RE,(match, rest, src: Partial<keyof typeof sriHashes>)=>{
  const hash = sriHashes[src]
  if (hash) {
    return `<script integrity="${hash}"${rest}></script>`
  }
  return match
})))

Please let me know if anything got incorrect results!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing, thanks a lot !!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, one more thing. Would it be too much of an issue if "valid" external scripts would get integrity? I just tested with console.log(<script src="a.k"></script>) and it returned the the script with the integrity, but I don't really think that would cause any trouble, would it?

I am not too sure to be honest

  • On the one hand, in theory this is a problem that could lead to many issues. We are modifying user code by doing XSS ourselves here, which looks really bad for a security module
  • On the other hand, I don't even know how browsers are supposed to interpret this...

This might be the reason why the author of https://github.com/Baroshem/nuxt-security/blob/v1.0.0-rc.1/src/runtime/nitro/plugins/99-cspNonce.ts was doing some lookbehind search around quotes

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, unfortunately you can't win them all with regex.. But at least the edge cases and potential issues are covered, so that's something

Sorry, I was commenting on your idea of a WeakMap cache
Our comments on integrity injection crossed at the same time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, all good!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be the reason why the author of https://github.com/Baroshem/nuxt-security/blob/v1.0.0-rc.1/src/runtime/nitro/plugins/99-cspNonce.ts was doing some lookbehind search around quotes

Yea, but at the same time it could be something like:

hi'<script>console.log('hi')</script>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not that straight-forward unfortunately, it needs to be very thoroughly planned for it to be executed correctly because there are many things that could go wrong here. But I'm afraid that until that happens, we might want to focus on optimizing cheerio as much as possible.

@GalacticHypernova
Copy link
Contributor Author

Also what about links? Are they always self closing? Or should I treat them the same as script tags?

@vejja
Copy link
Collaborator

vejja commented Mar 26, 2024

Also what about links? Are they always self closing? Or should I treat them the same as script tags?

I think <link> is always empty. It is never self closing, and it has no closing </link> tag.
So it should always be like <link rel="stylesheet" href="example.css">

@GalacticHypernova
Copy link
Contributor Author

GalacticHypernova commented Mar 26, 2024

We might want to look into a hybrid approach, at least until a suitable solution is found for the edge cases that are harder to catch with regex. That is, for external scripts/links/etc, regex would be used, but for inline stuff cheerio. That way it would still increase performance (as it would be used less), but wouldn't compromise integrity entirely (again, as a temporay solution).
Would love to hear your thoughts on this

@vejja
Copy link
Collaborator

vejja commented Mar 26, 2024

We might want to look into a hybrid approach, at least until a suitable solution is found for the edge cases that are harder to catch with regex. That is, for external scripts/links/etc, regex would be used, but for inline stuff cheerio. That way it would still increase performance (as it would be used less), but wouldn't compromise integrity entirely (again, as a temporay solution). Would love to hear your thoughts on this

Isn't an inline script the same as an external script except that it doesn't have an src attribute ?
Let me dig into this, I need to read the HTML spec 🤮🤮🤮

@GalacticHypernova
Copy link
Contributor Author

Isn't an inline script the same as an external script except that it doesn't have an src attribute ? Let me dig into this, I need to read the HTML spec 🤮🤮🤮

Well, not exactly. Assuming the worst case scenario, that everyone using this module is trying to cause harm, there are a lot more ways for inline scripts to cause trouble.

@Baroshem
Copy link
Owner

Baroshem commented May 7, 2024

Hey folks, any progress on this? I would love to see this being added to the upcoming 2.0.0 release :)

@GalacticHypernova
Copy link
Contributor Author

Hey! Sorry for the slow progress, been very busy with real life matters. I'm currently trying to debug the errors, I believe those are the only blockers. If you have any idea why they may be popping up I could push a fix, maybe even today.

@GalacticHypernova
Copy link
Contributor Author

@vejja do you have any idea why the tests are failing? Nothing in this PR seems to be the cause of this. Is the issue in the unit tests?

@vejja vejja marked this pull request as ready for review May 10, 2024 16:56
Copy link
Collaborator

@vejja vejja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now
Only substantial modification was to make 'as' optional in LINK_RE when scanning ssgHashes

Thanks @GalacticHypernova for doing all the heavy-lifting here with regexes
This is a huge improvement that is expected to have a significant impact in edge runtimes
💚

@vejja vejja mentioned this pull request May 10, 2024
7 tasks
@GalacticHypernova
Copy link
Contributor Author

GalacticHypernova commented May 11, 2024

@vejja glad I could help! I just noticed I accidentally made as required when making the pattern, sorry about that and thanks for catching it! It's a bit hard to see that on phone 🥲
If you don't mind I'll run some tests on it to make sure there's no performance impact. I didn't make changes like removing the unused files and working directly with the html object yet because I wanted the PR to work first so there won't be too much work to backtrack when debugging.

@GalacticHypernova
Copy link
Contributor Author

GalacticHypernova commented May 13, 2024

I have 2 main candidates for the optional as pattern:
We can use /<link(?=[^>]+\brel="(stylesheet|preload|modulepreload)")(?=[^>]+\bintegrity="([\w\-+/=]+)")(?=[^>]+\bas="(\w+)"|)[^>]+>/
But we can also use the pattern @vejja suggested, which is /<link(?=[^>]+\brel="(stylesheet|preload|modulepreload)")(?=[^>]+\bintegrity="([\w\-+/=]+)")(?=(?:[^>]+\bas="(\w+)")?)[^>]+>/

Any preferences? The only difference is that in the first pattern the positive lookahead adds an empty alternative with |, whereas the second one makes the inner pattern optional, so it would essentially do the same and check for either the as attribute or an empty alternative.

In terms of performance, from what I tested they perform similarly, but I guess only time will tell with full-fledged websites.

@vejja
Copy link
Collaborator

vejja commented May 13, 2024

Can you give an example where the 2 patterns would give a different result ?

@GalacticHypernova
Copy link
Contributor Author

GalacticHypernova commented May 13, 2024

Can you give an example where the 2 patterns would give a different result ?

I can try to find some edge case but from what I have tested thus far its the same, only expressed differently. The first pattern essentially tells the regex engine to make sure there is either the as attribute or an empty alternative (which would always be true since it's a fallback), and the second pattern (the one you suggested) tells the engine to look for a potential match with the as attribute, meaning it will also accept nothing as it's only a potential match.

@vejja vejja changed the base branch from main to chore/2.0.0-rc.1 May 17, 2024 08:40
@vejja vejja merged commit 60ddf61 into Baroshem:chore/2.0.0-rc.1 May 17, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance optimization opportunity
3 participants