New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling escape symbols #1349
Comments
|
The general escape rule is that a backslash swallows the next character (Reference). Within names (or URLs), an escape cannot consume a newline, but within strings it can. In the above diagram, you’ll see exceptions for hex digits. In most cases, we could normally ignore this, as escape codes always begin words (or what the spec calls "identifiers") and hex digits are already covered by identifier rules, but the swallowing of whitespace afterward may be important.
Given the CSS
And it should be more like this:
|
I only have a small update. I’m also wrapping up school with the kids and need to start work, so please forgive typos or confusing statements. I was able to complete my first pass of the tokenizer this weekend. Here is the source: https://gist.github.com/jonathantneal/dd859523df584b012dfa1cc8795d7ca9 Here are some benchmark results.
And here is an example structure of a token: // where css is "-.567800E-0012780em"
const tokenExample1 = [
/** @type {TokenType} */ "number",
/** @type {TokenValue} */ "-.567800E-0012780em",
/** @type {TokenStart} */ 0,
/** @type {TokenEnd} */ 19,
/** @type {TokenSplit} */ 17
]
// where css is "1138--thx"
const tokenExample2 = [
/** @type {TokenType} */ "number",
/** @type {TokenValue} */ "1138--thx",
/** @type {TokenStart} */ 0,
/** @type {TokenEnd} */ 9,
/** @type {TokenSplit} */ 4
]
// where css is "foo\:bar"
const tokenExample3 = [
/** @type {TokenType} */ "identifier",
/** @type {TokenValue} */ "foo\\:bar",
/** @type {TokenStart} */ 0,
/** @type {TokenEnd} */ 8,
/** @type {TokenSplit} */ 0
]
/** @typedef {"comment" | "delimiter" | "identifier" | "number" | "string" | "whitespace"} TokenType - Name of the token. */
/** @typedef {string} TokenValue - Value of the token. */
/** @typedef {number} TokenStart - Character index of the source where the token starts. */
/** @typedef {number} TokenEnd - Character index of the source where the token ends. */
/** @typedef {number} TokenSplit - Character index of the source where the token can be split. Splits occur between numbers and their units, and splits occur between identifier symbols (like `@` or `#`) and their identifier names). */ |
Thanks a ton for looking into this, @jonathantneal! Maybe @ai can provide some performance tips? |
Do we really need to change it in the tokenizer if it affects performance? We can change parser and avoid any performance investigations. |
Having some time tonight, I’ve pushed the latest version of the tokenizer to https://github.com/csstools/tokenizer It includes benchmark tests and the ability to create token JSON files, if that’s at all helpful. These benchmarks were averaged from my local machine:
I don’t know, because I don’t yet understand some parts of the current tokenizer. For instance, the current tokenizer produces a single token from CSS like I think the answer to your question has a couple dependencies. The first dependency is to find out the performance impact the current PostCSS parser adds to its tokenizer. If the currently combined tokenizer and parser are slower than this new tokenizer I’m working on, then I think it’s worth it to have a "slower" tokenizer but a faster overall product that handles CSS accurately.
Would you have time to investigate this? My earlier comment describes how escapes are assembled. I’ve also included a small demo of its tokenization on CodePen. |
I want to provide more frequent updates here, but my free time is split between this and a significant upgrade to PostCSS Preset Env. I think the next thing I can do is plug the tokenizer into the PostCSS object model, and see what the final performance impact is on Bootstrap’s CSS. Let me admit that I have trouble solving the small issue (escapes) without solving a big issue (css syntax compliance). I do believe solving the bigger issue will knock out the little issue, tho. My hope is that I can at least push this far enough along that you and others can bring it across the finish line. |
Please let me know if I should post less frequently. Here are my updates: In Progress: Combine the Experimental Tokenizer with the PostCSS Object ModelI have not yet integrated the experimental tokenizer into the PostCSS object model. I think I need a highly focused 4 hour window to accomplish this; time I have not yet had. Once this is done, I should be able to verify whether the escape symbols are properly handled with this combination. The weekend is soon. Now: Compression May Improve PerformanceI have added a compression step to the experimental tokenizer using terser, and the compressed tokenizer is regularly scoring better benchmarks than the uncompressed version. A unique element of the compression is that it is replacing every character code variable with the actual character code number; e.g. As of April 30, 2020, these benchmarks were averaged from my local machine:
Benchmarks are not an exact science for me. I have not changed anything about the experimental tokenizer, but it is more regularly parsing 1.8x slower, an improvement from 2x slower. Still, the compressed experimental tokenizer is benchmarking only 1.4x slower than the current PostCSS tokenizer, and it is picking up 1.2x the tokens. Soon: Improving Syntax TestsI would like to migrate some of the css-syntax tests from the web-platform-tests project into this development work. This is lower priority than integrating the Object Model, and I will fight the temptation to do this first. With new versions of these tests, PostCSS could protect itself against syntactically valid edge cases, and even trail-blaze work toward the Houdini Parser API. The WPT tests themselves are not entirely helpful, because browsers are limited to testing the serialization results of CSS. |
I recommend copying https://github.com/postcss/postcss/blob/master/lib/tokenize.es6#L253-L254 JS is very slow when you are tokenizing symbol by symbol. |
I’m feeling much better after analyzing those jumps. Let me share what I’ve learned. I’d like to know what you think. I was able to implement the comment jump. There was no perceivable yield. This is what it looked like: // comment
case FS:
if (css.charCodeAt(pos + 1) === STAR) {
typ = 'comment'
pos = css.indexOf('*/', pos + 2) + 1
if (pos === 0) {
pos = len
} else {
++pos
}
} else {
++pos
}
break I was able to implement the string jump. There was no perceivable improvement. This is what it looked like: // string
case DBLQ:
case SNGQ:
var quote = cp0 === DBLQ ? '"' : "'"
var escaped
do {
escaped = false
pos = css.indexOf(quote, pos + 1) + 1
if (pos === 0) {
pos = len
break
}
escapePos = pos - 1
while (css.charCodeAt(escapePos - 1) === BS) {
escapePos -= 1
escaped = !escaped
}
} while (escaped)
break So far, it seems those jumps are not beating symbol checks. However, after looking at the next two jumps — I think I see where some of the performance differences are coming from. However, I’m not sure I should implement those jumps. I did not want to implement the parentheses jumps because they skip tokenizing the contents between the parentheses. This is probably one of the reasons the current tokenizer is picking up 49,548 tokens from Put another way; the experimental tokenizer is 1.4 times slower but it is also picking up 1.2 times more tokens, and it is accurately handling the escapes. Up Next: I think I have time tonight to wire in this new tokenizer to the current object model. EDIT: I have added a very crude implementation of a PostCSS-like Object Model. The code is pushed to the same repo.
|
Yeap, we need to count PostCSS core + PostCSS Value parser time. If a new tokenizer will prepare tokens for value/selector parsing, a little slow down will bring a big speedup in value/selector parsing.
I like this metric :D |
I haven’t had much time, but I have added another test that runs PostCSS with Selector and Value parsing. The functionality looks like this, which I hope we’ll find generous: parsed.root.walk(node => {
switch (node.type) {
case 'decl':
postcssValuesParser(node.value)
break
case 'rule':
postcssSelectorParser.processSync(node.selector)
break
}
}) In reality, I think most plugins would runs those parsers in separate walks, and most plugin packs would run those parsers multiple times. Still, here are the updated parse times:
https://github.com/csstools/tokenizer#collecting-postcss-parser-benchmarks |
Andrey, please let me know if I’m pestering with these updates or if I can make them more helpful. While analyzing the "slower" parts of the tokenizer, it seems like eagerly checking the character ahead improves overall performance. I have rewritten the tokenizer to take advantage of this. I have also added 2 fields to a token; they are the opening and closing distances between the meaningful value of a token and its delimiters. Although “delimiter” is a poor term, this refers to the split between things like the Anyway, I think you’ll really like these results!
— From https://github.com/csstools/tokenizer#collecting-postcss-parser-benchmarks |
A new performance breakthrough is awesome 😍. Let’s change tokenizer in 8.0. Is it possible to use it in the safe parser or SCSS parser without changes? I forked the current one and can fork a new one too if it will be impossible to customise it. |
@ai what is status this feature for 8 version? Postpone? |
@evilebottnawi yeap, we decided to do it after 8.0, since new parser will make 8.0 release too big and will force us to wait too long |
Sounds good 👍 |
Took me a while to analyse this issue because initially I interpreted this as "PostCSS has a bug with escaped symbols". But this is not a bug in PostCSS. As far as I can tell PostCSS produces the correct AST with or without escaped symbols. It produces a few words too many at the tokenizer level, but this goes away in the parser. The original issue was that a downstream package did not produce the correct AST and uses the same tokenizer. Changing the tokenizer here would produce a different output there and might have fixed some issues. From a different perspective this sounds very risky to me. That the downstream package uses internals from PostCSS also caused other issues and most plugins are no longer uses this dependency because of that. I think that this has a lot of overlap with : #1145 (comment) |
From https://acss.io/#pseudo-classes:
Each line should parse as one word (i.e. one identifier):
This can be verified with https://rawgit.com/tabatkins/parse-css/master/example.html
This issue continues from shellscape/postcss-values-parser#93
With thanks and credit to @nex3 and @ai for identifying this issue.
Update: I have started work on updating the Tokenizer, but I may need assistance as I integrate or abandon the current multichar tokens. I don’t necessarily see how those tokens benefit speed. Any risk of inaccuracy seems too steep a price to pay.
@ai, thank you for the wonderful documentation @ https://github.com/postcss/postcss/blob/7.0.27/docs/architecture.md#tokenizer--libtokenizees6-
The text was updated successfully, but these errors were encountered: