New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix listblock preformatted overlap #4055
base: master
Are you sure you want to change the base?
Fix listblock preformatted overlap #4055
Conversation
Some blocks (e.g. listblocks and preformatted) consume newlines in their exit pattern. Some blocks (e.g. listblocks and preformatted) need newlines in their *entry* patterns, but that newline was consumed by the previous block! This updates the lexer to put the newline back when exiting a block consumes the newline.
The entry pattern for preformatted blocks understand and respect listblocks, but the mid-block pattern does not. The syntax is the same, so the pattern match should be the same. Otherwise, preformatted blocks will accidentally turn subsequent listblock lines into preformatted!
We have unit tests for the different syntaxes, here these for tables and preformatted blocks: |
And nice to see the results of the puzzle, thanks for the effort! |
Umm... You imply that I actually understand how any of this actually works... :) A quick glance of the unit tests seem to show that you supply a piece of sample markup source text, then a series of array elements that my guess is are the resulting steps through the tokenization state. Makes sense -- but I don't have a way of getting that state graph. Do you have a suggestion on how I can get that list? If so, I can probably fake my way through the process and create unit tests for various combinations of interleaved preformatted/listblock/table blocks. After all, I don't want this broken in the future, either! :) |
While I'm waiting to find out how to get the graph, here's what I would propose for unit tests:
|
inc/Parsing/Lexer/Lexer.php
Outdated
// If we are closing a block and the last character consumed by our matched | ||
// string is a newline, put it back. See the following for details: | ||
// https://github.com/dokuwiki/dokuwiki/issues/4054 | ||
if ($mode == "__exit" && substr($matched, -1) == "\n") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am by no means an expert on the DW Lexer — but should this not be:
if ($this->isModeEnd($mode) && substr($matched, -1) == "\n") {
Or better yet:
if ($this->isModeEnd($mode) && str_ends_with($matched, "\n")) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AIUI, str_ends_with is PHP8, and DokuWiki still supports PHP 7.
As for isModeEnd: you could be right; I too am not an expert on the lexer. A quick look at the source shows that the code for that function is: return ($mode === "__exit");
so it seems that they will be functionally equivalent. For a person unfamiliar with the lexer, $mode
was right there... :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recently, in the development version polyfills for str_ends_with() and likes are added e.g.
dokuwiki/inc/compatibility.php
Lines 116 to 117 in 2c6bdf5
if (!function_exists('str_ends_with')) { | |
function str_ends_with(string $haystack, string $needle) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No a polyfill has been added recently. So str_*_with functions can now be used.
See
dokuwiki/inc/compatibility.php
Lines 90 to 121 in 2c6bdf5
/** | |
* polyfill for PHP < 8 | |
* @see https://www.php.net/manual/en/function.str-starts-with | |
*/ | |
if (!function_exists('str_starts_with')) { | |
function str_starts_with(string $haystack, string $needle) | |
{ | |
return empty($needle) || strpos($haystack, $needle) === 0; | |
} | |
} | |
/** | |
* polyfill for PHP < 8 | |
* @see https://www.php.net/manual/en/function.str-contains | |
*/ | |
if (!function_exists('str_contains')) { | |
function str_contains(string $haystack, string $needle) | |
{ | |
return empty($needle) || strpos($haystack, $needle) !== false; | |
} | |
} | |
/** | |
* polyfill for PHP < 8 | |
* @see https://www.php.net/manual/en/function.str-ends-with | |
*/ | |
if (!function_exists('str_ends_with')) { | |
function str_ends_with(string $haystack, string $needle) | |
{ | |
return empty($needle) || substr($haystack, -strlen($needle)) === $needle; | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it doesn't work on my system... I'm running 2023-04-04a "Jack Jackrum", not a development snapshot. I'm assuming that functionality came post-Jack? (And I really don't want to set up a development snapshot at this time: in fact, I'm just manually patching this stuff in the GitHub web UI...) But if the polyfill capability is in the post-Jack snapshot feel free to make that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel out of my depth regarding the unit tests. I took a look at https://github.com/dokuwiki/dokuwiki/blob/master/_test/tests/inc/parser/lexer.test.php but saw no easy way to add tests for this PR. But maybe I'm looking in the wrong place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to add a test for the lexer, but we can add tests for the parse handers for listblock, preformatted and tables. Look above for the comment from @Klap-in with proper paths. In that case, we can give it a chunk of markup source, and then compare our statically-added state graph against the graph generated by the running code. I replied with proposed markup that covers all of the updated transitions above in this issue. I also took that same markup and replaced the comments with heading tags and put it on my production server with my PR applied. You can see the result here: https://www.fluidtechservices.com/dokuwiki/blocktest
It shows that the markup is being processed correctly. As opposed to what happens when you paste it into the DokuWiki sandbox, which I've also done: https://www.dokuwiki.org/sandbox:playground
You can see on my server each type of markup results in exactly the right type of HTML. You can see on the sandbox it does not.
Also, I searched for some doc on unit testing and found this: https://www.dokuwiki.org/devel:unittesting I can tell you this, I do not have the environment for setting this up -- I'm way over my time budget for this. But if someone can give me a straightforward way to get the state graph, I can use the existing unit tests as a pattern and create new ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the Table to preformatted transition
is correct on your wiki? I don't see any preformatted text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct: I just noticed the same thing. It seems the lexer is going into the preformatted block and grabbing the text, but not outputting the text as HTML.
After spending the last half-hour looking into it, I checked the sandbox: this bug is in the current lexer as well. So I don't know where that text is going, or why, but it's not related to this patch. That problem already exists...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general markup level makes sense for me. However, as this feels as a rather general change, probably adding tests for the lexer as well might make sense for how it should consume/not consume new lines between different parsermodes. Of course, these tests are less trivial.
On short notice, I have not much time available. I hope I can find some time in a week/2 weeks.
Everybody who can create unit tests is invited to add them, I think especially for markup these are quite doable :-)
The unit tests are part of the development snapshot. I assume they are not delivered with a stable download. More background is available at https://www.dokuwiki.org/devel:unittesting |
The latest commit did also trigger the execution of the unit tests on GitHub. |
First of all, kudos to you for tackling this and coming up with a workable solution! I am still wondering if lookahead/lookbehinds wouldn't be a cleaner, more controlled approach. But that will be easy to test once we have the appropriate unit tests. I am also unsure if I agree with your proposed change in how list syntax is treated special in codeblocks. What would you expect the output to be for the following? Two code blocks and a list item inbetween?
That "graph" is the list of instructions (somewhat cleaned up). It is the output of the parser.
The way I usually do it, is to actually do manually create the expected output as good as I can, then see if and how it fails. The comparison is useful to question my assumptions on the output. I then can decide if my code or my test data needs to be adjusted.
Uhm. I guess you don't need me to explain how this is a terrible idea?
Welcome to Open Source. |
Well, it's your project, but it's been literally 18 years and no one's done lookahead. I'm not fond of special casing things, either, but this addresses the exact weakness at the exact point where the weakness is created. Perfect is the enemy of the good.
There is no "special" treatment. I used the exact same pattern match to start a block to continue a block. In fact, the original code was the special treatment: why would we expect the syntax to change for the second line than it was for the first line? That is the source of the frustration for users -- that is contrary to their expectations! If the user expects that markup they are creating is going to assume a multi-line code block and ignore legitimate markup for more than just this current line, they should use the multi-line code tag. They are intentionally using a single-line markup (double-space). It's literally part of the syntax spec: it only takes effect for that single line. It's nice that multiple lines are coalesced into the same block where possible, certainly. But why would we forcibly assume that they really wanted it to be aggressively multi-line, to the point that we actually ignore valid markup, as if we knew better than the user?
Only if they used double-space tags, I would expect a code block, a list block and a code block. After all, it's exactly what the user typed. If you're requiring muti-line code blocks that automatically ignore markup, use the <code> tags -- just like you did when you typed your example! (Just in case this isn't clear: the syntax change is ONLY for the double-space tag.) If someone is just blindly pasting a bunch of lines that you want to be code, why would you be using the double space tags in the first place? If you are using the double-space tags, it's likely because you're typing (or pasting them) one at a time, and it does not seem unreasonable to see if you're creating a list block. If that's not what you want, then use the <code> tags. Right now, it's impossible to properly do interleaved lines without breaking things up vertically. My code fixes that. And it's still possible to do multiple lines easily, again with zero vertical breakup.
Well, you have familiarity with the states of the lexer and the transitions they make. I do not.
No, no I do not, but thank you for the offer? I'm not a developer. I did not stay in a Holiday Inn. I had a problem. I solved the problem. I did my best to do it as cleanly and reasonably as possible, within the framework of the project, and to go as far as I could in providing that information upstream. I currently do not have ether the environment for extensive development nor the deep understanding to accurately guess what the system is going to do internally with a particular set of markup in order to create unit tests. I have created sample markup text that exercises this area, and compared the output from the unpatched system with the output of the patched system. The output matches exactly what I would expect it to -- except in the transition from table to preformatted, where the output matches exactly the same as the output before the change. (There seems to be an unrelated bug that mangles that output in both cases.) In any case, please feel free to move forward with this as you see fit. My system is working to my satisfaction at this time. I have made the fix available to others so they too can solve this bug. If I have to carry along a 5-line patch, so be it -- patching PHP is easy enough... I want to do what I can to upstream this as effectively as possible, with as little friction for everyone as possible. But if that requires me to be able to generate lexer output solely in my head... Well, then, I guess I keep my out-of-tree patch. Having said that, I am very willing to do what I am able to do to assist with this. Please let me know if there are other things that would help. |
I don't think look behinds work well in the lexer. There is not guaranteed to be anything to look behind. It is consuming each match before it looks for the next match. Conceivably, you could use a look behind for your end pattern, but for start/single patterns, they would need to be effectively optional. |
Issue: #4054
This PR updates the lexer to put back a newline that is consumed by the exit of a block, as well as updating the Preformatted addPattern to match addEntryPattern, so it doesn't consume subsequent listblock lines.