Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix listblock preformatted overlap #4055

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

fts-tmassey
Copy link

Issue: #4054

This PR updates the lexer to put back a newline that is consumed by the exit of a block, as well as updating the Preformatted addPattern to match addEntryPattern, so it doesn't consume subsequent listblock lines.

Some blocks (e.g. listblocks and preformatted) consume newlines in their exit pattern.  Some blocks (e.g. listblocks and preformatted) need newlines in their *entry* patterns, but that newline was consumed by the previous block!  This updates the lexer to put the newline back when exiting a block consumes the newline.
The entry pattern for preformatted blocks understand and respect listblocks, but the mid-block pattern does not.  The syntax is the same, so the pattern match should be the same.  Otherwise, preformatted blocks will accidentally turn subsequent listblock lines into preformatted!
@Klap-in
Copy link
Collaborator

Klap-in commented Sep 11, 2023

We have unit tests for the different syntaxes, here these for tables and preformatted blocks:
https://github.com/dokuwiki/dokuwiki/blob/master/_test/tests/inc/parser/parser_table.test.php
https://github.com/dokuwiki/dokuwiki/blob/master/_test/tests/inc/parser/parser_preformatted.test.php
Could you add tests as well for the new cases? that might prevent that this functionality will break in the future by other changes.

@Klap-in
Copy link
Collaborator

Klap-in commented Sep 11, 2023

And nice to see the results of the puzzle, thanks for the effort!

@fts-tmassey
Copy link
Author

fts-tmassey commented Sep 11, 2023

We have unit tests for the different syntaxes, here these for tables and preformatted blocks: https://github.com/dokuwiki/dokuwiki/blob/master/_test/tests/inc/parser/parser_table.test.php https://github.com/dokuwiki/dokuwiki/blob/master/_test/tests/inc/parser/parser_preformatted.test.php Could you add tests as well for the new cases? that might prevent that this functionality will break in the future by other changes.

Umm... You imply that I actually understand how any of this actually works... :) A quick glance of the unit tests seem to show that you supply a piece of sample markup source text, then a series of array elements that my guess is are the resulting steps through the tokenization state. Makes sense -- but I don't have a way of getting that state graph.

Do you have a suggestion on how I can get that list? If so, I can probably fake my way through the process and create unit tests for various combinations of interleaved preformatted/listblock/table blocks. After all, I don't want this broken in the future, either! :)

@fts-tmassey
Copy link
Author

While I'm waiting to find out how to get the graph, here's what I would propose for unit tests:

// Preformatted block unit tests

// Single-line preformatted to list transition
Unformatted text 0
  Preformatted 0
  * List 0

// Multiple single-line preformatted to list transition
Unformatted text 0
  Preformatted 0
  Preformatted 1
  * List 0

// Single-line preformatted to table transition
Unformatted text 0
  Preformatted 0
| Table Row 0 Col 0    | Row 0 Col 1     | Row 0 Col 2        |

// Multiple single-line preformatted to table transition
Unformatted text 0
  Preformatted 0
  Preformatted 1
| Table Row 0 Col 0    | Row 0 Col 1     | Row 0 Col 2        |

// Listblock unit tests

// Listblock to preformatted transition
Unformatted text 0
  * List 0
  Preformatted 0

// listblock to table transition
Unformatted text 0
  * List 0
| Table Row 0 Col 0    | Row 0 Col 1     | Row 0 Col 2        |

// Table unit tests

// Table to preformatted transition
Unformatted text 0
| Table Row 0 Col 0    | Row 0 Col 1     | Row 0 Col 2        |
  Preformatted 0

// Table to listblock transition
Unformatted text 0
| Table Row 0 Col 0    | Row 0 Col 1     | Row 0 Col 2        |
  * List 0

// If we are closing a block and the last character consumed by our matched
// string is a newline, put it back. See the following for details:
// https://github.com/dokuwiki/dokuwiki/issues/4054
if ($mode == "__exit" && substr($matched, -1) == "\n") {
Copy link
Collaborator

@fiwswe fiwswe Sep 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am by no means an expert on the DW Lexer — but should this not be:
if ($this->isModeEnd($mode) && substr($matched, -1) == "\n") {

Or better yet:
if ($this->isModeEnd($mode) && str_ends_with($matched, "\n")) {

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AIUI, str_ends_with is PHP8, and DokuWiki still supports PHP 7.

As for isModeEnd: you could be right; I too am not an expert on the lexer. A quick look at the source shows that the code for that function is: return ($mode === "__exit"); so it seems that they will be functionally equivalent. For a person unfamiliar with the lexer, $mode was right there... :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recently, in the development version polyfills for str_ends_with() and likes are added e.g.

if (!function_exists('str_ends_with')) {
function str_ends_with(string $haystack, string $needle)
so it fine to use them already.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No a polyfill has been added recently. So str_*_with functions can now be used.
See

/**
* polyfill for PHP < 8
* @see https://www.php.net/manual/en/function.str-starts-with
*/
if (!function_exists('str_starts_with')) {
function str_starts_with(string $haystack, string $needle)
{
return empty($needle) || strpos($haystack, $needle) === 0;
}
}
/**
* polyfill for PHP < 8
* @see https://www.php.net/manual/en/function.str-contains
*/
if (!function_exists('str_contains')) {
function str_contains(string $haystack, string $needle)
{
return empty($needle) || strpos($haystack, $needle) !== false;
}
}
/**
* polyfill for PHP < 8
* @see https://www.php.net/manual/en/function.str-ends-with
*/
if (!function_exists('str_ends_with')) {
function str_ends_with(string $haystack, string $needle)
{
return empty($needle) || substr($haystack, -strlen($needle)) === $needle;
}
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it doesn't work on my system... I'm running 2023-04-04a "Jack Jackrum", not a development snapshot. I'm assuming that functionality came post-Jack? (And I really don't want to set up a development snapshot at this time: in fact, I'm just manually patching this stuff in the GitHub web UI...) But if the polyfill capability is in the post-Jack snapshot feel free to make that change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel out of my depth regarding the unit tests. I took a look at https://github.com/dokuwiki/dokuwiki/blob/master/_test/tests/inc/parser/lexer.test.php but saw no easy way to add tests for this PR. But maybe I'm looking in the wrong place?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to add a test for the lexer, but we can add tests for the parse handers for listblock, preformatted and tables. Look above for the comment from @Klap-in with proper paths. In that case, we can give it a chunk of markup source, and then compare our statically-added state graph against the graph generated by the running code. I replied with proposed markup that covers all of the updated transitions above in this issue. I also took that same markup and replaced the comments with heading tags and put it on my production server with my PR applied. You can see the result here: https://www.fluidtechservices.com/dokuwiki/blocktest

It shows that the markup is being processed correctly. As opposed to what happens when you paste it into the DokuWiki sandbox, which I've also done: https://www.dokuwiki.org/sandbox:playground

You can see on my server each type of markup results in exactly the right type of HTML. You can see on the sandbox it does not.

Also, I searched for some doc on unit testing and found this: https://www.dokuwiki.org/devel:unittesting I can tell you this, I do not have the environment for setting this up -- I'm way over my time budget for this. But if someone can give me a straightforward way to get the state graph, I can use the existing unit tests as a pattern and create new ones.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the Table to preformatted transition is correct on your wiki? I don't see any preformatted text.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct: I just noticed the same thing. It seems the lexer is going into the preformatted block and grabbing the text, but not outputting the text as HTML.

After spending the last half-hour looking into it, I checked the sandbox: this bug is in the current lexer as well. So I don't know where that text is going, or why, but it's not related to this patch. That problem already exists...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general markup level makes sense for me. However, as this feels as a rather general change, probably adding tests for the lexer as well might make sense for how it should consume/not consume new lines between different parsermodes. Of course, these tests are less trivial.

On short notice, I have not much time available. I hope I can find some time in a week/2 weeks.

Everybody who can create unit tests is invited to add them, I think especially for markup these are quite doable :-)

@Klap-in
Copy link
Collaborator

Klap-in commented Sep 11, 2023

The unit tests are part of the development snapshot. I assume they are not delivered with a stable download. More background is available at https://www.dokuwiki.org/devel:unittesting
Ideally unit tests are created upfront, but in practice I guess most people create them afterward. What I do often myself to create more quickly the needed output is copying existing cases, or by running (similar) examples and modify them to the expected outcome.

@Klap-in
Copy link
Collaborator

Klap-in commented Sep 11, 2023

The latest commit did also trigger the execution of the unit tests on GitHub.
See for example the php8.0 run for the results.
It shows that we have about 32 failures, in a quick scan I saw mainly additions or removals of new lines.
For each of them we must check if a failure needs further improvement, or that it is a desired change of output.

@splitbrain
Copy link
Collaborator

First of all, kudos to you for tackling this and coming up with a workable solution!

I am still wondering if lookahead/lookbehinds wouldn't be a cleaner, more controlled approach. But that will be easy to test once we have the appropriate unit tests.

I am also unsure if I agree with your proposed change in how list syntax is treated special in codeblocks.

What would you expect the output to be for the following? Two code blocks and a list item inbetween?

  $foo = 7;
  --$foo;
  echo $foo;

A quick glance of the unit tests seem to show that you supply a piece of sample markup source text, then a series of array elements that my guess is are the resulting steps through the tokenization state. Makes sense -- but I don't have a way of getting that state graph.

That "graph" is the list of instructions (somewhat cleaned up). It is the output of the parser.

how to create the expected state graph. I can't do it by hand, and I'm sure no one else does it by hand either.

The way I usually do it, is to actually do manually create the expected output as good as I can, then see if and how it fails. The comparison is useful to question my assumptions on the output. I then can decide if my code or my test data needs to be adjusted.

I'm doing this on a production system

Uhm. I guess you don't need me to explain how this is a terrible idea?

I'm way over my time budget for this.

Welcome to Open Source.

@fts-tmassey
Copy link
Author

fts-tmassey commented Sep 12, 2023

I am still wondering if lookahead/lookbehinds wouldn't be a cleaner, more controlled approach. But that will be easy to test once we have the appropriate unit tests.

Well, it's your project, but it's been literally 18 years and no one's done lookahead. I'm not fond of special casing things, either, but this addresses the exact weakness at the exact point where the weakness is created. Perfect is the enemy of the good.

I am also unsure if I agree with your proposed change in how list syntax is treated special in codeblocks.

There is no "special" treatment. I used the exact same pattern match to start a block to continue a block. In fact, the original code was the special treatment: why would we expect the syntax to change for the second line than it was for the first line? That is the source of the frustration for users -- that is contrary to their expectations!

If the user expects that markup they are creating is going to assume a multi-line code block and ignore legitimate markup for more than just this current line, they should use the multi-line code tag. They are intentionally using a single-line markup (double-space). It's literally part of the syntax spec: it only takes effect for that single line. It's nice that multiple lines are coalesced into the same block where possible, certainly. But why would we forcibly assume that they really wanted it to be aggressively multi-line, to the point that we actually ignore valid markup, as if we knew better than the user?

What would you expect the output to be for the following? Two code blocks and a list item inbetween?

  $foo = 7;
  --$foo;
  echo $foo;

Only if they used double-space tags, I would expect a code block, a list block and a code block. After all, it's exactly what the user typed. If you're requiring muti-line code blocks that automatically ignore markup, use the <code> tags -- just like you did when you typed your example! (Just in case this isn't clear: the syntax change is ONLY for the double-space tag.) If someone is just blindly pasting a bunch of lines that you want to be code, why would you be using the double space tags in the first place? If you are using the double-space tags, it's likely because you're typing (or pasting them) one at a time, and it does not seem unreasonable to see if you're creating a list block. If that's not what you want, then use the <code> tags. Right now, it's impossible to properly do interleaved lines without breaking things up vertically. My code fixes that. And it's still possible to do multiple lines easily, again with zero vertical breakup.

The way I usually do it, is to actually do manually create the expected output as good as I can, then see if and how it fails. The comparison is useful to question my assumptions on the output. I then can decide if my code or my test data needs to be adjusted.

Well, you have familiarity with the states of the lexer and the transitions they make. I do not.

I'm doing this on a production system

Uhm. I guess you don't need me to explain how this is a terrible idea?

No, no I do not, but thank you for the offer?

I'm not a developer. I did not stay in a Holiday Inn. I had a problem. I solved the problem. I did my best to do it as cleanly and reasonably as possible, within the framework of the project, and to go as far as I could in providing that information upstream. I currently do not have ether the environment for extensive development nor the deep understanding to accurately guess what the system is going to do internally with a particular set of markup in order to create unit tests. I have created sample markup text that exercises this area, and compared the output from the unpatched system with the output of the patched system. The output matches exactly what I would expect it to -- except in the transition from table to preformatted, where the output matches exactly the same as the output before the change. (There seems to be an unrelated bug that mangles that output in both cases.)

In any case, please feel free to move forward with this as you see fit. My system is working to my satisfaction at this time. I have made the fix available to others so they too can solve this bug. If I have to carry along a 5-line patch, so be it -- patching PHP is easy enough... I want to do what I can to upstream this as effectively as possible, with as little friction for everyone as possible. But if that requires me to be able to generate lexer output solely in my head... Well, then, I guess I keep my out-of-tree patch.

Having said that, I am very willing to do what I am able to do to assist with this. Please let me know if there are other things that would help.

@Chris--S
Copy link
Collaborator

I don't think look behinds work well in the lexer. There is not guaranteed to be anything to look behind. It is consuming each match before it looks for the next match. Conceivably, you could use a look behind for your end pattern, but for start/single patterns, they would need to be effectively optional.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants