Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New-style UNL Regex Wrongly Accepts Text After A Trailing Space #3667

Open
tbpassin opened this issue Nov 26, 2023 · 20 comments
Open

New-style UNL Regex Wrongly Accepts Text After A Trailing Space #3667

tbpassin opened this issue Nov 26, 2023 · 20 comments
Assignees
Labels
Bug Delegated Delegatated to others
Milestone

Comments

@tbpassin
Copy link
Contributor

tbpassin commented Nov 26, 2023

g.unl_regex accepts a unl that contains text after a trailing space, leading to a colorized string that CTRL-Click cannot navigate to. Example:

unl:gnx://workbook.leo#tom.20231125211832.1 the cross reference

The entire line above gets colorized, and a CTRL-Click on it fails to navigate even though the actual UNL part is a valid UNL. The error message:

Not found: 'unl:gnx://workbook.leo#tom.20231125211832.1 the cross reference'

Note that any fix must allow for the possibility that a file name may contain spaces. Presumably they are already converted to %20 but I have not checked this.

The use case for correcting this behavior is a desire to insert a cross reference to another node along with a descriptive string such as the node's headline.

I realize that making this work with both old-style and new-style UNLs could be a challenge. So maybe as a compromise only the new-style UNL behavior could be fixed.

Another but minor bug is that the error message that the unl cannot be found goes only to the console and not to the log pane.

@tbpassin tbpassin added the Bug label Nov 26, 2023
@tbpassin tbpassin added this to the 6.7.6 milestone Nov 26, 2023
@tbpassin tbpassin changed the title New-style UNL Regex Wrongly Accepts Text After A Training Space New-style UNL Regex Wrongly Accepts Text After A Trailing Space Nov 26, 2023
@edreamleo edreamleo self-assigned this Nov 27, 2023
@edreamleo
Copy link
Member

@tbpassin The workaround would seem to be changing:

unl:gnx://workbook.leo#tom.20231125211832.1 the cross reference

to:

the cross reference: unl:gnx://workbook.leo#tom.20231125211832.1

Is that feasible for you?

@edreamleo edreamleo removed the First label Nov 27, 2023
@tbpassin
Copy link
Contributor Author

Well, of course I could do that (and I tried it out yesterday) but many times I'd rather read it the other way. It also got me concerned about paths with other (non-space but unusual) characters. I discovered that a node whose headline includes an apostrophe (e.g., Abe's Recipe) cannot be completely selected by the current regex. This makes the node inaccessible by UNL navigation. There are probably other characters like that that i haven't checked yet.

UNL navigation should work for any characters that can appear in a headline, don't you think?

@edreamleo edreamleo modified the milestones: 6.7.6, 6.7.7 Nov 30, 2023
@edreamleo
Copy link
Member

@tbpassin Thanks for this report. I'm going to assign this to you. You will be the best judge of what works and you have the skills to create a PR.

@edreamleo edreamleo added the Delegated Delegatated to others label Nov 30, 2023
@tbpassin
Copy link
Contributor Author

The challenge is to accept spaces in file paths, yet to stop recognizing the UNL even if it's not at the end of a line. The URL way is to replace spaces by "%20". I don't know how feasible this would be to implement with new-style UNLs. I seem to remember that it would have worked with the old ones, do you recall if that's right?

If we did start replacing spaces with '%20" I think that it would be backward compatible. Of course I would check that.

@boltex
Copy link
Contributor

boltex commented Dec 28, 2023

Here are some notes from some experiments I made with UNLs in Leo (I've never used those before) when trying to make them work somewhat successfully in LeoJS and LeoInteg this evening:

I'm now working on UNL and URL links in LeoJS and LeoInteg. I've come up with custom regex (with the help of chatGpt who is excellent with regex) because the generic regex for finding UNLs in leoGlobals.py is not specific enough:

unl_regex = re.compile(r"""\bunl:[^`'"]+""")

For instance, the following excerpt from tutorial-tips.html from Leo's documentation shows this next sample, but the regex stops after the apostrophe in the Leo's Documentation part of the headline:

For example, the UNL for the parent @rst of this node in LeoDocs.leo is:

    ``unl://C:/leo.repo/leo-editor/leo/doc/LeoDocs.leo#Leo's Documentation-->Tutorials-->@rst html\tutorial-tips.html``

So far, here are the regex I've come up with. (Two separate for both kinds of UNLs)

    const gnxUnlRegex = /\bunl:gnx:\/\/[^\s]+/g;
    const headlineUnlRegex = /\bunl:\/\/(?:[^\s#]*#)?(?:[^->]+(?:-->[^->]+)*)/g;

@boltex
Copy link
Contributor

boltex commented Dec 28, 2023

@tbpassin I'm making some progress, perhaps it could help also in Leo. That headlineUnlRegex was still not satisfactory in LeoJS, I had to modify it again to :

const headlineUnlRegex = /\bunl:\/\/(?:[^\s#]*#)?[^>]*(?:-->[^>]+)*[^>]*$/g;

@tbpassin
Copy link
Contributor Author

I didn't even know there was such a thing as a UNL regex for headlines.

For example, the UNL for the parent @rst of this node in LeoDocs.leo is:
unl://C:/leo.repo/leo-editor/leo/doc/LeoDocs.leo#Leo's Documentation-->Tutorials-->@rst html\tutorial-tips.html

At least in Leo-Python, we don't want (or can't have) backslashes in the UNL. There is actually a problem because for UNL's spaces are not escaped, unlike in URLs. So a regex cannot tell where the end of the UNL is, if it is followed by some text. This means in practice that there cannot be any text on a line following a UNL.

This makes me unhappy but I don't see how it can be changed for reasons of backwards compatibility.

@boltex
Copy link
Contributor

boltex commented Dec 28, 2023

@tbpassin You didn't know because there were none. hehe 😄 !

I've come up with those two different regex myself yesterday (one for gnx based UNLs and one for the older headline based UNLs) in LeoJS because the single regex used in Leo is insufficient and misses some matches. (As I've explained above) I just thought I'd pop in here

Thanks for explaining that they cannot have backslashes and cannot have characters on the same line after the unl. This information will be useful for fixing the regex used in Leo that I mentioned earlier:
unl_regex = re.compile(r"""\bunl:[^'"]+""")`

I will also adjust the visual rendering, and regex detection in LeoJS with those subtle details that you've just explained. So Thanks for that! 😁

Like I explained above, It stops at an apostrophe in a headline name. This means that the example UNLs given in the official documentation are not valid. (or that the regex to detect them is broken.)

OH! I've just realized that the failing example I had given (with the apostrophe in the headline name list and the backslashes) is from the OLD WEBSITE and is not to be found in present on the new site: (See adress bar in Screenshot )

image

Either way, the UNL regex and its features and documentation in Leo need some small bit of work and cleanup...(but not much really) The UNLs are a great feature and deserve to be polished and well documented.

For now, I'll do more testing and experiments in both Leo and LeoJS with UNLs to make sure they works properly, (I've yet to test the two kinds of UNLs without the optional 'outline-path' part to be used on the currently opened outline)

So I'll be back later this evening and I'll report my findings and suggestions about UNL handling in this issue's thread.

@boltex
Copy link
Contributor

boltex commented Dec 28, 2023

@tbpassin @edreamleo
To be more concise, a screenshot is always a good idea hehe ! :)
image

Here's what I meant by 'the regex does not capture the UNL properly'. You can see the color stoping at the apostrophe. (Thomas said they are not expected to have backslashes nor to have more content on the same line, so the check to end the UNL with apostrophes in the regex is (maybe?) extraneous... ? (perhaps the 2 regex I've come up with in LeoJS could help?)

Anyways, this is just my first findings..l Like I said, I'll spend the next hours doing more experiments and tests in Leo and LeoJS with all the possibles UNL forms and I'll report back in here with my thoughts and suggestions.

@boltex
Copy link
Contributor

boltex commented Dec 29, 2023

Good news! Separating the regex into two regex allows for @tbpassin vision to become reality: that is, to allow spaces and text after gnx based UNLs. Here is what I use in LeoJS (regex are a pain to work and figure out by the way! Thank god I had chatGPT for that! Here a "share" link to the conversation I had with it in order to come up with those two specific regexes: https://chat.openai.com/share/02f8070b-e5a9-48fd-8e6c-c1f14717a33e )

(Note: see other post reply below for even better regex than those two.)

    const gnxUnlRegex = /\bunl:gnx:[^\r\n#]*#\S*/g; // This allows spaces and content after the UNL.
    const headlineUnlRegex = /\bunl:(?!gnx:)[^\r\n#]*#[^\r\n]*\S/g; // This goes on until end of line.

So perhaps in Leo the unl_regex in leoGlobals.py could be replaced by those two, (and colorized + detected with those)

gnx_unl_regex = re.compile(r"""\bunl:gnx:[^\r\n#]*#\S*""")
headline_unl_regex = re.compile(r"""\bunl:(?!gnx:)[^\r\n#]*#[^\r\n]*\S""")

I've tested those in Leo. Those all work, (to be able to have leo open unopened files it has to have the full path if not alreaday in the current Commander's directory) And even with backslashes instead of slashes ! :)

unl:gnx://C:\prog\testmarkdown\testmd1.leo#felix.20231224140524.1
unl:gnx://C:/prog/testmarkdown/testmd1.leo#felix.20231224140524.1

unl://C:/prog/testmarkdown/testmd1.leo#@button test2
unl://C:\prog\testmarkdown\testmd1.leo#@button test2

But the only problem I had ( perhaps @edreamleo could shed light on this) Is that the documentation states that the 'path outline' part of the UNL is optional to navigate into the currently opened outline, Here is the documentation as per leoGlobals.py in << About clickable links >>

2. New in Leo 6.7.4: UNLs based on gnx's (global node indices):

   Links of the form `unl:gnx:` + `//{outline}#{gnx}` open the given
   outline and select the first outline node with the given gnx. These UNLs
   will work as long as the node exists anywhere in the outline.

   For example, the link: `unl:gnx://#ekr.20031218072017.2406` refers to this
   outline's "Code" node. Try it. The link works in this outline.

   *Note*: `{outline}` is optional. It can be an absolute path name or a relative
   path name resolved using `@data unl-path-prefixes`.

3. Leo's headline-based UNLs, as shown in the status pane:

   Headline-based UNLs consist of `unl://` + `//{outline}#{headline_list}`
   where headline_list is list of headlines separated by `-->`.

   This link works: `unl://#Code-->About this file`.

But those links without the 'outline' part failed to work when I tried them.
Should those work as per docum in leoGlobals.py?

unl://#@button test2
unl:gnx://#felix.20231224140524.1

Lastly, I thought that the p.get_UNL and related methods section of the Position class in leoNodes.py should have creation methods for those very small UNLs without the optional 'outline' part.

@boltex
Copy link
Contributor

boltex commented Dec 29, 2023

@edreamleo Maybe I should open a separate issue for the UNLs without the 'outline' part ? I'm guessing that this fix should be trivial since it's for the currently opened 'c' Commander... but I may be wrong of course! 😆

@boltex
Copy link
Contributor

boltex commented Dec 29, 2023

I forgot to add:

It seems that the double slash is required for valid UNLs (as per this comment in p.get_UNL and related methods in leoNodes.py)

# All unls must contain a file part: f"//{file-name}#"
# The file-name may be empty.

So I guess the regex should instead require this double slash too, like so:

gnx_unl_regex = re.compile(r"""\bunl:gnx:\/\/[^\r\n#]*#\S*""")
headline_unl_regex = re.compile(r"""\bunl:\/\/[^\r\n#]*#[^\r\n]*\S""")

@tbpassin
Copy link
Contributor Author

Apparently I had not tested enough cases! I tried a selection some of which had failed previously and they worked. But nit enough coverage, I suppose.

So we check if a UNL has the "gnx" part, and if it doesn't we check if it's an old style UNL (what you have been calling a "headline" UNL). The terminology is a little misleading because the actual old style UNLS were not exactly the same as the new "old-style" ones.

For the new-style UNLs that contain gnxes, they cannot contain spaces so the UNL can be followed by more text on the same line. The regex needs to recognize that these new-style UNLs will terminate when they encounter a string or end of line. In addition, the filename part might contain spaces so the regex has to accept that.

The "old-style" UNLs unfortunately can contain spaces if their headlines do. So they can be terminated only be a newline or end of the tex

@edreamleo
Copy link
Member

@boltex @tbpassin test.leo contains the node --- tests of UNL. Some of this nodes children contain various unls that should work (unless indicated otherwise).

Before creating another issue, let me re-test these unls in devel and ekr-3736-unls.

@boltex
Copy link
Contributor

boltex commented Dec 29, 2023

@tbpassin thanks for your comment

For the new-style UNLs that contain gnxes, they cannot contain spaces so the UNL can be followed by more text on the same line.

Note that one of the new regex that I proposed above for the gnx based UNLs

gnx_unl_regex = re.compile(r"""\bunl:gnx:\/\/[^\r\n#]*#\S*""")

makes a careful selection discrimination to support spaces only in the outline file path, but not after the # sign. Therefore enabling you with this 'trick' to have spaces and more content after those 'gnx' based UNLs even when they are used with the full path which may contain spaces, because that regex knows it's past the gnx part and spaces are not allowed past that point 😉

So you have the best of both worlds with this regex 😄 (or your cake and eat it too - english is not my first language sorry hehe!)

@edreamleo
Copy link
Member

@boltex PR #3737 fixes all aspects of unls that I know of, except maybe this issue.

@edreamleo
Copy link
Member

@boltex @tbpassin A quick check verifies that the unls in test.leo do not work if any text follows them.

So let the brainstorming continue.

@edreamleo
Copy link
Member

@boltex @tbpassin It will be easy to update the new unit test in PR #3737 to handle trailing text once the smoke clears on this issue.

I would appreciate your favorable review of PR #3737 soon.

@tbpassin
Copy link
Contributor Author

So after saying that new "old-style" (headline-path-based) don't contain the full path to their file, I remembered that we need to support full paths for backwards compatibility. We don't need to generate such paths any more but we should continue to support them.

And we always need to support potential spaces in file names.

@edreamleo
Copy link
Member

@tbpassin Say rather that Leo might not show paths in the status are. See the summary below.

After about 20 minutes of searching I discovered the relevant code:

LeoTree.set_status_line (a base class method, shared by all guis) contains:

method = p.get_legacy_UNL if kind.lower() == 'legacy' else p.get_UNL
c.frame.putStatusLine(method())

Here is p.get_legacy_UNL :

def get_legacy_UNL(self) -> str:
    """
    Return a headline-oriented UNL, as in legacy versions of p.get_UNL.

    @bool full-unl-paths determines the size of the file part.

    LeoTree.set_status_line will call this method if legacy unls are in effect.
    """
    p = self
    c = p.v.context
    path_part = '-->'.join(list(reversed([z.h for z in self.self_and_parents()])))
    full = c.config.getBool('full-unl-paths', default=False)
    file_part = c.fileName() if full else os.path.basename(c.fileName())
    return 'unl:' + f"//{file_part}#{path_part}"

and here is p.get_UNL:

def get_UNL(self) -> str:
    """
    Return a gnx-oriented UNL.

    Breaking change to Leo's API: returned a path-oriented UNL previously.

    @bool full-unl-paths determines the size of the file part.

    LeoTree.set_status_line calls this method if gnx-based unls are in effect.
    """
    p = self
    c = p.v.context
    full = c.config.getBool('full-unl-paths', default=False)
    file_part = c.fileName() if full else os.path.basename(c.fileName())
    return 'unl:gnx:' + f"//{file_part}#{self.gnx}"

Summary

@bool 'full-unl-paths determines whether unls of any kind contain a full path.

The code, not our memory, determines what happens :-)

@edreamleo edreamleo modified the milestones: 6.7.7, 6.7.8 Jan 13, 2024
@edreamleo edreamleo modified the milestones: 6.7.8, 6.7.9 Jan 29, 2024
@edreamleo edreamleo modified the milestones: 6.7.9, 6.7.10 Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Delegated Delegatated to others
Projects
None yet
Development

No branches or pull requests

3 participants