Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get marker delimitation #96

Open
jokteur opened this issue Dec 27, 2023 · 4 comments
Open

Get marker delimitation #96

jokteur opened this issue Dec 27, 2023 · 4 comments

Comments

@jokteur
Copy link

jokteur commented Dec 27, 2023

Hello,

I am writing a WYSIWYG Markdown editor focused on math and science, and I want to use Markdown as the base format. The problem I am going to describe is present in many other Markdown parsers, as a result I decided to completely write a new parser from scratch (in C++) and make some modifications to the Markdown standard to fit my own needs (this is the result).

The prototype I wrote was working okay, but now I've decided to rewrite the whole application in Rust, and also decided to not maintain my own parser which is much more prone to bugs and crashes.

The marker delimitation problem

I am rewriting what I wrote here: https://github.com/jokteur/ab-parser#the-delimitation-marker-problem.

For my WYSIWYG application, I need to know where the markers of a specific block / span are, to temporarily display to the user the markers, like on this demo here: https://github.com/wooorm/markdown-rs/assets/25845695/420c1496-7306-4c69-b7ca-74059ec95886

Let's say that we have the following Markdown example:

- >> [abc
  >> def](example.com)

This example would generate an abstract syntax tree (AST) like:

DOC
  UL
    LI
      QUOTE
        QUOTE
          P
            URL
              TEXT

How do we attribute each non-text markers (like -, >, [, ...) to the correct block / span ?

My parser was created to solve this specific problem, while keeping reasonable performance. To do this, each object (BLOCK or SPAN) is represented by an vector of boundaries. A boundary is defined as follows:

struct Boundary {
    line_number: usize,
    pre: usize,
    beg: usize,
    end: usize,
    post: usize,
}

This struct designates offsets in the raw text which form its structure. line_number is the line number in the raw text on which the boundary is currently operating. Offsets between pre and beg are the pre-delimiters, and offsets between end and post are the post-delimiters. Everything between beg and end is the content of the block / span.

Here is a simple example. Suppose we have the following text: _italic_, which starts at line 0 and offset 0 then the boundary struct would look like {0, 0, 1, 7, 8}.

Going back to the first example, we now use the following notation to illustrate ownership of markers: if there is x, it indicates a delimiter, if there is _ it indicates content, and . indicates not in boundary. Here are the ownership for each block and span:

- >> [abc
  >> def](example.com)

UL:
_________
______________________

LI:
xx_______
xx____________________

QUOTE (1st):
..x______
..x___________________

QUOTE (2nd):
...xx____
...xx_________________

P:
.....____
....._________________

URL:
.....x___
.....___xxxxxxxxxxxxxx

TEXT:
......___
.....___..............

Is there any simple way to rewrite this kind of information ?

Currently, markdown-rs provides positional information like this:

Text { value: "abc\ndef", position: Some(1:7-2:10 (6-19)) }

I may have a workaround to rewrite this kind of information (after it has been parsed, go from leaf nodes, compare the text with raw text, and check which chars are part of the node or node, and attribute them to the parent). This workaround may be slow, but it is okay for my usage because I only need marker delimitation information where the cursor is (not on the whole document).

I don't really know how well markdown-rs works, how difficult would it be that have this information built-in the parser ?

@ChristianMurphy
Copy link
Collaborator

Welcome @jokteur! 👋
The overview of the project is a good starting point. https://github.com/wooorm/markdown-rs#overview
The process to parse markdown looks like this:

                    markdown-rs
+-------------------------------------------------+
|            +-------+         +---------+--html- |
| -markdown->+ parse +-events->+ compile +        |
|            +-------+         +---------+-mdast- |
+-------------------------------------------------+

If you want to work with raw events/tokens, rather than the AST, use the parse file/function.

@jokteur
Copy link
Author

jokteur commented Jan 1, 2024

And would they be any way to use the parser file/function without forking the project ? Because currently this API is private, which doesn't allow me to implement my own compiler on top of markdown-rs.

@ChristianMurphy
Copy link
Collaborator

ChristianMurphy commented Jan 2, 2024

There is also a JavaScript version of this project, on the JS side there is a lower level package micromark that exposes this. https://github.com/micromark/micromark
@wooorm may be able to comment on the intent on the rust side.

@wooorm
Copy link
Owner

wooorm commented Jan 5, 2024

No, it’s not exposed yet. This project is currently at the state where it has to get some traction IMO before all the internals are exposed, to figure out how to expose things, and whether to expose things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants