Get marker delimitation #96

jokteur · 2023-12-27T11:58:21Z

Hello,

I am writing a WYSIWYG Markdown editor focused on math and science, and I want to use Markdown as the base format. The problem I am going to describe is present in many other Markdown parsers, as a result I decided to completely write a new parser from scratch (in C++) and make some modifications to the Markdown standard to fit my own needs (this is the result).

The prototype I wrote was working okay, but now I've decided to rewrite the whole application in Rust, and also decided to not maintain my own parser which is much more prone to bugs and crashes.

The marker delimitation problem

I am rewriting what I wrote here: https://github.com/jokteur/ab-parser#the-delimitation-marker-problem.

For my WYSIWYG application, I need to know where the markers of a specific block / span are, to temporarily display to the user the markers, like on this demo here: https://github.com/wooorm/markdown-rs/assets/25845695/420c1496-7306-4c69-b7ca-74059ec95886

Let's say that we have the following Markdown example:

- >> [abc
  >> def](example.com)

This example would generate an abstract syntax tree (AST) like:

DOC
  UL
    LI
      QUOTE
        QUOTE
          P
            URL
              TEXT

How do we attribute each non-text markers (like -, >, [, ...) to the correct block / span ?

My parser was created to solve this specific problem, while keeping reasonable performance. To do this, each object (BLOCK or SPAN) is represented by an vector of boundaries. A boundary is defined as follows:

struct Boundary {
    line_number: usize,
    pre: usize,
    beg: usize,
    end: usize,
    post: usize,
}

This struct designates offsets in the raw text which form its structure. line_number is the line number in the raw text on which the boundary is currently operating. Offsets between pre and beg are the pre-delimiters, and offsets between end and post are the post-delimiters. Everything between beg and end is the content of the block / span.

Here is a simple example. Suppose we have the following text: _italic_, which starts at line 0 and offset 0 then the boundary struct would look like {0, 0, 1, 7, 8}.

Going back to the first example, we now use the following notation to illustrate ownership of markers: if there is x, it indicates a delimiter, if there is _ it indicates content, and . indicates not in boundary. Here are the ownership for each block and span:

- >> [abc
  >> def](example.com)

UL:
_________
______________________

LI:
xx_______
xx____________________

QUOTE (1st):
..x______
..x___________________

QUOTE (2nd):
...xx____
...xx_________________

P:
.....____
....._________________

URL:
.....x___
.....___xxxxxxxxxxxxxx

TEXT:
......___
.....___..............

Is there any simple way to rewrite this kind of information ?

Currently, markdown-rs provides positional information like this:

Text { value: "abc\ndef", position: Some(1:7-2:10 (6-19)) }

I may have a workaround to rewrite this kind of information (after it has been parsed, go from leaf nodes, compare the text with raw text, and check which chars are part of the node or node, and attribute them to the parent). This workaround may be slow, but it is okay for my usage because I only need marker delimitation information where the cursor is (not on the whole document).

I don't really know how well markdown-rs works, how difficult would it be that have this information built-in the parser ?

The text was updated successfully, but these errors were encountered:

ChristianMurphy · 2023-12-31T18:29:41Z

Welcome @jokteur! 👋
The overview of the project is a good starting point. https://github.com/wooorm/markdown-rs#overview
The process to parse markdown looks like this:

                    markdown-rs
+-------------------------------------------------+
|            +-------+         +---------+--html- |
| -markdown->+ parse +-events->+ compile +        |
|            +-------+         +---------+-mdast- |
+-------------------------------------------------+

If you want to work with raw events/tokens, rather than the AST, use the parse file/function.

jokteur · 2024-01-01T15:09:59Z

And would they be any way to use the parser file/function without forking the project ? Because currently this API is private, which doesn't allow me to implement my own compiler on top of markdown-rs.

ChristianMurphy · 2024-01-02T16:14:19Z

There is also a JavaScript version of this project, on the JS side there is a lower level package micromark that exposes this. https://github.com/micromark/micromark
@wooorm may be able to comment on the intent on the rust side.

wooorm · 2024-01-05T11:52:11Z

No, it’s not exposed yet. This project is currently at the state where it has to get some traction IMO before all the internals are exposed, to figure out how to expose things, and whether to expose things.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get marker delimitation #96

Get marker delimitation #96

jokteur commented Dec 27, 2023

ChristianMurphy commented Dec 31, 2023

jokteur commented Jan 1, 2024

ChristianMurphy commented Jan 2, 2024 •

edited

wooorm commented Jan 5, 2024 •

edited

Get marker delimitation #96

Get marker delimitation #96

Comments

jokteur commented Dec 27, 2023

The marker delimitation problem

Is there any simple way to rewrite this kind of information ?

ChristianMurphy commented Dec 31, 2023

jokteur commented Jan 1, 2024

ChristianMurphy commented Jan 2, 2024 • edited

wooorm commented Jan 5, 2024 • edited

ChristianMurphy commented Jan 2, 2024 •

edited

wooorm commented Jan 5, 2024 •

edited