WIP: Fix blocks (Lexer/Parser rework) #104

GuiltyDolphin · 2021-06-29T14:16:07Z

WORK IN PROGRESS

Fixes No inline markup in blocks #16
Fixes newline breaks italic markup #34
Fixes Markup Not Preserved Across Linebreaks #80
Fixes Cannot Nest Rich Text Formatting #81
Fixes table, table.row and table.cell start at column 2, should be column 1 #84
Resolves Update blocks to better match the spec #101

Outline

in addition to changes from Fix unclosed property drawers causing a crash #102 ...
fixes unclosed property drawers being omitted from AST
fixes unopened property drawer ends being omitted from AST
fixes some type inconsistencies that allowed for runtime errors
refactors all token tests to make them easier and faster to extend
adds support for parser-combinator style parsing
reworks the lexer and parser for ease of extension
fixes priority cookies not being parsed in headings
adds support for multi-line text markup (this is bounded by a search over 2-3 lines, so shouldn't significantly affect performance)
adds support for markup inside text markup (per the spec)

Treatment of blocks

the syntax draft implies that you can nest greater blocks (e.g., quotes inside quotes), but org-element-parse-buffer parses this as a paragraph (with text like #+BEGIN_QUOTE) inside a quote block - if you try to HTML export the document, Org produces entirely text in no blocks. Which option do we want here? I think allowing quote blocks inside quote blocks might be the best option, then the user can decide whether to render these as text or not

Todo

make sure everything works with @types/unist version 2.0.5
list bullet/checkbox to be included in AST
robust parsing of objects inside verse blocks
update documentation + examples to reflect changes
update other orgajs packages to account for changes

GuiltyDolphin · 2021-06-29T16:20:34Z

I'm finding it difficult to work with unist's {[k: string]: unknown} catch all index for Node. Wipes out a lot of typechecking.

Using a version of Node without the catch all really helps for internal typechecking - but anything that uses unist then needs to have types switched to e.g., Document & unist.Node, which is a bit annoying.

Trying to figure out the nicest way to handle this that gives us good type safety but doesn't require external API changes.

Specifically, the issue arises when using functions from unified which expect e.g., Document to support arbitrary keys. Not sure why it's been handled like this, as the functions in unified shouldn't be using any keys they haven't verified exist through typing (and thus the typing in unist shouldn't need to require support for arbitrary keys).

GuiltyDolphin · 2021-06-30T10:50:23Z

Re Node, I've opened a discussion, but I don't know whether the changes will be agreed or not.

For now, I'll keep thinking about how we can have good type safety internally, whilst allowing users to choose the less safe variant of Node from @types/unist when they prefer. I'll do this in a separate PR to make it easier to review, then either split this PR to not include the changes, or rebase it onto that PR @xiaoxinghu.

GuiltyDolphin · 2021-06-30T13:14:09Z

~~@xiaoxinghu Okay, I have a good solution (at least internally). (Edit: still trying to fix)~~

~~As before, internally make sure to import Node and Parent from types.ts rather than unist.~~

~~Now, when exporting in index.ts we just add the index signature to Node (see here), and done! That's all we need to maintain compatability 😄~~

Edit: maybe I spoke too soon---this seems to be causing issues too (needed to clean and recompile). Gonna see if I can get the re-defining of the interface to work

Edit: the fix didn't work - propagated too quickly. However, discussions are going well so we might be able to use Node from unist if it gets updated :)

Edit: unist is updated! Hopefully will be able to propagate this to the other orga packages.

GuiltyDolphin · 2021-07-05T12:05:50Z

@xiaoxinghu update on block parsing.

There are three main cases of block contents we need to watch out for:

"greater" and "special" blocks, which can contain other elements
"element" blocks, which can contain only text
"verse" blocks (a special case of "element" blocks), which can contain objects (but not elements)

I've updated the parser/lexer so it can now handle "greater" and "element" blocks, and to a degree it can handle "verse" blocks.

However, as it currently stands, I'm needing to add exceptions for lexing objects for the verse block in a way that we will likely need to be able to apply in other areas, but currently isn't very maintainable.

As a result, I'm looking into one of these two methods:

rewriting parts of the lexer so that we can have a more state/context-based system - i.e., we only lex certain tokens when we know they have correct meaning (can only lex object tokens in verse block, for example)
simplifying the lexer to just look for simple tokens, and do more of the work in the parser

Point 1 has the advantage of simplifying some of the parsing, and means we don't have to worry about converting tokens in the parser.

Point 2 has the advantage of simplifying lexing to be more clearly token-based without too much lookahead, but means that we may need to do some re-parsing depending on how we handle the lexer.

I think point 2 currently feels a bit more daunting beacuse the lexer needs to be more atomic - once the lexer is more atomic then it should actually be easier.

I'll look into point 2 and let you know how it goes - it should make life a lot easier in updating parsing in the future.

Added some documentation too.

Much safer this way, managed to catch a bunch of potential crashes.

subthedubdub · 2021-07-21T02:45:10Z

👍 Wow, what a refactor!

GuiltyDolphin · 2021-07-21T08:11:09Z

👍 Wow, what a refactor!

Yep. I'm trying to be as comprehensive with tests as possible to make it easier to review, but it's always gonna be a bit of a beast.

Once the backend stuff is done w/ the lexer and parser, I'll need to make sure that all the frontend packages work correctly with this, which might be harder.

xiaoxinghu · 2021-08-08T22:58:00Z

@GuiltyDolphin , this looks dangerous to me, I suggest that we break these changes down into smaller pieces, because I am doing some refactor work myself which is pretty major. Also, I am changing the way tests are written, so I think it'd be easier to do these tests after mine, I will try my best to get it out of my computer.

GuiltyDolphin · 2021-08-09T13:00:34Z

@xiaoxinghu Alright! I'll close this for now until your changes are done, and then see how things look after that.

GuiltyDolphin force-pushed the gd/fix-blocks branch from 08c69f3 to a94a406 Compare July 8, 2021 13:14

GuiltyDolphin changed the title ~~WIP: Fix blocks~~ WIP: Fix blocks (Lexer/Parser rework) Jul 8, 2021

GuiltyDolphin force-pushed the gd/fix-blocks branch from a94a406 to a092485 Compare July 12, 2021 07:46

GuiltyDolphin added 23 commits July 17, 2021 13:32

Enable noImplicitAny and strictNullChecks

b3ba38b

Nit some formatting

60fe5f5

Add some tests for numberOfLines

454fbaf

Switch to using Point and Position from unist

6ab1777

Add jest configuration

6702dbf

Add some more numberOfLines tests

d30f0eb

Fix some bugs with toIndex

1817b5d

Added some documentation too.

Ensure number of lines of empty string is 0

4161f66

Remove redundant type assertions

d3a2ef9

Rework location

2545e81

Add tests for shift

95a372c

Add docstring for shift

a216dc3

Add some tests for linePosition

26d6b77

Add lengthOfLine helper

1836527

Fix up substring

3ffac1e

Fix match

735eeab

Add TextKit type

283b817

Rewrite match/substring to not include end

a5a55b5

Add test for location inverse of toIndex

46f93b6

Add eof helper

9e8c48e

Add eol helper

80ef3f2

Add lastNonEOL helper

93e305f

Fix incorrect linePosition calculation

c26b8a7

GuiltyDolphin added 19 commits July 17, 2021 14:11

Fix positions being elided from lexer results

d31c78f

Fix missing positions in parser debugger

bd37570

Fix positions of table cells in parsing

ca7409c

Add some parsing combinators + switch over table

3fd4df9

Fix an implicit use of any

34d11d6

Switch orga to use strictNullChecks

5a2d42e

Much safer this way, managed to catch a bunch of potential crashes.

Update drawer parsing to use combinators

de3d72c

Improve type safety with eat

ac41666

Update headline parser to use combinators

32566ad

Ensure priority is parsed in heading

60030e5

Move timestamp lexer into tokenize directory

5d44a1e

Add todo

3d0e816

Remove timestamp parser from global exports

711ebed

Update getChar to return a Char

3b3910d

Ensure POST is wrapped as a group

52d7fe4

Update some dependencies

6b019a6

Nit

eb6ecb5

Add initial support for multi-line markup

a8b8d36

Add support for objects in certain text markup

d39ae90

GuiltyDolphin force-pushed the gd/fix-blocks branch from a092485 to d39ae90 Compare July 18, 2021 11:54

GuiltyDolphin closed this Aug 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Fix blocks (Lexer/Parser rework) #104

WIP: Fix blocks (Lexer/Parser rework) #104

GuiltyDolphin commented Jun 29, 2021 •

edited

GuiltyDolphin commented Jun 29, 2021 •

edited

GuiltyDolphin commented Jun 30, 2021

GuiltyDolphin commented Jun 30, 2021 •

edited

GuiltyDolphin commented Jul 5, 2021 •

edited

subthedubdub commented Jul 21, 2021

GuiltyDolphin commented Jul 21, 2021

xiaoxinghu commented Aug 8, 2021 •

edited

GuiltyDolphin commented Aug 9, 2021

WIP: Fix blocks (Lexer/Parser rework) #104

WIP: Fix blocks (Lexer/Parser rework) #104

Conversation

GuiltyDolphin commented Jun 29, 2021 • edited

Outline

Treatment of blocks

Todo

GuiltyDolphin commented Jun 29, 2021 • edited

GuiltyDolphin commented Jun 30, 2021

GuiltyDolphin commented Jun 30, 2021 • edited

GuiltyDolphin commented Jul 5, 2021 • edited

subthedubdub commented Jul 21, 2021

GuiltyDolphin commented Jul 21, 2021

xiaoxinghu commented Aug 8, 2021 • edited

GuiltyDolphin commented Aug 9, 2021

GuiltyDolphin commented Jun 29, 2021 •

edited

GuiltyDolphin commented Jun 29, 2021 •

edited

GuiltyDolphin commented Jun 30, 2021 •

edited

GuiltyDolphin commented Jul 5, 2021 •

edited

xiaoxinghu commented Aug 8, 2021 •

edited