Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Fix blocks (Lexer/Parser rework) #104

Closed
wants to merge 145 commits into from

Conversation

GuiltyDolphin
Copy link
Contributor

@GuiltyDolphin GuiltyDolphin commented Jun 29, 2021

WORK IN PROGRESS

Outline

  • in addition to changes from Fix unclosed property drawers causing a crash #102 ...
  • fixes unclosed property drawers being omitted from AST
  • fixes unopened property drawer ends being omitted from AST
  • fixes some type inconsistencies that allowed for runtime errors
  • refactors all token tests to make them easier and faster to extend
  • adds support for parser-combinator style parsing
  • reworks the lexer and parser for ease of extension
  • fixes priority cookies not being parsed in headings
  • adds support for multi-line text markup (this is bounded by a search over 2-3 lines, so shouldn't significantly affect performance)
  • adds support for markup inside text markup (per the spec)

Treatment of blocks

  • the syntax draft implies that you can nest greater blocks (e.g., quotes inside quotes), but org-element-parse-buffer parses this as a paragraph (with text like #+BEGIN_QUOTE) inside a quote block - if you try to HTML export the document, Org produces entirely text in no blocks. Which option do we want here? I think allowing quote blocks inside quote blocks might be the best option, then the user can decide whether to render these as text or not

Todo

  • make sure everything works with @types/unist version 2.0.5
  • list bullet/checkbox to be included in AST
  • robust parsing of objects inside verse blocks
  • update documentation + examples to reflect changes
  • update other orgajs packages to account for changes

@GuiltyDolphin
Copy link
Contributor Author

GuiltyDolphin commented Jun 29, 2021

I'm finding it difficult to work with unist's {[k: string]: unknown} catch all index for Node. Wipes out a lot of typechecking.

Using a version of Node without the catch all really helps for internal typechecking - but anything that uses unist then needs to have types switched to e.g., Document & unist.Node, which is a bit annoying.

Trying to figure out the nicest way to handle this that gives us good type safety but doesn't require external API changes.

Specifically, the issue arises when using functions from unified which expect e.g., Document to support arbitrary keys. Not sure why it's been handled like this, as the functions in unified shouldn't be using any keys they haven't verified exist through typing (and thus the typing in unist shouldn't need to require support for arbitrary keys).

@GuiltyDolphin
Copy link
Contributor Author

Re Node, I've opened a discussion, but I don't know whether the changes will be agreed or not.

For now, I'll keep thinking about how we can have good type safety internally, whilst allowing users to choose the less safe variant of Node from @types/unist when they prefer. I'll do this in a separate PR to make it easier to review, then either split this PR to not include the changes, or rebase it onto that PR @xiaoxinghu.

@GuiltyDolphin
Copy link
Contributor Author

GuiltyDolphin commented Jun 30, 2021

@xiaoxinghu Okay, I have a good solution (at least internally). (Edit: still trying to fix)

As before, internally make sure to import Node and Parent from types.ts rather than unist.

Now, when exporting in index.ts we just add the index signature to Node (see here), and done! That's all we need to maintain compatability 😄

Edit: maybe I spoke too soon---this seems to be causing issues too (needed to clean and recompile). Gonna see if I can get the re-defining of the interface to work

Edit: the fix didn't work - propagated too quickly. However, discussions are going well so we might be able to use Node from unist if it gets updated :)

Edit: unist is updated! Hopefully will be able to propagate this to the other orga packages.

@GuiltyDolphin
Copy link
Contributor Author

GuiltyDolphin commented Jul 5, 2021

@xiaoxinghu update on block parsing.

There are three main cases of block contents we need to watch out for:

  • "greater" and "special" blocks, which can contain other elements
  • "element" blocks, which can contain only text
  • "verse" blocks (a special case of "element" blocks), which can contain objects (but not elements)

I've updated the parser/lexer so it can now handle "greater" and "element" blocks, and to a degree it can handle "verse" blocks.

However, as it currently stands, I'm needing to add exceptions for lexing objects for the verse block in a way that we will likely need to be able to apply in other areas, but currently isn't very maintainable.

As a result, I'm looking into one of these two methods:

  1. rewriting parts of the lexer so that we can have a more state/context-based system - i.e., we only lex certain tokens when we know they have correct meaning (can only lex object tokens in verse block, for example)
  2. simplifying the lexer to just look for simple tokens, and do more of the work in the parser

Point 1 has the advantage of simplifying some of the parsing, and means we don't have to worry about converting tokens in the parser.

Point 2 has the advantage of simplifying lexing to be more clearly token-based without too much lookahead, but means that we may need to do some re-parsing depending on how we handle the lexer.

I think point 2 currently feels a bit more daunting beacuse the lexer needs to be more atomic - once the lexer is more atomic then it should actually be easier.

I'll look into point 2 and let you know how it goes - it should make life a lot easier in updating parsing in the future.

@GuiltyDolphin GuiltyDolphin changed the title WIP: Fix blocks WIP: Fix blocks (Lexer/Parser rework) Jul 8, 2021
@subthedubdub
Copy link

👍 Wow, what a refactor!

@GuiltyDolphin
Copy link
Contributor Author

👍 Wow, what a refactor!

Yep. I'm trying to be as comprehensive with tests as possible to make it easier to review, but it's always gonna be a bit of a beast.

Once the backend stuff is done w/ the lexer and parser, I'll need to make sure that all the frontend packages work correctly with this, which might be harder.

@xiaoxinghu
Copy link
Collaborator

xiaoxinghu commented Aug 8, 2021

@GuiltyDolphin , this looks dangerous to me, I suggest that we break these changes down into smaller pieces, because I am doing some refactor work myself which is pretty major. Also, I am changing the way tests are written, so I think it'd be easier to do these tests after mine, I will try my best to get it out of my computer.

@GuiltyDolphin
Copy link
Contributor Author

@xiaoxinghu Alright! I'll close this for now until your changes are done, and then see how things look after that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants