Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use external parsers fparser / lfortran #85

Open
gnikit opened this issue Mar 28, 2022 · 24 comments
Open

Use external parsers fparser / lfortran #85

gnikit opened this issue Mar 28, 2022 · 24 comments
Assignees
Labels
help wanted Extra attention is needed investigate Issue requires research to see if it can be implemented

Comments

@gnikit
Copy link
Member

gnikit commented Mar 28, 2022

The current parser for fortls is flimsy and commonly runs into problems when it comes to finding implementations of overloaded methods and preprocessor definitions. It would be interesting to see if fparser could be used to do the parsing of the source files instead.

Potential problems:

  1. Does not support f2018 and newer standards
  2. Parsing might not work in parallel
@gnikit gnikit self-assigned this Mar 28, 2022
@gnikit gnikit changed the title Use fparse as parser Use fparser as parser Mar 28, 2022
@ZedThree
Copy link

ZedThree commented Apr 6, 2022

I'm also looking at using fparser for ford -- I think that would help with Fortran-FOSS-Programmers/ford#405 too.

There's still some f2008 features it doesn't support yet either, but I'm already looking at implementing some of those

@gnikit
Copy link
Member Author

gnikit commented Apr 6, 2022

The problem with fparser is that it is very slow compared to the fortls parser or better yet with compiler parsers that give AST/ASR trees.
@certik and I have had some discussions on how lfortran (or lpython) could be used to parse source files since it is so lightweight and performant. In theory, one could use the generated AST/ASRs to create a language server.

However, in practice this is much more involved since the implementation of a server's request is usually very specific to the parser used. So, although it would be possible to swap fortl's parser with fparser or lfortran it would require a lot of work.

Fingers crossed this does not apply to ford!

@ZedThree
Copy link

ZedThree commented Apr 6, 2022

Unfortunately the parser in ford is very tightly coupled to the other parts, so swapping it out for something else would require some work. But I do think it would be worth doing so. I'm very interested in helping support some nature of common parsing library, it would take a bunch of the maintenance burden off ford!

The original author of ford had looked into using flang instead: hansec/fortran-language-server#205 (comment)
It's very appealing because of the other sorts of tools one could build off of it -- think about all the tooling that's available for C and C++ from clang.

@gnikit
Copy link
Member Author

gnikit commented Apr 6, 2022

it would take a bunch of the maintenance burden off ford!

@ZedThree I wholeheartedly understand.

I remember that post and I would be more than glad to also be part of it. However, I think we are running into a "show-stopper"; creating an abstract enough parser that can be used for fortls, ford, etc. is a long term project.

Specifically for fortls, if we decided to switch to fparser today, I believe it would take minimum 1 month fulltime to have an alpha version of fortls with the new parser, I suspect ford will also be along those lines. Logically, as cmacmackin said, someone/something needs to fund this work, because it is not a sustainable model if the entire project relies on volunteers.

I know lfortan has Summer of Code to sponsor some of this work but I suspect it's going to be a challenging set of projects, especially for students.

In general, I think a user friendly, robust and performant Fortran parser is a project that would require at least a couple people working on it continuously part-time to ensure timely bug fixes and quick implementation of new standards. Unfortunately, I don't see that happening anytime soon.

From what I can tell this is one of the objectives for lfortran and lpython, but I am not sure at what stage the project currently is. Maybe @certik can shed some light as to how would one go about obtaining funding for creating/utilising an existing parser and making an API for other tools to use and if lfortran is working towards something like this.

@certik
Copy link
Member

certik commented Apr 6, 2022

LFortran has a Fortran 2018 parser and AST that is very fast to parse and as far as we know complete. If you find a bug, we'll try to fix it quickly. If you are interested in using it, I am happy to help out.

@gnikit
Copy link
Member Author

gnikit commented Apr 6, 2022

@certik I was wondering, does lfortran have an API? I was looking through the dev docs and I didn't see anything being mentioned.

@certik
Copy link
Member

certik commented Apr 6, 2022

Yes, there is an API that you can use. We don't have a documentation for it, but there is an API that LFortran itself uses to walk and query the AST.

@gnikit
Copy link
Member Author

gnikit commented Apr 6, 2022

I found this: https://github.com/lfortran/lfortran/blob/master/src/lfortran/parser/parser.h
which I suspect is one of the ways to interact with your parser. I was wondering is there an equivalent Python version, or something alike?

EDIT: Is this the Python parser (and its usage comments) I should be looking?
https://gitlab.com/lfortran/lfortran.py/-/blob/master/lfortran/parser/parser.py

BTW To test your parser I would also suggest you try parsing some of the examples in https://github.com/gnikit/fortls/tree/master/test/test_source

They contain a lot of edge cases (unfortunately, they are mixed along with more traditional examples.

Another test, which fortls's parser currently fails is #72:

#define subroutine val

subroutine foo(val)
real :: val
end subroutine foo

@gnikit
Copy link
Member Author

gnikit commented Apr 30, 2022

@ZedThree you might want to have a look at https://github.com/pyparsing/pyparsing. One of the things I wanted to investigate was to create a rudimentary fortran parser with pyparsing but I have been quite busy this last month.

@certik
Copy link
Member

certik commented Apr 30, 2022

Yes, the parser is in src/lfortran/parser/parser.h. The very old Python version is in the lfortran.py repository, don't use that. If you are interested in using the new parser from Python, we can write Python wrappers for it.

You can test it like this:

$ lfortran --show-ast test_submodule.f90    
(TranslationUnit [(Submodule foo_module () submodule1 () [] [(ImplicitNone [] ())] [] [(Procedure foo1 [] [(SimpleAttribute AttrModule)] () [] [] [] [] [(Write 0 [(()) ((String "(A)"))] [] [(StrOp (StrOp (StrOp (String "testing :: ") Concat (FuncCallOrArray trim [] [(() a ())] [] [])) Concat (String "::")) Concat (FuncCallOrArray trim [] [(() b ())] [] []))] ())] [])])])

All the files seem to work, except test_inc.f90, I think we don't have include implemented yet (I created an issue for it: https://gitlab.com/lfortran/lfortran/-/issues/687).

@gnikit gnikit changed the title Use fparser as parser Use external parsers fparser / lfortran May 10, 2022
@gnikit gnikit added help wanted Extra attention is needed investigate Issue requires research to see if it can be implemented labels May 10, 2022
@gnikit
Copy link
Member Author

gnikit commented Jun 30, 2022

Yes, @certik Python wrappers would be great, basically something that returns the AST and/or ASR trees for a URI, but I know that you are super busy so I don't want to add more things to your plate!

@gnikit
Copy link
Member Author

gnikit commented Nov 23, 2022

Re: #235 (comment)

@certik

A language server's parser is not meant to be a strict implementation of the standard, like that of a compiler's. It is actually meant to be the opposite; permissive and error resistant. How is a compiler going to parse incomplete syntax? This for example breaks lfortran (and every other compiler) while fortls is capable of parsing it and giving completion suggestions at both inte to integer and print*, v to val or verify (intrinsic)

program name
  implicit none
  inte
  integer :: val
  print*, v
end program name

You can't rely on cached valid states for dynamic features like completion, hover and syntax highlighting simply because there is absolutely no guarantee that such state existed or will exist while the user types.

Moreover, a cached state is not useful for fetching info from the ASR if the underling ASR has changed, that will simply result in the wrong data being fetched. Cached states are good for improving performance of very few and specific LSP requests like like onSave and resolving includes (#include, include, use, import, etc.) without having to descend the dependency graph of the entire project.

Potentially, you can turn a compiler parser to something that a language server can use by removing all failure checks that are present and instead carrying on, attempting to parse the rest of the code. That will almost definitely come with a performance hit, since you will have to check for more types of AST nodes while also screening for false-positives, remember source code is incomplete e.g.

program main
implicit none
integer :: int = 1
int = int! while typing the second `int`, can you tell if it's a variable or an intrinsic function?
end program main

In summary, a compiler's parser and a language server's parser are meant to handle two very different states of source code. Caching valid states is not a valid strategy since those rarely exist and unless you can convert the compiler's parser to be more error resistant for invalid syntax, using a compiler's parser in a language server would not work.

@certik
Copy link
Member

certik commented Nov 23, 2022

There are three ways forward, each with pros / cons:

  • using a dedicated parser and semantics just for the language server (pro: you can do exactly what you want for a langauge server; con: you have to reimplement and maintain everything separately, duplicating the work that we need to do in fortran-lang especially in maintaining it)
  • extending LFortran's parser+semantics to be able to handle invalid syntax and invalid semantics (pro: we'll maintain just one parser and one semantics and we can collaborate on the same project; con: we have to figure out if it is possible to do this with high performance)
  • using existing ("strict") LFortran parser+semantics and using the ASR from the last valid state to query auto completion (pro: we already have the full F2018 parser and semantics is rapidly improving; con: it requires a valid state)

I encourage you to try to figure out from your end how we can collaborate instead of giving up on collaboration simply because there are slightly different requirements for a language server compared to a compiler. For example the C# compiler works as a language server. How does the C++ clang language server handle this? Do they really maintain a separate parser and semantics for clangd of all of C++ (in https://clangd.llvm.org/ they say "clangd is based on the Clang C++ compiler, and is part of the LLVM project")? I think using the last valid state is not a bad strategy, I think it can work very well in practice. As well as figuring out to extend the parser/semantics to account more for errors (we use Bison, which does have some support: https://www.gnu.org/software/bison/manual/html_node/Error-Recovery.html, but I haven't used it yet --- there could be a compromise, there might be a way to handle errors on a statement line inside a function, the page talks about that; without losing any performance).

If we can collaborate on a single project, we both win big time, as well as the whole community, that will not be fragmented maintaining two separate compiler frontends. The Fortran community is small, we need to figure out how to reuse each others work, instead of each of us working on our own project and not wanting to collaborate. I don't need to use LFortran's parser+semantics, I would be happy to use yours if it can be as fast and can handle as much Fortran as LFortran can. It seems LFortran's parser + semantics is closer to the "common goal", that's why I recommend figuring out how we can use it.

@gnikit
Copy link
Member Author

gnikit commented Nov 24, 2022

For me the only viable solution is no. 2. I know that no. 3 seems that it would work but me personally, based completely on the knowledge I have so far about parsers and language servers cannot see how that would work robustly. LSP requests are served on the latest version of your doc, how could you query your AST for param X at line Y and column Z when that parameter did not exist in the cached AST.

For clangd, I don't know how they do it. I will have to sit down and carefully go through their project or speak with someone from the project directly. Unfortunately, I don't have the time right now.

If we can collaborate on a single project, we both win big time, as well as the whole community, that will not be fragmented maintaining two separate compiler frontends.

I agree, hence this issue, we just need to allocate some time to work on this and set some targets. At least for me the problem right now is that I don't have much free time for extra work.

I don't need to use LFortran's parser+semantics, I would be happy to use yours if it can be as fast and can handle as much Fortran as LFortran can. It seems LFortran's parser + semantics is closer to the "common goal", that's why I recommend figuring out how we can use it.

I would say that it should be the opposite, use a better parser (LFortran) and potentially attach it to an existing LSP frontend like fortls OR copy the algorithms from fortls into your own frontend. fortls will become way more reusable once pygls releases v1.0.0 but I understand if you want to develop your own frontend.

There is a lot more work to be done to make a good language server, be that fortls or a new one under LFortran. The parser is just one of the many things needed. In my personal view implementing some of the LSP requests is noticeably harder than writing a parser.

@certik
Copy link
Member

certik commented Nov 24, 2022

Ok. Let's experiment with 2. and see if we can satisfy the LSP requirements. I'll start with incomplete statements, that would be the most useful and then we'll go from there.

How should we represent this in AST --- InvalidStatement AST node?

Is the idea that for a compiler application, the compiler might decide to just stop the processing until all syntax errors are fixed (that's the current behavior), while for the LSP application you will then simply ignore InvalidStatement and try to construct ASR as best as you can, ignoring/recovering from whatever semantic errors you encounter? So for an LSP application you want to construct an ASR as best as you can for as large invalid source codes as you can?

@certik
Copy link
Member

certik commented Dec 6, 2022

@gnikit let's continue the discussion. As fortran-lang, we should not be developing two parsers and semantics, the ball is in your court to lead this change.

@gnikit
Copy link
Member Author

gnikit commented Dec 7, 2022

We have to distinguish between the usage of the 2 parsers. Importantly there are multiple fundamental differences between fortls and LFortran, that make replacing the parser of the former with the latter a difficult task.

  • Fortls is stable while LFortran is in beta.
  • Fortls is platform independent while LFortran's parser requires compilation and external libraries.
  • Equally important, fortls is self-contained and we control the parser. LFortran is very actively developed with multiple people making changes to the parser. We can't have people altering the behaviour of the parser and breaking the language server.
  • I am also not familiar with how to contribute, extend and test the LFortran parser.
  • finally, there's the whole "error-resistant" part of the parser which I frankly have no idea how to address.

Just because the fortls parser cannot parse every single semantic or definition of the standard does not make it inferior. As I demoed in the linked issue, it's equally easy to break any compilers' parser as it is to break fortls'.


With regards to me making this happen. I am afraid it will take more than just myself. There are multiple components in Fortls and LFortran that we would have to redesign and/or create. I know some of them but I would have to spend a nontrivial amount of time researching the possible solutions to others, something I currently can't do. The tasks I see as needing research to be able to make informed design decisions are:

  • C++ to/from Python interfacing that's async and thread safe.
  • Easy and portable integration of LFortran's parser into fortls via PyPi, no external dependencies
  • a mechanism to ensure that LFortran developers don't break fortls when they make changes to the parser (potentially new testing suite)

Finally, I would like to emphasize that we can't break fortls for the sake of merging the two parsers. Fortls has had a pretty good track record of not breaking the language server. This endeavour needs to be realistic, the merge should add features to the existing capabilities of the language server, not remove any. If it's deemed easier to create a second language server in LFortran we should do that instead, and I would be happy to be part of it.

@gnikit
Copy link
Member Author

gnikit commented Dec 7, 2022

I will set aside some time (~2h) by Friday to better understand what clangd is doing, that should provide some guidance as to the "error-resistant" strategy for the parser.

@certik
Copy link
Member

certik commented Dec 8, 2022

You have to first define the requirements on the parser. For example if we said the requirements is to only parse arithmetic expressions and nothing else, then indeed it's easy to make a stable parser.

I propose the requirements are that it can parse all valid Fortran code. And possibly some invalid code, to be determined.

LFortran can parse all valid Fortran code, as far as we know. Fortls cannot. Consequently, Fortls is either a prototype or alpha, but it cannot be beta or production (stable). LFortran's parser on the other hand is beta. Much more stable and further along.

Consequently, we should switch to the more advanced parser as soon as we can, not wait, as it will get harder and harder the further fortls is developed.

@Sean1708
Copy link

@gnikit Are you aware of tree-sitter? It was designed specifically for language servers and similar tools, so syntax errors don't prevent the rest of the file from being parsed. E.g.:

$ cat test.f90 && echo && tree-sitter generate && tree-sitter parse test.f90
module foo
  jkljfskla$$$
end module

program main
  sjkldsal@!@
end

(translation_unit [0, 0] - [7, 0]
  (module_unit [0, 0] - [2, 10]
    name: (identifier [0, 7] - [0, 10])
    (ERROR [1, 2] - [1, 14]
      (ERROR [1, 11] - [1, 14])))
  (program_unit [4, 0] - [6, 3]
    name: (identifier [4, 8] - [4, 12])
    (ERROR [5, 2] - [5, 11]
      (ERROR [5, 10] - [5, 11]))
    (comment [5, 11] - [5, 13])))

I unfortunately can't say whether it would be easier to update LFortran's parser or write a tree-sitter grammar from scratch, but it is another option that you might want to look into.

@ZedThree
Copy link

ZedThree commented Feb 6, 2023

Tree-sitter certainly seems like it has significant upsides, as it's designed to be very fast and fault tolerant -- ideal for things like syntax highlighting and LSP.

The downside might be that it's designed for parsing single files at a time, and Fortran often requires information from used modules at the parsing stage. I'm not sure how much of a limitation that is in practice.

@freevryheid
Copy link
Contributor

The new helix editor uses tree-sitter although not for fortran as yet , although fortls is implemented. Perhaps this?

@ZedThree
Copy link

ZedThree commented Feb 6, 2023

Ah, I didn't realise there was a Fortran parser already implemented, thanks @freevryheid! While it's not as complete as any of the parsers discussed here (fortls, ford, fparser or lfortran), it is blazing fast (0.6 s vs 3.8 s for fortls on a benchmark problem).

Tree-sitter also comes with a query language which is supposed to be very fast. I've not had a play with that yet, but it looks promising.

I'll try and find some time to experiment with getting into ford

@gnikit
Copy link
Member Author

gnikit commented Feb 6, 2023

Thanks for the link @Sean1708, tree-sitter looks interesting, but I don't think we will be making a push for it anytime soon.
We have decided that probably the best thing would be to get LFortran to act as an error tolerant parser (same way clang and friends operate), output the AST to JSON (something already possible with LFortran dev) and parse it in fortls, or any other program. In that way, we will combine our efforts and prevent work duplication.

This functionality will be in addition to the existing Python parser which is relatively error tolerant but slow as hell.

For fortls currently the priorities are focused on the LSP side of things and also ironing out some bugs before the major v3.0.0 release (which I have postponed since October)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed investigate Issue requires research to see if it can be implemented
Projects
None yet
Development

No branches or pull requests

5 participants