Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clang JSON parser #120

Open
tobiashienzsch opened this issue Apr 24, 2021 · 4 comments
Open

Clang JSON parser #120

tobiashienzsch opened this issue Apr 24, 2021 · 4 comments

Comments

@tobiashienzsch
Copy link
Contributor

tobiashienzsch commented Apr 24, 2021

Hi Jonathan,

I've been looking into the clang JSON output you mentioned in the standardese issue #195 for the last couple of days. Since I have no glue how deep your own research to this topic has gone, I just wanted to share the experiences I've made.

Limitations

Preprocessor

The clang JSON output is produced after the preprocessor has run, so we won't be able to create entities like cpp_macro_definition, cpp_macro_parameter and cpp_include_directive unfortunately. I've tried to find a tool from the LLVM ecosystem that provides more output concerning the processor, but was unable to find anything useful. The pp-trace looked promising, but unfortunately running pp-trace-12 --callbacks='*' /path/to/file.hpp did not generate anything useful.

Nested Namespaces

Each namespace shows up as it's own entity in the JSON document, so we would need to match the line numbers to see if it is a nested namespace declaration.

Semantic Parent

Similar to the namespace above, we would need to track this manually. I'm not familiar enough with libclang yet to tell if this is going to be a minor or major challenge for the JSON parser.

JSON conformance

Both Boost.JSON and simdjson follow the JSON standard exactly. This means there is a nesting limit of 32 objects. Boost.JSON has parsing options to change this behavior, I haven't found those settings for simdjson yet, so I'm not sure if they can be changed. I've created a couple of test source files to see how big the resulting .json files get. When you include all STL headers and run the command: clang++ -Xclang -ast-dump=json -std=c++17 stl_headers.hpp > test.json the resulting file is 550mb. To successfully parse the etl/string.hpp header that is mentioned in the standardese issues I had to up the nesting limit before parsing.

Results So Far

I've successfully parsed cpp_enum & cpp_enum_value entities using the Boost.JSON parser. I definitely need to do more tests before I can tell how much work the complete parser is going to be. Switching to simdjson once I really start working on pull requests should not be an issue. Both libraries seem to provide similar APIs. I used Boost for the testing, because I've been working with it the last couple of months. I think simdjson is a very valid option as well, I don't really have any preferences, but I think the selection should be based on performance metrics first, since the files that need to be parsed can get huge.

Toby

@foonathan
Copy link
Collaborator

Thanks for looking into it; I plan on starting it in May.

Preprocessor

Does it support comments?

JSON conformance

simdjson/simdjson#8 seems to indicate that you can specify it for simdjson.

Switching to simdjson once I really start working on pull requests should not be an issue.

If you could do PRs that could be great. I think it would be best if I do the necessary infrastructure changes first and then we can both work on adding parsers for specific declarations?

@tobiashienzsch
Copy link
Contributor Author

Does it support comments?

Yes, the following code produces this hierarchy:

  • FullComment
    • ParagraphComment
    • BlockCommandComment
      • ParagraphComment
        • TextComment
        • ...
// Some normal comment. Not included in the JSON AST

/// \brief Does foo
/// Some more details
void foo();

If you could do PRs that could be great. I think it would be best if I do the necessary infrastructure changes first and then we can both work on adding parsers for specific declarations?

Exactly what I was thinking.

@foonathan
Copy link
Collaborator

foonathan commented May 14, 2021

I have implemented the basics of a cppast::astdump_parser on the feature/json branch. It currently just creates cpp_unexposed_entity. The tool and the tests are currently hacked to use this new parser unconditionally (so everything fails).

Over the next weeks I will add more and more implementations; let me know what you want to tackle or if you have any questions.

@foonathan
Copy link
Collaborator

foonathan commented Jun 22, 2021

Current status:

  • cpp_alias_template.hpp
  • cpp_array_type.hpp
  • cpp_attribute.hpp
  • cpp_class.hpp
  • cpp_class_template.hpp
  • cpp_decltype_type.hpp
  • cpp_enum.hpp
  • cpp_expression.hpp
  • cpp_file.hpp
  • cpp_friend.hpp
  • cpp_function.hpp
  • cpp_function_template.hpp
  • cpp_function_type.hpp
  • cpp_language_linkage.hpp
  • cpp_member_function.hpp
  • cpp_member_variable.hpp
  • cpp_namespace.hpp
  • cpp_preprocessor.hpp
  • cpp_static_assert.hpp
  • cpp_template_parameter.hpp
  • cpp_type.hpp
  • cpp_type_alias.hpp
  • cpp_variable.hpp
  • cpp_variable_template.hpp

(Checked doesn't mean "works 100%", but "mostly works").

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants