Skip to content

SimoneAncona/xparser

Repository files navigation

Xparser

Introduction

Xparser is a versatile parsing library that empowers developers with robust parsing capabilities.

Installation

You can easily install Xparser by downloading the source codes and integrating them into your project.

Getting Started

Define a grammar

In order to use Xparser, you need to define your grammar using a simple JSON file.

A grammar allows Xparser to transform a sequence of characters into a syntax tree.

All JSON Xparser grammar files must have the following structure:

{
    "name": "nameOfYourGrammar",
    "terminals": [
        {
            "name": "nameOfTerminalRule",
            "regex": "ECMAScript regular expression"
        }
    ],
    "rules": [
        {
            "name": "ruleName",
            "expressions": [
                "[b]def<identifier>():"
            ]
        }
    ]
}

You can also specify the JSON schema so that you don't run into errors:

{
    "$schema": "https://raw.githubusercontent.com/SimoneAncona/xparser/main/schemas/schema.json",
}

For more information click here

Using Grammars

As mentioned above, we use grammars to generate an abstract syntax tree or AST, you can do it in your C++ project:

#include "xparser.hh"
#include <fstream>
#include <sstream>
#include <string>
#include <stdexcept>
#include <iostream>

std::string read_json_file(std::string filename);

int main(int argc, char** argv)
{
    Xpp::Parser parser(read_json_file("myGrammar.json"));       // import the grammar file
    Xpp::AST ast = parser.generate_ast("parse this string");    // parse a string and generate the AST

    std::cout << ast.to_json().to_string() << std::endl;  // see the JSON string representation of the AST
    return 0;
}

std::string read_json_file(std::string filename)
{
    ifstream file;
    stringstream buff;
    file.open(filename);
    if (file.fail())
        throw std::runtime_error("Cannot open the file: " + filename);
    buff << file.rdbuf();
    return buff.str();
}

Grammars

Terminal Values

A terminal is always a final node in the AST, a terminal value can be a literal number, a literal string or an identifier. There are 3 types of terminal values:

  • Predefined: terminals that are built-in such as integer or identifier
  • User-defined: terminals that are defined in the terminals property of the grammar JSON file.
  • Constant: terminals that are defined in rule expressions, we will see later what this means.

A terminal is defined by a name and a regular expression, except for those constants.

Predefined Terminal Values

There are 12 built-in terminals:

  • integer: that is equivalent to [-|+]?\d+ regular expression.
  • identifier: that is equivalent to [_a-zA-Z][_a-zA-Z0-9]*.
  • real: that is equivalent to [+|-]?\d+(\.\d+)?.
  • alpha: that is equivalent to [a-zA-Z].
  • alnum: equivalent to [a-zA-Z0-9].
  • digit: equivalent to [0-9].
  • hexDigit: equivalent to [0-9a-fA-F].
  • octalDigit: equivalent to [0-7].
  • space: equivalent to [^\S\r\n].
  • newLine: equivalent to \r?\n.
  • any: equivalent to ..
  • eof: End Of File.

User-defined Terminal

User-defined terminals are defined in the JSON grammar file under the terminals property. A terminal is defined by specifying the name and the ECMAScript regular expression.

NOTE: regular expressions are strings, in order to represent the expression /[^\S\r\n]/ you must write "[^\\S\\r\\n]".

A user-defined terminal could be like the following.

{
    "terminals": [
        {
            "name": "binaryNumber",
            "regex": "[0|1]+"
        }
    ]
}

NOTE: The order in which they are placed in the array indicates the hierarchy, the topmost terminals will be parsed first.

Rules

A rule define the syntax of the language and specify how elements of the language are combined. Rules are defined under the rules property in the JSON grammar.
Each rule has a name and a set of expressions which specify the syntax.

{
    "rules": [
        {
            "name": "variableDeclaration",
            "expressions": [
                "[b]var<identifier><newLine|eof>"
            ]
        }
    ]
}

NOTE: The order in which rules are placed in the array indicates a reverse hierarchy, those below are parsed first.

Rule Expression Language

The rule expression language allows you to specify the syntax of a rule, there are 3 elements in the rule expression language:

  • Constant terminals: are used to define strings or sequences of characters that must match exactly in order to form a valid expression or sentence.
  • References: references to other rules or terminals, references are delimited by <>.
  • Flags: flags are always specified at the beginning and are delimited by [].

Constant Terminals

As mentioned above, constant terminals tells the parser to match exactly the character sequence. For example:

"[b]if<space*>(<condition>)"

In this expression, if is a constant terminal and tells the parser to match exactly the string "if".

To use <, [, | and other characters that have special meaning in Rule Expression Language in a constant terminal you need to use the \ character

NOTE: the escape character in the JSON file must be written \\. Example:
"[s]def \< <identifier> \>"
✔️ "[s]def \\< <identifier> \\>"

References

A reference is a reference to another rule or terminal, that tells the parser to match the string that follow the referenced rule.
A rule can have a reference to itself provided that in the expression array there is at least one expression with only terminal references or constant terminals.
Using the previous example:

"[b]if<space*>(<condition>)"

<condition> is a reference to a rule called condition.

Quantifiers

A reference can be quantified. There are 5 quantifiers:

  • ?: zero or 1.
  • *: zero or more.
  • +: 1 or more.
  • {x}: exactly x of.
  • {x:y}: a range from x to y (included).

Quantifiers are placed at the end of the reference like this:

"4letters:<alpha{4}>"

The example above specify to match a string that starts with "4letters:" and then followed by exactly 4 alphabetic characters.

Alternate

References can be alternated, alternate matches are represented using the | character. Each alternative represents a different way to match a part of the expression. For example:

"4letters_or_5num:<alpha{4}|digit{5}>"

In this example we match all strings that starts with "4letters_or_5num:" followed by 4 alphabetic characters or 5 decimal digits.

Flags

Flags are specified at the beginning of the expression and can change how the expression is evaluated.
There are 4 flags:

  • s for ignore spaces: if this flag is set, every space between different terminals and terminals, rule references and other rules or terminals and rule references, will be ignored and not evaluated as a constant terminal.
  • b for boundary: this flag guarantees that there is at least 1 space of gap between terminals or rules with same expressions or regular expressions.
  • i for case-insesitive: all constant terminals are case insensitive.
  • I for case-insesitive: all characters of a constant terminal are lower case or upper case, not a mix.

NOTE: you cannot specify both i and I flags.

Example:

"[Isb]foreach<space*>(<identifier> in <identifier>)"

That expression can match:

  • FOREACH (el in els).
  • foreach( el IN els).

That expression doesn't match with:

  • Foreach(el in els).
  • foreach(elinels).

Spaces in Rule Expression Language

If not specified, spaces can be evaluated as constant terminal or ignored. Let's see the difference:

"hello world<letter{4}> <number{6}>"
      ┃                ┃
      ┃                ┃
      ┗━━━━━━━━━━━━━━━━┻━━ These spaces are constant terminals.

"[s]hello world<letter{4}> <number{6}>"
         ┃                ┃
         ┃                ┗ This space will be ignored.
         ┗ This is space is a part of the constant terminal.

"[s]hello <identifier> world"
         ┃            ┃
         ┃            ┃
         ┗━━━━━━━━━━━━┻━━ These spaces will be ignored
         You must add <space> this is because 'hello' 
         and 'world' can be seen as identifiers and 
         's' does not guarantee that there are no 
         spaces.

"[sb]hello <identifier> world"
          ┃            ┃
          ┃            ┃
          ┗━━━━━━━━━━━━┻━━ These spaces will be ignored
          However the 'b' flag ensures that there is at 
          least one space between constant terminals and
          references.

"[b]hello <identifier> world"
         ┃            ┃
         ┃            ┃
         ┗━━━━━━━━━━━━┻━━ These spaces will not be ignored