Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial match #645

Open
Seb35 opened this issue Apr 2, 2020 · 1 comment
Open

Partial match #645

Seb35 opened this issue Apr 2, 2020 · 1 comment

Comments

@Seb35
Copy link

Seb35 commented Apr 2, 2020

Issue type

  • Bug Report: no
  • Feature Request: yes
  • Question: no
  • Not an issue: no

Prerequisites

  • Can you reproduce the issue?: yes
  • Did you search the repository issues?: yes
  • Did you check the forums?: no
  • Did you perform a web search (google, yahoo, etc)?: yes

Description

Context: My use case is a have a PEG.js grammar to recognise small expressions in a large text, and I want to pick them out of the whole text without surrounding text. Specifically I know quite well how the expression starts, so I launch the grammar on these “candidate” expressions, but I don’t know where the expressions end: it ends when the grammar cannot recognise the next character.

To fix the ideas, my PEG.js grammar deals with French laws texts and recognises potential links to other articles: expressions like "article 3 du code civil" or "au I du troisième alinéa de l’article 7 bis A de la loi n° 78-17 du 6 janvier 1978" and these expressions could be anywhere in the texts, but they start always more or less with the word "article" or some other word in a whitelist (details).

Polyfill 1: It is possible and it works to write the entry rule as entryPartial = e:myRealEntryRule .* { return e; } but this is highly inefficient for long texts. This is what I implemented (details) with a speed up to capture no more than the current line (details).

Polyfill 2: It is also possible to launch myGrammar.parse( text ) in a try, then catch the location of the error e.location.start.offset and relaunch myGrammar.parse( text.substr( 0, e.location.start.offset ) ) and accept the result if the second launch succeed.

Proposition: Add an option partialMatch in order not to fail after the entry rule is recognised. This means that, on this line, the second condition peg$currPos === input.length is removed when options.partialMatch is active. A variant would be to accept even if the entry rule fails, but the syntax writer could also add a "?" at the end of the entry rule if s/he wants, so I’m not sure it is a use-case for this issue.

We can choose if a partial match is activated: 1. either as an option during grammar generation (so that the JS code is fixed with or without the feature), or 2. either as an option during grammar execution.

This feature request is linked to #607, but the other one is about debugging grammars, here it is about accepting smaller answers than the entire text.

On another PEG parser

Before porting it to PEG.js I wrote my PEG in Python with Parsimonious, and Parsimonious has two different methods:

  • parse to parse the entire string and return an exception if there are remaining characters, similar to the current behaviour of PEG.js (details)
  • match to parse the string and fail only if the entry rule is not recognised, similar to the proposed behaviour of partial match (details)

Steps to Reproduce

Example code:

const pegjs = require( "pegjs" );
const grammar = pegjs.generate( 'article = "article" " "+ [0-9-]+' );
match = grammar.parse( "article 7 du code civil", { partialMatch: true } )

Expected behavior:

[ 'article', [ ' ' ], [ '2', '2', '5', '-', '1' ] ]

Actual behavior:

SyntaxError: Expected [0-9,\\-] or end of input but " " found.

Performance comparison

When I try the simple grammar above on a long text like this one (warning: 10.51MB) on a candidate link at index 1618 (which is "article 225-1") and it remains 10,642,912 characters after that, I obtain the following performance results on 20 runs for each category.

  • Polyfill 1: mean = 2354.872 ms ; stddev = 87.320 ms
  • Polyfill 2: mean = 0.307 ms ; stddev = 0.125 ms
  • Proposed feature: mean = 0.102 ms ; stddev = 0.019 ms

Software

  • PEG.js: 0.10.0 or master (future 0.11.0)
@Seb35
Copy link
Author

Seb35 commented Apr 2, 2020

The linked PR is what I used for the performance comparison (although based on an older master).

The PR fixes the behaviour at generation-time. Finally I find it would be better to allow changing it at execution-time: if this behaviour is shared by others, the PR could be adapted.

@Seb35 Seb35 mentioned this issue Apr 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant