You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Did you perform a web search (google, yahoo, etc)?:yes
Description
Context: My use case is a have a PEG.js grammar to recognise small expressions in a large text, and I want to pick them out of the whole text without surrounding text. Specifically I know quite well how the expression starts, so I launch the grammar on these “candidate” expressions, but I don’t know where the expressions end: it ends when the grammar cannot recognise the next character.
To fix the ideas, my PEG.js grammar deals with French laws texts and recognises potential links to other articles: expressions like "article 3 du code civil" or "au I du troisième alinéa de l’article 7 bis A de la loi n° 78-17 du 6 janvier 1978" and these expressions could be anywhere in the texts, but they start always more or less with the word "article" or some other word in a whitelist (details).
Polyfill 1: It is possible and it works to write the entry rule as entryPartial = e:myRealEntryRule .* { return e; } but this is highly inefficient for long texts. This is what I implemented (details) with a speed up to capture no more than the current line (details).
Polyfill 2: It is also possible to launch myGrammar.parse( text ) in a try, then catch the location of the error e.location.start.offset and relaunch myGrammar.parse( text.substr( 0, e.location.start.offset ) ) and accept the result if the second launch succeed.
Proposition: Add an option partialMatch in order not to fail after the entry rule is recognised. This means that, on this line, the second condition peg$currPos === input.length is removed when options.partialMatch is active. A variant would be to accept even if the entry rule fails, but the syntax writer could also add a "?" at the end of the entry rule if s/he wants, so I’m not sure it is a use-case for this issue.
We can choose if a partial match is activated: 1. either as an option during grammar generation (so that the JS code is fixed with or without the feature), or 2. either as an option during grammar execution.
This feature request is linked to #607, but the other one is about debugging grammars, here it is about accepting smaller answers than the entire text.
On another PEG parser
Before porting it to PEG.js I wrote my PEG in Python with Parsimonious, and Parsimonious has two different methods:
parse to parse the entire string and return an exception if there are remaining characters, similar to the current behaviour of PEG.js (details)
match to parse the string and fail only if the entry rule is not recognised, similar to the proposed behaviour of partial match (details)
SyntaxError: Expected [0-9,\\-] or end of input but " " found.
Performance comparison
When I try the simple grammar above on a long text like this one (warning: 10.51MB) on a candidate link at index 1618 (which is "article 225-1") and it remains 10,642,912 characters after that, I obtain the following performance results on 20 runs for each category.
Polyfill 1: mean = 2354.872 ms ; stddev = 87.320 ms
Polyfill 2: mean = 0.307 ms ; stddev = 0.125 ms
Proposed feature: mean = 0.102 ms ; stddev = 0.019 ms
Software
PEG.js: 0.10.0 or master (future 0.11.0)
The text was updated successfully, but these errors were encountered:
The linked PR is what I used for the performance comparison (although based on an older master).
The PR fixes the behaviour at generation-time. Finally I find it would be better to allow changing it at execution-time: if this behaviour is shared by others, the PR could be adapted.
Issue type
Prerequisites
Description
Context: My use case is a have a PEG.js grammar to recognise small expressions in a large text, and I want to pick them out of the whole text without surrounding text. Specifically I know quite well how the expression starts, so I launch the grammar on these “candidate” expressions, but I don’t know where the expressions end: it ends when the grammar cannot recognise the next character.
To fix the ideas, my PEG.js grammar deals with French laws texts and recognises potential links to other articles: expressions like "article 3 du code civil" or "au I du troisième alinéa de l’article 7 bis A de la loi n° 78-17 du 6 janvier 1978" and these expressions could be anywhere in the texts, but they start always more or less with the word "article" or some other word in a whitelist (details).
Polyfill 1: It is possible and it works to write the entry rule as
entryPartial = e:myRealEntryRule .* { return e; }
but this is highly inefficient for long texts. This is what I implemented (details) with a speed up to capture no more than the current line (details).Polyfill 2: It is also possible to launch
myGrammar.parse( text )
in atry
, then catch the location of the errore.location.start.offset
and relaunchmyGrammar.parse( text.substr( 0, e.location.start.offset ) )
and accept the result if the second launch succeed.Proposition: Add an option
partialMatch
in order not to fail after the entry rule is recognised. This means that, on this line, the second conditionpeg$currPos === input.length
is removed whenoptions.partialMatch
is active. A variant would be to accept even if the entry rule fails, but the syntax writer could also add a "?" at the end of the entry rule if s/he wants, so I’m not sure it is a use-case for this issue.We can choose if a partial match is activated: 1. either as an option during grammar generation (so that the JS code is fixed with or without the feature), or 2. either as an option during grammar execution.
This feature request is linked to #607, but the other one is about debugging grammars, here it is about accepting smaller answers than the entire text.
On another PEG parser
Before porting it to PEG.js I wrote my PEG in Python with Parsimonious, and Parsimonious has two different methods:
parse
to parse the entire string and return an exception if there are remaining characters, similar to the current behaviour of PEG.js (details)match
to parse the string and fail only if the entry rule is not recognised, similar to the proposed behaviour of partial match (details)Steps to Reproduce
Example code:
Expected behavior:
Actual behavior:
SyntaxError:
Expected [0-9,\\-] or end of input but " " found.
Performance comparison
When I try the simple grammar above on a long text like this one (warning: 10.51MB) on a candidate link at index 1618 (which is "article 225-1") and it remains 10,642,912 characters after that, I obtain the following performance results on 20 runs for each category.
Software
The text was updated successfully, but these errors were encountered: