Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for incremental parsers #507

Open
kwesibrunee opened this issue Jun 7, 2017 · 33 comments
Open

Add support for incremental parsers #507

kwesibrunee opened this issue Jun 7, 2017 · 33 comments
Milestone

Comments

@kwesibrunee
Copy link

kwesibrunee commented Jun 7, 2017

I am building a streaming XML parser and I had rough time getting location information due to the nature of the parser.

Here is what I did to make it work,

I added the following 2 sections to my initializer

 peg$posDetailsCache = options.peg$posDetailsCache;
 function peg$computeLocation(startPos, endPos) {
  var startPosDetails = peg$computePosDetails(startPos),
      endPosDetails   = peg$computePosDetails(endPos);
      return {
		start: {
			offset: options.offset + startPos,
			line:   startPosDetails.line,
	    	column: startPosDetails.column
		},
	end: {
	    	offset: options.offset + endPos,
		line:   endPosDetails.line,
		column: endPosDetails.column
	}
  };
}

Then after each parse (which consumes one item only) outside of the pegjs grammar, I update options.offset with the ending offset, and options.peg$posDetailsCache with an object like so [{line: 4, col: 5}]

it would be nice if the parser supported passing in the starting offset and the starting line number and column, to better support incremental parsing.

Related: #281 Allow an option to specify a start offset

@futagoza futagoza changed the title Modify parser.options to support incremental parsers Add support for incremental parsers Jun 9, 2017
@futagoza futagoza added this to the post-1.0.0 milestone Jun 9, 2017
@futagoza
Copy link
Member

futagoza commented Jun 9, 2017

I'm assuming you want support for parsing from an index other then 0, yes?

@kwesibrunee
Copy link
Author

kwesibrunee commented Jun 10, 2017

No actually, just support for modifying the location that is reported in the location() object.

The parser I am writing, parses a chunk of data at a time (i.e. node streams) and parses as much as it can.

so given the following document

<root>
<child> text</child>
</root>

and suppose it comes in in two chunks

1:
<root>
<child> te

2:
xt</child>
</root>

during the first parse it would parse

<root>
and
<child>
and it would throw out "te"

but the next parse it would parse
"text"
</child>
</root>

During the second parse I want it to reflect the location in the original document not the chunk it has been presented.

@futagoza
Copy link
Member

Sorry, just trying to understand what you want here 😄

Basically you want the location data to reflect whats passed via the options rather then from the starting index used by the internal parser functions? Wouldn't this return incorrect location data for what was parsed in the first parse (i.e "te" from "text")?

Since your parsing chunks of data at a time, wouldn't it be more optimal to just return the range:

function range() {

    return [ peg$savedPos, peg$currPos ];
    /* or */
    //return { start: peg$savedPos, end: peg$currPos };

}

@kwesibrunee
Copy link
Author

The parser never has access to the whole document at once, only one chunk at a time, since it doesn't have access to the whole document, the location() function as it stands does not give very relevant information.

in the preceding example the first chunk, would have the correct location info, the second chunk would be very incorrect. the string "te" would be prepended to the second chunk and would be recocognized as the full string "text". Without changes to the location function, the start of string "text" would be recognized as offset 0, line 1, column 1, instead of the correct values of offset 13, line 2 column 8

This would also be useful if say you only wanted to parse part of a document.

@futagoza
Copy link
Member

in the preceding example the first chunk, would have the correct location info, the second chunk would be very incorrect. the string "te" would be prepended to the second chunk and would be recocognized as the full string "text". Without changes to the location function, the start of string "text" would be recognized as offset 0, line 1, column 1, instead of the correct values of offset 13, line 2 column 8

Like I asked above, basically you want the location data to reflect whats passed via the options rather then from the starting index used by the internal parser functions, which as you showed above, can be achieved by overwriting peg$posDetailsCache and peg$computeLocation.

Currently the parser's generated by PEG.js don't keep their state between different parsing sessions, so there is no way to do what you want automatically, so manually overwriting peg$posDetailsCache and peg$computeLocation is the only way.

This would also be useful if say you only wanted to parse part of a document.

This can be a different issue depending on the source input given to the generated parser:

  1. If using streams, the current method you are employing is best
  2. If parsing a whole document, then manually setting the values for peg$savedPos and peg$currPos might do the trick, but to be honest I've never tried this, although have been pondering whether to implement this as a feature or not.

@futagoza futagoza modified the milestones: post-1.0.0, 1.0.0 Mar 16, 2018
@futagoza
Copy link
Member

Have decided to implement this before v1, but will need to think of a concrete direction to take based on the OP's request and my post above.

@futagoza futagoza added the task label Mar 16, 2018
@StoneCypher
Copy link

This basically isn't possible. Unfortunately the current maintainer is researching impossible promises instead of fixing easy things users actually need.

The way a packrat parser works is the parsing equivalent of A* - it caches its old matching attempts. The reason it's able to match so liberally is that it has all the forward stems, which prior to caching seemed performance unrealistic. That's also why it's so fast.

The problem is, in order to do incremental parsing, you'd either need a copy of the old cache, which is typically thousands to hundreds of thousands of times the size of the document, or to keep the forward piece of the old document and just re-parse it.

This issue should be closed as not possible. This isn't how packrats work. (Also, they're fast enough that I can't think of a reason this would be necessary or desirable. I can usually parse ~10 meg documents in under a single screen frame.)

Most likely it's faster to re-parse from scratch than it would be to load the old stems from disk

Just track your location in the first parse and re-seek there in the second, with a unique XPath. XML has no deletion. It's guaranteed that the stream position you were at is still there.

@StoneCypher
Copy link

There is something called an incremental packrat parser, and there are even openly available javascript implementations by the Ohm people

But if you get down to the nitty gritty and look, you'll realize that they're about as related as a hashmap and a vector (because one's a sparse array and one's a dense array, and the only things they really share are words in the title)

@Mingun
Copy link
Contributor

Mingun commented Feb 3, 2020

This issue should be closed as not possible.

I don't think so. Yes, there are grammars for which incremental parsing will give nothing, but there are also those for which it can be done. The problem is to divide these classes and report it at the compile stage so as not to generate non-working code

@StoneCypher
Copy link

You know this software better than I do, and I could be wrong, and if you're certain I'll shut up, but

Do you really believe that peg.js can be practically made to do this without serious overhaul?

@StoneCypher
Copy link

Like, the ohm people's approach requires a kind of lazy memoization that isn't in peg at all as far as I know, and I don't see how to stuff it in (maybe I'm missing the boat; that happens sometimes)

@Mingun
Copy link
Contributor

Mingun commented Feb 3, 2020

Do you really believe that peg.js can be practically made to do this without serious overhaul?

I had some ideas in the past, but as project is dead I never implemented them... It's just a challenge, but at this point I don't need that functionality, but to do something nobody needs -- boring

@StoneCypher
Copy link

This project isn't dead. It has 200,000 downloads a week.

It just needs a maintainer who cares. I'd love to resurrect this and start fixing the problems. I'll just go through the existing worklog, cherry pick, add tests, and release.

I'm fixing my peg grammars with bash scripts.

There's no reason this project should die. It's the best JS parser out there.

@Mingun
Copy link
Contributor

Mingun commented Feb 3, 2020

I understand. Maybe the best option to make a fork and develop it according to your requirements

@StoneCypher
Copy link

I did that three years ago.

No, I'm not going to fork a library in order to take it away from its current maintainers, and bring it back to life. That's not what forks are, or how they work.

My explicit goal is to remove the team that spent three years making 0.11.0 then threw it away and said "we're going to write a new library from scratch, and you can't contribute to or see the new one, and this is now my personal hobby project."

As I understand it, that team is ryuu.

This is an extremely important library that is heavily used by many packages system-wide. It's time for someone to bring it back to a healthy status.

Forking it will not achieve that goal.

My goal is to fork 0.10.0 and start back-porting 0.11.0 merges into 0.12.0, but increasing testing as I go, and actually releasing the library

There are six different issues in this repo that say "I've given up on peg because it hasn't released in years," or "because it hasn't released since David left."

Nobody cares about my fork. If I forked it, nothing would be saved.

I'm going to save the real thing now, if David lets me.

I'm sorry that you want me to let the real thing die, but I'm not willing to do that. I also don't know why you'd want that. Your PRs are dying with 0.11.0. I would retain them.

At any rate, I wasn't asking for advice. I already told David what I want, and I still want it even if you want me to pull a Ryuu, and start over in a different repo that nobody can see or will ever care about, and continue leaving the users stranded.

My goal isn't to save myself. My goal is to save the library.

It's time for the current developers to let someone with library experience step in and create a healthy develop and release process, and save the library.

Your opinion of my desire is noted.

I still want David to let me undo the three years of damage that have been done since he left, by a group who accepted the responsibility of maintenance, then never actually did it.

The current maintainer has literally never maintained the library a single time, and every single promise he made to David back in 2017 to acquire this repo is faulted.

It was time to let go in 2018

I want David to let me fix this. Peg is the best parser around

Its own developers (you) are saying "I gave up on this project, it's dead"

Why you also want me to not save it and work on my own private fork, when all the developers have given up on the project as a dead project, is beyond me

I'm sorry you've given up on peg.js

I have not.

@Mingun
Copy link
Contributor

Mingun commented Feb 3, 2020

I would retain them.

I just leaved they in my fork. They don't died completely, you still can resurrect them. I'm just not interested in doing that.

Nobody cares about my fork. If I forked it, nothing would be saved.

I have been fitting the implementation of ranges for 5 years to all the changing formatting requirements, added 3 new features during evolving initial PR (delimiters, variable boundaries, function boundaries), but it was so not even seen by anyone in the community. This suggests that there is actually no community. In fact, outside in my face was said that I am not a community and my opinion doesn't matter, though there were no others.

So I think "the community" will quite cost without your sacrifice :(

My goal isn't to save myself. My goal is to save the library.

Only fork, IMHO. Sad, but true. Maybe an example of io.js will teach something.

work on my own private fork

Why private? Make a public fork, do good things and community (if it exist) will found you

@StoneCypher
Copy link

Why private? Make a public fork

I'm here to save the library, not make a new one.

Please stop asking me for this now. I've been clear that the answer is no.

.

I have been fitting the implementation of ranges for 5 years

I know. If I'm made maintainer, ranges will go in using your code in 2020.

These problems come from the lack of a maintainer. Forking won't fix them; it'll make them worse.

.

This suggests that there is actually no community.

There used to be. There can be again.

I started filing tickets less than 24 hours ago and I've already had conversations with nine people.

Quit being fatalist. This is easily fixable. All that needs to happen is the keys need to be put in the hands of an experienced maintainer who has a focus on releases.

.

So I think "the community" will quite cost without your sacrifice :(

I'm not making a sacrifice here.

.

it was so not even seen by anyone in the community

I saw that for the first time yesterday night, when I started reading through the issue tracker. (I'm going to read the whole thing, open and closed.)

I also saw #209.

I am sympathetic to both positions here.

On the one hand, this is a feature that really ought to go in, that a lot of people want, that other libraries support, and that the underlying tools are ready to support.

On the other hand, the conversation wasn't actually finished.

If I was allowed to be maintainer, here's what I would do:

  1. I would make a new ticket and tag everyone I thought had been involved in the past. I would also ask them to tag anyone they knew I missed.
  2. I would say "hey, can we talk this over and make a decision? I generally like Mingun's code, and I want to find out whether we can either merge it, or slightly modify it to make it mergeable."
  3. If the discussion wasn't brief and succinct, say three or four days, I would say "okay, this library is trying to angle towards new progress, so I'd like to set a one week timer from today," and re-tag everyone
  4. I would also have pretty high standards for testing
  5. Then, in the vernacular, it's time to merge. This is a good feature and, with more testing, it's also a good implementation.
    1. I would want the community to be able to discuss which characters are best for the sigil, but five years is nonsense, and it's time to pull the trigger

.

My goal

IMHO.

Hi, you've pushed that opinion four times now, and I've said no every time.

Stop telling me your opinion of my goal. My goal isn't changing.

.

Why private?

Because I did it three years ago, when this was still a healthy library, and the changes I was making were things @dmajda had said no to.

So I figured I'd fix it locally, and when the community fixed it properly, I'd convert my grammar.

Three years later, the developers have given up on this project as dead, but are also trying their very hardest to get me to not fix that even after I've clearly told them to stop, which is bewildering and inappropriate.

If you make any more comments about my needing to make a fork, I will simply cease to respond. I have no interest in your hope of keeping peg.js dead.

The answer is a hard no. If you want peg 0.10.0, pin your package. peg 0.12.0 needs to happen, and I don't really understand or care why you want to get in the way of that.

Everything you merged into 0.11.0 is lost right now. Ryuu said it's all being thrown away. If you think you're keeping your merged work in, understand that my maintenance will accomplish that, and Ryuu has explicitly stated that he's starting over from a custom codebase from scratch.

Ryuu has also started three other PEG compilers since he took over maintenance

This library only died because

  1. The new maintainer won't maintain, and
  2. The new developers aren't angry

Hi, I'm one of the old developers, and I'm angry

If you're not willing to help save this library, at least stop telling me not to.

This library needs to be maintained. It's a critical piece of infrastructure. A fork isn't maintenance. A fork doesn't solve any of the problems.

.

Maybe an example of io.js will teach something.

io.js failed, despite buy-in from the creator of the original language and all five of its five largest contributors, and when they discarded it, the only reason they pretended it wasn't a total loss was to be able to re-hire two of those five people

Yes, io.js does teach me something. Namely, "forks don't work, and nobody's problems got fixed until node was updated later."

@Mingun
Copy link
Contributor

Mingun commented Feb 3, 2020

I 'm not trying to dissuade you. If you do, I will only be happy and may even complete some of my experiments and bring them to mergeble state. Just looking in the past, somehow I can't believe.

Criticality is strongly exaggerated. If it were as you say, someone would have already made a fork and developed it. Or this library developed rather than frozen in development for 5 years.

@StoneCypher
Copy link

If you do, I will only be happy and may even complete some of my experiments and bring them to mergeble state.

That would be fantastic. Many of your PRs seem really powerful to me, and with some added testing, it seems to me like they should be merged

.

Criticality is strongly exaggerated. If it were as you say, someone would have already made a fork and developed it.

Screen Shot 2020-02-03 at 10 28 38 AM

The current maintainer alone has made three or four distinct forks (pegx, epeg, meg; arguably pegjs-dev, since he did that on his own before he was a maintainer) as well as an openpeg fork. At this point I would argue that 0.11.0 is also a fork, since he's declared it's never being merged.

In my last 48 hours, I've talked to about ten people in here. Half of them have forks.

polkovnikov-ph has an entire meta-peg that puts pegs into other pegs.

I have two forks myself, one which doesn't show in the fork list - the contrarian one I've maintained for years, and the new one I made the other day, before I learned 0.11.0 was dead, in the effort to implement fixes for densification #638, es6 modules #627, and typescript modules #597 .

Two of my friends who have not, as far as I can tell, ever been in this issue tracker have forks

That people have kept working on 0.11.0 for three years despite that nobody's releasing tells me everything I need to know about how critical this is.

.

Or this library developed rather than frozen in development for 5 years.

This library was in fact actively developed until May 2017, when the maintainer change happened.

That day was the day the library froze.

That's why I believe a new maintainer change can un-freeze it.

@Mingun
Copy link
Contributor

Mingun commented Feb 3, 2020

Screen Shot 2020-02-03 at 10 28 38 AM

forks
This is more relevant image.

This library was in fact actively developed until May 2017

I would say that until then there had been code shuffling, but no new features implemented. Maybe a couple of bugs fixed, and that, they had to be fought back. Actual development froze much earlier than @dmajda left

@StoneCypher
Copy link

You know what I see there?

I see a library that's been unmaintained for three years and still has six active forks over the last couple months.

Anyway, look, paint the line wherever you want in the sand. Maybe there's a better place to draw it; I haven't gone on an archaeological dig.

The point is, it's still a line in the sand. Let's just use our feet to fuck the sand up now.

0.12.0 is entirely possible if we can get people to say "hey, let's let it happen."

This is the best parser out there after half a decade of inactivity.

Do you realize how great it would become if it became an open source accessable project?

Let's make a clean 0.12.0 then start cherry picking the pieces of 0.11.0 that are trustworthy. After that, we can clean up, and start fixing the remainder.

Lots of these are super difficult, sure

But lots of them aren't. I get ES6 module support from a one-line bash script. That's silly and easily fixed

@StoneCypher
Copy link

Forwards movement doesn't have to be perfect; it just has to be non-trivial

Help me get David onboard? Please

@StoneCypher
Copy link

Like seriously.

Imagine what would happen if 0.12.0 came out, and other than fixing the README and some other trivial shit, it was just 0.10.0.

And then what if the next week, es module support came out

And what if the following week, the webpage got a major update

And what if the following week, we released simple typescript support

Those are all things I've already done. Those are all like two hour jobs. That's a relaxed and easy cadence

But it's so different than what users are used to that it would make clear that something has changed

I am quite confident that this library has the user community and defect base to get to meaningful sub-monthly releases, and I am quite confident that I could make that happen

@STRd6
Copy link

STRd6 commented Feb 3, 2020

I've got my own fork as well, a single file, zero dependency implementation based on the original academic paper. I had originally tried using the 0.10.0 codebase, but even that was too bloated for my tastes.

My feeling is that anyone to whom this is mission critical has already forked because dealing with "open source" drama is more draining than spending a couple weekends learning and implementing a solution to one's specific needs. Seriously, the paper is an easy read and in <100 hours I was able to get a parser that was 4x faster and 100x smaller.

There's always the fundamental tradeoff between wanting to use an existing solution and having a perfect match for your own problem. Parser communities are in a difficult valley where once someone has learned enough to use a parsing tool effectively its just a small amount of additional work to write their own entire parser.

The biggest opportunity cost of open source is spiritual.

Namaste 🙏

P.S. Don't get me wrong, I'm really enjoying these threads, it's great entertainment! I just don't expect much productive value from GH or JS communities these days.

@StoneCypher
Copy link

It's an understandable viewpoint.

Given the current status of things, I'm hoping I can get people to say "well, let an attempt happen and let's see"

@michael-brade
Copy link

@StoneCypher I like your energy that you seem to want to put into this lib and I agree with your general ideas (and I already would have after the first time you wrote them - no need to say the same thing five times 😁). I also don't like a fork very much, but that means we are dependent on exactly one person - futagoza. What if he doesn't respond? Then we are stuck, nothing we can do about that. dmajda already said he has no rights anymore, so nothing he can do either.

Here is the thing I don't agree with: "in a different repo that nobody can see or will ever care about" because that has a simple solution: open a new issue in all caps saying something like "PROJECT DEAD - CONTINUED HERE" and then write all the necessary information there. I would be the first person to switch my project over, and I am sure many others would. And this would not be the first project this happened to, I have already followed a few other projects down that route. No big deal, just switch. As a user, I don't really care where I get my code from as long as it's a place people care about.

So I would give @futagoza a couple more weeks to see if he responds and what his opinion is. If there is no sign of life and no response, go ahead and recreate this project properly.

@michael-brade
Copy link

Actually, I just saw that @michel-kraemer is also a member of pegjs, so maybe there even more hope 😀

@StoneCypher
Copy link

@michael-brade - Forks don't work. I'm going to be very direct: I do not want you to tell me to do something different again

There are already four forks attempting to save this repo. You don't know what they are. You aren't looking for them when saying "fix it that way."

It doesn't matter. A fork won't fix anything. Existing downstream consumers do not receive bugfixes in a fork. PRs are lost in a fork. Issues are lost in a fork.

The answer is a hard no.

@futagoza must stop blocking peg immediately.

@StoneCypher
Copy link

This library will not die because someone promised to be a maintainer, refused to maintain, and decided to keep it for themselves.

Morally, @futagoza has no right to promise to be a maintainer, then destroy the library.

@michael-brade
Copy link

There are already four forks attempting to save this repo.

How do you know that they are? Do you have any links? The only kind of active fork I see is meisl/pegjs, none of the others is active at all. And even the meisl fork doesn't look like it's an attempt to save this repo.

You aren't looking for them when saying "fix it that way."

Maybe you didn't read my suggestion: if those forks did intend to save this repo, then they did it wrong because they did not create an obvious issue telling the community where to go. I agree with everything you say until this marker has been created. Then the rules change.

@futagoza must stop blocking peg immediately.

Right. And how do you intend to enforce that?

@StoneCypher
Copy link

How do you know that they are?

Because I went looking for bugfixes, years ago.

.

Do you have any links?

Yes. However, if forks worked, I wouldn't have to give them; you'd have them too. This is my explanation of why I will not consider a fork, or entertain any further wasted time mollycoddling the fake maintainer

.

if those forks did intend to save this repo, then they did it wrong because they did not create an obvious issue telling the community where to go.

Actually, they did. It was deleted.

In one sense, you are right: they did it wrong, by creating a fork.

Most people get libraries from npm, where this library doesn't even have a readme. They'll never see the issue.

I am being radically clear when I say "I do not want to have any further discussions about forks."

Further attempts to discuss this will be ignored.

.

I agree with everything you say until this marker has been created.

It has been, several times.

@michael-brade
Copy link

if those forks did intend to save this repo, then they did it wrong because they did not create an obvious issue telling the community where to go.

Actually, they did. It was deleted.

Ok, I have no way to check this. But assuming you are right, then it should be obvious to you that the person who deleted this issue has no intention to give up power over this repo. Under those circumstances, how is it that you think you can change anything without a fork?

All I can say is I wish you luck. And that I will follow to wherever active development takes place.

@StoneCypher
Copy link

how is it that you think you can change anything without a fork?

Because eventually futagoza will come back, and he'll have to make this decision again.

Those were in 2018. It's 2020. He's started at least four entirely distinct PEG parsers since, one in a different base language, as well as two programming languages.

Back then, he hadn't thrown away three years of work as not trustworthy. Now he has.

Back then, he was only six months into it. That's still a "well give me a chance" timeline, barely. Now it's three years.

Back then, he could still say he was committed to this, and that this was his focus.

Things have changed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants