Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NoMethodError raised when parsing invalid wikitext #81

Open
robfors opened this issue Feb 15, 2020 · 5 comments
Open

NoMethodError raised when parsing invalid wikitext #81

robfors opened this issue Feb 15, 2020 · 5 comments

Comments

@robfors
Copy link

robfors commented Feb 15, 2020

When I try to parse the english Wikipedia page for 2012 BDO World Darts Championship and Progress Wrestling, Infoboxer currently raises NoMethodError.

This can currently be reproduced with

Infoboxer.wikipedia.get('2012 BDO World Darts Championship') #=> NoMethodError: undefined method `named?' for #<Heading(level: 3): Women>
Infoboxer.wikipedia.get('Progress Wrestling') #=> NoMethodError: undefined method `named?' for #<Heading(level: 3): Progress Proteus Championship>


I took the 2012 BDO World Darts Championship wikitext and narrowed the problem to

{{
|a
<!---->
}}

<!---->

{|
|-
{{}}

=B=

which will raise NoMethodError: undefined method `named?' for #<Paragraph: =B=>.
I can not remove any more as the error will disappear.

I am using Infoboxer 7c81182.

@zverok zverok closed this as completed in db52d58 Feb 15, 2020
@zverok
Copy link
Contributor

zverok commented Feb 15, 2020

Thanks for reporting!

Surprisingly enough, this example uncovered the whole clusterbug of several inconsistencies:

  • how empty comment <!----> is processed
  • how the template-in-table is processed
  • how the heading after a non-closed table is processed

All of them should be fixed in the current master.

Unfortunately, Progress Wrestling page still parses with an error, at least clearly diagnosed: it has unclosed template, looking like this:

|{{sort|{{age in days nts|month1=09|day1=15|year1=2019}}+
|-

{{sort| template is opened here, but never closed. I am not sure how to handle that -- MediaWiki itself seems to implicitly close the template with the next line of the table, but I am not sure what would be the robust solution. One may suggest just fixing the Wikipedia page source, which is easy and will be a good thing anyways :)

@zverok zverok reopened this Feb 15, 2020
@robfors
Copy link
Author

robfors commented Feb 15, 2020

Thanks for the fix again! This is a really great project.

I have been building a system that parses every english Wikipedia page in order to discover links and find paths between pages.

I think keeping a strict parser is a good idea. I am mostly just concerned with unexpected behavior such as hanging and errors other than ParsingError.

@zverok
Copy link
Contributor

zverok commented Feb 16, 2020

I have been building a system that parses every english Wikipedia page in order to discover links and find paths between pages.

Oh, that's really interesting! Is it (or would be it) some public project? I am so happy for somebody to really use the library. I put a lot of work in it, and TBH somewhat proud of the outcome, but at some point, it became evident that the Ruby community is not really into this kind of stuff :)

I think keeping a strict parser is a good idea. I am mostly just concerned with unexpected behavior such as hanging and errors other than ParsingError.

Unfortunately, we can't make parser strict: otherwise, it would break on the third of the real pages in Wikipedia. MediaWiki software allows (and successfully displays) all kinds of incomplete and inconsistent markup, and my goal is "if Wikipedia allowed it, it should be parseable". If not for this fact, the whole Infoboxer could've been just some formal wikitext grammar definition (lot of other MediaWiki parsers do exactly this, and they quickly become unusable with "your markup is wrong, dude").

So, ideally, even this case probably should be parsed properly, just it requires some time and experimenting on "what is it exactly the rules for implicitly close the template with forgotten }}". But if it is a really rare case, sometimes it is easier to just fix the markup :)

@robfors
Copy link
Author

robfors commented Feb 24, 2020

Is it (or would be it) some public project?

Yes, I am in the process of building a website that displays the shortest path between any two specified pages. Ill post an update when I have it working.

it would break on the third of the real pages in Wikipedia.

Wow, I had no idea so many pages were malformed. I was considering setting up a bot to look for article updates that were malformed and fix them manually or send an automated message to the poster. However, it seems attempting to fix so many posts retroactively would not be a very good solution. Ideally I think the problem should be solved from the MediaWiki software. Maybe it could warn the poster about malformed wikitext then use their non strict parser to repair the wikitext every time they preview their changes. Have you heard of any future plans for them to do something like that?

@zverok
Copy link
Contributor

zverok commented Feb 24, 2020

Yes, I am in the process of building a website that displays the shortest path between any two specified pages. Ill post an update when I have it working.

Cool! Looking forward to it :) And happy this little library has some real-world use.

About "malformed" wikitext: I believe it is not a (conceptual) bug, but a (conceptual) feature. It is kinda the same as it was with HTML back in the 90s: if only perfectly well-formed HTML worked, it wouldn't have the widest adoption back then, before CSMs, IDEs, and validators. The "magic" was that anything you can scribble in Notepad with some <tag>s, worked somehow, and it felt empowering and cool.

The same with wikitext: the initial idea was "it is just text", then some formatting was added (but it still hasn't been to be "well-formed" in any way: if you missed ] in your link, well, you'll just see foo [[bar] instead of text with link, you'll understand and fix it -- or page moderator will fix it later). It allows the widest adoption of Wikipedia editing by all kinds of non-tech people, who you can't even properly explain the idea of "well-formed" markup: they could be very knowledgeable in the topic they are writing in, and then try to make links and tables just by copy-pasting examples... And save whatever they see as "remotely normally readable".

Eventually, bots & moderators & other users fix every malformed syntax, but eventualy is very important word here: at any time of page's lifecycle it can contain non-critical markup problems, and still be useful; 80-year-old jazz critic or 14-year-old schoolgirl on her first "editing marathon" probably wouldn't be educated by "you did wrong" notification, but rather repulsed...

That's why Wikipedia parsing is so hard: there is never a formal definition "what's right", just "let's try our best to do what MediaWiki does" :)
(Another reason, though, is "templates" feature design, and I can say a lot about it, and it would not be kind words...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants