duplicate matches in one sentence #300

Istador · 2019-08-22T19:28:46Z

Example 1:

wtf('[[Page 1]], [[Page 1]]').sentences(0).html()

Result:

<span class="sentence"><a class="link" href="./Page_1"><a class="link" href="./Page_1">Page 1</a></a>, Page 1</span>

Expected Result:

<span class="sentence"><a class="link" href="./Page_1">Page 1</a>, <a class="link" href="./Page_1">Page 1</a></span>

Example 2:

wtf('Just click on here: [[Page 1|here]] or [[Page 2|here]]').sentences(0).html()

Result:

<span class="sentence">Just click on <a class="link" href="./Page_1"><a class="link" href="./Page_2">here</a></a>: here or here</span>

Expected Result:

<span class="sentence">Just click on here: <a class="link" href="./Page_1">here</a> or <a class="link" href="./Page_2">here</a></span>

spencermountain · 2019-08-23T16:52:24Z

whoa! thank you Robin. That certainly looks like a bug!

I'll look at fixing it this week.
thank you for point it out.

same thing for markdown and latex output, too. 😬

Istador · 2019-08-23T21:32:13Z

Looking at the code that generates the html output, I think the  and  tags are also affected, because it also uses the same smartReplace function that just replaces the first occurrence of the sub-string in the text.

While parsing the input string it needs to remember the position where in the text the link, bold or italic is located at, to later replace it at the right position for the output.

spencermountain · 2019-08-25T13:33:27Z

right. I had avoided using character offsets, because the text keeps changing, as it gets parsed.
Yuck!
Maybe we can add a simple check for # of matches, immediately. Not rendering a link, (or a bold!) is better than double-rendering it, IMO. ... Or maybe we can check for dupes when we parse a link, and add an integer to it or something.

But yeah, it's clear that some design changes need to come. I'd like to seperate html/markdown/latex outputs into seperate projects all-together. My initial goal with this library was just to get json out of a page, but it's become obvious that rendering stuff is a big part of the project, and needs to be done properly.

You're free to pursue any solution for this. I'll merge anything!
Thanks. ;)

Istador · 2019-08-26T00:16:49Z

Yep, it looks complicated.
Lots of separate regular expressions that are replacing one after another in a chain on the same text, while doing feature extraction (semantic analysis) at the same time.

If continuing with this approach, following steps would need to fix all the offsets generated by previous steps, if they are changing the text.

An alternative approach would be to parse the input string by generating an abstract syntax tree. And then to define the transformations from the AST to other data formats (text, json, html, markdown, etc).

Checking for duplicates and writing down which n-th occurrence it is will probably not work, because there are too many edge cases with the other transformation steps. To correctly implement that you'd need access to the already completely transformed string before the current match.

Not rendering seems better than rendering the wrong things, but the same condition holds: To correctly implement this you'd need access to the already completely transformed string before the match. Otherwise there will be over-blocking (not-render things that are actually fine currently) - which might still be better than the current situation.

Counter-Example: abi, i, [[a|b]], [[b|a]]

Another issue, that just came to my mind while thinking about the current order of transformations (links, bold, italic):

wtf("[[Page 1|'''link''']]").sentences(0).html()

Produces:

<span class="sentence"><b>link</b></span>

Instead of:

<span class="sentence"><a class="link" href="./Page_1"><b>link</b></a></span>

Because the link parser thinks the final text contains the ''', which it doesn't because the bold parser replaces it after the link parser.

spencermountain · 2019-08-27T12:41:14Z

yeah, thanks Robin. I understand your point.
You're not the first person to suggest an AST. It certainly seems to make some sense.

I guess since my motivation for this library has been to get data out of wikipedia, and it's doing that fine in this case, I can't see myself dramatically re-writing things in such a way. If there were cases where we couldn't parse the document, without a proper AST interpreter, I'd probably be more keen.

Maybe I should review the concepts and state-of-the-art tooling, but I feel like if you generated a AST, you're still just doing all sorts of look-ahead/look-behind checks, loops, nested conditions, etc.

I mean, For parsing a table, you still have to know newline, '|-' creates a new row, but not if it's followed by a }, etc. Arguably, doing str.match(/\n|-/) is still an easier way to handle this, than climbing through a nested data-structure.

Maybe I gotta stew on this issue a little more. The real problem is having two identical things within a sentence. Maybe we should operate on a smaller unit. Maybe we should parse individual words, which would perhaps resolve this. Maybe having two identical links in one sentence is so niche that we can ignore it. Maybe html output is something we should have an astrix on, admitting it can't be done very well.
¯_(ツ)_/¯

There's been other clever proposals to add some token structured data back into the string - like __my-id-here__ - so that ordering of parsed components can be better maintained. That may help with this, too.
What do you think about that idea?

cheers

spencermountain · 2019-08-27T12:50:01Z

created another issue for the [[bold|'''bold''']] thing. Good find!

Istador · 2019-09-04T17:25:22Z

Parsing individual words:

Is '''b c''' inside of a '''b c''' d considered as a single word?

Speaking of parsing on a word level, the parsing on the sentence level isn't optimal either.

Here is a new bug for you :P

const parsed = wtf("'''A normal sentence. And a second sentence.'''")
parsed.sentences(0).html()
parsed.sentences(1).html()

Resulting in:

<span class="sentence">'''A normal sentence.</span>
<span class="sentence">And a second sentence.'''</span>

Expected (or similiar):

<span class="sentence"><b>A normal sentence.</b></span>
<span class="sentence"><b>And a second sentence.'''</b></span>

Token structure proposal:

Generating a token structure (syntactic analysis) is the first step of parsing, so it can't harm, but without a semantic analysis it won't help much with nested structures I think.

I haven't looked at issue #201 in detail, but the tokens are concatenated into the text and need to be str_replaced, instead of generating a separate data structure. That might cause problems when the following steps need to operate inside the text of the tokens or if they don't detect the tokens correctly and interpret them as text or interpret text as tokens (the last point is very unlikely, because ___CITE_7238234792_ID_3___ is probably not content in the article because of the timestamp).

Just for fun, because I find this problem interesting and I like to program things, I hacked together a little parser that generates an AST data structure on a sentence level only (so it only supports anchors, bolds and italics).

It's far from perfect, doesn't support all features that your parser offers, should be optimized and will probably fail horrible with invalid input data, but it avoids a lot of potential parsing issues that happen with the ordered replace parsing and it has a lot of examples as unit tests:
https://github.com/Istador/wtf-wikipedia-ast-parser

It basically boils down to this regexp to tokenize the string:

/('''''|''''|'''|''|<\/?[bi]>|\[\[|\]\]|\|)/g

And the following grammar (EBNF):

Sentence      = Token, { Token } ;
Token         =  Text | Link | BoldAndItalic | Bold | Italic ;
Link          = "[[", { Token }, [ "|", { Token } ], "]]" ;
BoldAndItalic = "''''", {{ Token }}, "''''" ;
Bold          = BoldTag | BoldQuote ;
BoldTag       = "<b>", { Token }, "</b>" ;
BoldQuote     = "'''", {{ Token }}, "'''"  ;
Italic        = ItalicTag | ItalicQuote ;
ItalicTag     = "<i>", { Token }, "</i>" ;
ItalicQuote   = "''", {{ Token }}, "''" ;
Text          = ? all text tokens from tokenization step ? ;

Interestingly toText(), toHTML(), toLatex() and toMarkdown() are pretty straightforward to implement. Only toJSON() gets complicated, because the resulting AST doesn't have the same internal data structure as the current wtf parser, and toJSON() is basically just the same as the internal structure (with slight modifications).

Example:

const { compile, parse } = require('./parser')

const input  = "pre [[link|pre ''italic'' post]] post"
const tokens = compile(input)
const ast    = parse(tokens)

console.log(tokens)
console.log(ast)
console.log(ast.toString())
console.log(ast.toText())
console.log(ast.toHTML())
console.log(ast.toMarkdown())

Output:

[
  { type: 'text', text: 'pre ', pos: 0 },
  { type: '[[', text: '[[', pos: 4 },
  { type: 'text', text: 'link', pos: 6 },
  { type: '|', text: '|', pos: 10 },
  { type: 'text', text: 'pre ', pos: 11 },
  { type: "''", text: "''", pos: 15 },
  { type: 'text', text: 'italic', pos: 17 },
  { type: "''", text: "''", pos: 23 },
  { type: 'text', text: ' post', pos: 25 },
  { type: ']]', text: ']]', pos: 30 },
  { type: 'text', text: ' post', pos: 32 }
]
Sentence {
  childs: [
    Text { text: 'pre ' },
    Link { childs: [Array], target: 'link' },
    Text { text: ' post' }
  ]
}
Sentence(Text("pre "), Link("link", Text("pre "), Italic(Text("italic")), Text(" post")), Text(" post"))
pre pre italic post post
<span class="sentence">pre <a class="link" href="./Link">pre <i>italic</i> post</a> post</span>
pre [pre *italic* post](./Link) post

spencermountain · 2019-09-04T22:43:54Z

hey cool!
thank you for making that. Sometimes it's so much easier just to checkout a javascript file, and be like 'ooh, right'.
Yeah, i share your concerns about the __CITE_98732__ thing. I'm a little swamped right now, but am planning on taking some big swipes at this library this fall/winter. Wanna join me?

spencermountain added bug priority labels Aug 23, 2019

spencermountain changed the title ~~.html() fails to generate separate links with the same textual representation~~ nested links in outputs Aug 23, 2019

spencermountain mentioned this issue Aug 27, 2019

Bold inside of a link-text #301

Open

spencermountain added next-release and removed bug labels Mar 12, 2020

spencermountain changed the title ~~nested links in outputs~~ duplicate matches in one sentence Mar 12, 2020

spencermountain mentioned this issue May 13, 2020

two or more links with the same target or name on the same line breaks HTML link formatting #367

Closed

spencermountain pinned this issue May 25, 2020

spencermountain mentioned this issue Sep 2, 2020

Feature: markup positions #383

Closed

spencermountain mentioned this issue Sep 25, 2020

HacktoberFest #392

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

duplicate matches in one sentence #300

duplicate matches in one sentence #300

Istador commented Aug 22, 2019 •

edited

spencermountain commented Aug 23, 2019 •

edited

Istador commented Aug 23, 2019

spencermountain commented Aug 25, 2019

Istador commented Aug 26, 2019

spencermountain commented Aug 27, 2019

spencermountain commented Aug 27, 2019

Istador commented Sep 4, 2019 •

edited

spencermountain commented Sep 4, 2019

duplicate matches in one sentence #300

duplicate matches in one sentence #300

Comments

Istador commented Aug 22, 2019 • edited

spencermountain commented Aug 23, 2019 • edited

Istador commented Aug 23, 2019

spencermountain commented Aug 25, 2019

Istador commented Aug 26, 2019

spencermountain commented Aug 27, 2019

spencermountain commented Aug 27, 2019

Istador commented Sep 4, 2019 • edited

spencermountain commented Sep 4, 2019

Istador commented Aug 22, 2019 •

edited

spencermountain commented Aug 23, 2019 •

edited

Istador commented Sep 4, 2019 •

edited