Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duplicate matches in one sentence #300

Open
Istador opened this issue Aug 22, 2019 · 8 comments
Open

duplicate matches in one sentence #300

Istador opened this issue Aug 22, 2019 · 8 comments

Comments

@Istador
Copy link
Contributor

Istador commented Aug 22, 2019

Example 1:

wtf('[[Page 1]], [[Page 1]]').sentences(0).html()

Result:

<span class="sentence"><a class="link" href="./Page_1"><a class="link" href="./Page_1">Page 1</a></a>, Page 1</span>

Expected Result:

<span class="sentence"><a class="link" href="./Page_1">Page 1</a>, <a class="link" href="./Page_1">Page 1</a></span>

Example 2:

wtf('Just click on here: [[Page 1|here]] or [[Page 2|here]]').sentences(0).html()

Result:

<span class="sentence">Just click on <a class="link" href="./Page_1"><a class="link" href="./Page_2">here</a></a>: here or here</span>

Expected Result:

<span class="sentence">Just click on here: <a class="link" href="./Page_1">here</a> or <a class="link" href="./Page_2">here</a></span>
@spencermountain
Copy link
Owner

spencermountain commented Aug 23, 2019

whoa! thank you Robin. That certainly looks like a bug!

I'll look at fixing it this week.
thank you for point it out.

same thing for markdown and latex output, too. 😬

@spencermountain spencermountain changed the title .html() fails to generate separate links with the same textual representation nested links in outputs Aug 23, 2019
@Istador
Copy link
Contributor Author

Istador commented Aug 23, 2019

Looking at the code that generates the html output, I think the <b> and <i> tags are also affected, because it also uses the same smartReplace function that just replaces the first occurrence of the sub-string in the text.

While parsing the input string it needs to remember the position where in the text the link, bold or italic is located at, to later replace it at the right position for the output.

@spencermountain
Copy link
Owner

right. I had avoided using character offsets, because the text keeps changing, as it gets parsed.
Yuck!
Maybe we can add a simple check for # of matches, immediately. Not rendering a link, (or a bold!) is better than double-rendering it, IMO. ... Or maybe we can check for dupes when we parse a link, and add an integer to it or something.

But yeah, it's clear that some design changes need to come. I'd like to seperate html/markdown/latex outputs into seperate projects all-together. My initial goal with this library was just to get json out of a page, but it's become obvious that rendering stuff is a big part of the project, and needs to be done properly.

You're free to pursue any solution for this. I'll merge anything!
Thanks. ;)

@Istador
Copy link
Contributor Author

Istador commented Aug 26, 2019

Yep, it looks complicated.
Lots of separate regular expressions that are replacing one after another in a chain on the same text, while doing feature extraction (semantic analysis) at the same time.

If continuing with this approach, following steps would need to fix all the offsets generated by previous steps, if they are changing the text.

An alternative approach would be to parse the input string by generating an abstract syntax tree. And then to define the transformations from the AST to other data formats (text, json, html, markdown, etc).


Checking for duplicates and writing down which n-th occurrence it is will probably not work, because there are too many edge cases with the other transformation steps. To correctly implement that you'd need access to the already completely transformed string before the current match.

Not rendering seems better than rendering the wrong things, but the same condition holds: To correctly implement this you'd need access to the already completely transformed string before the match. Otherwise there will be over-blocking (not-render things that are actually fine currently) - which might still be better than the current situation.

Counter-Example: <i>abi</i>, <b>i</b>, [[a|b]], [[b|a]]


Another issue, that just came to my mind while thinking about the current order of transformations (links, bold, italic):

wtf("[[Page 1|'''link''']]").sentences(0).html()

Produces:

<span class="sentence"><b>link</b></span>

Instead of:

<span class="sentence"><a class="link" href="./Page_1"><b>link</b></a></span>

Because the link parser thinks the final text contains the ''', which it doesn't because the bold parser replaces it after the link parser.

@spencermountain
Copy link
Owner

yeah, thanks Robin. I understand your point.
You're not the first person to suggest an AST. It certainly seems to make some sense.

I guess since my motivation for this library has been to get data out of wikipedia, and it's doing that fine in this case, I can't see myself dramatically re-writing things in such a way. If there were cases where we couldn't parse the document, without a proper AST interpreter, I'd probably be more keen.

Maybe I should review the concepts and state-of-the-art tooling, but I feel like if you generated a AST, you're still just doing all sorts of look-ahead/look-behind checks, loops, nested conditions, etc.

I mean, For parsing a table, you still have to know newline, '|-' creates a new row, but not if it's followed by a }, etc. Arguably, doing str.match(/\n|-/) is still an easier way to handle this, than climbing through a nested data-structure.

Maybe I gotta stew on this issue a little more. The real problem is having two identical things within a sentence. Maybe we should operate on a smaller unit. Maybe we should parse individual words, which would perhaps resolve this. Maybe having two identical links in one sentence is so niche that we can ignore it. Maybe html output is something we should have an astrix on, admitting it can't be done very well.
¯_(ツ)_/¯

There's been other clever proposals to add some token structured data back into the string - like __my-id-here__ - so that ordering of parsed components can be better maintained. That may help with this, too.
What do you think about that idea?

cheers

@spencermountain
Copy link
Owner

created another issue for the [[bold|'''bold''']] thing. Good find!

@Istador
Copy link
Contributor Author

Istador commented Sep 4, 2019

Parsing individual words:

Is '''b c''' inside of a '''b c''' d considered as a single word?


Speaking of parsing on a word level, the parsing on the sentence level isn't optimal either.

Here is a new bug for you :P

const parsed = wtf("'''A normal sentence. And a second sentence.'''")
parsed.sentences(0).html()
parsed.sentences(1).html()

Resulting in:

<span class="sentence">'''A normal sentence.</span>
<span class="sentence">And a second sentence.'''</span>

Expected (or similiar):

<span class="sentence"><b>A normal sentence.</b></span>
<span class="sentence"><b>And a second sentence.'''</b></span>

Token structure proposal:

Generating a token structure (syntactic analysis) is the first step of parsing, so it can't harm, but without a semantic analysis it won't help much with nested structures I think.

I haven't looked at issue #201 in detail, but the tokens are concatenated into the text and need to be str_replaced, instead of generating a separate data structure. That might cause problems when the following steps need to operate inside the text of the tokens or if they don't detect the tokens correctly and interpret them as text or interpret text as tokens (the last point is very unlikely, because ___CITE_7238234792_ID_3___ is probably not content in the article because of the timestamp).


Just for fun, because I find this problem interesting and I like to program things, I hacked together a little parser that generates an AST data structure on a sentence level only (so it only supports anchors, bolds and italics).

It's far from perfect, doesn't support all features that your parser offers, should be optimized and will probably fail horrible with invalid input data, but it avoids a lot of potential parsing issues that happen with the ordered replace parsing and it has a lot of examples as unit tests:
https://github.com/Istador/wtf-wikipedia-ast-parser

It basically boils down to this regexp to tokenize the string:

/('''''|''''|'''|''|<\/?[bi]>|\[\[|\]\]|\|)/g

And the following grammar (EBNF):

Sentence      = Token, { Token } ;
Token         =  Text | Link | BoldAndItalic | Bold | Italic ;
Link          = "[[", { Token }, [ "|", { Token } ], "]]" ;
BoldAndItalic = "''''", {{ Token }}, "''''" ;
Bold          = BoldTag | BoldQuote ;
BoldTag       = "<b>", { Token }, "</b>" ;
BoldQuote     = "'''", {{ Token }}, "'''"  ;
Italic        = ItalicTag | ItalicQuote ;
ItalicTag     = "<i>", { Token }, "</i>" ;
ItalicQuote   = "''", {{ Token }}, "''" ;
Text          = ? all text tokens from tokenization step ? ;

Interestingly toText(), toHTML(), toLatex() and toMarkdown() are pretty straightforward to implement. Only toJSON() gets complicated, because the resulting AST doesn't have the same internal data structure as the current wtf parser, and toJSON() is basically just the same as the internal structure (with slight modifications).

Example:

const { compile, parse } = require('./parser')

const input  = "pre [[link|pre ''italic'' post]] post"
const tokens = compile(input)
const ast    = parse(tokens)

console.log(tokens)
console.log(ast)
console.log(ast.toString())
console.log(ast.toText())
console.log(ast.toHTML())
console.log(ast.toMarkdown())

Output:

[
  { type: 'text', text: 'pre ', pos: 0 },
  { type: '[[', text: '[[', pos: 4 },
  { type: 'text', text: 'link', pos: 6 },
  { type: '|', text: '|', pos: 10 },
  { type: 'text', text: 'pre ', pos: 11 },
  { type: "''", text: "''", pos: 15 },
  { type: 'text', text: 'italic', pos: 17 },
  { type: "''", text: "''", pos: 23 },
  { type: 'text', text: ' post', pos: 25 },
  { type: ']]', text: ']]', pos: 30 },
  { type: 'text', text: ' post', pos: 32 }
]
Sentence {
  childs: [
    Text { text: 'pre ' },
    Link { childs: [Array], target: 'link' },
    Text { text: ' post' }
  ]
}
Sentence(Text("pre "), Link("link", Text("pre "), Italic(Text("italic")), Text(" post")), Text(" post"))
pre pre italic post post
<span class="sentence">pre <a class="link" href="./Link">pre <i>italic</i> post</a> post</span>
pre [pre *italic* post](./Link) post

@spencermountain
Copy link
Owner

hey cool!
thank you for making that. Sometimes it's so much easier just to checkout a javascript file, and be like 'ooh, right'.
Yeah, i share your concerns about the __CITE_98732__ thing. I'm a little swamped right now, but am planning on taking some big swipes at this library this fall/winter. Wanna join me?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants