Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Table attributes are parsed as Text #285

Open
Eccenux opened this issue Jun 5, 2022 · 5 comments
Open

Bug: Table attributes are parsed as Text #285

Eccenux opened this issue Jun 5, 2022 · 5 comments

Comments

@Eccenux
Copy link

Eccenux commented Jun 5, 2022

There seem to be something weird going on when parsing Tables. I get attributes as Text nodes.

Let's consider this example:

import mwparserfromhell
text = """
{| class="wikitable"
|-
! style="width:101px" | heading1
! style="width:102px" | heading2
|- data-test="whatever"
| style="width:201px" | testing
|}
""".strip()
code = mwparserfromhell.parse(text)
# code.filter_tags(matches=lambda node: node.tag == "table")
# print (code.get_tree())

for node in code.filter():
	if isinstance(node, mwparserfromhell.nodes.tag.Tag):
		print(f'[{node.__class__.__name__}]({node.tag})\n"""\n', node, '\n""""')
	else:
		print(f'[{node.__class__.__name__}]\n"""\n', node, '\n""""')

This how the text is rendered by MW:

<table class="wikitable">
	<tr>
		<th style="width:101px">heading1</th>
		<th style="width:102px">heading2</th>
	</tr>
	<tr data-test="whatever">
		<td style="width:201px">testing</td>
	</tr>
</table>

So there are 3 cells here. In HTML textContent would be heading1, heading2, testing.

But in MWPFH I get Text nodes for attributes. For example:

[Tag](td)
"""
 | style="width:201px" | testing

"""

[Text]
"""
 style
"""

[Text]
"""
 width:201px
"""

[Text]
"""
  testing

"""

I would expect those attributes wouldn't appear as Text nodes at all. They are already available node.attributes so that should be enough.

So when traversing above cell, it should only show:

[Tag](td)
"""
 | style="width:201px" | testing

"""

[Text]
"""
  testing

"""
@lahwaacz
Copy link
Contributor

lahwaacz commented Jun 27, 2022

You are doing a recursive filter on the wikicode. Those text nodes are nested inside the tag nodes. The tree of your snippet looks correct:

>>> print(code.get_tree())
<
      table
      class
    = wikitable
>
      <
            tr
      >
            <
                  th
                  style
                = width:101px
            >
                   heading1\n
            </
                  th
            >
            <
                  th
                  style
                = width:102px
            >
                   heading2\n
            </
                  th
            >
      </
            tr
      >
      <
            tr
            data-test
          = whatever
      >
            <
                  td
                  style
                = width:201px
            >
                   testing\n
            </
                  td
            >
      </
            tr
      >
</
      table
>

The table attributes are parsed as text because MediaWiki parses table attributes before producing the table. For example, table attributes can contain templates and other wikicode. So I would say this is not a bug in mwparserfromhell.

@Eccenux
Copy link
Author

Eccenux commented Jun 27, 2022

The table attributes are parsed as text because MediaWiki parses table attributes before producing the table. For example, table attributes can contain templates and other wikicode. So I would say this is not a bug in mwparserfromhell.

I don't insist in calling it a bug, but still it doesn't make sense when parsing stuff with a bot. When I want change something in texts I explicitly don't want to mess with table attributes.

@lahwaacz
Copy link
Contributor

I don't insist in calling it a bug, but still it doesn't make sense when parsing stuff with a bot. When I want change something in texts I explicitly don't want to mess with table attributes.

Then you need to adjust your recursive iteration. For example:

for node in code.filter():
    parent = code.get_parent(node)
    if isinstance(parent, mwparserfromhell.nodes.tag.Tag) and not parent.contents.contains(node):
        print(f"parent of '{node}' is a tag and the node is not in the tag's contents, skipped")
        continue
    if isinstance(node, mwparserfromhell.nodes.tag.Tag):
        print(f'[{node.__class__.__name__}]({node.tag})\n"""\n', node, '\n""""')
    else:
        print(f'[{node.__class__.__name__}]\n"""\n', node, '\n""""')

@earwig
Copy link
Owner

earwig commented Jun 30, 2022

@lahwaacz is right, but maybe there's room here for improvement. We might want to distinguish between Text nodes that are "visible"/rendered and Text nodes that are purely internal like template parameter names and tag attributes.

It's basically the problem being solved by strip_code but generalized for other use cases, to allow a way to filter VisibleText nodes or something.

@lahwaacz
Copy link
Contributor

@earwig, 👍 because the loop like

for node in code.filter():
    parent = code.get_parent(node)

that is needed for this is not efficient as it basically iterates again for each node, c.f. #195.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants