Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mojo::DOM treats "< foo" as start tag, phantom elements ensue #2031

Open
mauke opened this issue Jan 29, 2023 · 7 comments
Open

Mojo::DOM treats "< foo" as start tag, phantom elements ensue #2031

mauke opened this issue Jan 29, 2023 · 7 comments

Comments

@mauke
Copy link
Contributor

mauke commented Jan 29, 2023

  • Mojolicious version: 9.31
  • Perl version: v5.36.0
  • Operating system: Ubuntu 22.04.1 LTS

Steps to reproduce the behavior

#!/usr/bin/env perl
use v5.12.0;
use warnings;
use Test::More;
use Mojo::DOM;

my $dom = Mojo::DOM->new('if a < script then="<!--"> </script> <p>FAIL</p>-->');

is_deeply
    $dom->find('script, p')->map(sub { $_->to_string })->to_array,
    [],
    'fragment contains no tags, just a comment';

done_testing;

Expected behavior

Test passes.

After seeing <, we are in the tag open state. The semantically relevant characters that can follow are !, /, ASCII alpha, ?, and EOF. Anything else (including spaces) triggers an invalid-first-character-of-tag-name error. If the parser doesn't abort, it should treat the < character literally, as if &lt; had been seen.

Actual behavior

not ok 1 - fragment contains no tags, just a comment
#   Failed test 'fragment contains no tags, just a comment'
#   at mojo-dom-bug-9.pl line 10.
#     Structures begin differing at:
#          $got->[0] = '<script then="&lt;!--"> </script>'
#     $expected->[0] = Does not exist
1..1
# Looks like you failed 1 test of 1.
@mauke
Copy link
Contributor Author

mauke commented Feb 1, 2023

The following change would implement the correct (at least for HTML) behavior:

diff --git lib/Mojo/DOM/HTML.pm lib/Mojo/DOM/HTML.pm
index e10b1532d..81f54c014 100644
--- lib/Mojo/DOM/HTML.pm
+++ lib/Mojo/DOM/HTML.pm
@@ -36,8 +36,10 @@ my $TOKEN_RE = qr/
     |
       \?(.*?)\?                                                                # Processing Instruction
     |
-      \s*((?:\/\s*)?[^<>\s\/0-9.\-][^<>\s\/]*\s*(?:(?:$ATTR_RE){0,32766})*+)   # Tag
+      (\/?[^<>\s\/0-9.\-][^<>\s\/]*\s*(?:(?:$ATTR_RE){0,32766})*+)             # Tag
     )>
+  |
+    <\/ (?![a-z]) ([^>]*) >                                                    # Invalid-first-character-of-tag-name error (bogus comment)
   |
     (<)                                                                        # Runaway "<"
   )??
@@ -101,12 +103,15 @@ sub parse {
   my $xml     = $self->xml;
   my $current = my $tree = ['root'];
   while ($html =~ /\G$TOKEN_RE/gcso) {
-    my ($text, $doctype, $comment, $cdata, $pi, $tag, $runaway) = ($1, $2, $3, $4, $5, $6, $11);
+    my ($text, $doctype, $comment, $cdata, $pi, $tag, $bogus_comment, $runaway) = ($1, $2, $3, $4, $5, $6, $11, $12);
 
     # Text (and runaway "<")
     $text .= '<'                                 if defined $runaway;
     _node($current, 'text', html_unescape $text) if defined $text;
 
+    # Malformed end tag
+    $comment = $bogus_comment if length $bogus_comment;
+
     # Tag
     if (defined $tag) {
 

That is:

  • < foo> treated as &lt; foo&gt; (invalid-first-character-of-tag-name)
  • </ foo> treated as <!-- foo--> (invalid-first-character-of-tag-name)
  • </> ignored (missing-end-tag-name)

The problem is all the tests in t/mojo/dom.t that rely on the previous behavior. There's so many, I'm not sure what to do about them.

@kraih
Copy link
Member

kraih commented Feb 1, 2023

If tests are wrong then they should be fixed.

@mauke
Copy link
Contributor Author

mauke commented Feb 2, 2023

The problem is stuff like this:

subtest 'XML name characters' => sub {
  my $dom = Mojo::DOM->new->xml(1)->parse('<Foo><1a>foo</1a></Foo>');
  is $dom->at('Foo')->text, '<1a>foo</1a>',                        'right text';
  is "$dom",                '<Foo>&lt;1a&gt;foo&lt;/1a&gt;</Foo>', 'right result';

  $dom = Mojo::DOM->new->xml(1)->parse('<Foo><.a>foo</.a></Foo>');
  is $dom->at('Foo')->text, '<.a>foo</.a>',                        'right text';
  is "$dom",                '<Foo>&lt;.a&gt;foo&lt;/.a&gt;</Foo>', 'right result';

  $dom = Mojo::DOM->new->xml(1)->parse('<Foo><.>foo</.></Foo>');
  is $dom->at('Foo')->text, '<.>foo</.>',                        'right text';
  is "$dom",                '<Foo>&lt;.&gt;foo&lt;/.&gt;</Foo>', 'right result';

  $dom = Mojo::DOM->new->xml(1)->parse('<Foo><-a>foo</-a></Foo>');
  is $dom->at('Foo')->text, '<-a>foo</-a>',                        'right text';
  is "$dom",                '<Foo>&lt;-a&gt;foo&lt;/-a&gt;</Foo>', 'right result';

  $dom = Mojo::DOM->new->xml(1)->parse('<Foo><a1>foo</a1></Foo>');
  is $dom->at('Foo a1')->text, 'foo',                     'right text';
  is "$dom",                   '<Foo><a1>foo</a1></Foo>', 'right result';

  $dom = Mojo::DOM->new->xml(1)->parse('<Foo><a .b -c 1>foo</a></Foo>');
  is $dom->at('Foo')->text, '<a .b -c 1>foo',                  'right text';
  is "$dom",                '<Foo>&lt;a .b -c 1&gt;foo</Foo>', 'right result';

  $dom = Mojo::DOM->new->xml(1)->parse('<😄 😄="😄">foo</😄>');
  is $dom->at('😄')->text, 'foo',              'right text';
  is "$dom",              '<😄 😄="😄">foo</😄>', 'right result';

  $dom = Mojo::DOM->new->xml(1)->parse('<こんにちは こんにちは="こんにちは">foo</こんにちは>');
  is $dom->at('こんにちは')->text, 'foo',                              'right text';
  is "$dom",                  '<こんにちは こんにちは="こんにちは">foo</こんにちは>', 'right result';
};

It specifically tests for "incorrect" (for HTML) behavior of the parser. I don't know enough about XML to say whether this is correct for XML, but if so, you might need different tokenizers for HTML and XML. :-/


PS: In HTML mode, the correct parse for

<😄 😄="😄">foo</😄>

would be

lt;😄 😄="😄"&gt;foo<!--😄-->

@kraih
Copy link
Member

kraih commented Feb 2, 2023

I don't see the relation between < foo and <😄 😄="😄">. And i'm pretty sure the latter is valid XML, since @mojojs/dom is more strict about what code points to allow for names (based on the XML spec). A separate tokenizer is out of the question, but i don't see the harm in being a bit more relaxed with names anyway. Using the same character ranges as @mojojs/dom would of course be an option.

@mauke
Copy link
Contributor Author

mauke commented Feb 2, 2023

Recall:

After seeing <, we are in the tag open state. The semantically relevant characters that can follow are !, /, ASCII alpha, ?, and EOF. Anything else (including spaces) triggers an invalid-first-character-of-tag-name error.

Like space, 😄 is not !, /, ?, or ASCII alpha, so it is a parse error.

Consider this example:

<😄 title="<script>console.log('hi');</script>"></😄>

According to the HTML5 spec, this contains a script element because it parses like

&lt;😄 title=&quot;<script>console.log('hi');</script>&quot;&gt;<!--😄-->

But Mojo::DOM doesn't see the <script>.

This is essentially unfixable: Browsers will always see a different document structure than Mojo::DOM as long as tag names can start with non-ascii-alpha characters.

@kraih
Copy link
Member

kraih commented Feb 2, 2023

That is certainly an interesting case. 🤔

@kraih
Copy link
Member

kraih commented Feb 2, 2023

I think the HTML/XML overlap is where we draw the line with correctness, and this will remain a case we handle like it was XML. But we can still make other cases that do not conflict with XML more strict.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants