Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDATA inside a title tag is not handled in Mochiweb parser #448

Open
mdg opened this issue Mar 2, 2023 · 0 comments
Open

CDATA inside a title tag is not handled in Mochiweb parser #448

mdg opened this issue Mar 2, 2023 · 0 comments
Labels

Comments

@mdg
Copy link

mdg commented Mar 2, 2023

Description

First, it's not clear to me that this is a bug or just a difference in expected behavior since I'm trying to use Floki to parse XML rather than HTML.

The issue is that CDATA inside of a <title> tag is not handled.

To Reproduce

Steps to reproduce the behavior:

  • Using Floki v0.34.1
  • Using Elixir 1.14.0 (compiled with Erlang/OTP 24)
  • Using Erlang/OTP 24 [erts-12.2.1] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [jit]
  • With this code:
    # An example to reproduce the problem
    iex(44)> Floki.parse_document!("<title><![CDATA[handle CDATA]]></title>")
    [{"title", [], ["<![CDATA[handle CDATA]]>"]}]
    iex(45)> Floki.parse_document!("<tacos><![CDATA[handle CDATA]]></tacos>")
    [{"tacos", [], ["handle CDATA"]}]

Expected behavior

I would expect it to behave like other tags.

# An example to reproduce the problem
iex(44)> Floki.parse_document!("<title><![CDATA[handle CDATA]]></title>")
[{"title", [], ["handle CDATA"]}]

I recognize that this is maybe an artifact of trying to use the library for the wrong purpose (parsing XML) so no problem if you want to close this as "won't fix" or whatever.

Patch

If you do want to fix it, here's a patch that appears to do it.

diff --git a/src/floki_mochi_html.erl b/src/floki_mochi_html.erl
index d4e3337..4096161 100644
--- a/src/floki_mochi_html.erl
+++ b/src/floki_mochi_html.erl
@@ -301,13 +301,10 @@ tokens(B, S=#decoder{offset=O}, Acc) ->
                     {Tag2, S2} = tokenize_script(B, S1),
                     tokens(B, S2, [Tag2, Tag | Acc]);
                 style ->
                     {Tag2, S2} = tokenize_style(B, S1),
                     tokens(B, S2, [Tag2, Tag | Acc]);
-                title ->
-                    {Tag2, S2} = tokenize_title(B, S1),
-                    tokens(B, S2, [Tag2, Tag | Acc]);
                 textarea ->
                     {Tag2, S2} = tokenize_textarea(B, S1),
                     tokens(B, S2, [Tag2, Tag | Acc]);
                 none ->
                     tokens(B, S1, [Tag | Acc])
@@ -318,12 +315,10 @@ parse_flag({start_tag, B, _, false}) ->
     case string:to_lower(binary_to_list(B)) of
         "script" ->
             script;
         "style" ->
             style;
-        "title" ->
-            title;
         "textarea" ->
             textarea;
         _ ->
             none
     end;
@@ -822,32 +817,10 @@ tokenize_style(Bin, S=#decoder{offset=O}, Start) ->
             tokenize_style(Bin, ?INC_CHAR(S, C), Start);
         <<_:Start/binary, Raw/binary>> ->
             {{data, Raw, false}, S}
     end.
 
-tokenize_title(Bin, S=#decoder{offset=O}) ->
-    tokenize_title(Bin, S, O).
-
-tokenize_title(Bin, S=#decoder{offset=O}, Start) ->
-    case Bin of
-        %% Just a look-ahead, we want the end_tag separately
-        <<_:O/binary, $<, $/, TT, II, TT2, LL, EE, ZZ, _/binary>>
-        when (TT=:= $t orelse TT =:= $T) andalso
-             (II=:= $i orelse II =:= $I) andalso
-             (TT2=:= $t orelse TT2 =:= $T) andalso
-             (LL=:= $l orelse LL =:= $L) andalso
-             (EE=:= $e orelse EE =:= $E) andalso
-             ?PROBABLE_CLOSE(ZZ) ->
-            Len = O - Start,
-            <<_:Start/binary, Raw:Len/binary, _/binary>> = Bin,
-            {{data, Raw, false}, S};
-        <<_:O/binary, C, _/binary>> ->
-            tokenize_title(Bin, ?INC_CHAR(S, C), Start);
-        <<_:Start/binary, Raw/binary>> ->
-            {{data, Raw, false}, S}
-    end.
-
 tokenize_textarea(Bin, S=#decoder{offset=O}) ->
     tokenize_textarea(Bin, S, O).
 
 tokenize_textarea(Bin, S=#decoder{offset=O}, Start) ->
     case Bin of
@mdg mdg added the Bug label Mar 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant