Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parse_fragment does not parse whitespace in HTML (or XML) text properly #421

Open
calimeroteknik opened this issue Sep 18, 2022 · 0 comments
Labels

Comments

@calimeroteknik
Copy link

calimeroteknik commented Sep 18, 2022

Description

parse_fragment does not parse whitespace in HTML (or XML) text properly, keeping it as-is when it should not.

To Reproduce

Steps to reproduce the behavior:

  • Using Floki v0.33.1
  • Using Elixir v1.13.2
  • Using Erlang OTP 24.3.2 [erts-12.3]
  • With this code:
      Floki.parse_document("<!DOCTYPE html>\n<html>\n\t<head>\n\t\t<title> \t&#110;&#111;&#116;&#104;&#105;&#110;&#103;\t\n\t\t\t &#116;&#111;\n&#115;&#101;&#101;  &#104;&#101;&#114;&#101;&#44;&#32;&#119;&#111;&#114;&#107;&#105;&#110;&#103;&#32;&#112;&#114;&#111;&#112;&#101;&#114;&#108;&#121; \n\n\t\t</title>\n\t</head>\n\t<body>\n\t</body>\n</html>\n")
        |> Rustic.Result.map_err(fn reason -> {:invalid_html, reason} end)
        |> Rustic.Result.and_then(fn doc ->
          data = doc
            |> Floki.find("head > title")
            |> Enum.take(1)
            |> Floki.text()
            |> Floki.HTMLParser.parse_fragment()
    
        end)
    I get the following output:
    {:ok, [" \tnothing\t\n\t\t\t to\nsee  here, working properly \n\n\t\t"]}

Expected behavior

The following output:

{:ok, [" nothing to see here, working properly "]}

(I think that the leading and trailing space must not be trimmed, although like the others it must be collapsed to 1 space; this might need triple-checking with the standards)

Test file (HTML): floki-test.html.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant