Floki removes blank text nodes without option to avoid this #75

Eiji7 · 2016-12-15T21:48:03Z

Actual results:

Floki.parse("<span>5</span> <span>=</span> <span>5</span>")  
[{"span", [], ["5"]}, {"span", [], ["="]}, {"span", [], ["5"]}]

Expected results:

Floki.parse("<span>5</span> <span>=</span> <span>5</span>")  
# or
Floki.parse("<span>5</span> <span>=</span> <span>5</span>", remove_text_nodes_with_whitespaces: false)

[{"span", [], ["5"]}, " ", {"span", [], ["="]}, " ", {"span", [], ["5"]}]

Note: this is really important in some cases. For example, please try parsing html generated from github markup (code samples).

philss · 2016-12-18T15:37:20Z

Hi @Eiji7.

This issue is a known issue inside Mochiweb HTML parser, which is the parser behind Floki.

The suggested workaround would replace white-spaces between tags with a   or   before calling :mochiweb.parse/1.

I tried some expressions to replace and I came up with this:

"<span>5</span> <span>=</span> <span>5</span>"
|> String.replace(~r/>[ \n\r]+</, ">&#32;<")
|> Floki.parse

# => [{"span", [], ["5"]}, " ", {"span", [], ["="]}, " ", {"span", [], ["5"]}]

Not sure if this is safe to do inside Floki.parse/2, specially because this would affect code inside script tags.

Maybe the option you are suggesting could come as true by default only to preserve the current behavior that is broken, but is already known. What do you think? Do you have suggestions?
Thanks!

Eiji7 · 2016-12-19T15:28:56Z

Edit:
@philss: Looks better, but it should be already implemented, so developers not need to care about replace.

aphillipo · 2016-12-29T09:17:38Z

@Eiji7 Surely it's going to add a lot of overhead if you don't need this though? Maybe it should be Floki.preserve_inter_tag_spaces and documented to say that it will copy the entire string to do this.

Eiji7 · 2016-12-29T09:43:23Z

@aphillipo: yup, this way or #37.
btw. If you can explain me how parsers should work, like:

best way is to create regex, you can find sample/demo patterns at ...

then I could think about create an XML/HTML parser and validator in Elixir. I can start work on it now, but I don't worked on similar project, so I can't provide others that it will be really fast. As listed in 3rd point at #37 issue we need think about a good algorithm to avoid decrease of efficiency

aphillipo · 2016-12-29T15:11:19Z

Building HTML parsers is very difficult - I want to use Servo's for this eventually...

https://github.com/servo/html5ever

Eiji7 · 2016-12-29T19:34:00Z

@philss and @aphillipo: Did you know :xmerl and :xmerl_scan?
For example see erlang docs.
Maybe it's what we are looking for that we can easily extend?

Erlang/OTP 19 [erts-8.1] [source] [64-bit] [smp:4:4] [async-threads:10]

Interactive Elixir (1.3.4) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> :xmerl_scan.string('<p><span>5</span> <span>=</span> <span>5</span></p>')  
{{:xmlElement, :p, :p, [], {:xmlNamespace, [], []}, [], 1, [],
  [{:xmlElement, :span, :span, [], {:xmlNamespace, [], []}, [p: 1], 1, [],
    [{:xmlText, [span: 1, p: 1], 1, [], '5', :text}], [], '/home/eiji',
    :undeclared}, {:xmlText, [p: 1], 2, [], ' ', :text},
   {:xmlElement, :span, :span, [], {:xmlNamespace, [], []}, [p: 1], 3, [],
    [{:xmlText, [span: 3, p: 1], 1, [], '=', :text}], [], :undefined,
    :undeclared}, {:xmlText, [p: 1], 4, [], ' ', :text},
   {:xmlElement, :span, :span, [], {:xmlNamespace, [], []}, [p: 1], 5, [],
    [{:xmlText, [span: 5, p: 1], 1, [], '5', :text}], [], :undefined,
    :undeclared}], [], '/home/eiji', :undeclared}, []}

What do you think about it?

philss · 2016-12-29T21:07:38Z

@Eiji7 :xmerl won't work for us. This is because xmerl was made to parse XML documents, and HTML is very less strict than XML. Normally HTML parsers are more forgiven than XML parsers, due the reason that users may write invalid HTML that should be presented in the browsers.

Here are some references:

One of the possibilities to fix this is to fix mochiweb's HTML parser.

As @aphillipo said, HTML parsers are very difficult to build. I created that issue (#37) without understand that. Now I'm more inclined to the idea of reuse an existing HTML parser. Servo's html5ever is my favorite choice for now.

I did not started working on this, but help trying out this "bridge" between Elixir and Rust would be very welcome! Here is a project that is very promising: https://github.com/hansihe/Rustler.

Eiji7 · 2016-12-30T15:44:57Z

@philss: Ok, I see your point. I see that your way ("bridge") is faster to implement. I don't know how exactly will work. Will it be as fast (with that "bridge") as native Elixir parser? It may be interesting for me in further future.
Anyway I'm interested in :xmerl and I think that I will look at its source files and I will probably create a Elixir version of it that will work for more that one document format (XML, HTML, XHTML and maybe also HAML).

philss · 2017-01-12T04:37:59Z

@Eiji7 sorry for the delay.

I'm not sure how it will work. It does not seems to have a big performance gap when using, for example, an Erlang NIF written in C or Rust. The problems are more related to the security of code. If the C/Rust code crash, the entire Erlang VM can crash. This is one of the reasons I want to give Rust a try with html5ever (because Rust promises a more safe environment for your programs).

Cool! The idea to write a HTML parser entirely in Elixir is not dead yet! It is only very hard to do the "right way".

I will let this open as a "bug" and try to figure out how to "quick fix" this.

Thanks!

dariodf · 2018-03-04T10:02:24Z

I tried to use html5ever and i got too many errors when parsing.

So here's a (quite expensive) solution for this problem:

  defp replace_whitespaces(text, last_text \\ "")
  defp replace_whitespaces(text, last_text) when last_text == text, do: text
  defp replace_whitespaces(text, _) do
    text
    |> String.replace(~r/>((&#32;)*) ( *)</, ">\\g{1}&#32;\\g{3}<")
    |> replace_whitespaces(text)
  end

Then, you use this function before your find and text:

    body
    |> replace_whitespaces
    |> Floki.find(selector)
    |> Floki.text

lud-wj · 2024-02-05T15:06:04Z

Hello,

I am having the same problem. Now it seems that the mochiweb parser is distributed in floki, so the problem could be fixed directly instead of relying on third party (which does not exist anymore AFAICT).

I have a workaround with html5ever that works for small payloads, I'm gonna test it with huge, dirty HTML chunks. But it could be nice to have an option to keep all the text nodes regardless of their contents, don't you think?

philss added the Bug label Jan 12, 2017

philss mentioned this issue Jan 15, 2017

Add support for html5ever as an optional HTML parser #83

Closed

philss added this to Needs triage in Bugs Jun 7, 2020

philss moved this from Needs triage to Low priority in Bugs Jun 7, 2020

sionide21 mentioned this issue Aug 21, 2020

Retain whitespace when parsing pure links RobertDober/earmark_parser#25

Merged

ppatrzyk mentioned this issue Mar 6, 2021

traverse_and_update is not applied to text nodes #338

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Floki removes blank text nodes without option to avoid this #75

Floki removes blank text nodes without option to avoid this #75

Eiji7 commented Dec 15, 2016

philss commented Dec 18, 2016

Eiji7 commented Dec 19, 2016 •

edited

aphillipo commented Dec 29, 2016

Eiji7 commented Dec 29, 2016

aphillipo commented Dec 29, 2016

Eiji7 commented Dec 29, 2016 •

edited

philss commented Dec 29, 2016

Eiji7 commented Dec 30, 2016

philss commented Jan 12, 2017

dariodf commented Mar 4, 2018

lud-wj commented Feb 5, 2024

Floki removes blank text nodes without option to avoid this #75

Floki removes blank text nodes without option to avoid this #75

Comments

Eiji7 commented Dec 15, 2016

philss commented Dec 18, 2016

Eiji7 commented Dec 19, 2016 • edited

aphillipo commented Dec 29, 2016

Eiji7 commented Dec 29, 2016

aphillipo commented Dec 29, 2016

Eiji7 commented Dec 29, 2016 • edited

philss commented Dec 29, 2016

Eiji7 commented Dec 30, 2016

philss commented Jan 12, 2017

dariodf commented Mar 4, 2018

lud-wj commented Feb 5, 2024

Eiji7 commented Dec 19, 2016 •

edited

Eiji7 commented Dec 29, 2016 •

edited