Create a built in HTML parser #37

philss · 2015-10-29T04:10:14Z

Floki needs a HTML parser built in, in order to remove the mochiweb dependency. This will enable more flexibility and better control of the parsing step.

The parser goals are:

support HTML5;
support HTML snippets;
be able to parse large files, like 15MB;
easy to traverse;
be a bit tolerant with errors, like missing closing tags.

philss · 2015-12-09T19:52:50Z

Here is a test case with an example of error that Floki does not support today: henrik/sipper@49a4c09

Thanks @henrik for the example!

gmile · 2016-06-07T17:30:32Z

@philss creating an html parser from scratch sounds like a huge amount of work. Have you thought about depending on a C library instead, such as this one https://github.com/google/gumbo-parser?

philss · 2016-06-09T15:05:18Z

@gmile yeah, I thought about that, but what I want is to not depend on an external dependency.
This came from a bit of frustration with the Nokogiri ruby gem. It uses libxml2 and FFI to make the bridge. It failed so many times to compile with me that I didn't like the experience.

But, this is not discarded. I also think Servo's HTML is a good option.

gmile · 2016-06-09T15:41:29Z

But, this is not discarded

@philss that said, are you specifically looking forward the Servo's HTML implementation? Otherwise, I could play with gumbo-parser integration and see how it goes.

philss · 2016-06-09T15:58:32Z

@gmile I'm not looking into this right now. So, please go for it. 👍

baron · 2016-07-12T11:15:27Z

I was wondering what the expected behavior of a native html parser would be. Right now mochiweb_html.parse always returns empty lists in either the middle or the end (depending on what level of nesting the html has). I'm not sure if this is a bug or feature but it was confusing when I first started using the library because I was hoping for some kind of "to_hash" like function in ruby.

iex(33)> htm = """
...(33)> <ul>
...(33)> <li>fooo</li>
...(33)> <li>bar</li>
...(33)> </ul>
...(33)> """
"<ul>\n<li>fooo</li>\n<li>bar</li>\n</ul>\n"
iex(34)> :mochiweb_html.parse(htm)
{"ul", [], [{"li", [], ["fooo"]}, {"li", [], ["bar"]}]}

Would a replacement function recreate this behavior for backwards compatibility or break the api?

BTW, thanks for the awesome library!

Eiji7 · 2016-12-20T16:36:23Z

ghost · 2017-01-14T23:50:38Z

Yeah, XPath would be awesome, especially when scraping data from a website. Chrome can automatically generate XPath paths for you to specifically grab tags which would save me a lot of pattern matching...

As far as html5ever, check out https://github.com/hansihe/Rustler

philss · 2017-03-14T04:13:18Z

@mhsjlw I agree. Please follow this issue for more details: #94 (sorry for the delay 😅 ).

philss · 2017-03-14T04:26:53Z

@gmile I totally forgot to update you, but right now is possible to use Servo's HTML parser with Floki!

Please follow these instructions: https://github.com/philss/floki#optional---using-http5ever-as-the-html-parser

gmile · 2017-03-14T10:33:04Z

@philss wow, that's awesome! Thanks!

liveresume · 2017-03-21T20:24:07Z

Rust NIFs anyone?

https://github.com/servo/html5ever

;)

ghost · 2017-03-22T00:05:28Z

@liveresume this was mentioned, twice, see #37 (comment) and #37 (comment)

f34nk · 2018-02-21T15:19:24Z

Please have a look at:
https://github.com/Overbryd/myhtmlex

Based on Alexander Borisov’s myhtml, this binding gains the properties of being html-spec compliant and very fast. https://github.com/lexborisov/myhtml

@Overbryd gave a talk about it in Berlin
I would love to see this coming together!

Overbryd · 2018-02-21T15:28:38Z

@f34nk Happy to help on this one.

I also wrote https://github.com/Overbryd/nodex that can be used to provide a safe execution (c-)node to give the best in performance/safety.

I would refrain from using myhtmlex widely as a NIF without explicitly checking the crash-safety requirements of the application requiring it. So maybe providing two modes of operation (NIF and C-Node) might be the best way to go for a widely used package.

philss · 2018-02-22T01:40:49Z

I didn't know we had bindings for myhtml. That's great! Thank you for the work on that, @Overbryd!

We could for sure write an adapter like we did for html5ever parser. I don't know yet how we would enable the configuration of a C-Node, or if this is needed for the adapter. We can elaborate more ideas on that.

Thank you for letting us know, @f34nk! Can you open a new issue with the proposal?

This is part of a bigger effort to write a compliant HTML parser in Elixir. The implementation follows WHATWG specification which is the living standard of HTML, but parts of the tokenizer are still missing like the handling of parse errors and some states. Those missing parts are not essential for most of the documents. You can see details about the HTML specification here: https://html.spec.whatwg.org/multipage/ This commit contains a lot of files. The most important one is the `lib/floki/html/tokenizer.ex`. We added a lot of test files that were generated according to html5lib-tests - a project that aims to provide test cases based on WHATWG specs. See: https://github.com/html5lib/html5lib-tests This tokenizer was written based on the specs as seen around September 2019. Most of the parser development progress is being tracked at https://github.com/philss/floki/projects/2 For now it will remain "private" and no other module is using it. This is related to #37 :)

philss added the Feature label Oct 29, 2015

philss added this to the 1.0 milestone Oct 29, 2015

philss mentioned this issue Jan 25, 2016

Attributes' names are forced lowercase #42

Closed

hauntedhost mentioned this issue May 23, 2016

Unexpected "/floki" tag #50

Open

philss added the Level:Advanced label Nov 15, 2016

Eiji7 mentioned this issue Dec 29, 2016

Floki removes blank text nodes without option to avoid this #75

Open

f34nk mentioned this issue Feb 22, 2018

Proposal: add parser adapter using myhtmlex #171

Closed

philss mentioned this issue Apr 3, 2019

WIP: Internal tokenizer #204

Merged

philss mentioned this issue Mar 24, 2022

Floki using the built in parser does not handle the optional closing p tag #395

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a built in HTML parser #37

Create a built in HTML parser #37

philss commented Oct 29, 2015

philss commented Dec 9, 2015

gmile commented Jun 7, 2016 •

edited

philss commented Jun 9, 2016

gmile commented Jun 9, 2016 •

edited

philss commented Jun 9, 2016

baron commented Jul 12, 2016 •

edited

Eiji7 commented Dec 20, 2016 •

edited

ghost commented Jan 14, 2017 •

edited by ghost

philss commented Mar 14, 2017

philss commented Mar 14, 2017

gmile commented Mar 14, 2017

liveresume commented Mar 21, 2017

ghost commented Mar 22, 2017

f34nk commented Feb 21, 2018 •

edited

Overbryd commented Feb 21, 2018

philss commented Feb 22, 2018

Create a built in HTML parser #37

Create a built in HTML parser #37

Comments

philss commented Oct 29, 2015

philss commented Dec 9, 2015

gmile commented Jun 7, 2016 • edited

philss commented Jun 9, 2016

gmile commented Jun 9, 2016 • edited

philss commented Jun 9, 2016

baron commented Jul 12, 2016 • edited

Eiji7 commented Dec 20, 2016 • edited

ghost commented Jan 14, 2017 • edited by ghost

philss commented Mar 14, 2017

philss commented Mar 14, 2017

gmile commented Mar 14, 2017

liveresume commented Mar 21, 2017

ghost commented Mar 22, 2017

f34nk commented Feb 21, 2018 • edited

Overbryd commented Feb 21, 2018

philss commented Feb 22, 2018

gmile commented Jun 7, 2016 •

edited

gmile commented Jun 9, 2016 •

edited

baron commented Jul 12, 2016 •

edited

Eiji7 commented Dec 20, 2016 •

edited

ghost commented Jan 14, 2017 •

edited by ghost

f34nk commented Feb 21, 2018 •

edited