HTML validation - Feature PR #180

diervo · 2017-02-10T16:39:44Z

TL;DR: would the owners of this repo be open to introduce a new API to validate a given HTML page or fragment?

Today the parser fixes internally the tree for you (incorrect self closing tags, missing tags, etc), giving you the already fixed tree.

I've been trying to find a good HTML validator, but the only one that is spec compliant is the one from W3C which is written in Java and found only as a service which is very inconvenient for most uses.

I believe given that this is the most used/compliant HTML parser, should be pretty straightforward to add HTML validation

Rather than creating a fork I would gladly do a PR if there is no opposition to this feature.

Thoughts?

The text was updated successfully, but these errors were encountered:

RReverser · 2017-02-10T19:02:49Z

It doesn't "fix" HTML, it parses it in accordance with spec. This is not a separate fixing mechanism from any other parsing, but normal parsing flow where some tags are implicit etc., but having implicit tags doesn't make HTML invalid according to HTML5 spec - in opposite, such documents are still totally valid.

diervo · 2017-02-10T20:19:34Z

What "valid" means? for example, per spec:
https://www.w3.org/TR/html/syntax.html#syntax-elements

Tags are used to delimit the start and end of elements in the markup. Raw text, escapable raw text, and normal elements have a start tag to indicate where they begin, and an end tag to indicate where they end. The start and end tags of certain normal elements can be omitted, as described below in the section on optional tags. Those that cannot be omitted must not be omitted

So in this case its saying <p/> is not valid HTML, but as you pointed out, its on the spec for html5 defines how to parse such cases.

So is just that we are talking about different HTML spec versions?

My ask is to add validate as per the strict HTML spec, and so I wanted to leverage your parser to detect such cases.

inikulin · 2017-02-10T20:48:39Z

Well, any HTML is valid, however it can be non-conforming - in that case spec says to report parse error. I believe that having validator is a good thing for some use scenarios, e.g. having conforming HTML justifies that it safe for parse-serialize round trips, consequently making HTML instrumentation safe as well. We had this discussion before: #55. And it still blocked on whatwg/html#1339. I'll keep it open as there is a demand for the feature, however I wouldn't expect it to be implemented soon.

diervo · 2017-02-10T20:58:26Z

We would be open to help on the standardization and implementing the changes if no one has a strong argument against it.

We are already working with @caridy and @domenic for some other HTML spec/questions stuff.
Let me test the waters :)

inikulin · 2017-02-10T21:01:55Z

@diervo It would be great!

caridy · 2017-02-10T21:48:05Z

I think @inikulin has the right intuition here, it is not about validation, but about conforming, and if the parser can provide a report about the conforming aspect of the parsed document, that should be sufficient for developer to do:

be confident that a conforming document was parsed correctly.
use the report as a feedback loop to the end user (e.g.: show error/warning messages in a linter/IDE).

RReverser · 2017-02-10T23:33:18Z

Personally to me, conformance checkers just feels like a thing from the past nowadays when we had to check our HTML with online W3C tool to be sure that it will be parsed correctly (or parsed at all) by all the different browsers. Now that they all follow the same spec (apart from temporary bugs), that feels less useful, but I don't oppose it surely if there are valid use cases.

inikulin · 2017-02-11T11:58:40Z

@RReverser you're right that cross browser compatibility is not an issue anymore (kinda), but it might be useful to ensure that provided markup will be interpreted as intended, because in some cases auto corrections may screw things up.

tmpfs · 2017-03-04T04:16:27Z

@diervo FYI you can run the nu validator locally using java -jar it just requires that you have java 1.8 installed. I would like it if parse5 does generate a report that would help detect conformance errors but in my experience the w3c validator is excellent and well maintained.

diervo · 2017-03-04T16:37:05Z

Yes we have been poking around with it, the fundamental problem is integration. The fact that is in Java adds some complexity to our integration scenarios. We will start working soon on the very first step to add validation into parse5 hopefully we can do incremental steps due to all of the possible parse errors and nuances

WilcoFiers · 2018-09-19T09:00:27Z

@diervo @inikulin It's been a while. Has any progress been made on this feature?

diervo · 2018-09-24T06:00:37Z

We have done a lot of work in the past on the HTML spec and added the proper error names and description. However we have never finished porting those to parse5 and validate optionally on this issues.

@inikulin I haven't forget we own you still a bunch of work :)
Maybe we should talk when would be a good time to collaborate again?

inikulin · 2018-09-24T12:29:47Z

@diervo Sure, I'd love to finish this work, I've also started to do some spec work for the tree construction stage on my own. Hit me up by email and we'll try to figure out the best time to finish it.

blizzardengle · 2024-02-17T16:27:38Z

It seems this feature may be dead, but I would like to resurrect it to say that I personally have been searching for an HTML parser that can perform basic validation/conformation checking.

I teach at a University and have built a tool to auto-grade programming projects. It detects and provides feedback on errors/issues in students code and of all things HTML has been the hardest to find a parser for. The ideal would be every node being marked with a Boolean of true (valid/conforming) or false (invalid/nonconforming), but I would be happy with a general "appears to be valid" option set on the root node.

The closest I have been able to find is htmlparser2 but the drawbacks there are:

Not truly spec compliant.
The validate method does not also return the dom tree so you have to parse HTML documents twice.

inikulin added the enhancement label Feb 10, 2017

inikulin mentioned this issue Apr 11, 2017

complain/throw on invalid html #193

Closed

stephanwlee mentioned this issue Dec 11, 2018

Removed unused import and fixed unbalanced tags tensorflow/tensorboard#1687

Merged

c0ncentus mentioned this issue May 14, 2021

tr HTML tag is disapear on specefic condition #345

Closed

jmsjtu mentioned this issue Oct 21, 2021

fix(template-compiler): Change error thrown on missing sourceCodeLocation to warning salesforce/lwc#2538

Merged

fictitious mentioned this issue Jan 23, 2022

Inconsistent handling of nested forms #387

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML validation - Feature PR #180

HTML validation - Feature PR #180

diervo commented Feb 10, 2017 •

edited

RReverser commented Feb 10, 2017 •

edited

diervo commented Feb 10, 2017 •

edited

inikulin commented Feb 10, 2017

diervo commented Feb 10, 2017 •

edited

inikulin commented Feb 10, 2017

caridy commented Feb 10, 2017

RReverser commented Feb 10, 2017

inikulin commented Feb 11, 2017

tmpfs commented Mar 4, 2017

diervo commented Mar 4, 2017

WilcoFiers commented Sep 19, 2018

diervo commented Sep 24, 2018

inikulin commented Sep 24, 2018

blizzardengle commented Feb 17, 2024

HTML validation - Feature PR #180

HTML validation - Feature PR #180

Comments

diervo commented Feb 10, 2017 • edited

RReverser commented Feb 10, 2017 • edited

diervo commented Feb 10, 2017 • edited

inikulin commented Feb 10, 2017

diervo commented Feb 10, 2017 • edited

inikulin commented Feb 10, 2017

caridy commented Feb 10, 2017

RReverser commented Feb 10, 2017

inikulin commented Feb 11, 2017

tmpfs commented Mar 4, 2017

diervo commented Mar 4, 2017

WilcoFiers commented Sep 19, 2018

diervo commented Sep 24, 2018

inikulin commented Sep 24, 2018

blizzardengle commented Feb 17, 2024

diervo commented Feb 10, 2017 •

edited

RReverser commented Feb 10, 2017 •

edited

diervo commented Feb 10, 2017 •

edited

diervo commented Feb 10, 2017 •

edited