Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML validation - Feature PR #180

Open
diervo opened this issue Feb 10, 2017 · 14 comments
Open

HTML validation - Feature PR #180

diervo opened this issue Feb 10, 2017 · 14 comments

Comments

@diervo
Copy link
Contributor

diervo commented Feb 10, 2017

TL;DR: would the owners of this repo be open to introduce a new API to validate a given HTML page or fragment?

Today the parser fixes internally the tree for you (incorrect self closing tags, missing tags, etc), giving you the already fixed tree.

I've been trying to find a good HTML validator, but the only one that is spec compliant is the one from W3C which is written in Java and found only as a service which is very inconvenient for most uses.

I believe given that this is the most used/compliant HTML parser, should be pretty straightforward to add HTML validation

Rather than creating a fork I would gladly do a PR if there is no opposition to this feature.

Thoughts?

@RReverser
Copy link
Collaborator

RReverser commented Feb 10, 2017

It doesn't "fix" HTML, it parses it in accordance with spec. This is not a separate fixing mechanism from any other parsing, but normal parsing flow where some tags are implicit etc., but having implicit tags doesn't make HTML invalid according to HTML5 spec - in opposite, such documents are still totally valid.

@diervo
Copy link
Contributor Author

diervo commented Feb 10, 2017

What "valid" means? for example, per spec:
https://www.w3.org/TR/html/syntax.html#syntax-elements

Tags are used to delimit the start and end of elements in the markup. Raw text, escapable raw text, and normal elements have a start tag to indicate where they begin, and an end tag to indicate where they end. The start and end tags of certain normal elements can be omitted, as described below in the section on optional tags. Those that cannot be omitted must not be omitted

So in this case its saying <p/> is not valid HTML, but as you pointed out, its on the spec for html5 defines how to parse such cases.

So is just that we are talking about different HTML spec versions?

My ask is to add validate as per the strict HTML spec, and so I wanted to leverage your parser to detect such cases.

@inikulin
Copy link
Owner

Well, any HTML is valid, however it can be non-conforming - in that case spec says to report parse error. I believe that having validator is a good thing for some use scenarios, e.g. having conforming HTML justifies that it safe for parse-serialize round trips, consequently making HTML instrumentation safe as well. We had this discussion before: #55. And it still blocked on whatwg/html#1339. I'll keep it open as there is a demand for the feature, however I wouldn't expect it to be implemented soon.

@diervo
Copy link
Contributor Author

diervo commented Feb 10, 2017

We would be open to help on the standardization and implementing the changes if no one has a strong argument against it.

We are already working with @caridy and @domenic for some other HTML spec/questions stuff.
Let me test the waters :)

@inikulin
Copy link
Owner

@diervo It would be great!

@caridy
Copy link

caridy commented Feb 10, 2017

I think @inikulin has the right intuition here, it is not about validation, but about conforming, and if the parser can provide a report about the conforming aspect of the parsed document, that should be sufficient for developer to do:

  1. be confident that a conforming document was parsed correctly.
  2. use the report as a feedback loop to the end user (e.g.: show error/warning messages in a linter/IDE).

@RReverser
Copy link
Collaborator

Personally to me, conformance checkers just feels like a thing from the past nowadays when we had to check our HTML with online W3C tool to be sure that it will be parsed correctly (or parsed at all) by all the different browsers. Now that they all follow the same spec (apart from temporary bugs), that feels less useful, but I don't oppose it surely if there are valid use cases.

@inikulin
Copy link
Owner

@RReverser you're right that cross browser compatibility is not an issue anymore (kinda), but it might be useful to ensure that provided markup will be interpreted as intended, because in some cases auto corrections may screw things up.

@tmpfs
Copy link

tmpfs commented Mar 4, 2017

@diervo FYI you can run the nu validator locally using java -jar it just requires that you have java 1.8 installed. I would like it if parse5 does generate a report that would help detect conformance errors but in my experience the w3c validator is excellent and well maintained.

@diervo
Copy link
Contributor Author

diervo commented Mar 4, 2017

Yes we have been poking around with it, the fundamental problem is integration. The fact that is in Java adds some complexity to our integration scenarios. We will start working soon on the very first step to add validation into parse5 hopefully we can do incremental steps due to all of the possible parse errors and nuances

@WilcoFiers
Copy link

@diervo @inikulin It's been a while. Has any progress been made on this feature?

@diervo
Copy link
Contributor Author

diervo commented Sep 24, 2018

We have done a lot of work in the past on the HTML spec and added the proper error names and description. However we have never finished porting those to parse5 and validate optionally on this issues.

@inikulin I haven't forget we own you still a bunch of work :)
Maybe we should talk when would be a good time to collaborate again?

@inikulin
Copy link
Owner

@diervo Sure, I'd love to finish this work, I've also started to do some spec work for the tree construction stage on my own. Hit me up by email and we'll try to figure out the best time to finish it.

@blizzardengle
Copy link

It seems this feature may be dead, but I would like to resurrect it to say that I personally have been searching for an HTML parser that can perform basic validation/conformation checking.

I teach at a University and have built a tool to auto-grade programming projects. It detects and provides feedback on errors/issues in students code and of all things HTML has been the hardest to find a parser for. The ideal would be every node being marked with a Boolean of true (valid/conforming) or false (invalid/nonconforming), but I would be happy with a general "appears to be valid" option set on the root node.

The closest I have been able to find is htmlparser2 but the drawbacks there are:

  1. Not truly spec compliant.
  2. The validate method does not also return the dom tree so you have to parse HTML documents twice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants