Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Certain characters in attribute values are not being escaped resulting in invalid HTML #198

Open
jasonkres opened this issue Jul 31, 2015 · 9 comments

Comments

@jasonkres
Copy link

Characters that should be escaped are not escaped when they occur in attribute values.
That is, a valid HTML5 document (e.g., per https://html5.validator.nu/) can be passed to CsQuery and come out as an invalid HTML5 document.

I have tested this 1.3.3249 (NuGet) and CsQuery master. The problem reproduces in both.

Not sure, but suspect that because this affects escaping, this may be a security problem where undesired HTML can be injected.

Program:

using System;
using CsQuery;

namespace CsQueryTestApp
{
    class Program
    {
        static void Main(string[] args)
        {
            string h = "<!DOCTYPE html><html><head><meta charset=\"UTF-8\"><title>test</title></head><body><a href=\"http://www.example.com?a=1&amp;b=2\">Hello &amp; World</a></body></html>";
            Console.WriteLine("Valid per https://html5.validator.nu/:");
            Console.WriteLine(h);
            Console.WriteLine("Invalid per https://html5.validator.nu/:");
            Console.WriteLine(CQ.CreateDocument(h).Render());
        }
    }
}

Output:

Valid per https://html5.validator.nu/:
<!DOCTYPE html><html><head><meta charset="UTF-8"><title>test</title></head><body><a href="http://www.example.com?a=1&amp;b=2">Hello &amp; World</a></body></html>
Invalid per https://html5.validator.nu/:
<!DOCTYPE html><html><head><meta charset="UTF-8"><title>test</title></head><body><a href="http://www.example.com?a=1&b=2">Hello &amp; World</a></body></html>

Message from validator:

Error: & did not start a character reference. (& probably should have been escaped as &amp;.)
At line 1, column 117
example.com?a=1&b=2">Hello &am
@schmalls
Copy link

I have also encountered this issue with quotes and angle brackets.

Before:

<span class="email" data-href="mailto:&quot;John Doe&quot; &lt;john.doe@example.com&gt;">

After:

<span class="email" data-href='mailto:"John Doe" <john.doe@example.com>'>

As you can see, the entity references disappeared and the quote style switched. The angle brackets in the attribute cause it to no longer be valid.

@jamietre
Copy link
Owner

The angle brackets are quoted, they don't need to transformed. That validates fine. The & is a problem.

@schmalls
Copy link

I am using Prince to convert from XHTML to PDF and it gives me an error saying the Unescaped '<' not allowed in attribute values. So, it may be they are more strict than the validator. My question is, why are the entities being converted in the first place? If they weren't, it would fix both issues.

@jamietre
Copy link
Owner

CsQuery is a parser.. Though this is really HtmlParser at work here for the
input When it loads a document it stores the actual value and not whatever
representation created it from the source. In a pre tag for example an & is
really an &. If we just stored the ASCII that came from the source then it
would be impossible to treat text data as something that is usable in any
context.

But HtmlParser is not broken. The problem is just in how I render it. &
needs to be escaped in attribute values always.

I am using Prince http://www.princexml.com to convert from XHTML to PDF
and it gives me an error saying the Unescaped '<' not allowed in attribute
values
. So, it may be they are more strict than the validator. My question
is, why are the entities being converted in the first place? If they
weren't, it would fix both issues.


Reply to this email directly or view it on GitHub
#198 (comment).

@jasonkres
Copy link
Author

So, it may be they are more strict than the validator.

Ideally, should be pluggable how these are handled when they occur in attributes.
?? Similar to how you handle other encoding issues with the HtmlEncoders ??
For this to be most useful, the encoder routine needs some context (especially: whether ' or " is the currently expected attribute delimiter).

@jamietre
Copy link
Owner

The code is here

public static string AttributeEncode(string text, bool alwaysQuote, out string quoteChar)...

It handles the quote context, but not the ampersand. The ampersand is the only character that requires encoding, and technically it only needs to be encoded if it is ambiguous -- so I'm actually surprised that validator.nu fails the example above.

I agree, there should be some control over how the attributes are generated. It might be desirable to use escaping for even legitimate characters for the same reason as elsewhere in the document (e.g. just for readability of unprintable characters in the source code). I can't think of a reason why the same translation used for the rest of the HTML wouldn't work in the attribute context.

There handling chooses apostrophe or double-quote in order to minimize the need for escaping quotes, but seems like this should be choosable (e.g. single quote, double quote, or "best fit") anyway.

@jasonkres
Copy link
Author

and technically it only needs to be encoded if it is ambiguous

Of course, if the document is XHTML, then there is no flexibility on ampersands. In the case of working with the XML serialization of HTML5, you can't even tell from the doctype whether it is supposed to be XML, so I guess if you have you have to either play safe: look for the http://www.w3.org/1999/xhtml namespace, or provide options for the user to tell you want they want to do. So flexibility on this is good...

@schmalls
Copy link

There are three characters that have to be escaped for attributes in XHTML it looks like from the XML specification. They are <, &, and ".

@jasonkres
Copy link
Author

Yes, escaping < is actually a must in XML spec; & and whichever of " or ' is not the delimiter, (of course.)
The relevant production from the spec is (where ^ indicates negation):

AttValue ::= '"' ([^<&"] | Reference)* '"'  
             | "'" ([^<&'] | Reference)* "'" 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants