Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML Entity Name munging in XML listings #103

Open
benjwadams opened this issue May 22, 2023 · 1 comment
Open

HTML Entity Name munging in XML listings #103

benjwadams opened this issue May 22, 2023 · 1 comment

Comments

@benjwadams
Copy link

ERDDAP does some bizarre name munging to HTML entities in XML listings.

For example in https://gcoos4.tamu.edu/erddap/metadata/iso19115/xml/ there are numerous href values like this
2004JuvenileSportfishNOAA_DATA_Mean_v0_0_iso19115.xml

Most browsers will transform this, but I have had issues with following links in some Python libraries if these HTML entities aren't explicitly escaped beforehand. It's also a pretty odd way to represent simple characters like periods and underscores where the usual characters would suffice. Any reason why these characters shouldn't be used instead of encoding to HTML entities?

@BobSimons
Copy link
Collaborator

It is the attributes of HTML and XML tags that must be strongly encoded, for security reasons. The code that does this is in com/cohort/util/XML.java in the method called encodeAsHTMLAttribute. The JavaDoc for that method explains:

 * For security reasons, for text that will be used as an HTML or XML attribute, 
 * this replaces non-alphanumeric characters with HTML Entity &#xHHHH; format.
 * See HTML Attribute Encoding at
 * [https://owasp.org/www-pdf-archive/OWASP_Cheatsheets_Book.pdf](https://owasp.org/www-pdf-archive/OWASP_Cheatsheets_Book.pdf)
 * pg 188, section 25.4 
 * "Encoding Type: HTML Attribute Encoding
 * Encoding Mechanism: 
 * Except for alphanumeric characters, escape all characters with the HTML Entity &#xHH;
 * format, including spaces. (HH = Hex Value)".
 * On the need to escape HTML attributes: [http://wonko.com/post/html-escaping](http://wonko.com/post/html-escaping)

Both of the links there are interesting reading.

One might argue that in some circumstances this strict encoding is not necessary. Perhaps. Perhaps not. The problem is that it is very time consuming (even if we assume the programmer has 100% understanding of the situation) and error prone to try to make that determination. It is vastly simpler and (more important) vastly safer to just routinely encode all attributes in the safe and recommended way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants