Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translation tooltip images have extra encoding in paths #81

Open
turnbullerin opened this issue Jul 21, 2022 · 3 comments
Open

Translation tooltip images have extra encoding in paths #81

turnbullerin opened this issue Jul 21, 2022 · 3 comments

Comments

@turnbullerin
Copy link

turnbullerin commented Jul 21, 2022

Hi!

I'm setting up ERDDAP behind an nginx proxy and using the sub module to rewrite the HTML content to have the proper URLs for our host (mostly because I'm planning a federated install at some point in the future and will only have one host name, so I'm doing something like http://publichost/1/erddap --> http://erddap1:8080/erddap, http://publichost/2/erddap --> http://erddap2:8080/erddap).

One issue I've noticed with this is that some links are getting very encoded to the point where it's hard to make good rewrite rules (notably where there's a tooltip and an image) - this seems to be the doing of XML.encodeAsHTMLAttribute() which is converting everything that isn't a digit or letter into an entity. The result is a URL like http://erddap1.org:8080/erddap/index.html is actually http:&x2f;&x2f;erddap1&x2e;8080&x2f;erddap&x2f;index&x2e;png which makes using sub more challenging.

It would be nice for these URLs to be consistent with the rest of the URLs on the page (which are not encoded in this fashion). I think it is safe to include :/. in the characters that encodeAsHTMLAttribute() would not encode at least, and generally only &, quotation marks, and those characters not supported in your character encoding need encoding (so codepoints > 255)

@turnbullerin turnbullerin changed the title Translation tooltip image breaks with proxy Translation tooltip images have extra encoding in paths Jul 21, 2022
@BobSimons
Copy link
Collaborator

I admit that some of the text (e.g., urls) in HTML attributes are excessively encoded.

However, I used to have a method (I think it was called minimallyEncodeAsHTMLAttribute()) which encoded a few more characters than what you suggested to minimally encode text that was stored as HTML attributes. However, that is insufficient for security reasons, notably the danger or cross-site scripting (XSS). See
https://wonko.com/post/html-escaping and
https://www.owasp.org/index.php/XSS_%28Cross_Site_Scripting%29_Prevention_Cheat_Sheet#Output_Encoding_Rules_Summary [that's now out-of-date, perhaps you can find the newer reference].

You can say, "yeah, but surely you don't need to encode that aggressively". But I think the problem was that ERDDAP was being flagged by security scans because of the possibility of XSS vulnerabilities. The mere presence of certain characters (characters you wouldn't think to be trouble) was flagged.

So the situation is not a simple as one might expect and the safe thing was to aggressively encode the attributes (which has not caused any problems for ~8 years, until now). I suspect the happy medium is somewhere between what minimallyEncodeAsHTMLAttribute did and the current system, but it is exceedingly hard to know what that set of characters that need to be encoded (to avoid trouble with security scans) is.

So, I will try to less aggressively encode those particular urls so the problem goes away for you.

You say "it's hard to make good rewrite rules", but encodeAsHTMLAttributes just has one rule: if the character isn't a letter or a digit, then encode it as &#xhh; (where hh is the hexadecimal version of the character number). That is a legitimate encoding of the HTML attribute. So it seems like a solution for you is simply to decode those HTML attributes via the inverse of that rule. This is a subset of the standard way of encoding HTML attributes. You'll always have to deal with the possibility of some characters being encoded this way. I don't know what tools you have at your disposal in NGINX, but hopefully you can write the inverse of that rule if a built in function doesn't already exist.

I admit I don't quite understand the problem you face. After all, when the browser reads the HTML, one of the first things it does is decode the attributes. It's not like a browser is ever going to redirect anyone to http:&x2f;&x2f;erddap1&x2e;8080&x2f;erddap&x2f;index&x2e;html. User's get redirected to the decoded URL. So the problem is that your system (NGINX?) isn't decoding the attribute, which is something it needs to do in any case. The current encoding may be excessive, but it isn't illegal or incorrect.

@BobSimons
Copy link
Collaborator

You said "it's hard to make good rewrite rules", but you didn't say what tools you have at your disposal. Since you say it's hard,
I'm guessing it isn't a programming language and uses regular expressions (regex's) instead.
I'm guessing you're trying to match the stated URL and replace it with something else, so the real problem is finding the first part of the URL.
If I'm guessing correctly, can't you just use a capture group
(variant1|variant2|variant3) //but with any number of variants
to match the different forms of the first part of the URL?, e.g.,
(http://my.domain:8080/|http&x3a;&x2f;&x2f;my&x2e;domain&x3a;8080&x2f;)
?
I could easily be wrong about your situation. If so, please tell me more of the details of what you need to do and what tools you have (e.g., regex's).

@BobSimons
Copy link
Collaborator

I found the updated link for the OWASP Cheatsheets Book:
See https://owasp.org/www-pdf-archive/OWASP_Cheatsheets_Book.pdf
page 188, section 25.4, says
Encoding Type: HTML Attribute Encoding
Encoding Mechanism:
Except for alphanumeric characters, escape all characters with the HTML Entity &#xHH; format,
including spaces. (HH = Hex Value)"

So my current method is following the OWASP advice exactly.
I think the security scan report pointed me to this originally, as I wouldn't have found it on my own.
So I am disinclined to change the encodeAsHTMLAttribute() method in my code, as it is likely that this will just generate warnings/errors on the security scan report.

It would be great if we could solve the problem you face via a change in your NGINX configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants