Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PercentCodec performs destructive decoding of valid UTF-8 byte sequences #667

Open
Aakash4396 opened this issue Mar 15, 2022 · 32 comments
Open

Comments

@Aakash4396
Copy link

If special characters ( 测试 ) are present in name of file, then the esapi validator fails to match the input.
These characters are encoded when passed to validator, but internally call goes to
ESAPI.validator().getValidInput("setHeader", strippedValue, "HTTPHeaderValue", sc.getIntProp("HttpUtilities.MaxHeaderValueSize"), false);

which internally calls
getValidInput(context, input, type, maxLength, allowNull, true);
point to be noted :- input is the string that has to be validated against regex.

further it calls
StringValidationRule class's
getValid(context, input); method.

The main problem occured here in getValid(context, input); method.
Here the input that is passed to validate is decoded encoder.canonicalize(input); as data to print in log message.
but when these parameters are passed to checkWhitelist(context, data, input); method they are passed in reverse order,
beacuse internally
checkWhitelist(context, data, input); uses data to match with regex, instead of using input.

The actual call to checkWhitelist should be checkWhitelist(context, input, data);

@jeremiahjstacey
Copy link
Collaborator

On the surface it would seem that the environment whitelist configuration needs to be adjusted to allow the canonicalized data in the format you're expecting.

From the documentation in the class, the workflow is implemented as intended, where the value of input passed to checkWhitelist method is the canonicalized form of the raw input received.

If you're using the default configurations distributed for ESAPI, then there may be an issue with one of the regexes in relation to the canonicalized form of this input.


Based on the documentation of StringInputValidator.checkWhitelist, the parameters are being passed in the intended order.

	/**
	 * checks input against whitelists.
	 * @param context The context to include in exception messages
	 * @param input the input to check
	 * @param orig A origional input to include in exception
	 *	messages. This is not included if it is the same as
	 *	input.
	 * @return input upon a successful check
	 * @throws ValidationException if the check fails.
	 */
	private String checkWhitelist(String context, String input, String orig) throws ValidationException

StringValidationRule.java

The value of input parameter in the workflow you're describing is the canonicalized form of the original input string. That is the value intended to be verified against the whitelist. That is evaluated on line 275 of the getValid method.

@Aakash4396
Copy link
Author

In the previous version of esapi jar 2.2.1.1, the raw input was directly passed to checkWhitelist without canonicalization but now in 2.3.0.0 rawinput is canonicalized and passed to checkWhitelist, this change is resulting to fail the canonicalization of raw input that contains unicode character (测试), because canonicalize don't handle unicode characters the canonicalized form of raw input contains garbage character which results in failure of pattern matching. This used to work with previous version of jar as raw input was passed directly to checkWhitelist.
Now while setting the header for response ESAPIUtilities.getHttpUtilities().setHeader(...); is called which internally calls ESAPI.validator().getValidInput(...); and 'getValidInput()' sets canonicalize flag to true, and results in failure to add header.

Is there any way to get around this ?

@xeno6696
Copy link
Collaborator

xeno6696 commented Apr 5, 2022

Well, yes. HTTP headers were never defined for anything other than ASCII. You have to encode characters represented with bytes larger than 127 or 0xff. Use percent encoding on non-ascii data.

UTF-8 headers are informally supported by some browsers and web application servers, but it’s best practice not to let the environment make those decisions for you.

From ESAPI’s perspective we could discuss allowing a canonicalize-less version of that method, but you can solve your problem today by percent encoding and calling it done.

@kwwall
Copy link
Contributor

kwwall commented Apr 5, 2022

First off, let me say that we knew that such "breakage" was possible. It was noted in ESAPI issue #588 (specifically, in my comment at #588 (comment)). IMO, it was a bug and we ought not to be relying on a bug for backward compatibility. I believe the way it behaves now is correct (unless of course we botched something in the attempted fix; looking back on issue #588, and the PR #611 which "closed" this, only touched "test" files. @xeno6696 - could you take a look and provide an explanation? Thanks.)

@xeno6696 wrote:

From ESAPI’s perspective we could discuss allowing a canonicalize-less version of that method, but you can solve your problem today by percent encoding and calling it done.

Yes, we could do that. @planetlevel even mentioned something like that in issue #588... providing something like a getOriginalAndPossiblyUnsafeInput() method, that would do the regex validation, but skip the canonicalization. That is not generally recommended though.

@xeno6696
Copy link
Collaborator

xeno6696 commented Apr 5, 2022

@kwwall first commit 92a96ec in the sequence was the actual logic change.

There was some reverse-engineering in #588 breaking apart what I did and why, but the actual logic change was just adding the data parameter to the whitelist/blacklist calls which somehow got missed back in '17. Everything else was testing as you pointed out. Like this question, it was mostly a "behavior changed, is it the right behavior?" kind of question/analysis.

We are still doing the right thing here.

@kwwall
Copy link
Contributor

kwwall commented Apr 5, 2022

@xeno6696 - My bad. Not sure how I missed that commit, especially when I was looking for it. I think I had false expectations that maybe you swapped the formal parameters 'input' and 'orig' in the private method definition rather than the actual parameter order in the call. Anyhow, thanks for the quick response. I knew it had to be there somewhere.

That said, I should have done better calling out more attention to issue #588 and our fix to it in the 2.2.3.0 release notes and how it potentially could break "backward compatibility" because of the bug fix. I will take the blame for that.

Lastly, unless someone can point to a revised RFC or a W3C standards document that allows HTTP header values to be non-ASCII, I think that we should close this and if someone such as @Aakash4396 wants something like the new aforementioned getOriginalAndPossiblyUnsafeInput() method added, they should create a new issue explicitly marked as an 'Enhancement' that requests this and try to add a small paragraph or two of why they believe it is justified. But it is is my understanding that HTTP header values are supposed to be ASCII (or safely encoded as such). If they are not, we are not going to be able to really address their validation using a regex approach.

I'll leave this issue open for a few more days to give it some more time for discussion, but unless someone can point out that non-ASCII header values (and names even??) are now permitted, I will opt to close this as 'wontfix'.

@xeno6696
Copy link
Collaborator

xeno6696 commented Apr 5, 2022

I had some free time over lunch. I created the following unit test:

    public void testUnicodeCanonicalize() {
    	Encoder e = ESAPI.encoder();
    	String input = "测试";
    	String expected = "测试";
    	String output = e.canonicalize(input);
    	assertEquals(expected, output);
    }

At any rate Aakash4396, if the characters you're getting at the time canonicalize is called are 测试 then your issue is something else entirely, because this passes. If you don't believe me, toss that exact code in a JUnit test class and run it yourself.

If it fails--it's possible platform encoding could be getting in the way. Windows 10 normalized UTF-8 as the standard byte encoding but if you're running prior to that it's cp-1252 and that can transform bytes under your feet. The same error can happen in your browser or on the target webserver, if some default somewhere is set to something other than UTF-8.

@Aakash4396
Copy link
Author

Hi @xeno6696,
The characters I get after canonicalize are not the same as passed because the characters are encoded before passing.

The use case is
The characters 测试 are encoded by passing to ESAPIUtilities.getEncoder().encodeForURL(...) function.
The characters get converted to %E6%B5%8B%E8%AF%95 and are passed as input.
Later when this input is passed through canonicalize it returns �� characters, which fails to match with provided pattern.

@xeno6696
Copy link
Collaborator

xeno6696 commented Apr 6, 2022

Ah, see, you needed to LEAD with that information. You would have saved a ton of time and energy, if you had said
"My input is %E6%B5%8B%E8%AF%95 and my expected output is 测试 but what I'm getting is æµ�è¯�

At any rate, I should have a solution ready quickly. Here's the culprit:

image

When I converted the API over to be non-destructive to higher-order bytes I only converted the HTMLEntityCodec leaving the rest as an exercise for the future. The future is now. lol

@kwwall
Copy link
Contributor

kwwall commented Apr 6, 2022

@xeno6696 - Now that you have a better grasp of the issue, do you think you could maybe suggest a revision of the issue title? I think what is listed here is way too broad, as seen by the rabbit trail that we went down.

@xeno6696 xeno6696 changed the title DefaultHTTPUtilities.java fails to set header, ESAPI.validator() method call fails to validate the header value. PercentCodec performs destructive decoding of valid UTF-8 byte sequences Apr 6, 2022
@Aakash4396
Copy link
Author

Hi @kwwall , @xeno6696
Thanks for the update.
Is there any timeline for this to be fixed and until then any workaround for the same ?

@kwwall
Copy link
Contributor

kwwall commented Apr 28, 2022

@Aakash4396 -- only partly in jest, one thing that would probably speed up a fix for this if you could submit a PR to fix it. We'd credit you in the next release notes for your assistance. And if you can't do that, at least maybe contribute a few other cases where it is failing. But @xeno6696 would have to be the one who gives you an estimate of when it could be fixed since I don't know his schedule and I'm not sure what is wrong or how to fix this. I don't think I understand enough about codepoints to know how to fix it and I expect that is partly what is involved here.

@planetlevel
Copy link

https://stackoverflow.com/questions/155892/unicode-url-decoding

@xeno6696
Copy link
Collaborator

xeno6696 commented Apr 28, 2022

I started working on it, but timelines when dealing with something that isn't a job--are funky things. It should be in the next release--hopefully within a couple months, but honestly if I convert one codec, I should convert them all. PercentCodec has a thorn in that IIRC it's also based on Latin-1 and so to do this correctly I should create a version of each and we should consider moving the default encoding to purposefully shake up some things. =-)

@xeno6696
Copy link
Collaborator

All of that is to say, I'll get to it when I get to it, if you want it done faster we accept MOST PRs. I could share a patch for my halfways conversion as it sits now if you want to run with it.

@xeno6696
Copy link
Collaborator

I missed @planetlevel 's link, but yes, I intend to ensure we support UTF-8 encoding/decoding.

@kwwall
Copy link
Contributor

kwwall commented Apr 28, 2022 via email

@xeno6696
Copy link
Collaborator

xeno6696 commented Apr 28, 2022

It's not as helpful as you might think--they do so without any of the decoding baggage.

Which is exactly what's broken here. The decoding.

Encoding is simple: Have codePoint/byte, get encoded URI string. Because java is UTF-16 under the hood, characters that can be represented by integer codepoints are trivial to convert into UTF-8 URI strings.

On decode we need to detect the correct byte string representation, which is why their code isn't super useful to us. This is also--I suspect--a key reason they chose only to encode.

That entire problem goes away.

But at any rate, decode logic needs to have the ability to detect %E6%B5%8B and correctly output

So the easy part is converting the existing codec to do Unicode, the harder part is ensuring we properly detect and handle these URL-to-Byte sequences.

It's not super hard, but like I said I need the better part of a day to get in the flow.

@kwwall
Copy link
Contributor

kwwall commented Apr 28, 2022

@xeno6696 - Tell you what. I'll give you the weekend off to work on it! 😂

@xeno6696
Copy link
Collaborator

image

@kwwall
Copy link
Contributor

kwwall commented Apr 30, 2022

@xeno6696 - Note when you extend the various codecs to handle Unicode, there is a Javadoc comment in the Encoder.canonicalize(String input, boolean restrictMultiple, boolean restrictMixed) interface that states:
| Note that canonicalize doesn't handle Unicode issues, it focuses on higher level encoding and escaping schemes.
that should be corrected or removed.

@xeno6696
Copy link
Collaborator

Right. The issue here is that our encoding/decoding only handles Latin-1 byte renderings. ‘%E6%B5%8B’ is rendering into incompatible and unrecoverable gibberish.

@xeno6696
Copy link
Collaborator

Oh wait… you’re saying the doc is out of step.

@kwwall
Copy link
Contributor

kwwall commented Apr 30, 2022 via email

@xeno6696
Copy link
Collaborator

xeno6696 commented Apr 30, 2022

So @kwwall that comment can stay. The mistakes in the current implementation actually break that contract. Not blaming who came before, but their mindset was clearly pre UTF-8's dominance.

I would actually be interested to do a performance test vs the Encoder project and ESAPI once this is is finished. On paper theirs should be faster, but the change I'm working has us treating all the underlying data as raw ints which should give significant memory and speed advantages.

@xeno6696
Copy link
Collaborator

Just so you know @kwwall, I took some time to double-check my thinking here, and I'm questioning the approach I took to rewrite this part of the library back in '17. yes, I made a ton of bugs go away that relate to encoding issues, but something doesn't smell right.

I double checked the char data type, and it's supposed to be good for all ranges \x0000-\xffff. That should handle everything Unicode, right? OKAY... this is coming back to me. Unicode characters feasibly tackle the range U+0000..U+10FFFF

So... I think that's what forced me into jumping to integers from the java lang docs: char: The char data type is a single 16-bit Unicode character. It has a minimum value of '\u0000' (or 0) and a maximum value of '\uffff' (or 65,535 inclusive).

Integer was the smallest data type available for me to be able to capture sequences like %E6%B5%8B and ultimately register them as 0x00e6b58b

I wonder if before I go too far on this one if I shouldn't set up a series of boundary tests and consider confining my implementation more firmly on that 21b range that character encodings actually encompass?

This is where I wish I took better notes sometimes.

@ghost
Copy link

ghost commented Jun 30, 2022

Hi, is there any further progress on this ticket? We are impacted by this issue.

@xeno6696
Copy link
Collaborator

xeno6696 commented Jul 8, 2022

I'm sure you're impacted @dlgma0 but you're dealing with volunteer hours.

I won't be able to make next week's release. The reason why I didn't code this back in '17 is because it dawned on me--like it did this morning--that the codec to handle the kinds of characters you're talking about require the percent codec to have UTF-8 encoding detection for bytes > 0x00ff.

in short, the ASCII case is a straightforward swap because the default implementation of this WAS ASCII only.

But to handle %E6%B5%8B%E8%AF%95 requires more sophisticated detection logic (and testing). I have to write logic to detect %E6%B5%8B for example, and get that translated to the integer 0x00e6b58b I have to plan that logic carefully and test against every possible character translation. For @kwwall 's edification, given my current trajectory, I should actually be done with this late august early september. Maybe sooner if I get lucky.

@kwwall
Copy link
Contributor

kwwall commented Jul 9, 2022

That's okay, Matt. We're all volunteers here. Anyone who needs urgently could help by submitting a PR that includes suitable tests and we will do a code review on it and it all looks good merge it to our 'develop' branch as well as being sure to give them proper credit in the next release notes.

@ghost
Copy link

ghost commented Jul 11, 2022

@xeno6696 Thanks for your response! Understood and could you please add a comment to this thread when the issue got fixed to which release. :)

@xeno6696
Copy link
Collaborator

Oh trust me, I will. Hopefully future codec updaters can build from my notes.

@kwwall
Copy link
Contributor

kwwall commented Sep 2, 2022

Seemingly somewhat related to issue #377, but I'm not the expert here. I dropped a link to a technical report on "Unicode Security Considerations" there that may be useful here as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants