PercentCodec performs destructive decoding of valid UTF-8 byte sequences #667

Aakash4396 · 2022-03-15T08:27:29Z

If special characters ( 测试 ) are present in name of file, then the esapi validator fails to match the input.
These characters are encoded when passed to validator, but internally call goes to
ESAPI.validator().getValidInput("setHeader", strippedValue, "HTTPHeaderValue", sc.getIntProp("HttpUtilities.MaxHeaderValueSize"), false);

which internally calls
getValidInput(context, input, type, maxLength, allowNull, true);
point to be noted :- input is the string that has to be validated against regex.

further it calls
StringValidationRule class's
getValid(context, input); method.

The main problem occured here in getValid(context, input); method.
Here the input that is passed to validate is decoded encoder.canonicalize(input); as data to print in log message.
but when these parameters are passed to checkWhitelist(context, data, input); method they are passed in reverse order,
beacuse internally
checkWhitelist(context, data, input); uses data to match with regex, instead of using input.

The actual call to checkWhitelist should be checkWhitelist(context, input, data);

The text was updated successfully, but these errors were encountered:

jeremiahjstacey · 2022-03-15T11:22:01Z

On the surface it would seem that the environment whitelist configuration needs to be adjusted to allow the canonicalized data in the format you're expecting.

From the documentation in the class, the workflow is implemented as intended, where the value of input passed to checkWhitelist method is the canonicalized form of the raw input received.

If you're using the default configurations distributed for ESAPI, then there may be an issue with one of the regexes in relation to the canonicalized form of this input.

Based on the documentation of StringInputValidator.checkWhitelist, the parameters are being passed in the intended order.

	/**
	 * checks input against whitelists.
	 * @param context The context to include in exception messages
	 * @param input the input to check
	 * @param orig A origional input to include in exception
	 *	messages. This is not included if it is the same as
	 *	input.
	 * @return input upon a successful check
	 * @throws ValidationException if the check fails.
	 */
	private String checkWhitelist(String context, String input, String orig) throws ValidationException

StringValidationRule.java

The value of input parameter in the workflow you're describing is the canonicalized form of the original input string. That is the value intended to be verified against the whitelist. That is evaluated on line 275 of the getValid method.

Aakash4396 · 2022-04-05T10:31:38Z

In the previous version of esapi jar 2.2.1.1, the raw input was directly passed to checkWhitelist without canonicalization but now in 2.3.0.0 rawinput is canonicalized and passed to checkWhitelist, this change is resulting to fail the canonicalization of raw input that contains unicode character (测试), because canonicalize don't handle unicode characters the canonicalized form of raw input contains garbage character which results in failure of pattern matching. This used to work with previous version of jar as raw input was passed directly to checkWhitelist.
Now while setting the header for response ESAPIUtilities.getHttpUtilities().setHeader(...); is called which internally calls ESAPI.validator().getValidInput(...); and 'getValidInput()' sets canonicalize flag to true, and results in failure to add header.

Is there any way to get around this ?

xeno6696 · 2022-04-05T13:34:43Z

Well, yes. HTTP headers were never defined for anything other than ASCII. You have to encode characters represented with bytes larger than 127 or 0xff. Use percent encoding on non-ascii data.

UTF-8 headers are informally supported by some browsers and web application servers, but it’s best practice not to let the environment make those decisions for you.

From ESAPI’s perspective we could discuss allowing a canonicalize-less version of that method, but you can solve your problem today by percent encoding and calling it done.

kwwall · 2022-04-05T13:41:23Z

First off, let me say that we knew that such "breakage" was possible. It was noted in ESAPI issue #588 (specifically, in my comment at #588 (comment)). IMO, it was a bug and we ought not to be relying on a bug for backward compatibility. I believe the way it behaves now is correct (unless of course we botched something in the attempted fix; looking back on issue #588, and the PR #611 which "closed" this, only touched "test" files. @xeno6696 - could you take a look and provide an explanation? Thanks.)

@xeno6696 wrote:

From ESAPI’s perspective we could discuss allowing a canonicalize-less version of that method, but you can solve your problem today by percent encoding and calling it done.

Yes, we could do that. @planetlevel even mentioned something like that in issue #588... providing something like a getOriginalAndPossiblyUnsafeInput() method, that would do the regex validation, but skip the canonicalization. That is not generally recommended though.

xeno6696 · 2022-04-05T14:11:08Z

@kwwall first commit 92a96ec in the sequence was the actual logic change.

There was some reverse-engineering in #588 breaking apart what I did and why, but the actual logic change was just adding the data parameter to the whitelist/blacklist calls which somehow got missed back in '17. Everything else was testing as you pointed out. Like this question, it was mostly a "behavior changed, is it the right behavior?" kind of question/analysis.

We are still doing the right thing here.

kwwall · 2022-04-05T14:37:54Z

@xeno6696 - My bad. Not sure how I missed that commit, especially when I was looking for it. I think I had false expectations that maybe you swapped the formal parameters 'input' and 'orig' in the private method definition rather than the actual parameter order in the call. Anyhow, thanks for the quick response. I knew it had to be there somewhere.

That said, I should have done better calling out more attention to issue #588 and our fix to it in the 2.2.3.0 release notes and how it potentially could break "backward compatibility" because of the bug fix. I will take the blame for that.

Lastly, unless someone can point to a revised RFC or a W3C standards document that allows HTTP header values to be non-ASCII, I think that we should close this and if someone such as @Aakash4396 wants something like the new aforementioned getOriginalAndPossiblyUnsafeInput() method added, they should create a new issue explicitly marked as an 'Enhancement' that requests this and try to add a small paragraph or two of why they believe it is justified. But it is is my understanding that HTTP header values are supposed to be ASCII (or safely encoded as such). If they are not, we are not going to be able to really address their validation using a regex approach.

I'll leave this issue open for a few more days to give it some more time for discussion, but unless someone can point out that non-ASCII header values (and names even??) are now permitted, I will opt to close this as 'wontfix'.

xeno6696 · 2022-04-05T20:23:49Z

I had some free time over lunch. I created the following unit test:

    public void testUnicodeCanonicalize() {
    	Encoder e = ESAPI.encoder();
    	String input = "测试";
    	String expected = "测试";
    	String output = e.canonicalize(input);
    	assertEquals(expected, output);
    }

At any rate Aakash4396, if the characters you're getting at the time canonicalize is called are 测试 then your issue is something else entirely, because this passes. If you don't believe me, toss that exact code in a JUnit test class and run it yourself.

If it fails--it's possible platform encoding could be getting in the way. Windows 10 normalized UTF-8 as the standard byte encoding but if you're running prior to that it's cp-1252 and that can transform bytes under your feet. The same error can happen in your browser or on the target webserver, if some default somewhere is set to something other than UTF-8.

Aakash4396 · 2022-04-06T10:25:33Z

Hi @xeno6696,
The characters I get after canonicalize are not the same as passed because the characters are encoded before passing.

The use case is
The characters 测试 are encoded by passing to ESAPIUtilities.getEncoder().encodeForURL(...) function.
The characters get converted to %E6%B5%8B%E8%AF%95 and are passed as input.
Later when this input is passed through canonicalize it returns æµ�è¯� characters, which fails to match with provided pattern.

xeno6696 · 2022-04-06T14:22:39Z

Ah, see, you needed to LEAD with that information. You would have saved a ton of time and energy, if you had said
"My input is %E6%B5%8B%E8%AF%95 and my expected output is 测试 but what I'm getting is æµ�è¯�

At any rate, I should have a solution ready quickly. Here's the culprit:

When I converted the API over to be non-destructive to higher-order bytes I only converted the HTMLEntityCodec leaving the rest as an exercise for the future. The future is now. lol

kwwall · 2022-04-06T14:35:42Z

@xeno6696 - Now that you have a better grasp of the issue, do you think you could maybe suggest a revision of the issue title? I think what is listed here is way too broad, as seen by the rabbit trail that we went down.

Aakash4396 · 2022-04-11T07:53:39Z

Hi @kwwall , @xeno6696
Thanks for the update.
Is there any timeline for this to be fixed and until then any workaround for the same ?

kwwall · 2022-04-28T03:11:54Z

@Aakash4396 -- only partly in jest, one thing that would probably speed up a fix for this if you could submit a PR to fix it. We'd credit you in the next release notes for your assistance. And if you can't do that, at least maybe contribute a few other cases where it is failing. But @xeno6696 would have to be the one who gives you an estimate of when it could be fixed since I don't know his schedule and I'm not sure what is wrong or how to fix this. I don't think I understand enough about codepoints to know how to fix it and I expect that is partly what is involved here.

planetlevel · 2022-04-28T13:14:32Z

https://stackoverflow.com/questions/155892/unicode-url-decoding

xeno6696 · 2022-04-28T17:15:51Z

I started working on it, but timelines when dealing with something that isn't a job--are funky things. It should be in the next release--hopefully within a couple months, but honestly if I convert one codec, I should convert them all. PercentCodec has a thorn in that IIRC it's also based on Latin-1 and so to do this correctly I should create a version of each and we should consider moving the default encoding to purposefully shake up some things. =-)

xeno6696 · 2022-04-28T17:17:54Z

All of that is to say, I'll get to it when I get to it, if you want it done faster we accept MOST PRs. I could share a patch for my halfways conversion as it sits now if you want to run with it.

xeno6696 · 2022-04-28T17:48:11Z

I missed @planetlevel 's link, but yes, I intend to ensure we support UTF-8 encoding/decoding.

kwwall · 2022-04-28T19:12:22Z

Just a thought...Does the OWASP Java Encoder Project already handle this? If so, maybe we can steal some code from them. (They also are licensed under the New BSD License.) They don't have decoders, so that would still need to be written though.

…

On Thu, Apr 28, 2022, 1:48 PM Matt Seil ***@***.***> wrote: I missed @planetlevel <https://github.com/planetlevel> 's link, but yes, I intend to ensure we support UTF-8 encoding/decoding. — Reply to this email directly, view it on GitHub <#667 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAO6PG2DZAZHJGARM5TRQLTVHLFONANCNFSM5QX5EIMQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

xeno6696 · 2022-04-28T20:47:05Z

It's not as helpful as you might think--they do so without any of the decoding baggage.

Which is exactly what's broken here. The decoding.

Encoding is simple: Have codePoint/byte, get encoded URI string. Because java is UTF-16 under the hood, characters that can be represented by integer codepoints are trivial to convert into UTF-8 URI strings.

On decode we need to detect the correct byte string representation, which is why their code isn't super useful to us. This is also--I suspect--a key reason they chose only to encode.

That entire problem goes away.

But at any rate, decode logic needs to have the ability to detect %E6%B5%8B and correctly output 测

So the easy part is converting the existing codec to do Unicode, the harder part is ensuring we properly detect and handle these URL-to-Byte sequences.

It's not super hard, but like I said I need the better part of a day to get in the flow.

kwwall · 2022-04-28T21:32:07Z

@xeno6696 - Tell you what. I'll give you the weekend off to work on it! 😂

xeno6696 · 2022-04-29T18:18:49Z

kwwall · 2022-04-30T19:37:55Z

@xeno6696 - Note when you extend the various codecs to handle Unicode, there is a Javadoc comment in the Encoder.canonicalize(String input, boolean restrictMultiple, boolean restrictMixed) interface that states:
| Note that canonicalize doesn't handle Unicode issues, it focuses on higher level encoding and escaping schemes.
that should be corrected or removed.

xeno6696 · 2022-04-30T19:48:15Z

Right. The issue here is that our encoding/decoding only handles Latin-1 byte renderings. ‘%E6%B5%8B’ is rendering into incompatible and unrecoverable gibberish.

xeno6696 · 2022-04-30T19:49:07Z

Oh wait… you’re saying the doc is out of step.

kwwall · 2022-04-30T20:49:33Z

Tell me how to rewrite that sentence & I'll revise it. I have lots of other Javadoc revisions to Encoder so it will save you a merge conflict. I'm fixing issue #686. Fix was delete 3 lines and replace it with one, but doc changes are extensive.

…

-kevin

On Sat, Apr 30, 2022, 3:49 PM Matt Seil ***@***.***> wrote: Oh wait… you’re saying the doc is out of step. — Reply to this email directly, view it on GitHub <#667 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAO6PG5QC2SNAJOXJFKZJGLVHWFD3ANCNFSM5QX5EIMQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

xeno6696 · 2022-04-30T21:07:00Z

So @kwwall that comment can stay. The mistakes in the current implementation actually break that contract. Not blaming who came before, but their mindset was clearly pre UTF-8's dominance.

I would actually be interested to do a performance test vs the Encoder project and ESAPI once this is is finished. On paper theirs should be faster, but the change I'm working has us treating all the underlying data as raw ints which should give significant memory and speed advantages.

xeno6696 · 2022-04-30T23:11:54Z

Just so you know @kwwall, I took some time to double-check my thinking here, and I'm questioning the approach I took to rewrite this part of the library back in '17. yes, I made a ton of bugs go away that relate to encoding issues, but something doesn't smell right.

I double checked the char data type, and it's supposed to be good for all ranges \x0000-\xffff. That should handle everything Unicode, right? OKAY... this is coming back to me. Unicode characters feasibly tackle the range U+0000..U+10FFFF

So... I think that's what forced me into jumping to integers from the java lang docs: char: The char data type is a single 16-bit Unicode character. It has a minimum value of '\u0000' (or 0) and a maximum value of '\uffff' (or 65,535 inclusive).

Integer was the smallest data type available for me to be able to capture sequences like %E6%B5%8B and ultimately register them as 0x00e6b58b

I wonder if before I go too far on this one if I shouldn't set up a series of boundary tests and consider confining my implementation more firmly on that 21b range that character encodings actually encompass?

This is where I wish I took better notes sometimes.

ghost · 2022-06-30T02:54:56Z

Hi, is there any further progress on this ticket? We are impacted by this issue.

xeno6696 · 2022-07-08T16:22:40Z

I'm sure you're impacted @dlgma0 but you're dealing with volunteer hours.

I won't be able to make next week's release. The reason why I didn't code this back in '17 is because it dawned on me--like it did this morning--that the codec to handle the kinds of characters you're talking about require the percent codec to have UTF-8 encoding detection for bytes > 0x00ff.

in short, the ASCII case is a straightforward swap because the default implementation of this WAS ASCII only.

But to handle %E6%B5%8B%E8%AF%95 requires more sophisticated detection logic (and testing). I have to write logic to detect %E6%B5%8B for example, and get that translated to the integer 0x00e6b58b I have to plan that logic carefully and test against every possible character translation. For @kwwall 's edification, given my current trajectory, I should actually be done with this late august early september. Maybe sooner if I get lucky.

kwwall · 2022-07-09T01:07:23Z

That's okay, Matt. We're all volunteers here. Anyone who needs urgently could help by submitting a PR that includes suitable tests and we will do a code review on it and it all looks good merge it to our 'develop' branch as well as being sure to give them proper credit in the next release notes.

ghost · 2022-07-11T10:26:29Z

@xeno6696 Thanks for your response! Understood and could you please add a comment to this thread when the issue got fixed to which release. :)

xeno6696 · 2022-07-11T14:37:02Z

Oh trust me, I will. Hopefully future codec updaters can build from my notes.

kwwall · 2022-09-02T21:21:19Z

Seemingly somewhat related to issue #377, but I'm not the expert here. I dropped a link to a technical report on "Unicode Security Considerations" there that may be useful here as well.

xeno6696 changed the title ~~DefaultHTTPUtilities.java fails to set header, ESAPI.validator() method call fails to validate the header value.~~ PercentCodec performs destructive decoding of valid UTF-8 byte sequences Apr 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PercentCodec performs destructive decoding of valid UTF-8 byte sequences #667

PercentCodec performs destructive decoding of valid UTF-8 byte sequences #667

Aakash4396 commented Mar 15, 2022

jeremiahjstacey commented Mar 15, 2022

Aakash4396 commented Apr 5, 2022

xeno6696 commented Apr 5, 2022

kwwall commented Apr 5, 2022 •

edited

xeno6696 commented Apr 5, 2022 •

edited

kwwall commented Apr 5, 2022

xeno6696 commented Apr 5, 2022 •

edited

Aakash4396 commented Apr 6, 2022

xeno6696 commented Apr 6, 2022

kwwall commented Apr 6, 2022

Aakash4396 commented Apr 11, 2022

kwwall commented Apr 28, 2022

planetlevel commented Apr 28, 2022

xeno6696 commented Apr 28, 2022 •

edited

xeno6696 commented Apr 28, 2022

xeno6696 commented Apr 28, 2022

kwwall commented Apr 28, 2022 via email

xeno6696 commented Apr 28, 2022 •

edited

kwwall commented Apr 28, 2022

xeno6696 commented Apr 29, 2022

kwwall commented Apr 30, 2022

xeno6696 commented Apr 30, 2022

xeno6696 commented Apr 30, 2022

kwwall commented Apr 30, 2022 via email

xeno6696 commented Apr 30, 2022 •

edited

xeno6696 commented Apr 30, 2022

ghost commented Jun 30, 2022

xeno6696 commented Jul 8, 2022

kwwall commented Jul 9, 2022

ghost commented Jul 11, 2022

xeno6696 commented Jul 11, 2022

kwwall commented Sep 2, 2022

PercentCodec performs destructive decoding of valid UTF-8 byte sequences #667

PercentCodec performs destructive decoding of valid UTF-8 byte sequences #667

Comments

Aakash4396 commented Mar 15, 2022

jeremiahjstacey commented Mar 15, 2022

Aakash4396 commented Apr 5, 2022

xeno6696 commented Apr 5, 2022

kwwall commented Apr 5, 2022 • edited

xeno6696 commented Apr 5, 2022 • edited

kwwall commented Apr 5, 2022

xeno6696 commented Apr 5, 2022 • edited

Aakash4396 commented Apr 6, 2022

xeno6696 commented Apr 6, 2022

kwwall commented Apr 6, 2022

Aakash4396 commented Apr 11, 2022

kwwall commented Apr 28, 2022

planetlevel commented Apr 28, 2022

xeno6696 commented Apr 28, 2022 • edited

xeno6696 commented Apr 28, 2022

xeno6696 commented Apr 28, 2022

kwwall commented Apr 28, 2022 via email

xeno6696 commented Apr 28, 2022 • edited

kwwall commented Apr 28, 2022

xeno6696 commented Apr 29, 2022

kwwall commented Apr 30, 2022

xeno6696 commented Apr 30, 2022

xeno6696 commented Apr 30, 2022

kwwall commented Apr 30, 2022 via email

xeno6696 commented Apr 30, 2022 • edited

xeno6696 commented Apr 30, 2022

ghost commented Jun 30, 2022

xeno6696 commented Jul 8, 2022

kwwall commented Jul 9, 2022

ghost commented Jul 11, 2022

xeno6696 commented Jul 11, 2022

kwwall commented Sep 2, 2022

kwwall commented Apr 5, 2022 •

edited

xeno6696 commented Apr 5, 2022 •

edited

xeno6696 commented Apr 5, 2022 •

edited

xeno6696 commented Apr 28, 2022 •

edited

xeno6696 commented Apr 28, 2022 •

edited

xeno6696 commented Apr 30, 2022 •

edited