Implement Custom UTF-8 Decoder #885

Arker123 · 2023-09-26T06:51:41Z

This PR introduces a custom UTF-8 string decoder to replace the existing binary2strings conversion.
Reference: https://en.wikipedia.org/wiki/UTF-8
Fixes: #867

williballenthin · 2023-09-26T08:51:03Z

@Arker123 would you include some performance benchmarks as you propose these changes? eg. does FLOSS performance noticeably change when using this implementation?

don't have to address this immediately, just before we consider it for merging

Arker123 · 2023-09-27T04:22:12Z

For rust-hello64.exe, the current coverage has dropped from 96% to 71%, which I'm investigating. That's the reason this PR is marked as a draft. On the plus side, it correctly extracts the string 'invalid args' as shown in issue #867.

Arker123 · 2023-09-27T04:26:33Z

Additionally, we still need to address the consideration of 'characters' in this implementation.

Source: https://en.wikipedia.org/wiki/UTF-8

Arker123 · 2023-10-02T03:42:15Z

Presently, we have achieved coverage levels of >= 96% in 5 out of 6 test files. However, one file, specifically https://github.com/mandiant/flare-floss-testfiles/blob/master/language/rust/rust-unknown-binaries/bin/1.69.0/i386/200c308a793630e4f3686dd846f0d55b6368834a859875970b4135f3ca487f46, is currently only at 86% coverage. :(

Arker123 · 2023-10-02T05:15:46Z

After conducting further research, I have discovered additional xrefs within the .rdata section of the i386 architecture. The current coverage for the aforementioned file stands at 91%.

You can view the findings here:

mr-tz · 2023-10-09T15:55:34Z

Let's not mix updates to the extraction algorithm and the custom UTF-8 string finder in this PR.

For performance here, let's focus on the time it takes to find just all the UTF-8 strings from files.

Arker123 · 2023-11-10T09:48:10Z

Here are some performance statistics:

File name: 1.59.0_amd64_4d41c02164ee4aeacd5e3fd67ecdc097233b8c6a725fab588d1cdde0461fd8cc
File size: 114.84571838378906 MB
perf: 8.53998875617981 s

File name: 1.49.0_i386_13af976c769e583e63665cc29ddeb6dc1a81c447ce7ea87bc10e01941576318b
File size: 112.57718658447266 MB
perf: 11.280088901519775 s

File name: 1.67.1_amd64_286164c8db3976443ca605209c182e93d2fee5854504d5e9fc32a0b701b78aa8
File size: 125.58065795898438 MB
perf: 20.27586317062378 s

File name: 1.53.0_amd64_6b4292500c9603d0f9647ae53adaa4cbf2840ae243d9efa3f14048f079e2d32a
File size: 65.56964874267578 MB
perf: 4.034202814102173 s

File name: 1.53.0_amd64_63fa08f342be3c2a9c13f3d4c02f5a808913b291d0dc2691424fbb6e50fadb0d
File size: 65.56991577148438 MB
perf: 4.46514892578125 s

File name: 1.65.0_amd64_6f0a952ebd0fd544bf27a13686a0f3494e9be102654a58996bedfe3bcc6f61ab
File size: 83.82809448242188 MB
perf: 9.418390274047852 s

File name: 1.53.0_amd64_bcca0fd4d94764ef5a1ade5e81e749b11d87ab1c6c490d9065040e20da9dcf73
File size: 65.56964874267578 MB
perf: 4.185543060302734 s

File name: 1.59.0_amd64_a09acc46cc64b945ba7e5b1fcd449b6cad435b31d45c7cdc9a4b8dc3c5d4aa04
File size: 115.15755462646484 MB
perf: 8.811774015426636 s

File name: 1.53.0_amd64_15d603b12906f96d4891164febef7b587667363b7c91011ee100c81cad3a0472
File size: 65.56964874267578 MB
perf: 4.438174724578857 s

File name: 1.64.0_amd64_8b4a259da875e73e61b72c87191b8e7ba35c29da6db4fc1f7805c5f5ef93d42e
File size: 74.84070587158203 MB
perf: 5.529693603515625 s

File name: 1.69.0_amd64_c5f8c0a07123fc97c7b673b22b5f54cf415d66454022827f849a76dbe7ab2853
File size: 17.7099609375 MB
perf: 2.7266366481781006 s

File name: 1.69.0_amd64_0eba113dde8ba85afd96116d39444fc4fcc7a280b45d0d15544642f290a5e72a
File size: 31.06982421875 MB
perf: 0.1291041374206543 s

File name: 1.67.1_amd64_a585d992715c9d482fea1d047ea938d884f2888fa5d037aeca8c14ee6f7636d1
File size: 125.58039093017578 MB
perf: 19.494038581848145 s

mr-tz · 2023-11-10T10:05:00Z

Are these the performance stats just for running the UTF-8 extraction code?

Can you clean up the PR to only include those changes, please?

Arker123 · 2023-11-10T11:28:59Z

Are these the performance stats just for running the UTF-8 extraction code?

Yes

Can you clean up the PR to only include those changes, please?

Okay😄

mr-tz · 2023-11-10T14:31:46Z

floss/language/rust/decode_utf8.py

+    raise ValueError("no .rdata section found")
+
+
+def extract_utf8_strings(pe: pefile.PE, min_length=MIN_STR_LEN) -> List[Tuple[str, int, int]]:


can you add a function that's signature is analog to extract_ascii_strings, i.e. operate on buffers?
this can then be wrapped by this or a similar function.

mr-tz · 2023-11-10T14:32:17Z

floss/language/rust/decode_utf8.py

+
+    for i in range(0, len(strings)):
+        # for 1 byte
+        if strings[i] & 0x80 == 0x00:


please add tests for various encoded strings to show they're properly decoded

Arker123 added 2 commits September 26, 2023 12:17

Implement UTF-8 Decoder

cfeb127

Tweaks

e083376

Minor changes

4a54532

Arker123 added 2 commits October 2, 2023 10:42

Discovered more i386 xrefs

775f1ce

Enhance percentage extraction to 91% 😄

18e6080

Arker123 mentioned this pull request Nov 9, 2023

More i386 xrefs #899

Open

Merge branch 'master' into utf8_decoder

6d8e314

Clean up

8515899

Arker123 marked this pull request as ready for review November 10, 2023 13:08

mr-tz reviewed Nov 10, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Custom UTF-8 Decoder #885

Implement Custom UTF-8 Decoder #885

Arker123 commented Sep 26, 2023

williballenthin commented Sep 26, 2023

Arker123 commented Sep 27, 2023 •

edited

Arker123 commented Sep 27, 2023

Arker123 commented Oct 2, 2023

Arker123 commented Oct 2, 2023

mr-tz commented Oct 9, 2023

Arker123 commented Nov 10, 2023

mr-tz commented Nov 10, 2023

Arker123 commented Nov 10, 2023

mr-tz Nov 10, 2023

mr-tz Nov 10, 2023

		raise ValueError("no .rdata section found")


		def extract_utf8_strings(pe: pefile.PE, min_length=MIN_STR_LEN) -> List[Tuple[str, int, int]]:

Implement Custom UTF-8 Decoder #885

Are you sure you want to change the base?

Implement Custom UTF-8 Decoder #885

Conversation

Arker123 commented Sep 26, 2023

williballenthin commented Sep 26, 2023

Arker123 commented Sep 27, 2023 • edited

Arker123 commented Sep 27, 2023

Arker123 commented Oct 2, 2023

Arker123 commented Oct 2, 2023

mr-tz commented Oct 9, 2023

Arker123 commented Nov 10, 2023

Here are some performance statistics:

mr-tz commented Nov 10, 2023

Arker123 commented Nov 10, 2023

mr-tz Nov 10, 2023

Choose a reason for hiding this comment

mr-tz Nov 10, 2023

Choose a reason for hiding this comment

Arker123 commented Sep 27, 2023 •

edited