Recognize txt #836

christian-intra2net · 2023-12-28T09:49:45Z

olevba's heuristic for detecting plain text (no \x00 in the binary data) does not work with many unicode encodings like utf16. Improve on that heuristic and move it to ftguess.py, so we can at least deal with harmless text encoded with utf8, latin1, or utf16 (with or without BOMs). This is far from perfect, ignores popular Asian encodings, but according to wikipedia utf8 is by far the most popular encoding used in software. If we need something better still, I'd recommend not re-inventing the wheel here but use libmagic or other specialized libraries.

I created sample files for all the encodings used and unittests to check them.

Test-driven development: want to correctly detect these as text in ftguess. Already use future ftguess text type. Since we're at it: slightly improve output of unittest

This is not so simple since various text encodings can look rather "binary", but a few simple heuristics will deal with many text types (at least those encountered here in Europe). Of course, all xml is text as well, so use checks for "is this text" only after more specialized tests like "is this xml".

christian-intra2net added 3 commits December 22, 2023 14:49

Add test samples with various text encodings

7eb14b4

Test-driven development: want to correctly detect these as text in ftguess. Already use future ftguess text type. Since we're at it: slightly improve output of unittest

Decode text in olevba before analyzing it

929d2c0

decalage2 self-requested a review December 28, 2023 10:25

decalage2 self-assigned this Dec 28, 2023

decalage2 added 👍 enhancement olevba proposal ftguess labels Dec 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recognize txt #836

Recognize txt #836

christian-intra2net commented Dec 28, 2023

Recognize txt #836

Are you sure you want to change the base?

Recognize txt #836

Conversation

christian-intra2net commented Dec 28, 2023