Skip to content

Unicode characters in YARA

Victor M. Alvarez edited this page Mar 11, 2021 · 1 revision

As you may have noticed, YARA 4.1 has started complaning about non-ASCII characters with some existing rules that compiled with no errors in previous versions. Here I'll try to explain why this backward-incompatible change has been necessary.

Until YARA 4.0.x, literal text strings were interpreted exactly as they appeared in the source file. All kind of characters were accepted, and no validation was performed in order to make sure that the strings were valid ASCII, UTF-8, or any particular encoding. This means that YARA searched for the string exactly as it was encoded by the text editor that was used for writing the rule. If the text editor encoded the source file as UTF-8, YARA searched for the UTF-8 representation of the string; if the text editor used Latin-1, YARA searched for the Latin-1 representation of the string.

This behavior is troublesome because the same literal string can be encoded in different ways depending on factors like your text editor, your operating system, the default code page that your operating system uses, etc. So, you don't fully control the actual sequence of bytes that YARA is searching for, unless you are completely aware of the ins and outs of code pages, Unicode and encoding formats. As I'm writing these lines I can't tell for sure how the file is going to look like in binary form once I hit the "save" button, and most YARA users are like me. Text encoding is a hairy issue, unless we limit ourselves to the old, plain, limited but uncomplicated ASCII format. This the approach that YARA is taking.

By limiting text strings to characters in the printable ASCII range only, the intention is preventing subtle issues caused by strings that were encoded by the text editor in ways the users didn't intend. Let me illustrate with an example:

rule space_1 {
    strings:
       $a = ""
    condition:
       $a
}

Do you see something strange in the rule above? It looks OK, right? A simple rule for detecting characters. Now copy & paste it into your text editor, save it, and scan with YARA any text file that you know that contains spaces. What happens?

I'll save you the burden of doing all that.... nothing happens. YARA won't produce matches even if the scanned file is full of spaces. And that's because the space you are seeing in that rule is not THE space, it's simply another kind of space. It's the Braille Pattern Blank character (U+2800), which is encoded with three bytes in UTF-8 (0xE2 0xA0 0x80), nothing to do with the ASCII space (0x20). So, YARA is not looking for 0x20, it's looking for the sequence 0xE2 0xA0 0x80. The rule above is equivalent to this one:

rule space_1 {
    strings:
       $a = "\xe2\xa0\x80"
    condition:
       $a
}

In fact, there's a minor detail that could indicate the attentive user that something weird was happening with the first rule, it has to do with the infamous "$a is slowing down scanning" warning. I'll leave that as an exercise to the reader.

The example above may look artificial, because I actually made it up in order to illustrate the issue. But I've seen real-life cases where something similar occurred. For example, I've found YARA rules with strings that contained the Horizontal Ellipsis character. This is because some text editors have the nasty (at least to me) behavior of replacing the good old ASCII characters you type in with some fancy Unicode characters. ASCII double quotation marks (0x22) are often replaced with a pair of left and right quotation marks (U+201C and U+201D) and three consecutive dots are replaced by the ellipsis Unicode character. The same happens with the ASCII single quote or apostrophe (0x27), which can be replaced by U+2019. All these cases have been found in actual YARA rules.

Another source of trouble is the Unit Separator (US) character. This is not a Unicode-related issue, the Unit Separator character is actually an ASCII character (0x1F), but one in the non-printable range. Some text editors introduce such characters in the text, and they may be invisible, or look like a narrow space very hard to distinguish.

As you can see, there are a number of reasons for making sure that strings in YARA are limited to the printable-ASCII range. I'm aware this may affect some of your rules, especially those containing strings in non-Latin alphabets like Greek or Cyrilic, but the truth is that the apparent support that previous versions of YARA had for non-ASCII characteres was completely illusory. When you have the string "ЯAPA" in a rule, YARA end up searching for the byte sequence 0xDF 0xC0 0xD0 0xC0 if your text editor is using the Windows-1251 code page, or 0xD0 0xAF 0xD0 0x90 0xD0 0xA0 0xD0 0x90 if your text editor is using UTF-8. If your intention is finding the "ЯAPA" string in some Windows PE file it is completely hopeless, as Microsoft compilers encode Unicode strings in UTF-16, and most text editors these days will encode your rules in either UTF-8 or your local code page, but not in UTF-16.

YARA 4.1 still supports the use of non-ASCII characters in comments and metadata, as shown in the example below.

rule yara_cyrillic {
    meta:
        info = "Finds ЯAPA encoded in Windows-1251 and UTF-8"
    strings:
        $a = "\xDF\xC0\xD0\xC0"  // ЯAPA in Windows-1251
        $b = "\xD0\xAF\xD0\x90\xD0\xA0\xD0\x90"  // ЯAPA in UTF-8
    condition:
        $a or $b
}

I hope this text helps clarify the reasoning behind the change. Text encoding is a fascinating field that hides a lot of complexity and quirks. As you start carving deeper, you may fall down a rabbit hole.

Clone this wiki locally