Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Classes (updated) #1379

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Classes (updated) #1379

wants to merge 4 commits into from

Conversation

regeciovad
Copy link
Contributor

@regeciovad regeciovad commented Oct 16, 2020

The goal is to enable the use of rules like the following in the VirusTotal Hunting:

rule contains_btc_address 
{
    strings:
        $btc = /[13][a-km-zA-HJ-NP-Z1-9]{25,34}/ fullword ascii wide
    condition: 
        $btc
}

The implementation allows the use of multiple characters per byte in atoms and per state in the Aho-Corasick automaton. This is done via encoding in the bitmask, a part that is already used in YARA.

Some additional actions are needed. The Aho-Corasick has to be deterministic even for the calculation of failure links.

However, simplification is shown in many areas. For example, the nocase option only involves the additional characters in the bitmask; new atoms are not created. The atoms contain more information about strings, and the Aho-Corasick automaton can be effective when used for searching.

The resulting transition table can be also smaller and more compact than if the same strings were generating multiple atoms.
For more details, I created a blog post about my changes and some test results: https://engineering.avast.io/yara-in-search-of-regular-expressions/.

Copy link
Member

@plusvic plusvic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @regeciovad, I've started reviewing this PR and I'll be adding new comments as I find new things that are worth commenting. This is going to take a while because the changes are profound and I want to get my head around them.

I also deployed this branch in VirusTotal and it looks pretty stable, the only drawback so far is a large increase in the size of compiled rulesets, it went from ~800MB to ~3GB in our case. What could explain that increase?

@@ -428,6 +428,7 @@ single
fail_if($$ == NULL, ERROR_INSUFFICIENT_MEMORY);

$$->re_class = $1;
$$->mask = 0xCC;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear why the node's mask is being set here to 0xCC. Apparently this is checked later at ...

https://github.com/VirusTotal/yara/pull/1379/files#diff-80eca980365d0a935a951470d2fb215bb663508b7354a985795b5a70755bd4a4R879

... but the comparison is with YR_ATOM_TYPE_CLASS instead of 0xCC, so changing the value of YR_ATOM_TYPE_CLASS introduces an issue. We should be able to change the value of YR_ATOM_TYPE_CLASS without breaking the code.

But initializing the mask with YR_ATOM_TYPE_CLASS also feel unnatural, because YR_ATOM_TYPE_CLASS is related to atoms, not to regexp nodes. I think that we should be using the node's type (RE_NODE_CLASS) instead of the mask in this condition:

if ((nodes)[i]->mask == YR_ATOM_TYPE_CLASS)
    {
      atom->bytes[i] = 0x00;
      for (int j = 0; j < 32; j++)
      {
        if ((nodes)[i]->re_class->bitmap[j] != 0)
        {
          chunk = (uint8_t)(nodes)[i]->re_class->bitmap[j];
          atom->bitmap[i][j / diff] |= chunk << ((j % diff) * diff);
        }
      }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry, I am updating my PR after a while, I had some deadlines, and I was also solving issues with this PR.
I changed the code based on this comment. However, I re-discovered my original motivation for this. For some reason, the type is not always correctly set up to RE_NODE_CLASS but RE_NODE_LITERAL.
This is causing missing matches on tests/test-bitmap.c:328: rule does not match.
The PR was, for that reason, failing tests for a few days. On other machines I have access to, the tests are OK.
I honestly have no idea why this is happening, and because I cannot recreate the failed test locally, I have trouble fixing the cause.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last comment regarding memory usage. I updated the compiler.c file with values decreasing memory usage. Hopefully, this will improve the issue you reported the last time.

libyara/ahocorasick.c Show resolved Hide resolved
libyara/ahocorasick.c Outdated Show resolved Hide resolved
int literal = _yr_ac_class_is_literal(input_state->bitmap);
if (literal >= 0)
{
input_state->input = literal;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input_state is modified here, that's a side effect of dfa_subtree that should avoided. Why is this required here? Is it possible to convert all the states of type YR_ATOM_TYPE_CLASS that can be converted to YR_ATOM_TYPE_LITERAL in a single place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change to the literal state is needed because it influences the algorithm. I tried to revise the algorithm several times, but I am mainly facing problems with match structure. It is sensitive to changes and copies, and many changes in the algorithm influence the order of matches, for example.

//
// Detect if two states are in a collision
//
static bool _yr_ac_find_conflict_states(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description for this function says that it detects if two states are in collision, but it receives four states, what's the purpose of the other two states? Which are the states that are checked for conflict?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some comments to make the process a little bit clearer, hopefully. When I am changing states, I have to update the matches and children's states as well. This is, however, not easy, with the goal of not changing the results, as I wrote in the previous comment. For that reason, I had to split it into several steps.

// If the class becomes literal, it returns a literal found.
// Otherwise, it returns 0.
//
static int _yr_ac_class_is_literal(YR_BITMASK* bitmap)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename this function to _yr_ac_class_get_literal because the current name suggests that the result from this function should be a boolean. Also, the function actually returns -1 when the bitmap doesn't represent a literal, but the documentation says that it returns 0. I would improve the documentation with something like:

////////////////////////////////////////////////////////////////////////////////
// If the class represented by the bitmap can be reduced to a literal, return
// that literal, otherwise return -1.
//
// For example, a class like [a] actually contains a single character, and is
// equivalent to literal "a", the class [a-z] however can't be represented by a
// single literal.
//

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@regeciovad
Copy link
Contributor Author

Thank you for your review. I will look into a memory usage increase. My version generally takes a little bit of memory, but there should not be such a gap. I noticed that one of my changes from the past that decreased the memory usage (#1518) was reverted (https://github.com/VirusTotal/yara/blob/master/libyara/compiler.c#L269). I will check if this influences it.
I will also update the code based on your comments and recommendations as soon as possible. Thank you for them.

@regeciovad regeciovad force-pushed the classesv2 branch 4 times, most recently from 1022b6b to 32f8802 Compare August 29, 2022 07:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants