Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Which messages can be deduplicated? #128

Open
williballenthin opened this issue Apr 24, 2024 · 9 comments
Open

Which messages can be deduplicated? #128

williballenthin opened this issue Apr 24, 2024 · 9 comments
Labels
question Further information is requested

Comments

@williballenthin
Copy link

williballenthin commented Apr 24, 2024

There are many opportunities for messages to be deduplicated as they are serialized into the BinExport2 format.

For example, mnemonics can trivially be deduplicated, and different instructions can refer to the same mnemonic indices. For example, pop eax and pop ebx might both refer to same mnemonic Mnemonic { name="pop" }. In fact, the proto documentation explicitly calls this out:

binexport/binexport2.proto

Lines 209 to 211 in 39f6445

// Index into the mnemonic array of strings. Used for de-duping the data.
// The default value is used for the most common mnemonic in the executable.
optional int32 mnemonic_index = 3 [default = 0];

Likewise, Expressions and Operands can be deduplicated in a similar way.

My primary question is: can instructions be deduplicated?

The protobuf documentation says so:

// Multiple levels of deduping have been applied to make the format more compact
// and avoid redundant data duplication. Some of this due to hard-earned
// experience trying to cope with intentionally obfuscated malicious binaries.
// Note in particular that the same instruction may occur in multiple basic
// blocks and the same basic block in multiple functions (instruction and basic
// block sharing). Implemented naively, malware can use this to cause
// combinatorial explosion in memory usage, DOSing the analyst. This format
// should store every unique expression, mnemonic, operand, instruction and
// basic block only once instead of duplicating the information for every
// instance of it.

So, perhaps that answers my question. However, I want to confirm that this implies that there is not a 1:1 mapping from instruction index to instruction address, since the same deduplicated instruction could be found in multiple basic blocks. I know the address is not stored within each instruction.

For example, as we process protobuf files with capa, our initial implementation builds an index from instruction index to instruction address, so that we can quickly recover function name -> basic block -> instruction -> address (and the reverse). But, if we cannot rely on a 1:1 mapping, we'll reorganize our index a bit.

(note, I suppose the first instruction of a function/vertex will be at the start of a basic block, so the first instruction will necessarily have the address set explicitly, which is useful. But I'd like to confirm the assumptions (and update the documentation) so BinExport2 consumers don't make mistakes).

@williballenthin
Copy link
Author

williballenthin commented Apr 24, 2024

Assuming that instructions can be deduplicated, and therefore there's not a 1:1 mapping between instruction index and instruction address, then I think there's an ambiguity in the Reference infrastructure:

binexport/binexport2.proto

Lines 273 to 297 in 39f6445

// Generic reference class used for address comments (deprecated), string
// references and expression substitutions. It allows referencing from an
// instruction, operand, expression subtree tuple to a de-duped string in the
// string table.
message Reference {
// Index into the global instruction table.
optional int32 instruction_index = 1;
// Index into the operand array local to an instruction.
optional int32 instruction_operand_index = 2 [default = 0];
// Index into the expression array local to an operand.
optional int32 operand_expression_index = 3 [default = 0];
// Index into the global string table.
optional int32 string_table_index = 4;
}
message DataReference {
// Index into the global instruction table.
optional int32 instruction_index = 1;
// Address being referred.
optional uint64 address = 2;
}

A Reference connects an instruction (and optionally operand/expression) to a string or global address, and it does so via an instruction index. However, not all instances of the same instruction may reference the same string/global address. For example, consider the instruction mov ebx, [eax]. In some scenarios, the binary analysis framework may be able to resolve where eax points to, and in other cases not, or may resolve it to a different address. But because the reference is keyed on the instruction index, rather than instruction address, I don't see a way to disambiguate which instance of the instruction references the global address. Do you have any insight or ideas?

@cblichmann cblichmann added the question Further information is requested label Apr 24, 2024
@cblichmann
Copy link
Member

cblichmann commented Apr 24, 2024

I think there might be a bit of confusion here about what is meant by the de-duplication. True, an instruction can be part of multiple basic blocks and basic blocks and be part of multiple functions. So in order to avoid a storage/memory blow-up, functions only store a list of basic block indices and basic blocks only store instruction index ranges.

So in the end, there still is a 1:1 mapping between an instruction and its address in the binary. However, we only actually store the address if it is

  • the entry-point of a function,
  • the first instruction of a basic block (usually indicated by not having "flow")
  • there exists padding space between the instruction and the next one (as determined by the disassembler)

Now if you have a Reference to an Instruction via instruction_index, you will need to compute the instruction's address if that is not stored (using GetInstructionAddress()).

OTOH, we are directly de-duping mnemonics and strings and only store those once.

@williballenthin
Copy link
Author

williballenthin commented Apr 24, 2024

Thanks for the explanation @cblichmann! I understand that there's an implicit guarantee of a 1:1 mapping between instruction indices and addresses. That makes some things much simpler.

Is it ok for BinExport2 exporters to de-duplicate the storage for Expressions and Operands? I don't think there are address-like details that may leak into this abstraction. Do you know of any BinExport2 consumers that would break on with this assumption?

If its all the same to you, it would be nice to be able to do this. In one I-promise-I-didn't-cherry-pick example, the five most common operands accounted for around 75% of all operands in a substantial program:

https://github.com/mandiant/capa/blob/dc8c7e8861b6d4d6eeef9c03f62b7e1728600de6/scripts/inspect-binexport2.py#L166-L170

Deduplicating the storage of Expressions and Operands would lead to a smaller file size (but so would gzipping the file, so this isn't very compelling). More importantly, tools like capa that process all operands looking for particular values can save a huge amount of time by only considering the unique Expressions and Operands.

Exporters that deduplicate Expressions and Operands could enable better performance for some consumers, while exporters that don't deduplicate are just fine and remain compatible, too.

--

I've been casually writing a BinExport2 exporter for my personal Rust-based binary analysis framework, and its a great learning experience. I appreciate a lot of the design decisions you've made with the format. Thanks for your patience as you answer my questions :-)

@cblichmann
Copy link
Member

[...]
Is it ok for BinExport2 exporters to de-duplicate the storage for Expressions and Operands? I don't think there are address-like details that may leak into this abstraction. Do you know of any BinExport2 consumers that would break on with this assumption?

I think you're right (not 100% sure, though, need to think about this more). But I don't think anyone will break because of this. BinDiff mostly does not care as long as it can render the expressions.

If its all the same to you, it would be nice to be able to do this. In one I-promise-I-didn't-cherry-pick example, the five most common operands accounted for around 75% of all operands in a substantial program:

https://github.com/mandiant/capa/blob/dc8c7e8861b6d4d6eeef9c03f62b7e1728600de6/scripts/inspect-binexport2.py#L166-L170

Deduplicating the storage of Expressions and Operands would lead to a smaller file size (but so would gzipping the file, so this isn't very compelling).

This is not too surprising, actually. This is also the reason we to the instruction histogram before assigning the mnemnonic ids. After all the distribution of instruction mnemonics is similar (hello mov :)).

More importantly, tools like capa that process all operands looking for particular values can save a huge amount of time by only considering the unique Expressions and Operands.

Exporters that deduplicate Expressions and Operands could enable better performance for some consumers, while exporters that don't deduplicate are just fine and remain compatible, too.

If we implement this, we should do it in the C++ binexport2_writer.cc as well in the Ghidra exporter. But maybe we hide it behind a feature flag for now? Care to create a feature request to track this?

[...]
I've been casually writing a BinExport2 exporter for my personal Rust-based binary analysis framework, and its a great learning experience.

Tell me more :)

I appreciate a lot of the design decisions you've made with the format. Thanks for your patience as you answer my questions :-)

That praise should go to Sören. I'll make sure to forward...

@mike-hunhoff
Copy link
Contributor

mike-hunhoff commented Apr 25, 2024

Thanks for the detailed explanations @cblichmann . I'm not seeing the following in practice:

So in the end, there still is a 1:1 mapping between an instruction and its address in the binary. However, we only actually store the address if it is
[...]
the first instruction of a basic block (usually indicated by not having "flow")
[...]

e.g. Ghidra's BinExport exporter logic (source):

// Write the full instruction address iff:
// - there is no previous instruction
// - the previous instruction doesn't have code flow into the current one
// - the previous instruction overlaps the current one
// - the current instruction is a function entry point

Specifically, "the previous instruction doesn't have code flow into the current one" would not record e.g. the address of the first instruction of a basic block that follows a conditional jump?

I'm seeing similar behavior in IDA's BinExport files but I couldn't easily verify. Can you please provide further guidance on this?

@cblichmann
Copy link
Member

Specifically, "the previous instruction doesn't have code flow into the current one" would not record e.g. the address of the first instruction of a basic block that follows a conditional jump?

The intend of phrasing it like this was to capture the situation of "new basic block started" as well as "instruction is a jump target" (might even be a jump into the middle of a function). In those cases one would expect BinExport to contain the instruction address.
Calling this "flow" is due to IDA Pro's FF_FLOW flag, which might be confusing.

@mike-hunhoff
Copy link
Contributor

mike-hunhoff commented Apr 26, 2024

Specifically, "the previous instruction doesn't have code flow into the current one" would not record e.g. the address of the first instruction of a basic block that follows a conditional jump?

The intend of phrasing it like this was to capture the situation of "new basic block started" as well as "instruction is a jump target" (might even be a jump into the middle of a function). In those cases one would expect BinExport to contain the instruction address. Calling this "flow" is due to IDA Pro's FF_FLOW flag, which might be confusing.

My main question here is whether or not BinExport is expected to record the address of the first instruction of all basic blocks? Depending on this answer, what I am seeing in the output may be considered either a bug or a feature.

@williballenthin
Copy link
Author

Based on what I understand @cblichmann said, I think you've found a bug and we should try to develop a test case. Do you see this behavior in all samples? I wonder why this hasn't been noticed in BinDiff before.

I wonder if consumers of BinExport aren't asserting the presence of the address field and are ending up with the default value.

@mike-hunhoff
Copy link
Contributor

Based on what I understand @cblichmann said, I think you've found a bug and we should try to develop a test case. Do you see this behavior in all samples? I wonder why this hasn't been noticed in BinDiff before.

I wonder if consumers of BinExport aren't asserting the presence of the address field and are ending up with the default value.

I'm 4 for 4 samples with BinExport files that exhibit this behavior, including BinExport files generated by IDA and Ghidra, so I'd expect this to be an issue for most. I've opened #130 and #129. Happy to continue discussion there - I've already opened an internal draft for #130 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants