Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decompiler Core Improvements #42

Open
c3rb3ru5d3d53c opened this issue Apr 26, 2022 · 15 comments · Fixed by #131
Open

Decompiler Core Improvements #42

c3rb3ru5d3d53c opened this issue Apr 26, 2022 · 15 comments · Fixed by #131
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@c3rb3ru5d3d53c
Copy link
Owner

  • Recursive (rework - yes or no)
  • Coverage for C++ (vtable)
  • Delphi
  • Rust
  • Go
  • VB6 (stop don't run)
@c3rb3ru5d3d53c c3rb3ru5d3d53c added the help wanted Extra attention is needed label Apr 26, 2022
@c3rb3ru5d3d53c c3rb3ru5d3d53c added this to the v1.1.1 milestone Apr 26, 2022
@herrcore
Copy link
Collaborator

herrcore commented Apr 27, 2022

Discovery

Before attempting to add additional disassembly passes we need to do a bit of discovery and documentation first. Mainly we want to have a documentation that explains exactly what coverage we expect, how the basic blocks are separated into a control flow graph (CFG), and how the wild carding is applied. Changes to either the coverage, the CFG, or the wildcarding will make all future traits generated incompatible with past traits so it is important that these not be changed often, and that they be well documented.

General Questions To Answer

  • What sections of a file are submitted for disassembly? Does this depend on the execution permissions?
  • How are the basic blocks parsed out for a trait. Are they split into a normalized CFG or is there overlap?
  • Should the disassembler be responsible for detecting non-native code or should this be handled elsewhere? How do we determine if this is VB6?
  • How is wildcarding applied?
    • Are all immediates removed or only memory references?
    • Can an argument be added to control this?

Coverage

  • Under with pure c code what coverage do we expect? Are there any known limitations ?
  • Is it possible to specify new disassembly start points? Possibly extend this to allow for PDB data later?
  • What sort of coverage is provided for compiled languages that result in non-referenced functions (ie. vtables)?
    • Coverage for C++ with inheritance
    • Coverage for Delphi
    • Coverage for GO
    • Coverage for Rust

Test Methodology

In addition to understand what we expect in coverage we will also want to generate some test data so we can compare before/after. One possible approach is to add some code that will give us a map (json dictionary) of the identified functions and the basic blocks contained within. This could then be used to first generate a "current coverage" data set that could be compared both against other disassembly tools, as well as our final changes.

Research

The following are some open source implementations of disassemblers with decent coverage.

@c3rb3ru5d3d53c
Copy link
Owner Author

c3rb3ru5d3d53c commented Apr 27, 2022

Q: What sections of a file are submitted for disassembly? Does this depend on the execution permissions?
A: Right now only sections marked as executable are extracted and submitted for disassembly.
Reference:

if (it->characteristics() & (uint32_t)SECTION_CHARACTERISTICS::IMAGE_SCN_MEM_EXECUTE){

Q: Should the disassembler be responsible for detecting non-native code or should this be handled elsewhere? How do we determine if this is VB6?
A1: Non-native code should be handled with a different decompiler, if it is not supported by capstone we make another decompiler/lexer, which will provide use the functionality we need. Similar to cil.cpp. However, this is a lot of extra custom work and likely out of scope for the limited time we have.
A2: This maybe out of scope for this GeekWeek, however we can provide a yara signature to end users.

Q: How is wildcarding applied? Can an argument be added to control this?
A1: Right now it is applied only to memory references, however I do believe we should also add support for immediates.
A2: Yes, additional code can be added in a specific function to add this functionality, I can point you to it :)
Reference:

binlex/src/decompiler.cpp

Lines 565 to 583 in 0f84757

string Decompiler::WildcardInsn(cs_insn *insn){
string bytes = HexdumpBE(insn->bytes, insn->size);
string trait = bytes;
for (int j = 0; j < insn->detail->x86.op_count; j++){
cs_x86_op operand = insn->detail->x86.operands[j];
switch(operand.type){
case X86_OP_MEM:
{
if (operand.mem.disp != 0){
trait = WildcardTrait(HexdumpBE(insn->bytes, insn->size), HexdumpMemDisp(operand.mem.disp));
}
break;
}
default:
break;
}
}
return TrimRight(trait);
}

Q: Is it possible to specify new disassembly start points? Possibly extend this to allow for PDB data later?
A1: Yes, the initial disassembly starting points are all based on a C++ queue, this queue is started by the parser for each file format. This is already started for collecting DLL or shared library functions in the code.
A2: It is possible to extend for PDB data later, however I think it maybe out of scope for this GeekWeek
Reference:

binlex/src/pe.cpp

Lines 59 to 84 in 0f84757

void PE::ParseSections(){
uint index = 0;
it_sections local_sections = binary->sections();
for (auto it = local_sections.begin(); it != local_sections.end(); it++){
if (it->characteristics() & (uint32_t)SECTION_CHARACTERISTICS::IMAGE_SCN_MEM_EXECUTE){
sections[index].offset = it->offset();
sections[index].size = it->sizeof_raw_data();
sections[index].data = malloc(sections[index].size);
memset(sections[index].data, 0, sections[index].size);
vector<uint8_t> data = binary->get_content_from_virtual_address(it->virtual_address(), it->sizeof_raw_data());
memcpy(sections[index].data, &data[0], sections[index].size);
if (binary->has_exports()){
Export exports = binary->get_export();
it_export_entries export_entries = exports.entries();
for (auto j = export_entries.begin(); j != export_entries.end(); j++){
uint64_t tmp_offset = binary->rva_to_offset(j->address());
if (tmp_offset > sections[index].offset &&
tmp_offset < sections[index].offset + sections[index].size){
sections[index].functions.insert(tmp_offset-sections[index].offset);
}
}
}
}
index++;
}
}

Q: What sort of coverage is provided for compiled languages that result in non-referenced functions (ie. vtables)?
A: Currently we do not have any code that specifically looks for vtables, it could certainly be additional logic that can be added to the initial parsing of the file, which will then populate more offsets with their associated sections to fill the decompile queue.

@herrcore
Copy link
Collaborator

Basic blocks are not normalized (they overlap):

Simple example: pe.emotet.x86

BB at offset 121979 overlaps with BB at offset 120533. We don't need to normalize this but it might be nice to have a flag make it an option as it will be a large performance hit.

@herrcore
Copy link
Collaborator

herrcore commented May 3, 2022

Added the pe and elf entry points to the functions to be processes instead of just pushing the address of the start of the function b86a709

Improved coverage for pe.x86

Function coverage: 1.24% ( Missing 476 from total 482 ) -> Function coverage: 54.98% ( Missing 217 from total 482 )
Basic block coverage: 1.68% ( Missing 2979 from total 3030 ) -> Basic block coverage: 63.17% ( Missing 1116 from total 3030 )

@c3rb3ru5d3d53c c3rb3ru5d3d53c changed the title Decompiler Core Decompiler Core Improvements May 7, 2022
@c3rb3ru5d3d53c
Copy link
Owner Author

Leaving this one open just changing title to Improvements so we can track

@c3rb3ru5d3d53c c3rb3ru5d3d53c removed this from the v1.1.1 milestone May 7, 2022
@c3rb3ru5d3d53c
Copy link
Owner Author

Having a look into this now, hoping to make some improvements to the linear diassembler pass, reading the paper, maybe important to note the most useful reading is in the implementation section :)

@c3rb3ru5d3d53c c3rb3ru5d3d53c linked a pull request May 22, 2022 that will close this issue
@c3rb3ru5d3d53c
Copy link
Owner Author

c3rb3ru5d3d53c commented May 23, 2022

image

if (IsCallInsn(cs_ins) == true && valid_block_count >= 1){
    CollectInsn(cs_ins, sections, index);
}

I got some improvement on coverage here doing this, not much but some.

Continuing to read the implementation.

@c3rb3ru5d3d53c
Copy link
Owner Author

image

This one appears to be done and implemented as the method IsSemanticNopInsn.

@c3rb3ru5d3d53c
Copy link
Owner Author

image

Would need to implement an IsReachable method for this.

@c3rb3ru5d3d53c c3rb3ru5d3d53c added the enhancement New feature or request label May 23, 2022
@c3rb3ru5d3d53c
Copy link
Owner Author

I created disassembler.cpp and disassembler.h, this should ensure there is no confusion that we are doing disassembly and not decompilation. Once ready, we should be able to substitute the decompiler.cpp and decompiler.h files for this one instead.

I've performed some improvements where now the method CollectOperands also adds the next instruction to the queue as a block.

Function coverage: 49.59%  ( Missing 243 from total 482 )
Basic block coverage: 62.08%  ( Missing 1149 from total 3030 )
Basic block errors: 0
Extra blocks from binlex: 1619

Still having a decent amount of false positives.

The linear pass jmp_address_1 and jmp_address_2 should track their next instruction address to add to queue, I suspect this is why we are still missing more basic blocks.

@c3rb3ru5d3d53c c3rb3ru5d3d53c linked a pull request May 24, 2022 that will close this issue
@c3rb3ru5d3d53c
Copy link
Owner Author

Added the method IsInvalidNopInsn, as this was in nuculeus because Visual Studio sometimes places semantic nops at the function start and Visual Studio uses int3 for padding. Thus, I added the enum BINARY_TYPE. I also changed the places we are using typedef enum with the appropriate enum as parameters.

bool Disassembler::IsInvalidNopInsn(cs_insn *insn){
    return IsNopInsn(insn) ||
        (IsSemanticNopInsn(insn) && (file_reference.binary_type != BINARY_TYPE_PE)) ||
        (IsTrapInsn(insn) && (file_reference.binary_type == BINARY_TYPE_PE));
}

We still get false positives with this, so I'm continuing to investigate.

@c3rb3ru5d3d53c
Copy link
Owner Author

Well we are here now:

Function coverage: 50.41%  ( Missing 239 from total 482 )
Basic block coverage: 61.75%  ( Missing 1159 from total 3030 )
Basic block errors: 0
Extra blocks from binlex: 595

False positives down a lot now, after removing the extra else statement from LinearDisassemble.

@c3rb3ru5d3d53c
Copy link
Owner Author

I added a check for IsFunction as part of AddDiscoveredBlock, since the collection of traits will collect blocks from functions, it makes no sense to have any duplicates between blocks and functions. Should result in a small performance gain.

void Disassembler::AddDiscoveredBlock(uint64_t address, struct Section *sections, uint index) {
    if (IsVisited(sections[index].visited, address) == false &&
    address < sections[index].data_size &&
    IsFunction(sections[index].functions, address) == false) {
        if (sections[index].blocks.insert(address).second == true){
            sections[index].visited[address] = DISASSEMBLER_VISITED_QUEUED;
            sections[index].discovered.push(address);
        }
    }
}

@c3rb3ru5d3d53c
Copy link
Owner Author

image

I'll start looking at detecting switch jump tables, we can probably add this to the CollectOperands method, so instead of doing multiple passes, we should be able to make it part of our continual collection of blocks and functions.

@jbx81-1337
Copy link
Collaborator

jbx81-1337 commented May 25, 2022

Apply a wildcard of already-known function to others blocks:
After we extract all the "easy" functions we can generate the epilogue/prologue wildcards of these and apply the wildcards to all the others blocks, if we find a match we can tag a block as "epilogue" or "prologue" and use it to identify other functions.

@c3rb3ru5d3d53c c3rb3ru5d3d53c removed a link to a pull request Jun 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants