Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decompiler shows array indexing as index_var[0x1000] instead of DATA_0x1000[index_var] (because of incorrect dataflow inference) #6460

Open
fmagin opened this issue Apr 24, 2024 · 4 comments
Assignees
Labels
Feature: Decompiler Status: Triage Information is being gathered

Comments

@fmagin
Copy link
Contributor

fmagin commented Apr 24, 2024

Describe the bug

Decompiler produces C Pseudo Code and P-Code with flipped interpretation of the *(pointer+offset) pattern in some cases

0x20002600 is a location inside the memory space and has the label req_string which is correctly recognized in the same code snippet in other lines.

  memset(&req_string,0,0x50);
  pcVar2 = pcVar6;
  if ((((pcVar6 != (char *)0x0) && (pcVar2 = (char *)0x0, "" < pcVar5)) && (*pcVar6 == 'G')) &&
     ((pcVar6[1] == 'E' && (pcVar6[2] == 'T')))) {
    for (; pcVar2 < pcVar5; pcVar2 = pcVar2 + 1) {
      bVar1 = (pcVar6 + 5)[(int)pcVar2];
      if (((bVar1 == 0x20) || (bVar1 - 9 < 2)) || (pcVar2 == "")) break;
      pcVar2[0x20002600] = bVar1;
    }
  }
  pcVar2[0x20002600] = '\0';
  if (req_string == '\0') {
    strcpy(&req_string,"index.html");
  }

To Reproduce

Steps to reproduce the behavior:

  1. Import and analyze binary samr21_http.elf.zip
  2. Decompile the http_recv method
  3. For loop around lines 30-40 contains flipped indexing, and is followed by flipped indexing

Expected behavior

Ideally Ghidra should correctly decompile this as req_string[pcVar2] instead of pcVar2[0x20002600]

Attachments

samr21_http.elf.zip

Environment (please complete the following information):

  • OS: Linux
  • Java Version: 17
  • Ghidra Version: 11.0.3
  • Ghidra Origin: official GitHub distro

Additional context

This only happens if the RO optimizations are enabled, disabling them flips the order to the correct one (with two variables, instead of variable and 0x20002600)

Architecture for that specific function is ARM THUMB

@Wall-AF
Copy link

Wall-AF commented Apr 24, 2024

It seems Ghidra needs a little help (as is often the case). Try changing the type of pcVar2 to a numeric one.

@fmagin
Copy link
Contributor Author

fmagin commented Apr 24, 2024

I care specifically about the results the Ghidra decompiler generates by default without human intervention. Changing the type of pcVar2 does indeed fix this, but I still consider this a shortcoming in the default decompiler algorithms. IMO the fact that 0x20002600 is a known address should be used by the decompiler in some way to not treat it as an index.

@marcushall42
Copy link
Contributor

Well, what made ghidra think that pcVar2 was a char *? That was the thing that really got it down the wrong track. When it's trying to figure out *(A+B), if it thinks that A is a pointer and B is a number that might be a pointer, then it seems to be the right choice to consider B to be a number, since adding two pointers doesn't make sense..

I'm forever annoyed by ghidra going the other way, when it produces something like fcn((int)&DATA_400010) when it should be fcn(0x400010), but because it is within the address range, ghidra really wants to make it into a pointer.

Decompiling is inherently ambiguous, and ghidra isn't really a project to produce a decompiler to produce perfect code, but rather a tool that can be used to understand what a binary program is doing. To the extent that the decompiler usually produces readable enough code so that I can better understand the binary, and especially if I can work with the decompiler to help that along, then I personally consider ghidra a tremendous success.

@fmagin
Copy link
Contributor Author

fmagin commented Apr 29, 2024

Good counter example, I agree that basing this decision solely on something also being a valid address doesn't always work either.

I don't actually care about the decompiler output (because that is only for human use anyway), what I care about is the underlying analysis that transforms the P-Code:
(unique, 0x98300, 4) INT_ADD (register, 0x2c, 4) , (const, 0x20002600, 4) (if running the decompiler with the normalize simplification level)
into
(unique, 0x98300, 4) PTRADD (register, 0x2c, 4) , (const, 0x20002600, 4) , (const, 0x1, 4) (with the decompile simplification style)

I don't want perfect C code, I want reliable P-Code on which further analyses can be based which can rely on correct results in the previous stages.[0]

You are correct that this is related to the incorrect identification of pcVar2 as a char *, but the problem is actually deeper than this. If looking at the C Code produces by IDA we can see:

      g_hs = (int)a1;
      v7 = *(_BYTE **)(a3 + 4);
      g_pcb = a2;
      v21 = *(unsigned __int16 *)(a3 + 10);
      memset(req_string, 0, sizeof(req_string));
      v8 = (unsigned int)v7;
      if ( v7 )
      {
        v8 = 0;
        if ( v21 > 6 && *v7 == 71 && v7[1] == 69 && v7[2] == 84 )
        {
          while ( v21 > v8 )
          {
            v15 = (unsigned __int8)v7[v8 + 5];
            if ( v15 == 32 || (unsigned int)(v15 - 9) <= 1 || v8 == 79 )
              break;
            req_string[v8++] = v15;
          }
        }
      }
      req_string[v8] = 0;

The crucial difference here is the v8 = 0;, which claims that the value of v8 inside this branch is always 0.
This assignment isn't "real" because there is no assembly instruction related to it which directly zeros this register.
The reason why this is inserted in IDA is because the instruction
00007922 MOVS R3, R6
effectively sets R3 to 0 because at this line the only possible value for R6 is 0.

And Ghidra does correctly recognize that this value must always be 0 at this location!

The P-Code as displayed by the Graph Control Flow action shows this block as:
image

The u_10000106 = COPY.D #0x0 is the effect of this MOVS instruction.

The problem is that this register isn't recognized as being set, and instead only a unique variable is created.

The next time this unique variable is used is in the block at 0x000079ae with the P-Code:
image

The other possible value for the new R3 comes from a block that increments R3 and this increment block is dominated by the block at 0x000079ae.

I can't test this hypothesis easily, but I suspect that setting the value of R3 directly to 0, instead of just assigning 0 to a unique variable could solve this overall problem, because then it might be clear enough that there is no real data flow from the other variable that is correctly recognized as a char* into this new variable, and it's just a compiler that has optimized out the initial zeroing of a register, because it can always assume that it is 0 anyway, and then uses this as a new variable.

[0] This potentially means that I must rely on the normalize output instead of the decompile output if such incorrect transformations of the P-Code are considered an inherent risk of the decompile simplification. But if this is considered a bug, then this would make it viable to use the decompile style for our analyses. This is the main reason why I opened this issue in the first place

@fmagin fmagin changed the title Decompiler shows array indexing as index_var[0x1000] instead of DATA_0x1000[index_var] Decompiler shows array indexing as index_var[0x1000] instead of DATA_0x1000[index_var] (because of incorrect dataflow inference) Apr 30, 2024
@ryanmkurtz ryanmkurtz added Feature: Decompiler Status: Triage Information is being gathered labels Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Decompiler Status: Triage Information is being gathered
Projects
None yet
Development

No branches or pull requests

5 participants