Skip to content

cecio/EMOTET-2020-Reversing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EMOTET: a State-Machine reversing exercise

Intro

Around the 20th of December 2020, there was one of the "usual" EMOTET email campaign hitting several countries. I had the possibility to get some sample and I decided to make this little analysis, to deep dive some specific aspects of the malware itself.

In particular I had a look to how the malware has been written, with an analysis of the interesting techniques used.

There is a very good analysis done by Fortinet in 2019, where the also the first stage has been analyzed. My exercise is more focused on the second stage on a recent sample.

In this repository you will find all the DLLs, scripts and tools used for the analysis, with the annotated Ghidra project file, with all the mapping to my findings (API calls, program logic, etc). You can use this as starting point for additional investigation on it. Enjoy ;-)

The Tools

The infection chain

EMOTET is usually spread by using e-mail campaign (in this case in Italian language)

emotet_mail

This particular sample is coming from what we can call the usual infection chain:

  1. delivery of an e-mail with a malicious zipped document
  2. once opened, the document runs an obfuscated powershell script and downloads the 2nd stage
  3. the 2nd stage (in form of a DLL) is then executed
  4. the 2nd stage establish some persistence and try to connect a C2

The initial triage

All the files used for this analysis are in the repository. The "dangerous" ones are password protected (with the usual pwd).

The DLL (sg.dll) has the following characteristics:

File Name: sg.dll    
Size:      340480 
SHA1:      b08e07b1d91f8724381e765d695601ea785d8276

This DLL exports a single function named RunDLL: once executed, it decrypts "in-memory" an additional DLL. This one, dumped as dump_1_0418.bin, is the target of my analysis:

File Name:  dump_1_0418.bin
Size:       122880
SHA1:       57cd8eac09714effa7b6f70b34039bbace4a3e23

DLL_exports

An initial overview of the dumped DLL, shows immediately that we don't have any string visible in it, no imports and a first look to the disassembly shows a heavily obfuscated code. We need to do some work here.

I fired up Ghidra and started to snoop around. Starting from the only exported function RunDLL you quickly end up to FUN_10009716 where you can spot a main loop with a kind of "State-Machine":

ghidra_main

It looks like that a given double-word (stored in ECX) is controlling what the program is doing. But this looks convoluted and not very easy to unroll, since nothing is really in clear. For example, if you try to isolate the library API call in x64dbg, you will face something like this:

xdbg_libcall

Every single API call is done in this way: there is a bunch of MOV, XOR, SHIFT and PUSH followed by a call to xxx606F (first red box), which decode in EAX the address of the function (called by the second red box). The number of PUSH just before the CALL EAX are the parameters, which could be worth to inspect.

The same "state" approach is also used in several sub-functions, not only in the main loop. So, everything looks time consuming, and I'd like to find a way to get the high level picture of it.

Speakeasy

This tool is a little gem: Speakeasy can emulate the execution of user and kernel mode malware, allowing you to interact with the emulated code by using quick Python scripts. What I'd like to do was to map every single state of the machine (ECX value of the main loop), to something more meaningful, like DLL API calls.

I had to work a bit to get what I wanted:

  • the emulation was failing in more than one point, with some invalid read. I investigated a bit the reason, and I saw that sometimes the CALL EAX done in some location was not valid (EAX set to 0). I decided to get the easy way and just skip these calls
  • I had to modify the call to a specific API (CryptStringToBinary)
  • I mapped the machine state
  • added a --state switch to control the flow of the emulation. You can use it to explore all the states (ex. --state 0x167196bc). You may encounter errors if needed parts are not initialized, but you can reconstruct the proper flow by looking at the Ghidra decompilation
  • in a second iteration, knowing where strings are decrypted, I added a dump of all the strings in clear (see following sections)

Then the execution of the final script (python emu_emotetdll.py -f sg.dll) gave me something very interesting. The list of the imported DLLs (with related addresses):

0x10017a4c: 'kernel32.LoadLibraryW("advapi32.dll")' -> 0x78000000
0x10017a4c: 'kernel32.LoadLibraryW("crypt32.dll")' -> 0x58000000
0x10017a4c: 'kernel32.LoadLibraryW("shell32.dll")' -> 0x69000000
0x10017a4c: 'kernel32.LoadLibraryW("shlwapi.dll")' -> 0x67000000
0x10017a4c: 'kernel32.LoadLibraryW("urlmon.dll")' -> 0x54500000
0x10017a4c: 'kernel32.LoadLibraryW("userenv.dll")' -> 0x76500000
0x10017a4c: 'kernel32.LoadLibraryW("wininet.dll")' -> 0x7bc00000
0x10017a4c: 'kernel32.LoadLibraryW("wtsapi32.dll")' -> 0x63000000
...

and a lot of API calls, mapped to the machine state:

[+] State: 1de2d3e5
0x10010ba0: 'kernel32.GetProcessHeap()' -> 0x7280
0x10018080: 'kernel32.HeapAlloc(0x7280, 0x8, 0x4c)' -> 0x72a0
[+] State: 5c80354
0x10010ba0: 'kernel32.GetProcessHeap()' -> 0x7280
0x10018080: 'kernel32.HeapAlloc(0x7280, 0x8, 0x20)' -> 0x72f0
0x10017a4c: 'kernel32.LoadLibraryW("advapi32.dll")' -> 0x78000000
0x10010ba0: 'kernel32.GetProcessHeap()' -> 0x7280
0x10014b3a: 'kernel32.HeapFree(0x7280, 0x0, 0x72f0)' -> 0x1
0x10010ba0: 'kernel32.GetProcessHeap()' -> 0x7280
...

This list was not complete (because I skipped on purpose some failing calls and probably some calls were not correctly intercepted), but it gave me an overall picture of what was going on. Thanks FireEye!

Mapping

With the help of Speakeasy output and a combination of dynamic and static analysis (done with x64gdb and Ghidra), I was able to reconstruct the main flows of the Malware. Consider that these flows are not complete, they are high level snapshot of what is going on for some (not all) the "states". I'm sure something is missing. This is the "main" flow

emotet_diag_general

Then we have the "Persistency" flow (the yellow boxes are the interesting ones):

emotet_diag_persistency

And the initial "C2" communication flow:

emotet_diag_C2

Not all the states were explored. I focused on persistence and initial C2. The great thing of this approach is that you can now alter the execution flow, by setting the ECX value you want to explore or execute.

I added a lot of details in the Ghidra file, by renaming the API calls and inserting comments. Every number reported in the graphs (ex 19a) are in the comments, so you can easily track the code section.

Ghidra_commented

I renamed the functions with this standard:

  • a single underscore in front of API calls
  • a double underscore in front of internal function calls

Interesting findings: encrypted strings

All the strings are encrypted in a BLOB, located, in this particular dumped sample, at 0x1C800

encrypted_blob

The green box is the XOR key and the yellow one is the length of the string. The function used to perform the decryption is the __decrypt_buffer_string_FUN_10006aba and __decrypt_headers_footer_FUN_100033f4

Ghidra_6aba

Every single string is decrypted and then removed from memory after usage. This is true even for C format strings. So you will not find anything in memory if you try to inspect the mapped sections at runtime.

As said before, I added a specific section in the Speakeasy script to dump those strings.

Interesting findings: list of C2 servers

IP of C2 are dumped form the same BLOB (in this case at 0x1CA00) just after the decryption in step 20a.

ip_list

As stated in Fortinet Analysis, this list is made of IP (green box) and port (yellow box). You can decode the whole list if you pass this part of the binary in the following python code:

import sys
import struct

b = bytearray(sys.stdin.buffer.read())

for x in range(0,len(b),8):
    print('%u.%u.%u.%u:%u' % (b[x+3],b[x+2],b[x+1],b[x],struct.unpack('<H',bytes(b[x+4:x+6]))[0]))

You can find the full list extracted in IoC section.

Interesting findings: persistence

This particular sample obtain persistency by installing a System Service. This campaign deployed different versions of the DLL using also different techniques: Run Registry Key is one of them.

The section installing the service is the 20a (state 0x204C3E9E). The high level steps are the following:

  • decrypt the format string %s.%s

  • generates random chars to build the service name (which results in something like xzyw.qwe)

  • get one random "Service Description" from the existing ones, and use it as description of the new service

Interesting findings: encrypted communications with C2

In section 8a (state 0x1C904052) we can spot out the load of a RSA public key

RSA_key

After this we have a call to CryptGenKey with algo CALG_AES_128. So it looks that the sample is going to use a symmetric key to encrypt communication.

In section 20a (state 0x386459ce) we see how the communication is encrypted:

  • CryptGenKey
  • CryptEncrypt of the buffer to send, with the previous key
  • CryptExportKey encrypted with the RSA public key
  • the exported and encrypted symmetric key is then prepended to the buffer sent via HTTP

Wrap up

The analysis is far to be complete, there are a lot of unexplored part of the sample. At the end my goal was to build a procedure to make the analysis easier, even for different or future samples, where it would be faster to understand the overall picture.

Appendix: IoC

C2 IP list:

118.38.110.192:80
181.136.190.86:80
167.71.148.58:443
211.215.18.93:8080
1.234.65.61:80
209.236.123.42:8080
187.162.250.23:443
172.245.248.239:8080
60.93.23.51:80
177.144.130.105:443
93.148.247.169:80
177.144.130.105:8080
110.39.162.2:443
87.106.46.107:8080
83.169.21.32:7080
191.223.36.170:80
95.76.153.115:80
110.39.160.38:443
45.16.226.117:443
46.43.2.95:8080
201.75.62.86:80
190.114.254.163:8080
12.162.84.2:8080
46.101.58.37:8080
197.232.36.108:80
185.94.252.27:443
70.32.84.74:8080
202.79.24.136:443
2.80.112.146:80
202.134.4.210:7080
105.209.235.113:8080
187.162.248.237:80
190.64.88.186:443
111.67.12.221:8080
5.196.35.138:7080
50.28.51.143:8080
181.30.61.163:443
103.236.179.162:80
81.215.230.173:443
190.251.216.100:80
51.255.165.160:8080
149.202.72.142:7080
192.175.111.212:7080
178.250.54.208:8080
24.232.228.233:80
190.45.24.210:80
45.184.103.73:80
177.85.167.10:80
212.71.237.140:8080
181.120.29.49:80
170.81.48.2:80
68.183.170.114:8080
35.143.99.174:80
217.13.106.14:8080
168.121.4.238:80
172.104.169.32:8080
111.67.12.222:8080
62.84.75.50:80
77.78.196.173:443
177.23.7.151:80
213.52.74.198:80
12.163.208.58:80
1.226.84.243:8080
113.163.216.135:80
188.225.32.231:7080
191.182.6.118:80
81.213.175.132:80
104.131.41.185:8080
152.169.22.67:80
185.183.16.47:80
192.232.229.54:7080
186.146.13.184:443
178.211.45.66:8080
122.201.23.45:443
70.32.115.157:8080
190.24.243.186:80
51.15.7.145:80
46.105.114.137:8080
81.214.253.80:443
192.232.229.53:4143
59.148.253.194:8080
191.241.233.198:80
181.61.182.143:80
190.195.129.227:8090
68.183.190.199:8080
138.97.60.140:8080
138.97.60.141:7080
137.74.106.111:7080
85.214.26.7:8080
71.58.233.254:80
94.176.234.118:443
188.135.15.49:80
80.15.100.37:80
82.76.111.249:443
155.186.9.160:80
189.2.177.210:443

Releases

No releases published

Packages

No packages published

Languages