Detection for SUNBURST C2 Stage-1 using Shannon Entropy
Using Shannon Entropy to detect Domain-Generation Algorithms (DGA). It should only be fed FQDNs into the function or read from a CSV file. SYNTAX: Prefix_or_subdomain.Domain.TLD Made due to APTs leveraging long-string DGA for Kill Chain: C2. ID long string URLs that have high entropy > 2x Standard Deviations. Looking at upperbound 2.5% of URLs with long prefixes/subdomains. Saving extra metadata if CND analysts need it (ie. Prefix Length)
- Python 3.9+
- Python exists in PATH variable
- All imports listed (pip install, if necessary)
- Input DNS log has 1x FQDN per line (no other punctuation)
- Cisco Top-1M (CSV, ZIP) or Majestic (CSV) Top-1M exist in same folder as the script.
- Install Python 3.9.x
- Stage all input files with correct formatting (domains in a text file, 1 per line, no punctuation)
- Ensure input files are in the same directory/folder as this script. 4a) (Windows) Open this script in Python IDLE or Visual Studio Code. 4b) (Linux) $~: python3 shannon.py
- Enter the input DNS log file to examine.
- Enter the Cisco Umbrella (CSV/ZIP) or Majestic Top-1M file (CSV) to leverage.
- Wait for the Frequency and Shannon Entropy values to generate.
- Examine the output files.
- Install Python 3.9.x
- Stage all input files with correct formatting (domains in a text file, 1 per line, no punctuation)
- Ensure input files are in the same directory/folder as this script. 4a) (Windows) Open this script in Python IDLE or Visual Studio Code. 4b) (Linux) $~: python3 shannon.py
- Input Log to analyze is "domains.txt", see VARIABLES.
- Pick Top-N file to build dictionary probability, see VARIABLES.
- Wait for the Frequency and Shannon Entropy values to generate.
- Examine the output files.
Assumptions:
- Capital and lowercase letters have equal probability of appearance. a) Neglecting that TOR nodes alter case-sensitivity in DNS queries & 0x20-encoding.
- ETOAN character frequency chart is insufficient due to missing: 0-9, "-", and "" a) Domains can have hyphens (-) and underscore () and numbers.
- Not designed to counter ExploderBot DGA.
- Input: simple text file, each URL is on its own row - nothing else.
- Input is composed of strictly URL strings w/o prohibited chars.
- Input and output files must be in the same folder as this script.
- Python3 is in PATH variable.
Data set used was derived from Pi-Hole query log (12 hr window) extracted from SQLite.
$~: sqlite3 /etc/pihole/pihole-FTL.db "SELECT domain FROM queries WHERE timestamp >='$(($(date +%s) - 43200))'" > domains.txt
Data was deduped using Excel > Data > Remove Duplicates Data for Reverse Lookups (PTR) in-addr.arpa were also removed.
Leveraging RedCanary's results for DGA analysis, this script takes URL and calculates the shannon entropy.
- shannon.py (1.0) - Get calculations to work with static RedCanary Dictionary.
- shannon.py (1.1) - Minor testing merged into 1.0.
- dictionary.py (1.0) - Test opening/closing Cisco + Majestic Top-1M.
- shannon.py (2.0) - Combining 1.0 w/ dictionary.py tested functions.
Splunk Shannon Entropy - where the idea came from* https://www.splunk.com/en_us/blog/tips-and-tricks/when-entropy-meets-shannon.html
SANS Mark Baggett - Tool RedCanary used to analyze Alexa Top 1M https://github.com/markbaggett/freq
RedCanary - Blog where Probability scores come from https://redcanary.com/blog/threat-hunting-entropy/
Alexa's Top 1M Domains - Data Corpus used by RedCanary https://www.alexa.com/topsites
DGA Detector https://github.com/exp0se/dga_detector
Shannon Entropy - Formula https://towardsdatascience.com/the-intuition-behind-shannons-entropy-e74820fe9800