Skip to content

ucsdsysnet/mx_inference

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MX Inference

Python 3.6 Python 3.7 Python 3.8

MX Inference is a tool that uses DNS data and active port scanning data to infer the mail provider of a domain. This tool makes inference based on the MX records specified by the domain.

Installation

  1. Clone our repository
git clone https://github.com/ucsdsysnet/mx_inference.git
  1. Install OpenSSL and associated development libraries (Reference).
# CentOS:
$ yum install openssl-devel libffi-devel

# Ubuntu:
$ apt-get install libssl-dev libffi-dev

# OS X (with Homebrew installed):
$ brew install openssl
  1. Install python libraries.
# Optionally: start a virtual environment
python3 -m venv .
source bin/activate

# Mandatory
python3 -m pip install -r requirements.txt

Tested Version

We tested our code with Python 3.6/3.8 and OpenSSL version 1.1.1i/1.0.2g on Ubuntu and Mac respectively.

Usage

Default Setup

We have some examples built in. You can simply run the following command (might take a while):

python3 demo_mx_inference.py

Sample output:

...
Domain: ucsd.edu
	ID:pphosted.com, Type:OK, Source:TLS, Debug MSG:Cert ID Ok, Conf_Score:1
		Suggested Company: ProofPoint
	...

Domain: netflix.com
	ID:google.com, Type:OK, Source:TLS, Debug MSG:Cert ID Ok, Conf_Score:2
		Suggested Company: Google
	...

Domain: gsipartners.com
	ID:google.com, Type:OK, Source:TLS, Debug MSG:Cert ID Ok, Conf_Score:2
		Suggested Company: Google
	...

# lodi.gov has two IDs because it has two MX records with same priority
Domain: lodi.gov
	ID:iphmx.com, Type:OK, Source:TLS, Debug MSG:Cert ID Ok, Conf_Score:1
		Suggested Company: Cisco
	
	ID:iphmx.com, Type:OK, Source:TLS, Debug MSG:Cert ID Ok, Conf_Score:1
		Suggested Company: Cisco
	...

Domain: jeniustoto.net
	Infered Provider ID: N/A


Domain: sgnetway.net
	ID:sgnetway.net, Type:OK, Source:Banner/EHLO, Debug MSG:Banner/EHLO ID Ok, Conf_Score:1
	...

Domain: bbw-chan.nl
	ID:bbw-chan.nl, Type:OK, Source:MX, Debug MSG:MX RD Ok, Conf_Score:1
	...

Domain: utexas.edu
	ID:utexas.edu, Type:OK, Source:TLS, Conf_Score:1
		Heuristics suggests new provider id: iphmx.com, reason: All IPs are in Cisco Ironport's AS
		Suggested company: Cisco
	...

Domain: summitorganization.org
	ID:secureserver.net, Type:OK, Source:Banner/EHLO, Conf_Score:1
		Heuristics suggests that this provider id might NOT be accurate, reason: FQDN (s132-148-130-121.secureserver.net) used by Banner/EHLO Indicates Potentially VPS
	...

Domain: arfonts.net
	ID:ovh.net, Type:OK, Source:Banner/EHLO, Conf_Score:1
		Heuristics suggests that this provider id might NOT be accurate, reason: FQDN (vps797297.ovh.net) used by Banner/EHLO Indicates Potentially VPS
	...

All scanning results are saved by default with the following name pattern under the project directory:

mx_inference-data-timestamp.csv

You can load the data saved locally and run the inference program on those domains.

python3 demo_mx_inference.py --load_data_from_path=/path/to/saved/data

Providers of domains

You can probe domains of your choice.

python3 demo_mx_inference.py --domains eng.ucsd.edu ucsd.edu

Sample output:

...
Domain: eng.ucsd.edu
	ID:google.com, Type:OK, Source:TLS, Debug MSG:Cert ID Ok, Conf_Score:1
		Suggested Company: Google
	...

Domain: ucsd.edu
	ID:pphosted.com, Type:OK, Source:TLS, Debug MSG:Cert ID Ok, Conf_Score:1
		Suggested Company: ProofPoint
	...

More verbose debug information

This is helpful when you are not getting expected results.

# Additional debug info
python3 demo_mx_inference.py --debug=1

# Very verbose debug info
python3 demo_mx_inference.py --debug=2

Play with different data sources for inference

Make inference based on TLS and MX

python3 demo_mx_inference.py --disable_banner

Make inference based on Banner/EHLO and MX

python3 demo_mx_inference.py --disable_tls

Other Arguments

--disable_tls: do not use TLS certificates for inference
--disable_banner: do not use banner/EHLO information for inference
--disable_heuristics: do not apply any heuristics
--heuristics_threshold: apply heuristics when confidence score <= threshold
--disable_heuristics_as: do not use heuristics that are based on AS information
--disable_heuristics_tls_pattern: do not use heuristics that are based on TLS FQDN pattern
--disable_heuristics_banner_pattern: do not use heuristics that are based on Banner/EHLO FQDN pattern
--map_id_to_company: Try mapping provider id to company. Default: True
--save_scan_data: Save scanned data of domains. Default: True
--debug: Debug information level. 0 = Minimum, 1 = Light, 2 = Verbose. Default: 0.

Note

  • This tool is not designed for large-scale analysis. Please use third-party datasets instead (e.g., OpenINTEL and Censys).
  • This tool is not tested with IPv6.
  • This tool does NOT infer the eventual mail provider used by end users of domain.
  • Use our heuristics with a grain of salt.

Project Structure

mx_inference
├── ...
├── lib/                       # Libraries
│   ├── certs/                 # CA/Intermediate Certificates 
│   ├── cert.py                # Handle certificates
│   ├── ds.py                  # Data structures
│   ├── extract_domain.py      # Extract domain from strings
│   ├── helper.py              # I/O helper functions
│   ├── heuristics.py          # Heuristics
│   ├── inference_funcs.py     # Inference functions
│   ├── network_lib.py         # Network functions
│   ├── preprocess.py          # Preprocessing certs
│   └── union_find.py          # Union find algorithm
├── config/                     
│   └── config.py              # Configurations
└── inference.py               # High-level wrapper for inferring mail providers of a domain

Extending This Work

  • How do I import my own data? If you have some data and want to use our tool, you add your own load data function in helper.py. An example function load_domain_data_from_path_format_censys can be found in helper.py. Data associated with each domain is loaded in demo_mx_inference.py by calling load_domain_data_from_path_format_censys.
  • How do I add other information for inference (e.g., rDNS of an IP)? If you find the information we use (i.e., MX, Banner/EHLO, TLS) for inference is not satisfying, you can modify the data structures defined in ds.py and scanning functions defined in network_lib.py.
  • How do I add my own heuristics? You can find our heuristics in heuristics.py. Heursitic functions are used by _infer_provider_id_for_mx_of_one_domain function in inference.py.

Cite Our Paper

@inproceedings{MxInfer,
  title={Who's Got Your Mail? Characterizing Mail Service Provider Usage},
  author={Liu, Enze and Akiwate, Gautam and Jonker, Mattijs and Mirian, Ariana and Savage, Stefan and Voelker, Geoffrey M.},
  booktitle={ACM Internet Measurement Conference (IMC'21)},
  publisher={ACM},
  year      = {2021},
  address   = {Virtual Event},
  month     = {November}
}

Bugs and Issues

This software is used and maintained for a research project and likely will have many bugs and issues. If you want to report any bugs or issues, please do it through the Github Issue Page.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages