Jarelllama's Scam Blocklist

Blocklist for newly created scam and phishing domains automatically retrieved daily using Google Search API, automated detection, and other public sources.

The automated retrieval is done daily at 19:00 UTC.

This blocklist aims to be an alternative to blocking all newly registered domains (NRDs) seeing how many, but not all, NRDs are malicious. A variety of sources are integrated to detect new malicious domains within a short time span of their registration date.

In the last 30 days, more than 5,653¹ malicious NRDs were found.

Download

Format	Syntax
Adblock Plus	\|\|scam.com^
Wildcard Domains	scam.com

Statistics

Total domains: 89703
Light version: 8687

New domains from each source: *
Today | Yesterday | Excluded | Source
  103 |         0 |      62% | Emerging Threats phishing
   35 |         0 |       7% | Google Search
 1457 |         0 |      10% | Jeroengui phishing feed
   12 |         0 |       7% | Jeroengui scam feed
    0 |         1 |       1% | Manual Entries
  981 |         0 |      20% | PhishStats
  144 |         0 |       0% | PhishStats (NRDs)
  521 |         0 |       4% | Regex Matching (NRDs)
   32 |         0 |       8% | aa419.org
   30 |         0 |       0% | dnstwist (NRDs)
   13 |         0 |      43% | fakewebsitebuster.com
   59 |         0 |      22% | guntab.com
    6 |         0 |       9% | petscams.com
    1 |         0 |      65% | scam.directory
    1 |         0 |      39% | scamadviser.com
    0 |         0 |       5% | stopgunscams.com
 3251 |         1 |      17% | All sources

* The new domain numbers reflect what was retrieved, not
 what was added to the blocklist.
* The Excluded % is of domains not included in the
 blocklist. Mostly dead, whitelisted, and parked domains.

Important

All data retrieved are publicly available and can be viewed from their respective sources.
Any data hidden behind account creation/commercial licenses is never used.

Domains over time (days)

Courtesy of iam-py-test/blocklist_stats.

Light version

Targeted at list maintainers, a light version of the blocklist is available in the lists directory.

Details about the light version

Intended for collated blocklists cautious about size
Only includes sources whose domains can be filtered by date registered/reported
Only includes domains retrieved/reported from February 2024 onwards, whereas the full list goes back further historically
Note that dead and parked domains that become alive/unparked are not added back into the light version (due to limitations in the way these domains are recorded)

Sources excluded from the light version are marked in SOURCES.md.

The full version should be used where possible as it fully contains the light version and accounts for resurrected/unparked domains.

Other blocklists

NSFW Blocklist

Created from requests, a blocklist for NSFW domains is available in Adblock Plus format here: nsfw.txt

Details about the NSFW Blocklist

Domains are automatically retrieved from the Tranco Top Sites Ranking daily
Dead domains are removed daily
Note that resurrected domains are not added back into the blocklist
Note that parked domains are not checked for in this blocklist

Total domains: 9263

This blocklist does not just include adult videos, but also NSFW content of the artistic variety (rule34, illustrations, etc).

Malware Blocklist

A blocklist for malicious domains extracted from Proofpoint's Emerging Threats rulesets can be found here: jarelllama/Emerging-Threats

Parts of the rulesets are integrated into the Scam Blocklist as well.

Sources

Retrieving scam domains using Google Search API

Google provides a Search API to retrieve JSON-formatted results from Google Search. A list of search terms almost exclusively found in scam sites is used by the API to retrieve domains. See the list of search terms here: search_terms.csv

Details

Scam sites often do not have long lifespans; malicious domains may be replaced before they can be manually reported. By programmatically searching Google using paragraphs from real-world scam sites, new domains can be added as soon as Google crawls the site. This requires no manual reporting.

The list of search terms is proactively maintained and is mostly sourced from investigating new scam site templates seen on r/Scams.

Active search terms: 16
API calls made today: 76
Domains retrieved today: 35

Retrieving phishing NRDs using dnstwist

New phishing domains are created daily, and unlike other sources that rely on manual reporting, dnstwist can automatically detect new phishing domains within days of their registration date.

dnstwist is an open-source detection tool for common cybersquatting techniques like Typosquatting, Doppelganger Domains, and IDN Homograph Attacks.

Details

dnstwist uses a list of common phishing targets to find permutations of the targets' domains. The target list is a handpicked compilation of cryptocurrency exchanges, delivery companies, etc. collated while wary of potential false positives. The list of phishing targets can be viewed here: phishing_targets.csv

The generated domain permutations are checked for matches in a newly registered domains (NRDs) feed comprising domains registered within the last 30 days. Each permutation is tested for alternate top-level domains (TLDs) using the 15 most prevalent TLDs from the NRD feed at the time of retrieval.

Active targets: 64
Domains retrieved today: 30

Regarding other sources

All sources used presently or formerly are credited here: SOURCES.md

The domain retrieval process for all sources can be viewed in the repository's code.

Automated filtering process

The domains collated from all sources are filtered against an actively maintained whitelist (scam reporting sites, forums, vetted stores, etc.)
The domains are checked against the Tranco Top Sites Ranking for potential false positives which are then vetted manually
Common subdomains like 'www' are stripped. The list of subdomains checked for can be viewed here: subdomains.txt
Only domains are included in the blocklist; URLs are stripped down to their domains and IP addresses are manually checked for resolving DNS records
Redundant rules are removed via wildcard matching. For example, 'abc.example.com' is a wildcard match of 'example.com' and, therefore, is redundant and removed. Wildcards are occasionally added to the blocklist manually to further optimize the number of entries

Entries that require manual verification/intervention are sent in a Telegram notification for fast remediations.

The full filtering process can be viewed in the repository's code.

Dead domains

Dead domains are removed daily using AdGuard's Dead Domains Linter.

Dead domains that are resolving again are included back into the blocklist.

Dead domains removed today: 0
Resurrected domains added today: 0

Parked domains

From initial testing, 9% of the blocklist consisted of parked domains that inflated the number of entries. Because these domains pose no real threat (besides the obnoxious advertising), they are removed from the blocklist daily.

A list of common parked domain messages is used to automatically detect these domains. This list can be viewed here: parked_terms.txt

If these parked sites no longer contain any of the parked messages, they are assumed to be unparked and are added back into the blocklist.

Tip

For list maintainers interested in integrating the parked domains as a source, the list of daily-updated parked domains can be found here: parked_domains.txt (capped to newest 8000 entries)

Parked domains removed today: 220
Unparked domains added today: 74

As seen in

Resources / See also

AdGuard's Dead Domains Linter: simple tool to check adblock filtering rules for dead domains
AdGuard's Hostlist Compiler: simple tool that compiles hosts blocklists and removes redundant rules
Elliotwutingfeng's repositories: various original blocklists
Google's Shell Style Guide: Shell script style guide
Grammarly: spelling and grammar checker
Jarelllama's Blocklist Checker: generate a simple static report for blocklists or see previous reports of requested blocklists
Legality of web scraping: the law firm of Quinn Emanuel Urquhart & Sullivan's memoranda on web scraping
ShellCheck: static analysis tool for Shell scripts
Tranco: research-oriented top sites ranking hardened against manipulation
VirusTotal: analyze suspicious files, domains, IPs, and URLs to detect malware (also includes WHOIS lookup)
iam-py-test/blocklist_stats: statistics on various blocklists

Appreciation

Thanks to the following people for the help, inspiration, and support!

@T145 - @bongochong - @hagezi - @iam-py-test - @sefinek24 - @sjhgvr

Contributing

You can contribute to this project in the following ways:

Sponsorship
Star this repository
Code reviews
Report domains and false positives
Report false negatives in the whitelist
Suggest search terms for the Google Search source
Suggest phishing targets for the dnstwist and Regex Matching sources
Suggest new sources
Suggest parked terms for the parked domains detection
Report false positives in the parked domains file

Number calculated using NRDs from Hagezi's NRD 30 feed. The number of malicious NRDs found in reality is higher due to additional feeds being used. See the list of feeds used here: SOURCES.md ↩

Name		Name	Last commit message	Last commit date
Latest commit History 8,946 Commits
.github		.github
config		config
data		data
lists		lists
scripts		scripts
LICENSE.md		LICENSE.md
README.md		README.md
SOURCES.md		SOURCES.md

License

jarelllama/Scam-Blocklist

Folders and files

Latest commit

History

Repository files navigation

Jarelllama's Scam Blocklist

Download

Statistics

Light version

Other blocklists

NSFW Blocklist

Malware Blocklist

Sources

Retrieving scam domains using Google Search API

Details

Retrieving phishing NRDs using dnstwist

Details

Regarding other sources

Automated filtering process

Dead domains

Parked domains

As seen in

Resources / See also

Appreciation

Contributing

Footnotes

About

Topics

Resources

License

Stars

Watchers

Forks

Sponsor this project

Languages