Adblock Plus crawler

This is a Web crawler, which uses Headless Chrome Crawler to load Adblock Plus extension, crawl a list of provided URLs with any given depth and log filter hit results.

This requires a small change in adblockpluschrome to expose the HitLogger class. In the end of the file just add this line: ext.HitLogger = HitLogger;

Running a crawl manually

Checkout the repository and then run:

npm install

When that is done, launch the crawl with

npm start -- -p ./devenv.chrome -u ./top500.domains.csv

This will run the crawl and store results into filterhits.log file in current directory.

Settings file

There are two ways of providing parameters to a crawler. First - through the command line, which has the highest priority. Second - through a config file settings.json, which looks like this:

{
    // path to a file which specifies a list of URLs to crawl
    "urllist":"settings/listofurls.csv",
    // path to Adblock Plus
    "abppath":"/path/to/adblockpluschrome/devenv.chrome",
    // Crawling depth
    "depth": 0,
    // Output folder
    "output": "/path/to/output/folder",
    // enable taking screenshots
    "screenshots": true,
    // time to wait before making a screenshot
    "screenshotsDelay": 5000
}

Input URLs file structure

Input URLs should be provided in a .csv file, where URLs are a second column. Something like this:

Number,URL
1,https://www.firsturl.com/
2,https://www.secondurl.com/

Running a Docker container

Checkout the repository and then run:

docker build -t abp-crawler .

This will build a container image based on the Dockerfile and name it abp-crawler. To run a crawl it is better to mount a volume with a settings file and input URLs. Also to get results it's better to mount a separate volume for results folder. So, create a settings folder and create settings.json file there and then create a result folder and mount them like so:

docker run -v ~/settings:/home/crawler/settings -v ~/result:/home/crawlresult abp-crawler

Log file structure

Log file is a list of lines, where each line is a JSON object.

Both blocked and non-blocked requests are logged. Non-blocked requests have a filter parameter set to `null.

{"request":{"url":"https://creativecommons.org/wp-content/themes/cc/js/sticky-nav.js","type":"SCRIPT","docDomain":"creativecommons.org","thirdParty":false,"sitekey":null,"specificOnly":false},"filter":null,"level":"info","message":""}

Blocked requests have filter set to actual filter that triggered the blocking:

{"request":{"url":"http://simple-adblock.com/adblocktest/files/adbanner.gif","type":"IMAGE","docDomain":"simple-adblock.com","thirdParty":false,"sitekey":null,"specificOnly":false},"filter":{"text":"/adbanner.","subscriptions":null,"regexpSource":null,"sitekeys":null},"level":"info","message":""}

Element hiding filters are also logged, when they are being hit:

{"request":{"url":"https://www.amazon.com/","type":"ELEMHIDE","docDomain":"www.amazon.com"},"filter":{"text":"@@||amazon.com^$elemhide","subscriptions":null,"contentType":1073741824,"regexpSource":null,"sitekeys":null},"level":"info","message":""}

Screenshots

Screenshots are stored in the output folder in a screenshots directory.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
adlabels		adlabels
api/controllers		api/controllers
frontend		frontend
settings		settings
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
abp.js		abp.js
crawler.js		crawler.js
detectAdLabel.py		detectAdLabel.py
frontend.js		frontend.js
index.js		index.js
logger.js		logger.js
options.js		options.js
package.json		package.json
server.js		server.js
singlepagecrawl.js		singlepagecrawl.js

shoniko/abpcrawler_hc

Folders and files

Latest commit

History

Repository files navigation

Adblock Plus crawler

Running a crawl manually

Settings file

Input URLs file structure

Running a Docker container

Log file structure

Screenshots

About

Resources

Code of conduct

Stars

Watchers

Forks

Sponsor this project

Languages