modCrawler

Crawler based on a modified browser to detect online tracking. Used in the canvas fingerprinting and evercookie detection experiments in our CCS 2014 paper, The Web Never Forgets. Visit The Web Never Forgets website for more info.

Installation

We strongly suggest you to use a virtual machine, container or similar isolation to install modCrawler.

git clone https://github.com/fpdetective/modCrawler
cd modCrawler
./setup.sh

Running the tests

Please run the tests before running the crawler. For simplicity you can run py.test from within the test directory. py.test will discover and run all the test. Alternatively, you can run individual tests from the command line such as: python -m test.runenv_test

Command line parameters

Below we give a description of the parameters that are passed to the agents.py module.

--urls: path to file that contains the list of URLs to crawl
--max_rank: max line number of the url to be crawled (if urls contain rank info)
--min_rank (optional): min line number of the url to be crawled (if url contains rank info)
--max_proc: maximum number of browsers that will run in parallel
--flash: Flash support (0: disable, 1: enable (default))
--cookie: Cookie support (0: allow all (default), 1: allow 1st party, 2: disable, 3: allow third-party cookies from visited)
--upload: Upload crawl results to a remote server via SSH. 0: don't upload (default), 1: upload (SSH server info should be completed in crawler/common.py)

Example:

To crawl top 100 urls in the etc/top-1m.csv file using 10 parallel crawlers (Flash disabled).
- python crawl.py --urls etc/top-1m.csv --max_rank 100 max_proc 10 --flash 0
To crawl urls between rank 100-1000 in the etc/top-1m.csv file using 5 parallel crawlers (Flash enabled).
- python crawl.py --urls etc/top-1m.csv --max_rank 1000 --min_rank 100 max_proc 5

After the crawl

modCrawler will store the data about the crawls in the jobs directory. For convenience, it places a symlink called latest that points to the directory of the most recent crawl.

During the crawl, you can watch the debug.log tail -f jobs/latest/debug.log

Once the crawl has finished, you can find the crawl data in the jobs/latest/ directory.

crawl.sqlite: Sqlite based crawl database.
...report.html: An HTML based report that gives an overview of the results. The name of the file depends on the date and crawl parameters.
debug.log: Debug logs.
error.log: Error logs, file is not created if there is no error.

In addition, the crawl directory is gzipped and stored in the jobs directory.

Building your own browser

The setup.sh script will download a modified Firefox which logs canvas fingerprinting related function calls. Alternatively, you can build your own Firefox using the provided browser patch. Make sure you use the right .mozconfig file for building (e.g., export MOZCONFIG=~/path/to/gecko-dev/.mozconfig-ffstd) Assuming you checked out the Firefox repository into ~/dev/gecko-dev/

cd ~/dev/gecko-dev/;
git fetch
git checkout GECKO4401_2016020518_RELBRANCH
git apply ~/dev/modCrawler/browser_patch/0001-Log-canvas-fingerprinting-related-function-calls.patch
./mach build
cd firefox-static
make package;
# copy it from dist dir to destination
cp dist/*.bz2 /path/to/modCrawler/bins

You need to place your freshly built browser to bins/ff-mod directory to make sure it is used by the crawler. Please consult the Mozilla documentation for errors you may run into.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
analysis		analysis
browser_patch		browser_patch
crawler		crawler
etc		etc
extractor		extractor
test		test
utils		utils
LICENSE		LICENSE
README.md		README.md
crawl.py		crawl.py
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

analysis

analysis

browser_patch

browser_patch

crawler

crawler

etc

etc

extractor

extractor

test

test

utils

utils

LICENSE

LICENSE

README.md

README.md

crawl.py

crawl.py

setup.sh

setup.sh

Repository files navigation

modCrawler

Installation

Running the tests

Command line parameters

After the crawl

Building your own browser

About

Releases

Packages

Languages

License

fpdetective/modCrawler

Folders and files

Latest commit

History

Repository files navigation

modCrawler

Installation

Running the tests

Command line parameters

After the crawl

Building your own browser

About

Resources

License

Stars

Watchers

Forks

Languages