CopyCat

CopyCat is an open-source resource to make deduplication in TREC-style experimental setups more attainable and provides (1) a compilation of near-duplicate documents within the ClueWeb09, the ClueWeb12, and two Common Crawl snapshots, and (2) a software library enabling the deduplication of arbitrary document sets.

Outline

Getting Started

The CopyCat software comes with a cli for the deduplication of run and qrel-files and a spark library for the deduplication of complete web crawls and provides a docker image with support for jupyter notebooks (+ common data science libraries).

To start a bash shell with CopyCat installed and your local directory mounted, run

docker run --rm -ti -v ${PWD}:/home/jovyan webis/chatnoir-copycat:1.0-jupyter bash

In this bash, you have CopyCat installed. E.g., to show the help, run copy-cat --help, which prints:

usage: CopyCat: Deduplication of run files and qrels.
       [-h] --input INPUT --output OUTPUT
       [--similarities {url,s3,cosine(3+5-grams),cosine(8-grams),cosine(1-grams),simhash(1-grams),simhash(3+5-grams),md5,text-profile} [{url,s3,cosine(3+5-grams),cosine(8-grams),cosine(1-grams),simhash(1-grams),simhash(3+5-grams),md5,text-profile} ...]]
       --documents {ChatNoirMapfiles,AnseriniIndex} [--anseriniIndex ANSERINIINDEX] [--retrieveDocId RETRIEVEDOCID] [--ranks RANKS]
       [--s3Threshold S3THRESHOLD] [--threads THREADS] [--runFile {true,false}] [--keepStopwords {true,false}]
       [--contentExtraction {Anserini,Boilerpipe,Jericho,No}] [--stemmer {porter,krovetz,null}] [--stopwords STOPWORDS]

named arguments:
  -h, --help             show this help message and exit
  --input INPUT          The run file or qrel file that should be deduplicated.
  --output OUTPUT        The result of the deduplication in jsonl format.
  --similarities {url,s3,cosine(3+5-grams),cosine(8-grams),cosine(1-grams),simhash(1-grams),simhash(3+5-grams),md5,text-profile} [{url,s3,cosine(3+5-grams),cosine(8-grams),cosine(1-grams),simhash(1-grams),simhash(3+5-grams),md5,text-profile} ...]
                         Calculate all passed similarities.
  --documents {ChatNoirMapfiles,AnseriniIndex}
                         Use the passed DocumentResolver to load  the  documents.  E.g.  AnseriniIndex  loads  documents by accessing a local
                         anserini-index.
  --anseriniIndex ANSERINIINDEX
                         When using AnseriniIndex as resolver for documents, we use the specified index.
  --retrieveDocId RETRIEVEDOCID
                         Retrieve a single document from and print it to  the  console.  This  is  useful to check the preprocessing on a few
                         example documents.
  --ranks RANKS          Include documents up to the specified rank in the deduplication.
  --s3Threshold S3THRESHOLD
                         Report only near-duplicate pairs with s3 scores on word 8-grams above the specified threshold.
  --threads THREADS
  --runFile {true,false}
                         Is the specified a run file (pass true), or a qrels file (pass false)
  --keepStopwords {true,false}
                         Switch: keep stopwords or remove them.
  --contentExtraction {Anserini,Boilerpipe,Jericho,No}
                         The name of the content extraction. (Use  'Anserini'  for  Anserini's  default HTML to plain text transformation, or
                         'No' in case documents are already transformed (e.g., because they come from an anserini index)
  --stemmer {porter,krovetz,null}
                         The name of the stemmer (passed to Lucene with Anserini).
  --stopwords STOPWORDS  The list of stopwords is read from this  file.  When  keepStopwords  is  false, and stopwords = null, then Anserinis
                         default is used.

To start a jupyter notebook with CopyCat installed and your local directory mounted, run:

docker run --rm -ti -v ${PWD}:/home/jovyan -p 8888:8888 webis/chatnoir-copycat:1.0-jupyter

Now you can point your browser to localhost:8888 to access the notebook.

Deduplication of Run and Qrel Files

CopyCat provides an docker image to support various deduplication experiments on standard IR run and qrel files for various test collections. The following List provides some examples on how to use the docker image to deduplicate run/qrel files:

Showcases

This repository contains the two showcases from the paper. As a general introduction, see the Getting Started section.

Deduplication of run/qrel files with the copycat-cli (copycat-on-clueweb.ipynb, open in nbviewer)
Transfer of relevance labels

Development Environment:

Please install the following requirements:

Java 8 (our hadoop cluster runs hadoop 2.8, hence we need to compile to be compatible with java 8)
Maven
Project Lombok to your IDE (used to remove a bit of boilerplate code)
https://approvaltests.com/ (especially the diff-tools, this is used in unit-tests)
Docker

After the installation of the requirements, you can run the following command to compile CopyCat and run all unit-tests:

make install

Deduplication of Large Crawls with Spark

To detect near-duplicates in large web crawls, copycat runs a pipeline of four steps: (1) calculation of the SimHash fingerprint for each document, (2) selection of one representative document, if multiple documents have identical fingerprints, (3) partition of the remaining fingerprints, (4) calculation of the Hamming distances between all fingerprints of a partition.

Create Document representations
- Run ./copycat-spark/src/main/bash/new-document-representation-spark.sh to produce document representations
- The Makefile contains specific examples. E.g., execute make common-crawl-small-sample-document-representations to create document representations for a small sample of the common crawl to test your environment.
- The following commands produce the document-representations used in the paper:
  - make clueweb09-document-representations
  - make clueweb12-document-representations
  - make common-crawl15-document-representations
  - make common-crawl17-document-representations
Create SimHash Deduplication Jobs
- This covers the selection of one representative document (if multipledocuments have identical fingerprint), and the partition of the remaining fingerprints into deduplication jobs.
- After the creation of the document representations, run ./copycat-spark/src/main/bash/create-deduplication-candidates.sh to produce: (1) a list of removed-documents (e.g. too short documents, regarding your configuration); (2) exact-duplicates; and (3) near-duplicate-tasks.
- The near-duplicate-tasks are the blocks that will be fully all-pairs deduplicated in step 3 of the pipeline (Run Deduplication Jobs).
- The following commands produce the deduplication jobs used in the paper:
  - make create-deduplication-candidates-cw09
Run Deduplication Jobs
- After the creation of the deduplication jobs, run ./copycat-spark/src/main/bash/deduplicate.sh to produce the pairs of near-duplicates.
- The following commands produce the document-representations used in the paper:
  - make deduplicate-cw09
  - make deduplicate-cw09-cw12-cc15

Name		Name	Last commit message	Last commit date
Latest commit History 595 Commits
.mvn/wrapper		.mvn/wrapper
case-studies		case-studies
copycat-cli		copycat-cli
copycat-modules		copycat-modules
copycat-spark		copycat-spark
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
mvnw		mvnw
mvnw.cmd		mvnw.cmd

License

chatnoir-eu/chatnoir-copycat

Folders and files

Latest commit

History

Repository files navigation

CopyCat

Outline

Getting Started

Deduplication of Run and Qrel Files

Showcases

Development Environment:

Deduplication of Large Crawls with Spark

About

Resources

License

Stars

Watchers

Forks

Languages