Basic Statistics of Common Crawl Monthly Archives

Analyze the Common Crawl data to get metrics about the monthly crawl archives:

size of the monthly crawls, number of
- fetched pages
- unique URLs
- unique documents (by content digest)
- number of different hosts, domains, top-level domains
distribution of pages/URLs on hosts, domains, top-level domains
and ...
- mime types
- protocols / schemes (http vs. https)
- content languages (since summer 2018)

This is a description how to generate the statistics from the Common Crawl URL index files.

The results are presented on https://commoncrawl.github.io/cc-crawl-statistics/.

Step 1: Count Items

The items (URLs, hosts, domains, etc.) are counted using the Common Crawl index files on AWS S3 s3://commoncrawl/cc-index/collections/*/indexes/cdx-*.gz.

define a pattern of cdx files to process - usually from one monthly crawl (here: CC-MAIN-2016-26)
- either smaller set of local files for testing
```
INPUT="test/cdx/cdx-0000[0-3].gz"
```
- or one monthly crawl to be accessed via Hadoop on AWS S3:
```
INPUT="s3a://commoncrawl/cc-index/collections/CC-MAIN-2016-26/indexes/cdx-*.gz"
```

run crawlstats.py --job=count to process the cdx files and count the items:

python3 crawlstats.py --job=count --no-exact-counts \
     --no-output --output-dir .../count/ $INPUT

Help on command-line parameters (including mrjob options) are shown by python3 crawlstats.py --help. The option --no-exact-counts is recommended (and is the default) to save storage space and computation time when counting URLs and content digests.

Step 2: Aggregate Counts

Run crawlstats.py --job=stats on the output of step 1:

python3 crawlstats.py --job=stats --max-top-hosts-domains=500 \
     --no-output --output-dir .../stats/ .../count/

The max. number of most frequent thosts and domains contained in the output is set by the option --max-top-hosts-domains=N.

Step 3: Download the Data

In order to prepare the plots, the the output of step 2 must be downloaded to local disk. Simplest, the data is fetched from the Common Crawl Public Data Set bucket on AWS S3:

while read crawl; do
    aws s3 cp s3://commoncrawl/crawl-analysis/$crawl/stats/part-00000.gz ./stats/$crawl.gz
done <<EOF
CC-MAIN-2008-2009
...
EOF

One aggregated, gzip-compressed statistics file, is about 1 MiB in size. So you could just run get_stats.sh to download the data files for all released monthly crawls.

Also the output of step 1 is provided on s3://commoncrawl/. The counts for every crawl is hold in 10 bzip2-compressed files, together 1 GiB per crawl in average. To download the counts for one crawl:

if you're on AWS and AWS CLI is installed and configured

CRAWL=CC-MAIN-2022-05
aws s3 cp --recursive s3://commoncrawl/crawl-analysis/$CRAWL/count stats/count/$CRAWL

otherwise

CRAWL=CC-MAIN-2022-05
mkdir -p stats/count/$CRAWL
for i in $(seq 0 9); do
  curl https://data.commoncrawl.org/crawl-analysis/$CRAWL/count/part-0000$i.bz2 \
    >stats/count/$CRAWL/part-0000$i.bz2
done

Step 4: Plot the Data

To prepare the plots using the downloaded aggregated data:

gzip -dc stats/CC-MAIN-*.gz | python3 plot/crawl_size.py

The full list of commands to prepare all plots is found in plot.sh. Don't forget to install the Python modules required for plotting.

Related Projects

The columnar index simplifies counting and analytics a lot - easier to maintain, more transparent, reproducible and extensible than running two MapReduce jobs, see the the list of example

SQL queries and
Jupyter notebooks

Name		Name	Last commit message	Last commit date
Latest commit History 198 Commits
_layouts		_layouts
assets		assets
plot		plot
plots		plots
stats		stats
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml
crawlplot.py		crawlplot.py
crawlstats.py		crawlstats.py
get_stats.sh		get_stats.sh
index.md		index.md
plot.sh		plot.sh
requirements.txt		requirements.txt
requirements_plot.txt		requirements_plot.txt
run_stats_hadoop.sh		run_stats_hadoop.sh
setup.py		setup.py
top_level_domain.py		top_level_domain.py

License

commoncrawl/cc-crawl-statistics

Folders and files

Latest commit

History

Repository files navigation

Basic Statistics of Common Crawl Monthly Archives

Step 1: Count Items

Step 2: Aggregate Counts

Step 3: Download the Data

Step 4: Plot the Data

Related Projects

About

Topics

Resources

License

Stars

Watchers

Forks

Languages