Skip to content

commoncrawl/ml-opt-out-experiments

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

What is this?

PySpark Jobs for investigating prevalence of ML Opt–Out Protocols, written by Alex Xue as part of the blog post A Further Look Into the Prevalence of Various ML Opt–Out Protocols.

How Do I Run It?

Requires sparkcc.py from commoncrawl/cc-pyspark.

Setup is the same as cc-pyspark. Make sure you have an ./input directory.

To run the jobs:

$SPARK_HOME/bin/spark-submit job_name.py \
    --num_output_partitions 1 --log_level WARN \
    ./input/test_warc.txt output_file_name

and specifically to run html_metatag_count.py (which has a different output schema)

$SPARK_HOME/bin/spark-submit ./html_metatag_count.py \
    --num_output_partitions 1 --log_level WARN --tuple_key_schema True \
    ./input/test_warc.txt output_file_name

About

A series of experiments into ML opt–out protocols

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages