commoncrawl

A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika

java mime-types warc cdx-files commoncrawl

Updated Apr 21, 2024
Java

commoncrawl / cc-pyspark

Star

Process Common Crawl data with Python and Spark

spark pyspark sparksql wet commoncrawl common-crawl warc-files wat-files

Updated Apr 8, 2024
Python

ngramp / commoncrawl-java

Star

spark commoncrawl

Updated Mar 12, 2024
Java

ngc7292 / query_of_cc

Star

This project is dataset and model checkpoints for the paper "Query of CC: Unearthing Large Scale Domain-Specific Knowledge from Public Corpora".

knowledge pile language-model commoncrawl pre-training llm

Updated Mar 5, 2024

cocrawler / cdx_toolkit

Star

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

python warc web-archiving cdx web-archives commoncrawl cdx-api

Updated May 20, 2024
Python

CI-Research / KeywordAnalysis

Star

Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends

wordcount keyword-extraction cluster-analysis commoncrawl

Updated Jan 28, 2024

ArtificialOSS / WebCrawl

Star

Crawls the web to generate a huge dataset for training

crawler ai artificial-intelligence dataset-generation commoncrawl web-archive

Updated Jan 24, 2024
Python

oscar-project / ungoliant

Star

🕷️ The pipeline for the OSCAR corpus

nlp crawler corpus-linguistics fasttext oscar commoncrawl common-crawl language-classification

Updated Dec 18, 2023
Rust

commoncrawl / news-crawl

Star

News crawling with StormCrawler - stores content as WARC

crawler news web-crawler apache-storm warc commoncrawl common-crawl storm-crawler

Updated Dec 13, 2023
Java

commoncrawl / cc-index-table

Star

Index Common Crawl archives in tabular format

sql spark columnar-storage aws-athena apache-parquet commoncrawl

Updated Sep 19, 2023
Java

jgonsior / dwtc-table-manual-classificator

Star

A tool for manually classification of dwtc tables. The result is then being used as a training data set.

java jquery flask commoncrawl webtable-classification

Updated Jul 25, 2023
Java

michaelharms / comcrawl

Star

A python utility for downloading Common Crawl data

python data deep-learning scraping commoncrawl common-crawl training-dataset

Updated Jun 8, 2023
Python

shjwudp / c4-dataset-script

Star

Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.

python nlp spark dataset commoncrawl massivetext

Updated Jun 7, 2023
Python

karust / gogetcrawl

Star

Extract web archive data using Wayback Machine and Common Crawl

golang crawler concurrency wayback-machine webarchive commoncrawl

Updated Jun 4, 2023
Go

Improve this page

Add a description, image, and links to the commoncrawl topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the commoncrawl topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

commoncrawl

Here are 47 public repositories matching this topic...

flairNLP / fundus

fhamborg / news-please

commoncrawl / cc-crawl-statistics

commoncrawl / nutch

commoncrawl / cc-webgraph

toimik / CommonCrawl

centic9 / CommonCrawlDocumentDownload

commoncrawl / cc-pyspark

ngramp / commoncrawl-java

ngc7292 / query_of_cc

cocrawler / cdx_toolkit

CI-Research / KeywordAnalysis

ArtificialOSS / WebCrawl

oscar-project / ungoliant

commoncrawl / news-crawl

commoncrawl / cc-index-table

jgonsior / dwtc-table-manual-classificator

michaelharms / comcrawl

shjwudp / c4-dataset-script

karust / gogetcrawl

Improve this page

Add this topic to your repo