warc

Here are 100 public repositories matching this topic...

webrecorder / browsertrix

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!

kubernetes cloud archiving warc web-archiving webrecorder web-archive wacz

Updated May 21, 2024
TypeScript

ArchiveBox / ArchiveBox

Sponsor

Star

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

Updated May 20, 2024
Python

harvard-lil / warc-gpt

Star

WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.

ai warc webarchiving rag

Updated May 20, 2024
Python

webrecorder / browsertrix-crawler

Sponsor

Star

Run a high-fidelity browser-based crawler in a single Docker container

crawler web-crawler crawling warc web-archiving webrecorder wacz

Updated May 20, 2024
TypeScript

maxcountryman / warc-parquet

Sponsor

Star

🗄️ A simple CLI for converting WARC to Parquet.

crawling parquet warc web-archiving duckdb

Updated May 17, 2024
Rust

machawk1 / wail

Star

🐋 Web Archiving Integration Layer: One-Click User Instigated Preservation

python gui warc web-archiving pyinstaller wayback heritrix openwayback

Updated May 16, 2024
Roff

openzim / warc2zim

Sponsor

Star

Command line tool to convert a file in the WARC format to a file in the ZIM format

scraper warc zim

Updated May 17, 2024
Python

nlnwa / warchaeology

Star

Command line tool for digging into WARC files

cli command-line warc

Updated May 16, 2024
Go

webrecorder / warcio

Sponsor

Star

Streaming WARC/ARC library for fast web archive IO

python warc web-archiving web-archives pywb

Updated May 15, 2024
Python

internetarchive / heritrix3

Star

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

java warc heritrix webcrawling

Updated May 15, 2024
Java

toimik / WarcProtocol

Star

Parser for WARC (aka WebArchive) files

warc webarchive webarchiving warc-files webarchives warc-format warc-reader warc-record

Updated May 10, 2024
C#

CorentinB / warc

Star

Read and write WARC files in Go

go archiving warc

Updated May 8, 2024
Go

natliblux / warc-safe

Star

A tool for detecting viruses and NSFW material in WARC files

antivirus warc webarchiving nsfw-classifier warc-safe

Updated May 3, 2024
Python

webrecorder / replayweb.page

Sponsor

Star

Serverless replay of web archives directly in the browser

service-worker warc web-archiving wayback-machine web-archive replay-web-page web-replay wacz

Updated May 2, 2024
TypeScript

openzim / zimit-frontend

Sponsor

Star

Zimit Public Web UI

spider warc zim

Updated May 2, 2024
Vue

toimik / CommonCrawl

Star

Common Crawl's processing tools

warc wat wet commoncrawl common-crawl warc-files wat-files common-crawl-data wet-files

Updated May 2, 2024
C#

chatnoir-eu / chatnoir-resiliparse

Star

A robust web archive analytics toolkit

python web cpp cython bigdata extraction warc webarchive htmlparser

Updated Apr 29, 2024
Cython

oduwsdl / ipwb

Star

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

python docker service-worker ipfs memento warc web-archiving wayback memento-rfc

Updated Apr 24, 2024
Python

centic9 / CommonCrawlDocumentDownload

Sponsor

Star

A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika

java mime-types warc cdx-files commoncrawl

Updated Apr 21, 2024
Java

helgeho / ArchiveSpark

Star

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

spark internet-archive warc web-archiving webarchive archivespark spark-framework

Updated Apr 4, 2024
Scala

Improve this page

Add a description, image, and links to the warc topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the warc topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

warc

Here are 100 public repositories matching this topic...

webrecorder / browsertrix

ArchiveBox / ArchiveBox

harvard-lil / warc-gpt

webrecorder / browsertrix-crawler

maxcountryman / warc-parquet

machawk1 / wail

openzim / warc2zim

nlnwa / warchaeology

webrecorder / warcio

internetarchive / heritrix3

toimik / WarcProtocol

CorentinB / warc

natliblux / warc-safe

webrecorder / replayweb.page

openzim / zimit-frontend

toimik / CommonCrawl

chatnoir-eu / chatnoir-resiliparse

oduwsdl / ipwb

centic9 / CommonCrawlDocumentDownload

helgeho / ArchiveSpark

Improve this page

Add this topic to your repo