Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!
-
Updated
May 21, 2024 - TypeScript
Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.
Run a high-fidelity browser-based crawler in a single Docker container
🗄️ A simple CLI for converting WARC to Parquet.
🐋 Web Archiving Integration Layer: One-Click User Instigated Preservation
Streaming WARC/ARC library for fast web archive IO
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
Parser for WARC (aka WebArchive) files
A tool for detecting viruses and NSFW material in WARC files
Serverless replay of web archives directly in the browser
Common Crawl's processing tools
A robust web archive analytics toolkit
InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS
A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-testing of frameworks like Apache POI and Apache Tika
An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.
Add a description, image, and links to the warc topic page so that developers can more easily learn about it.
To associate your repository with the warc topic, visit your repo's landing page and select "manage topics."