Skip to content

Eventually a search engine, but currently a filtering pipeline for HTML and soon WARC files.

License

Notifications You must be signed in to change notification settings

hadrianw/abracabra

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Abracabra

A search engine project, that will index only sites without ads and trackers. Saying it otherwise - search results should be all uBlock Origin clean.

At this stage it's becoming a pipeline to filter through Common Crawl archives and check a percentage of sites, that do not match criteria taken from AdBlock filter lists like EasyList.

It's written in Rust and uses the lol_html crate for HTML parsing and partially for selector matching. It uses the adblock crate for rule parsing. The current version tries to use it for matching, but it has been proven much too slow. A 200kB file takes 3 seconds on a modern amd64 CPU with heavy optimisations.

About

Eventually a search engine, but currently a filtering pipeline for HTML and soon WARC files.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published