Skip to content

A tool that deduplicates lines of a textfile with the speed of ram and scales nicely on all cores concurrently.

Notifications You must be signed in to change notification settings

sarahlj/LineDeduplicator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 

Repository files navigation

LineDeduplicator

This is a fast and concurrent deduplication tool that removes duplicate lines in a textfile and leverages multiple cpu cores whilst keeping memory footprint low.

install dependencies

Note: Go 1.9+ is required because of sync.Map.

go get github.com/OneOfOne/xxhash

Features

  • Unique lines will be written to disk
  • Optional: Duplicate lines will be written to disk
  • Non cryptographic hash is used for memory close speed
  • low memory usage because of hashmap lookup
  • Uses all cores of a system and its optimized for 16 cores and more
  • linear performance and ram usage
  • Super fast (when you come from Python or Javascript)
  • Producer-Consumer pattern used to implement concurrency

Memory usage is always 4 bytes for every unique line in the file. set(lines)*4bytes

About

A tool that deduplicates lines of a textfile with the speed of ram and scales nicely on all cores concurrently.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages