Skip to content

go-cardinality is a Go library that calculates the cardinality and distinct count of values in a dataset, providing efficient and accurate estimations.

Notifications You must be signed in to change notification settings

anthonysyk/go-cardinality

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

go-cardinality

go-cardinality is a Go library that calculates the cardinality and distinct count of values in a dataset, providing efficient and accurate estimations.

Usage

  • Retrieve all unique values of a specific field in a dataset, which is useful for creating enums or generating dimension tables.
  • Analyze the distribution of values within a particular field in a dataset to gain insights into the most frequently occurring values.
fields := naive.DistinctCount(Movie{}, movies, "Year", "Genres")
genres, err := fields.GetField("Genres")
genres.PrettyPrint()
Comedy = 350
Drama = 338
Thriller = 194
Horror = 162
Action = 162
Romance = 117
...

Check examples here : Examples

Progress

Features

  • Naive approach using a map data structure
  • Naive approach with concurrent processing for improved performance
  • Naive approach with RxGo
  • API for calculating cardinality in a list of objects
  • Use HyperLogLog algorithm for accurate estimation of distinct values

Types implemented

  • Type int
  • Type string
  • Type []string
  • Type []int

Tests

  • Run unit tests
make test
  • Run benchmark tests
make bench
  • Run coverage
make coverage

Considerations

  • Usage of reflect library : even if greatly discouraged, we needed to use it (moderately) to build generic methods based on struct fields and types (schema).

Glossary

  • Cardinality is a mathematical term. It translates into the number of elements in a set. In databases, cardinality refers to the relationships between the data in two database tables. Cardinality defines how many instances of one entity are related to instances of another entity.
  • Distribution refers to the way data values are spread or organized within a dataset. It describes the frequency or occurrence of different values or groups of values in a specific attribute or field.

Inspiration

Resources

About

go-cardinality is a Go library that calculates the cardinality and distinct count of values in a dataset, providing efficient and accurate estimations.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published