Large Text Word Count

A study on Exact and Approximate Occurrences Counters

Description

The challenge of parallel event counting in a memory efficient way is not a recent topic, but it is one still under discussion as there is great room for improvement. Most of today’s solutions perform memory optimization by applying probabilistic counters to estimate the total number of occurrences of events.

This project focuses on 2 of the most famous approximate counters to determine an estimation of the most used words of literary works from several authors in several languages and compare them to an exact counter. Conclusions drawn from the study applied to the dataset are presented in the project report.

Repository Structure

/datasets - literary works taken from Project Gutenberg used as input data

/report - documentation of the conducted study

/results - outputs produced by the implemented code

/src - source code of the algorithms

Data Visualization

Counter estimations of each algorithm for the top 10 words.

Counters deviations of each algorithm for the top 50 words.

Instructions to Run

$ cd src
$ pip3 install -r requirements.txt
$ python3 WordOccurrenceCounting.py

Author

The author of this repository is Filipe Pires, and the project was developed for the Advanced Algorithms Course of the master's degree in Informatics Engineering of the University of Aveiro.

For further information read the report or contact me at filipesnetopires@ua.pt.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasets

datasets

report

report

results

results

src

src

README.md

README.md

Repository files navigation

Large Text Word Count

Description

Repository Structure

Data Visualization

Instructions to Run

Author

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
datasets		datasets
report		report
results		results
src		src
README.md		README.md

FilipePires98/LargeText-WordCount

Folders and files

Latest commit

History

Repository files navigation

Large Text Word Count

Description

Repository Structure

Data Visualization

Instructions to Run

Author

About

Topics

Resources

Stars

Watchers

Forks

Languages