Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Word Count


"... This book will be a great resource for
both readers looking to implement existing
algorithms in a scalable fashion and readers
who are developing new, custom algorithms
using Spark. ..."

Dr. Matei Zaharia
Original Creator of Apache Spark

FOREWORD by Dr. Matei Zaharia

Introduction to Word Count

  • Word Count is a simple and easy to understand algorithm which can be easily implemented as a MapReduce/Spark application. Given a set of text documents, the program counts the number of occurrences of each word.

  • Word count finds out the frequency of each word in a set of documents/files. The goal is to create a dictionary of (key, value) pairs, where key is a word (as a String), and value is an Integer denoting the frequency of a given key/word.

  • Complete set of solutions are given for Word Count problem using

  • BEFORE reduction filter: You may add filter() to remove undesired words (this can be done after tokenizing records)

  • AFTER reduction filter: To have a desired final word count as (word, frequency), you may add filter() to remove elements where frequency < N , where N (as an integer) is your threshold. This can be done after reduction.


Word Count in MapReduce

Word Count in Picture




References

1. Word count from Wiki

2. Word Count Example, Spark