Skip to content

Latest commit

 

History

History
18 lines (9 loc) · 518 Bytes

README.md

File metadata and controls

18 lines (9 loc) · 518 Bytes

spark-LDA-example

A simple Spark LDA example. This project contains a basic Document Clustering example in which data cleaning is also done.

We are going to perform these procedures for the document clustering, these steps include:

  1. Spark RegexTokenizer : For Tokenization

  2. Stanford NLP Morphology : For Stemming and lemmatization

  3. Spark StopWordsRemover : For removing stop words and punctuation

  4. Spark TF-IDF : For computing term frequencies or tf-idf

  5. Spark LDA : For Clustering of documents.