Topic Modelling

The main focus of this project is the application of Topic Modeling on short documents, collected from social media platform - Twitter. The algorithm used for this purpose is Latent Dirichlet Allocation, which is one of the simplest topic models. Apache Spark engine together with its underlying Hadoop File System have been used to distribute work across all nodes/machines.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

SBT 0.13.12
Apache Spark 2.1.0
Scala 2.11.0
Hadoop 2.7.3

Dataset has been collected from Twitter platform using TwitterCollector script.

Installing

Run following commands for initial project setup:

git clone http://github.com/arajski/topic-modelling
cd topic-modelling

Edit submit-spark.sh file to make sure it contains correct paths for Hadoop and Apache Spark (file contains sample configuration).
To run the application and send it to Apache Spark cluster, execute the following script with HDFS url as a parameter. It should point to a directory with stored data files.

./submit-spark.sh hdfs_url

First run will download all dependencies, including Stanford CoreNLP library, compile the solution and run the test suites.

Running the tests

To run the test suites, simply run sbt test. Test cases are available in src/test/scala directory

Built With

Stanford CoreNLP - Library used for Natural Language Processing
Apache Spark - Data processing and task distribution engine

License

This project is licensed under the MIT License - see the LICENSE file for details

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
docs		docs
project		project
src		src
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt
submit-spark.sh		submit-spark.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs

docs

project

project

src

src

LICENSE

LICENSE

README.md

README.md

build.sbt

build.sbt

submit-spark.sh

submit-spark.sh

Repository files navigation

Topic Modelling

Getting Started

Prerequisites

Installing

Running the tests

Built With

License

About

Releases

Packages

Languages

License

arajski/topic-modelling

Folders and files

Latest commit

History

Repository files navigation

Topic Modelling

Getting Started

Prerequisites

Installing

Running the tests

Built With

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages