Skip to content

texasmichelle/challenge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Interview Challenge

To get started, download and install Scala 2.10, sbt, and Spark 1.6.1.

Download and install Spark

Download the prebuilt version 1.6.1 from here: Download Spark
Move it to the standard installation directory on your machine.
Set the $SPARK_HOME environment variable to this directory.

Compilation

To build from source, execute the package command from sbt:

challenge.git$ sbt package

Input files

Copy the OANC input transcripts into the resources directory. The expected path is:

resources/OANC-GrAF/data/spoken/telephone/switchboard

Execution

To generate output files, run the jar you just created in standalone mode. This will run locally on a single machine.

challenge.git$ $SPARK_HOME/spark-submit target/scala-2.10/interview-challenge_2.10-1.0.jar

Results

The relevant output files can be found here:

output/feature1.txt  
output/feature2.txt

Scalability

While this sample code runs on a single node, the driver could easily be modified to operate on a full Spark cluster, whether standalone or Hadoop-based. Provided the input files are sufficiently small to fit into memory (and thus suitable for use with sc.wholeTextFiles()), this solution should scale well with the addition of file consolidation functionality.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages