1trc

Also just a good exercise in writing + running a Spark job from scratch 🤷

Mostly set up using sparkProjectTemplate.g8.

Running this code locally

(Requires a working Java installation, tested with OpenJDK 11)

I've inlcuded 2 parquet files (~24 MB each, sorry) so that this job can be run fully locally.

Also included is Paul Phillip's SBT runner, so you should be able to just run the job like so:

./sbt "run sample_data/measurements-*.parquet ./output"

And selecting the local application (TrillionRowChallengeLocalApp) when prompted.

And then inspect the output like:

> head -n 5 output/*.csv
Abha,60.1,-27.1,17.996445512228416
Abidjan,65.5,-19.5,25.996608497255856
Abéché,72.5,-12.9,29.427604933154107
Accra,71.0,-14.8,26.397831693177235
Addis Ababa,56.8,-25.0,15.983908400715903

You can also run the tests with

./sbt test

Running this job with Spark on Amazon Elastic MapReduce (EMR)

In order to properly test this job at scale, I've been running it on ephemeral YARN clusters on EMR. I've tried to make this process as reproducible as possible by hardcoding the job configuration in a submit script (scripts/emr_submit.py) which can probably be run with a little bit of AWS config on the user's end.

The steps to get this up and running are (approximately) as follows:

build a JAR with ./sbt assemble
Set up an S3 bucket for storing the JAR as well as output data and logs
Copy said JAR to S3 bucket
Update the hardcoded values in the emr_submit script
Run the script (python scripts/emr_submit)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
project		project
sample_data		sample_data
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt
sbt		sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

project

project

sample_data

sample_data

scripts

scripts

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

build.sbt

build.sbt

sbt

sbt

Repository files navigation

1trc

Running this code locally

Running this job with Spark on Amazon Elastic MapReduce (EMR)

About

Releases

Packages

Languages

License

SamWheating/1trc

Folders and files

Latest commit

History

Repository files navigation

1trc

Running this code locally

Running this job with Spark on Amazon Elastic MapReduce (EMR)

About

Resources

License

Stars

Watchers

Forks

Languages