MapReduce is a scalable computation model that allows batch processing of enormous datasets. Serverless computing is a cloud programming paradigm in which software can be deployed with resources allocated on-demand without the need to manage server infrastructure. We wanted to bring these two models together and create a Serverless MapReduce implementation that provides the advantages of both worlds, like cost-effectiveness and parallelism. We used AWS Lambda to invoke cloud functions that perform map or reduce tasks and AWS S3 as a distributed object store for inputs, outputs, and intermediate data.
The files in this repository are described below:
runner.py: Client side code used by the user to schedule map and reduce workers.
mapper_lambda_function.py: Code running on the Mapper AWS Lambda instances. The user defines a Map(K, V) → [(K, V)] function here.
reducer_lambda_handler.py: Code running on the Reducer AWS Lambda instances. The user defines a Reduce(K, Val-list) → Output function here.
process_output.py: Tester file to combine the output of the word count example job.