Skip to content

Extracts the Top K Common Words between 2 Text Files using Hadoop's MapReduce

Notifications You must be signed in to change notification settings

zhermin/topkcommonwords

Repository files navigation

############### Assignment 1-1: Top K Common Words #####################

Command Format: TopkCommonWords <input_file1> <input_file2> <stopwords> <output_dir> <k>
Example: hadoop jar cm.jar TopkCommonWords commonwords/input/task1-input1.txt commonwords/input/task1-input2.txt commonwords/input/stopwords.txt commonwords/cm_output/ 10
(All the file path in the command is HDFS path)
The output should be stored in a text file commonwords/cm_output/part-r-00000 by default.

Use scripts to compile and submit your codes
$ sbatch slurm_run.sh           # compile and run your code on sample dataset
$ squeue -u <username>          # check status of your task, if the queue is empty means that the task is finished
$ cat ASSIGN_1.out              # check result
$ ./submit			# submit your codes (You can submit multiple times before due time)
These scripts will also be used on marking. So ensure your output format is the same as 'answer.txt'

In this assignment, you should ONLY modify:
-- TopkCommonWords.java

Do NOT add new files. Do NOT define Java packages. The script will compile 'TopkCommonWords.java' as simple java file.

About

Extracts the Top K Common Words between 2 Text Files using Hadoop's MapReduce

Topics

Resources

Stars

Watchers

Forks