GitHub - zhermin/topkcommonwords: Extracts the Top K Common Words between 2 Text Files using Hadoop's MapReduce

zhermin / topkcommonwords Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Extracts the Top K Common Words between 2 Text Files using Hadoop's MapReduce

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
temp		temp
ASSIGN_1.out		ASSIGN_1.out
README		README
TopkCommonWords$File1Mapper.class		TopkCommonWords$File1Mapper.class
TopkCommonWords$File2Mapper.class		TopkCommonWords$File2Mapper.class
TopkCommonWords$InnerJoinReducer$1.class		TopkCommonWords$InnerJoinReducer$1.class
TopkCommonWords$InnerJoinReducer.class		TopkCommonWords$InnerJoinReducer.class
TopkCommonWords.class		TopkCommonWords.class
TopkCommonWords.java		TopkCommonWords.java
answer.txt		answer.txt
check_common.py		check_common.py
cm.jar		cm.jar
output.txt		output.txt
slurm_run.sh		slurm_run.sh
submit		submit

Repository files navigation

############### Assignment 1-1: Top K Common Words #####################

Command Format: TopkCommonWords <input_file1> <input_file2> <stopwords> <output_dir> <k>
Example: hadoop jar cm.jar TopkCommonWords commonwords/input/task1-input1.txt commonwords/input/task1-input2.txt commonwords/input/stopwords.txt commonwords/cm_output/ 10
(All the file path in the command is HDFS path)
The output should be stored in a text file commonwords/cm_output/part-r-00000 by default.

Use scripts to compile and submit your codes
$ sbatch slurm_run.sh           # compile and run your code on sample dataset
$ squeue -u <username>          # check status of your task, if the queue is empty means that the task is finished
$ cat ASSIGN_1.out              # check result
$ ./submit			# submit your codes (You can submit multiple times before due time)
These scripts will also be used on marking. So ensure your output format is the same as 'answer.txt'

In this assignment, you should ONLY modify:
-- TopkCommonWords.java

Do NOT add new files. Do NOT define Java packages. The script will compile 'TopkCommonWords.java' as simple java file.

About

Extracts the Top K Common Words between 2 Text Files using Hadoop's MapReduce

data-engineering hadoop-mapreduce

Readme

Activity

0 stars

1 watching

0 forks

Report repository

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

temp

temp

ASSIGN_1.out

ASSIGN_1.out

README

README

TopkCommonWords$File1Mapper.class

TopkCommonWords$File1Mapper.class

TopkCommonWords$File2Mapper.class

TopkCommonWords$File2Mapper.class

TopkCommonWords$InnerJoinReducer$1.class

TopkCommonWords$InnerJoinReducer$1.class

TopkCommonWords$InnerJoinReducer.class

TopkCommonWords$InnerJoinReducer.class

TopkCommonWords.class

TopkCommonWords.class

TopkCommonWords.java

TopkCommonWords.java

answer.txt

answer.txt

check_common.py

check_common.py

cm.jar

cm.jar

output.txt

output.txt

slurm_run.sh

slurm_run.sh

submit

submit

Repository files navigation

About

Languages

zhermin/topkcommonwords

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Stars

Watchers

Forks

Languages