DSGA1004

Where to see results?

See all results here: Google Drive

How to execute the whole process

To execute the whole process on NYU Dumbo, make sure that you've logged in and have internet connection.

Upload all .tsv to HDFS

Inside the folder where you store all your tsv files, type:

hdfs dfs -put *.tsv /user/<netid>/

If you want to check if all files are uploaded, type:

hdfs dfs -ls /user/<netid>/*.tsv

Don't forget to replace netid with your actual netid, also get rid of larger than and smaller than signs

Clean all tsv files

Type:

for f in `hdfs dfs -ls /user/<netid>/*.tsv | grep <netid> | awk '{print $8}'`;do spark-submit --conf spark.pyspark.python=/share/apps/python/3.4.4/bin/python <your file path>/cleaner.py $f; done;

Note this will write all files that ready for clustering as num-.out, and all files ready for AVF as cat-.out, on your HDFS folder

Merge and pull all files to your Dumbo local

Inside where you want to store all processed files:

for f in `hdfs dfs -ls -d /user/<netid>/*.out/ | grep <netid> | awk '{print $8}' | xargs -n 1 basename`; do hfs -getmerge $f ${f%.*}$".tsv"; done;

This preserves the original .out file name. For example, ABC.out is merged to ABC.tsv

Run Spark AVF

./Spark_AVF/run_spark_avf.sh

This will output _outliers and _null which will contain the null data and outliers of the file. The default percent of outliers is 5%. It can be changed in run_spark_avf.sh.

Run MR-AVF

./MR_AVF/MRAVF_run.sh
<type the location of the file>

This will output answer which will contain the outliers of the file.

Compare AFV and MR-AVF

./spark_AVF/run_spark_avf_compare.sh
<type the name of the file>
./MR_AVF/MRAVF_run.sh
<type the location of the file>

This will output the time taken to implement MR-AVF and Spark AVF respectively. It is worth noting that the input file for MR-AVF have been cleaned by a alternative cleaner cleaner_mr with the header = False.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
MRAVF		MRAVF
Spark_AVF		Spark_AVF
Final report.pdf		Final report.pdf
Poster2.pdf		Poster2.pdf
README.md		README.md
cleaner.py		cleaner.py
cleaner_mr.py		cleaner_mr.py
clustering.py		clustering.py
help.txt		help.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MRAVF

MRAVF

Spark_AVF

Spark_AVF

Final report.pdf

Final report.pdf

Poster2.pdf

Poster2.pdf

README.md

README.md

cleaner.py

cleaner.py

cleaner_mr.py

cleaner_mr.py

clustering.py

clustering.py

help.txt

help.txt

Repository files navigation

DSGA1004

Where to see results?

How to execute the whole process

Upload all .tsv to HDFS

Clean all tsv files

Merge and pull all files to your Dumbo local

Run Spark AVF

Run MR-AVF

Compare AFV and MR-AVF

About

Releases

Packages

Languages

yueqiusun/DSGA1004-Big-Data

Folders and files

Latest commit

History

Repository files navigation

DSGA1004

Where to see results?

How to execute the whole process

Upload all .tsv to HDFS

Clean all tsv files

Merge and pull all files to your Dumbo local

Run Spark AVF

Run MR-AVF

Compare AFV and MR-AVF

About

Topics

Resources

Stars

Watchers

Forks

Languages