Distributed Data Mining Lab - TUM SS_2017

Summary

The wiki contains the resources on How to Setup a Distributed Environment for Data Mining and Analysis.

Technologies/Resources involved are:

Weekly Progress

The wiki for the Distributed Data Mining lab course would be available per week.

Week 1 - Setting up the Virtual Machine
- Introduction to OpenNebula
- VM Creation Steps
- Logging in the Virtual Machine through SSH
Week 2 - Setting up a Single and Multinode Cluster
- Introduction to Hadoop
- Single Node Hadoop Installation
- Multi Node Hadoop Installation
Week 3 - Exploration of Hadoop & Spark
- Introduction to Spark
- Multi Node Spark Installation
- Examples of Hadoop and Spark
Week 4 - Programming Experience on Hadoop and Spark
- Introduction to MLlib
- Web UI - Hadoop , Yarn and Spark Cluster
- Examples of Hadoop and Spark
- Troubleshooting
Week 5 - Programming Experience on Hadoop and Spark (continued)
- Prime Number Examples
- Spark Examples
- Performance Analysis
- Troubleshooting
Week 6 - Extraction of NCBI Database : Part1
- NCBI API
- Retrieving Pubmed Reports
- Troubleshooting
Week 7 - Extraction of Elsevier Data: Part2
- Elsevier API
- Nalaf, LocText, StringText
- Docker
Week 8 - Elastic Search Installation , LocText Installation & Parsing of Full Text Papers
- LocText Installation
- Multi Node Elastic Search Installation
- Parsing of Pubmed Records
Week 9 - Storing Parsed Papers in Elastic Search
- Basic architecture
- Elastic Search Storage
- Remarks
Week 10 - Text-mine the relationship of Protein & Cell with LocText
- Elastic Search Records
- Mining with LocText
- Troubleshooting
Week 11 - Text-mine the relationship of Protein & Cell with LocText with Map Reduce
- Creating Additional Disks for Elastic Search Records
- Mining with LocText and Nalaf
- Writing a map Reduce Job
- Remarks
Week 12 - Text-mine the relationship of Protein & Cell with LocText with Map Reduce
- Mining with LocText and Nalaf
- map Reduce Job
- Kibana Visualizations
- Troubleshooting
Final Presentation prezi.pdf

References

https://linoxide.com/cluster/setup-hadoop-multi-node-cluster-ubuntu/ (Multinode Hadoop Setup)
http://data-flair.training/blogs/apache-spark-installation-on-multi-node-cluster-step-by-step-guide/ (Multinode Spark Setup)
https://spark.apache.org/docs/1.2.0/mllib-guide.html (MLLib Documentation)
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ (MapReduce in Python)
http://cs.smith.edu/dftwiki/index.php/Hadoop_Tutorial_1_--_Running_WordCount (Map Reduce Tutorial)
https://www.digitalocean.com/community/tutorials/how-to-install-and-use-docker-getting-started (Docker Installation)
https://github.com/Rostlab/LocText (Loctext)
https://github.com/Rostlab/nalaf (Nalaf)
https://github.com/titipata/pubmed_parser(Pubmed Parser)
https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html (Elastic Search)
http://hadoop.apache.org/
https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
https://rubenmiddeljans.files.wordpress.com/2015/08/hadoop-cluster.jpg
http://spark.apache.org/faq.html
https://spark.apache.org/mllib/
https://www.digitalocean.com/community/tutorials/how-to-set-up-an-nfs-mount-on-ubuntu-16-04
https://docs.docker.com/engine/installation/linux/ubuntu/

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Week1		Week1
Week10		Week10
Week11		Week11
Week12		Week12
Week2		Week2
Week3		Week3
Week4		Week4
Week5		Week5
Week6		Week6
Week7		Week7
Week8		Week8
Week9		Week9
README.md		README.md
prezi.pdf		prezi.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Week1

Week1

Week10

Week10

Week11

Week11

Week12

Week12

Week2

Week2

Week3

Week3

Week4

Week4

Week5

Week5

Week6

Week6

Week7

Week7

Week8

Week8

Week9

Week9

README.md

README.md

prezi.pdf

prezi.pdf

Repository files navigation

Distributed Data Mining Lab - TUM SS_2017

Summary

Technologies/Resources involved are:

Weekly Progress

References

About

Releases

Packages

IshmeetKaur/Distributed-Data-Mining-Lab

Folders and files

Latest commit

History

Repository files navigation

Distributed Data Mining Lab - TUM SS_2017

Summary

Technologies/Resources involved are:

Weekly Progress

References

About

Topics

Resources

Stars

Watchers

Forks