The wiki contains the resources on How to Setup a Distributed Environment for Data Mining and Analysis.
The wiki for the Distributed Data Mining lab course would be available per week.
-
Week 1 - Setting up the Virtual Machine
- Introduction to OpenNebula
- VM Creation Steps
- Logging in the Virtual Machine through SSH
-
Week 2 - Setting up a Single and Multinode Cluster
- Introduction to Hadoop
- Single Node Hadoop Installation
- Multi Node Hadoop Installation
-
Week 3 - Exploration of Hadoop & Spark
- Introduction to Spark
- Multi Node Spark Installation
- Examples of Hadoop and Spark
-
Week 4 - Programming Experience on Hadoop and Spark
- Introduction to MLlib
- Web UI - Hadoop , Yarn and Spark Cluster
- Examples of Hadoop and Spark
- Troubleshooting
-
Week 5 - Programming Experience on Hadoop and Spark (continued)
- Prime Number Examples
- Spark Examples
- Performance Analysis
- Troubleshooting
-
Week 6 - Extraction of NCBI Database : Part1
- NCBI API
- Retrieving Pubmed Reports
- Troubleshooting
-
Week 7 - Extraction of Elsevier Data: Part2
- Elsevier API
- Nalaf, LocText, StringText
- Docker
-
Week 8 - Elastic Search Installation , LocText Installation & Parsing of Full Text Papers
- LocText Installation
- Multi Node Elastic Search Installation
- Parsing of Pubmed Records
-
Week 9 - Storing Parsed Papers in Elastic Search
- Basic architecture
- Elastic Search Storage
- Remarks
-
Week 10 - Text-mine the relationship of Protein & Cell with LocText
- Elastic Search Records
- Mining with LocText
- Troubleshooting
-
Week 11 - Text-mine the relationship of Protein & Cell with LocText with Map Reduce
- Creating Additional Disks for Elastic Search Records
- Mining with LocText and Nalaf
- Writing a map Reduce Job
- Remarks
-
Week 12 - Text-mine the relationship of Protein & Cell with LocText with Map Reduce
- Mining with LocText and Nalaf
- map Reduce Job
- Kibana Visualizations
- Troubleshooting
- https://linoxide.com/cluster/setup-hadoop-multi-node-cluster-ubuntu/ (Multinode Hadoop Setup)
- http://data-flair.training/blogs/apache-spark-installation-on-multi-node-cluster-step-by-step-guide/ (Multinode Spark Setup)
- https://spark.apache.org/docs/1.2.0/mllib-guide.html (MLLib Documentation)
- http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ (MapReduce in Python)
- http://cs.smith.edu/dftwiki/index.php/Hadoop_Tutorial_1_--_Running_WordCount (Map Reduce Tutorial)
- https://www.digitalocean.com/community/tutorials/how-to-install-and-use-docker-getting-started (Docker Installation)
- https://github.com/Rostlab/LocText (Loctext)
- https://github.com/Rostlab/nalaf (Nalaf)
- https://github.com/titipata/pubmed_parser(Pubmed Parser)
- https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html (Elastic Search)
- http://hadoop.apache.org/
- https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
- https://rubenmiddeljans.files.wordpress.com/2015/08/hadoop-cluster.jpg
- http://spark.apache.org/faq.html
- https://spark.apache.org/mllib/
- https://www.digitalocean.com/community/tutorials/how-to-set-up-an-nfs-mount-on-ubuntu-16-04
- https://docs.docker.com/engine/installation/linux/ubuntu/