Skip to content

Extraction techniques used to find content related to Nantes in Web Data Commons

License

Notifications You must be signed in to change notification settings

Bertrandbenj/open-data-plus

 
 

Repository files navigation

Capstone project Open Data Plus

Extraction techniques used to find content related to Nantes in Web Data Commons.

Table of contents

Prerequisites

Installation

Clone the repository and then build it using Maven

git clone https://github.com/Callidon/open-data-plus.git
cd open-data-plus/
mvn package

Cluster setup

  • Install Hadoop on every machine of the cluster (master + slaves)
  • Install Apache Spark on the master
  • Create/edit the following configuration files

Configuration files

All configuration files must be placed in $HADOOP_HOME/etc/hadoop.

Master

  • slaves: This file lists the hosts, one per line. It should not contains the IP address of the master!

Master and slaves

  • core-site.xml: replace $HOSTNAME by the IP address of the current host (ex: 172.16.134.152), and $HADOOP_DIR by the location where files will be stored by the HDFS (make sure you have enough space!)
<configuration> 
  <property> 
    <name>fs.defaultFS</name> 
    <value>hdfs://$HOSTNAME:9000</value> 
  </property>
  <property>
  	<name>hadoop.tmp.dir</name>
  	<value>$HADOOP_DIR/tmp/hadoop-${user.name}</value>
  </property>
</configuration>
  • yarn-site.xml: replace $MASTER_HOST by the IP address of the master (ex: 172.16.134.152)
<configuration>
	<property>
        <name>yarn.resourcemanager.hostname</name>
        <value>$MASTER_HOST</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
</configuration>
  • mapred-site.xml: replace $MASTER_HOST by the IP address of the master (ex: 172.16.134.152)
<configuration>
	<property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property> 
      <name>mapred.job.tracker</name> 
      <value>$MASTER_HOST:9001</value> 
   </property>
</configuration>

Deployment

On the master, run the following scripts to start the cluster

# start master then slaves
$HADOOP_HOME/sbin/start-all.sh

# start YARN resource manager
$HADOOP_HOME/sbin/start-yarn.sh

To ensure that the cluster is running correctly, run the jsp command on each machine.

On the master, you should see :

  • Namenode
  • Resource Manager
  • SecondaryNamenode
  • JSP

On any slave, you should see :

  • Datanode
  • JSP

You can access Hadoop application's control panel at http://localhost:50070(on the master).

On the master, run the following scripts to shutdown the cluster

# stop master then slaves
$HADOOP_HOME/sbin/stop-all.sh

# stop YARN resource manager
$HADOOP_HOME/sbin/stop-yarn.sh

Usage

Once the cluster has been deployed, you must upload all the files you want to evaluate on the Hadoop file system.

Then, you can launch the crawler with the following command:

spark-submit --class com.alma.opendata.NQuadsSearch --master <spark-master-url> --deploy-mode cluster target/open-data-crawler-1.0-SNAPSHOT-jar-with-dependencies.jar path/to/data/files

You can see the progress of the task at http://localhost:8080(on the master). This will also gives you the spark master url.

Useful spark-submit options :

  • --executor-memory <memory> can be used to set how many memory slaves will use (by default, 1G).
  • --num-executors <number> can be used to describes how many executors which will execute the application. We recommend to set this to the number of slaves in the cluster.

License

MIT License

Authors

Adel Benabadji, Hiba Benyahia, Asma Boussalem, Théo Couraud, Pierre Gaultier, Lenny Lucas & Thomas Minier

About

Extraction techniques used to find content related to Nantes in Web Data Commons

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 71.0%
  • Scala 15.3%
  • Python 13.7%