Skip to content

Combination between Apache Spark and Sqoop to extract data from Hive table into relational database, integrated with pipeline using luigi.

Notifications You must be signed in to change notification settings

naimazizi/hive-export

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HIVE-Export

Introduction

Combination between Apache Spark and Sqoop to extract data from Hive table into relational database (MySQL in this source code, but another rdbms can be used as long as supported by sqoop), integrated with pipeline using luigi.

Getting Started

  1. Insert all table, anchor columns and query that will be exported from Hive.

    In relational database (sink), table with same name and columns needs to be created first. Need to define primary key or unique key as well.

    List of queries and tables can be found in query.tsv, separated by tab (\t).

    Upon inserting row to specific table, anchor columns will be checked whether its already exists or not. Row with matching anchor column will be updated, otherwise it will insert new row. This behaviour is defined in luigi.cfg on variable SQOOP_UPDATE_MODE (either allowinsert or updateonly).

    Table with no anchor column can be defined by using dash sign(-). Multiple columns use comma (,) as separator.

    table_name anchor_columns query
    table1 col1,col2,col3 SELECT * FROM table1
    tabl2 col1,col2 SELECT * FROM table2 WHERE col1 > 1
  2. Add path to environment variable

    export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH

    export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.6-src.zip:$PYTHONPATH

    export PYTHONPATH=/vagrant/hive-export/:$PYTHONPATH

    export LUIGI_CONFIG_PATH=/vagrant/hive-export/luigi.cfg

  3. Create log for luigi central scheduler

    mkdir /tmp/luigid

    touch /tmp/luigid/luigi-server.log

  4. Start luigi central scheduler

    luigid --logdir=/tmp/luigid

    Luigi central scheduler can be accessed at http://localhost:8082

  5. Run the following command on terminal

    luigi --module HiveExport InsertToDatabase --path query.tsv

    Need to add current module to PYTHONPATH

Development Environment (optional)

If you don't have any access to hadoop environment that integrated with spark and hive, you can setup your own development environment using vagrant (follow this link).

Troubleshooting

  • key not found: _PYSPARK_DRIVER_CALLBACK_HOST

    Edit the following path in PYTHONPATH (located in ~/.bashrc) then replace py4j version that match your Apache spark (py4j is already bundled in Apache spark installation folder).

      export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
      export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.6-src.zip:$PYTHONPATH
      export PYTHONPATH=/vagrant/hive-export/:$PYTHONPATH  
    

About

Combination between Apache Spark and Sqoop to extract data from Hive table into relational database, integrated with pipeline using luigi.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages