rdbms-to-hdfs

Apache spark code to extract data from a JDBC relational database to HDFS

Setup

using MySql with some opendata example datasets CDH 5.8(local install on my laptop following https://github.com/krisgeus/ansible_local_cdh_hadoop

MySql

Inspired by blog of joe fallon: http://blog.joefallon.net/2013/10/install-mysql-on-mac-osx-using-homebrew/ brew install mysql mysql.server restart

DONT!!!mysql_secure_installation
DONT!!!unset TMPDIR
DONT!!!mysql_install_db --verbose --user=whoami --basedir="$(brew --prefix mysql)" --datadir=/usr/local/var/mysql --tmpdir=/tmp

mysql -u root

create database sourcedb;

create user 'sourcedb'@'localhost' identified by 'sourcedb';

grant all on sourcedb.* to 'sourcedb'@'localhost';

exit

mysql -u sourcedb -D sourcedb -p

Sample databases

https://www.ntu.edu.sg/home/ehchua/programming/sql/SampleDatabases.html

download and extract sakila.tar.gz

git clone https://github.com/datacharmer/test_db.git

cd test-db
mysql -u root < employees.sql

mysql -u root

SOURCE /Users/kgeusebroek/dev/xebia/godatadriven/projects/sourcedbload/sampledb/sakila-db/sakila-schema.sql

SOURCE /Users/kgeusebroek/dev/xebia/godatadriven/projects/sourcedbload/sampledb/sakila-db/sakila-data.sql

grant all on sakila.* to 'sourcedb'@'localhost';

grant all on employees.* to 'sourcedb'@'localhost';

exit

Zeppelin notebook for exploration

In the notebooks directory we have a zeppelin notebook with the experimentation code

The following settings are needed to make this work:

<property>
  <name>zeppelin.notebook.dir</name>
  <value>${git clone dir of rdbms-to-hdfs}/notebooks/</value>
  <description>path or URI for notebook persist</description>
</property>

Inside the notebook some specific spark interperter settings are mentioned. this mainly consists of adding the needed dependencies for making the jdbc connection.

Since we use mysql we use the mysql:mysql-connector-java:6.0.3


## Running the steps

### Step1a Full load of tables to parquet files on HDFS
spark-submit --class nl.krisgeus.jdbc.JdbcMain target/scala-2.10/rdbms-to-hdfs-assembly-0.1-SNAPSHOT.jar -s 0 -c employees -o /tmp/sourcedb

or

spark-submit --class nl.krisgeus.jdbc.JdbcMain target/scala-2.10/rdbms-to-hdfs-assembly-0.1-SNAPSHOT.jar --step 0 --config employees --output /tmp/sourcedb

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
notebooks/2BV5QV9QM		notebooks/2BV5QV9QM
project		project
src		src
.gitignore		.gitignore
README.md		README.md
rdbms-to-hdfs.sbt		rdbms-to-hdfs.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

notebooks/2BV5QV9QM

notebooks/2BV5QV9QM

project

project

src

src

.gitignore

.gitignore

README.md

README.md

rdbms-to-hdfs.sbt

rdbms-to-hdfs.sbt

Repository files navigation

rdbms-to-hdfs

Setup

MySql

Sample databases

Zeppelin notebook for exploration

About

Releases

Packages

Languages

krisgeus/rdbms-to-hdfs

Folders and files

Latest commit

History

Repository files navigation

rdbms-to-hdfs

Setup

MySql

Sample databases

Zeppelin notebook for exploration

About

Resources

Stars

Watchers

Forks

Languages