Skip to content

This project explains the extraction of data from RDBMS , Transforming the ingested data and loading it to HDFS in a straight manner using the Hadoop components

Notifications You must be signed in to change notification settings

aparajithguha/ETL-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

ETL Project

A simple ETL project interlinking the various components of Hadoop Ecosystem

hi

Creating a simple dataset in Mysql

  • creating table in MySql
create table studdetails(id integer(10), firstname varchar(20), lastname varchar(30), age integer(10), 
phone integer(14), location varchar(30));
  • insert some values
insert into studdetails values(001,"Rajiv","Reddy",21,9848022337,"Hyderabad");
insert into studdetails values(002,"siddarth","Battacharya",22,9848022338,"Kolkata");
insert into studdetails values(003,"Rajesh","Khanna",22,9848022339,"Delhi");
insert into studdetails values(004,"Preethi","Agarwal",21,9848022330,"Pune");
insert into studdetails values(005,"Trupthi","Mohanthy",23,9848022336,"Bhuwaneshwar");
insert into studdetails values(006,"Archana","Mishra",23,9848022335,"Chennai");
insert into studdetails values(007,"Komal","Nayak",24,9848022334,"trivendram");
insert into studdetails values(008,"Bharathi","Nambiayar",24,9848022333,"Chennai");

Use sqoop to ingest the data and import to Hive

sqoop import --connect  jdbc:mysql://localhost/appu1 --driver com.mysql.jdbc.Driver --username user 
--password 12345 --table studdetails --hive-import -m 1 --hive-table appu.studdetails1

Create a Hive partition table

create table studentpart(id int, firstname String, lastname String, phone bigint, loc String) PARTITIONED BY 
(age INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE SET hive.exec.dynamic.partition = true; SET hive.exec.dynamic.partition.mode = nonstrict; // we are partitioning based on age INSERT OVERWRITE TABLE studentpart PARTITION(age) SELECT id,firstname,lastname,phone,
loc,age from studdetails1; // Check your partitions show partitions studentpart; // view data based on partitions select * from studentpart where age='21';

Loading Data to Pig and converting to JSON

pig -useHCatalog

A = LOAD 'appu.studentpart' USING org.apache.hive.hcatalog.pig.HCatLoader(); 

B = filter A by age == 24;

store B into 'studentjson' USING JsonStorage();

Loading Data to SPARK

import org.apache.spark.sql.SparkSession

val session = SparkSession.builder().appName("test").master("local").getOrCreate()

val df = session.read.json("part-m-000000")

df.show() 

Import the same data to MongoDB

//create a database and a collection
use student
db.createCollection('studentdetails')
//now exit the mongo shell
//now use mongo import
mongoimport -d student -c studentdetails part-m-000000

About

This project explains the extraction of data from RDBMS , Transforming the ingested data and loading it to HDFS in a straight manner using the Hadoop components

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published