Skip to content

AlexanderSaykov/ETL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ETLexample

This is examples of my ETL processes made by Spark on Scala.

This ETL process processes data on tax transactions. We take the initial layer, perform aggregating operations on it like Filtering, adding new columns, joins. After that, the data is written to HDFS. The result is a tax report for the internal services of the bank.

Inside you can find next objects and classes:

1.0 srs/main/scala

1.1 Application

Main app. It accepts spark arguments and addition arguments like startDate, endDate, asnuCode. These information is needed to perform correct ETL operation. Concurrentthought.cla library is used to parse these arguments. asnuCode is internal technical code of bank. When asnuCode is 6000 it is called "EKP" number, when it is 9000 it is called "EKS" number.

1.2 ApplicationArguments

This class is needed for Concurrentthought.cla to create new arguments for main app

1.3 Const

Constants which are used in this ETL process

1.4 DataSources

Here you can find methods which reads initial dataframes

1.5 EKP and EKS masks

Each tax transactions has dt_acoount ad kt_account number. During ETL proccess we need to filter these transactions according to mask. Mask is first 5 numbers of dt_account or kt_account number.

1.6 Sheet30AggregationTransform

Here you can find transform method which performs all aggregation functions. According to asnuCode it chooses right process.

2.0 srs/main/test

2.1 AggregationEKP30Test

Here are tests for EKP process. First test is called main app test and it just start main app with arguments. Next tests check that all filters and aggregations works correct. We know for sure that transactions with certain masks should pass all filters and aggregations and stays in final dataframe. We crate a case class which represent one tax transaction and performing transform operation on it. We check all combinations of masks. If we get empty dataframe after ETL process it means that test is failed.

2.2 DTO case class for tests

3.0 res

3.1 in

initial dataframs

3.2 out

folder where final dataframe will be written

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages