Skip to content

dhirajmahato/Data_Engineering_module

Repository files navigation

Data_Engineering_module

Data Pipeline, ETL and other implementations

Data Processing

  1. Online Transactions Processing
  2. Online Analytics Processing

Data Pipeline

  • describe a workflow consisting of one or more tasks that ingest, move, and transform raw data from one or more sources to a destination.
  • generally separate data pipelines into 2 categories: batch processing (most common) and real-time processing pipelines.
  • 4 components
    1. Data source ----> data ingestion
    2. Business logic (cleaning, filtering, applying logic) --> Extract, Transform
    3. Destination (data warehouse, data lake) --> Load
    4. Schedular/Orchestration (for batch)
  • Usecase
    • If a business sold software as a service, a Data Engineer might create a data pipeline that runs on a daily basis which takes some of the data generated by the software application, combine it with some data the marketing department has and send the data to a dashboard. The dashboard could then be used to better understand how customers are using the software.
    • Change Data Capture (CDC) Pipeline - for transactional data
    • ELT Pipeline
      • transform and load into destination - this lose the flexibility to re-compute raw data
      • transform the raw data in the destination itself.

Key Components:

  • Databases:
    • Desired Properties - ACID

      • A - Atomicity
      • C - Consistency
      • I - Isolation
      • D - Durability
    • Data modeling: mapping out an information system how multiple parts are connected, entity-relationship diagram, 2 main categories:

      1. Relational modeling - using tables, columns, and rows to represent data. A key component for relational modeling is normalization (reduces data redundancy). Great for transactional/operational workloads where data is constantly inserted, updated, or deleted. Analytical queries become slow at larger data scales.
        • Relational Databases: Each table can have one or more columns with unique identifiers (primary key) that point to an id column in another table (foreign key) which forms the relationship between the two tables.
          • Easier to do complex queries
          • Supports atomic transactions
          • Harder to scale (vertical scaling)
        • Use cases: Customer Relationship Management, Enterprise resource planning, SaaA application, eCommerce and Web, Payment and booking system
        • Tools: RDMS use SQL

      1.1 Dimensional Modeling: dimensional modeling revolves around organizing data into two types of datasets: fact tables and dimension tables. Facts are usually comprised of numerical values that can be aggregated while dimensions hold descriptive attributes of entities/objects. - A key tradeoff the dimensional model makes is it denormalizes data (increases data redundancy) in order to speed up queries. - Within dimensional modeling there are a few different schema design patterns: star schema (recommended in most cases), snowflake schema, and galaxy schema.

      1.2 Data Vault Modeling: most flexible modeling technique, adapting to changes and new datasets easily while storing all historical data by default. - Hubs: Tables that contain a list of unique business keys (natural keys), surrogate keys, and metadata describing the data source for each hub item. - Links: tables that associate hubs and satellites via the business key. - satellites: tables that hold the descriptive data about the entities being modeled as well as start and end date columns to track historical changes.

      1. Non-relational: storage model that is optimized for the specific type of data being stored. For example, data may be stored as a key/value pair, as JSON, or as a graph consisting of nodes and edges.
        • Better at simpler queries
        • Easier to scale (horizontal scaling).
        • Flexible schema makes development faster.
        • Types of Databases
          • Key-Value Database
          • Document Database
          • Graph Database
          • In-Memory Database
          • Search-Engine Database
          • Timeseries Database

Data Architecture:

  • Data Warehouse: A data warehouse is a relational database in which the data is stored in a schema that is optimized for data analytics rather than transactional workloads.

    • fact and dimension table schema is called a star schema; though it's often extended into a snowflake schema by adding additional tables related to the dimension tables
    • Data warehouses are made for complex queries on large datasets.
  • Data Lake: A data lake is a file store, usually on a distributed file system for high performance data access, to store all your structured and unstructured data at any scale

    • often apply a schema-on-read approach to define tabular schemas on semi-structured data files at the point where the data is read for analysis, without applying constraints when it's stored.
    • Technologies like Spark or Hadoop are often used to process queries on the stored files and return data.
  • Lake Database: The raw data is stored as files in a data lake, and a relational storage layer abstracts the underlying files and expose them as tables, which can be queried using SQL

    • Databricks combines data warehouses & data lakes into a lakehouse architecture.

Useful Links

About

ETL pipelines and other implementations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published