Skip to content

Latest commit

 

History

History
476 lines (310 loc) · 34.6 KB

data_engineering_nanodegree_-_part_1_-_data_modeling.md

File metadata and controls

476 lines (310 loc) · 34.6 KB

Data Engineering Nanodegree - Part 1 - Data Modeling

Data Engineering

Data Modeling

Module 1

Introduction To Data Modeling

Q: What is a database?Answer:
  • A database is a structured repository or organised collection of data 
  • It is stored and retrieved electronically for use in applications

Remarks: Data can be stored, updated, or deleted from a database.

Q: What is a data base management system (DBMS)?Answer: It is the software that interacts with end users, applications, and the database itself to capture and analyze the data.

Remarks:

  1. Introduction to DBMS
  2. DBMS as per Wikipedia

Q: Why can't everything be stored in a giant Excel spreadsheet?Answer: There are limitations to the amount of data that can be stored in an Excel sheet.
So, a database helps organize the elements into tables - rows and columns, etc.
Also reading and writing operations on a large scale is not possible with an Excel sheet, so it's better to use a database to handle most business functions.

Q: Does data modeling happen before you create a database, or is it an iterative process?Answer:
  • It's definitely an iterative process.
  • Data engineers continually reorganize, restructure, and optimize data models to fit the needs of the organization.

Q: What is a data model?Answer: It is an abstraction that organizes elements of data and how they will relate to each other.

Remarks: Check with wikipedia

Q: What is a schema?Answer: The term "schema" refers to the organization of data as a blueprint of how the database is constructed (divided into database tables in the case of relational databases). 

Q: What are advantages of using a relational database?Answer:
  • Writing SQL queries: With SQL being the most common database query language.
  • Modeling the data not modeling queries
  • Ability to do JOINS
  • Simplicity: If you have a smaller data volume (and not big data) you can use a relational database for its simplicity.
  • ACID Transactions: Allows you to meet a set of properties of database transactions intended to guarantee validity even in the event of errors, power failures, and thus maintain data integrity.

Q: Describe atomicity (ACID)Answer: "The whole transaction is processed or nothing is processed."

Example:

  • A commonly cited example of an atomic transaction is money transactions between two bank accounts.
  • The transaction of transferring money from one account to the other is made up of two operations. 
  • First, you have to withdraw money in one account, and second you have to save the withdrawn money to the second account. 
  • An atomic transaction, i.e., when either all operations occur or nothing occurs, keeps the database in a consistent state. 
  • This ensures that if either of those two operations (withdrawing money from the 1st account or saving the money to the 2nd account) fail, the money is neither lost nor created. 
  • Source Wikipedia for a detailed description of this example.

Q: Describe consistency (ACID).Answer: Only transactions that abide by constraints and rules are written into
the database, otherwise the database keeps the previous state.

Remarks: The data should be correct across all rows and tables. Check out additional information about consistency on Wikipedia.

Q: Describe isolation (ACID).Answer: Transactions are processed independentlyand securely, order does not matter. 

Remarks:

  • A low level of isolation enables many users to access the data simultaneously, however this also increases the possibilities of concurrency effects (e.g., dirty reads or lost updates).
  • On the other hand, a high level of isolation reduces these chances of concurrency effects, but also uses more system resources and transactions blocking each other. Source: Wikipedia

Q: Describe durability (ACID).Answer: Completed transactions are saved to
database even in cases of system failure. 

Example: A commonly cited example includes tracking flight seat bookings. So once the flight booking records a confirmed seat booking, the seat remains booked even if a system failure occurs. Source: Wikipedia.

Q: When shouldn't you use relational databases?Answer:
  • Have large amounts of data: Relational Databases are not distributed databases and because of this they can only scale vertically by adding more storage in the machine itself. You are limited by how much you can scale and how much data you can store on one machine. You cannot add more machines like you can in NoSQL databases.
  • Need to be able to store different data type formats: Relational databases are not designed to handle unstructured data.
  • Need high throughput -- fast reads: While ACID transactions bring benefits, they also slow down the process of reading and writing data. If you need very fast reads and writes, using a relational database may not suit your needs.
  • Need a flexible schema: Flexible schema can allow for columns to be added that do not have to be used by every row, saving disk space.
  • Need high availability: The fact that relational databases are not distributed (and even when they are, they have a coordinator/worker architecture), they have a single point of failure. When that database goes down, a fail-over to a backup system occurs and takes time.
  • Need horizontal scalability: Horizontal scalability is the ability to add more machines or nodes to a system to increase performance and space for data.

Q: What is PostgreSQLAnswer: Open source object-relational database system
Uses and builds on the SQL language

Q: List five common types of NoSQL Databases.Answer:
  1. Apache Cassandra (partition row store)
  2. MongoDB (document store)
  3. DynamoDB (key-value store)
  4. Apache HBase (wide column store)
  5. Neo4J (graph database)

Q: Which concept in Apache Cassandra is similar to a schema in PostgreSQL?Answer: The keyspace; it is a collection of tables.

Q: What are good use cases for NoSQL (and more specifically Apache Cassandra)?Answer:
  1. Transaction logging (retail, health care)
  2. Internet of Things (IoT)
  3. Time series data
  4. Any workload that is heavy on writes to the database (since Apache Cassandra is optimized for writes). 

Example: Uber uses Apache Cassandra for their entire backend. Netflix uses Apache Cassandra to serve all their videos to customers.

Q: When to use a NoSQL Database?Answer:
  • Store different data type formats: NoSQL was also created to handle different data configurations: structured, semi-structured, and unstructured data. JSON, XML documents can all be handled easily with NoSQL.
  • Large amounts of data: Relational databases are not distributed databases and because of this they can only scale vertically by adding more storage in the machine itself. NoSQL databases were created to be able to be horizontally scalable. The more servers/systems you add to the database the more data that can be hosted with high availability and low latency (fast reads and writes).
  • Need horizontal scalability: Horizontal scalability is the ability to add more machines or nodes to a system to increase performance and space for data
  • Need high throughput: While ACID transactions bring benefits they also slow down the process of reading and writing data. If you need very fast reads and writes using a relational database may not suit your needs.
  • Need a flexible schema: Flexible schema can allow for columns to be added that do not have to be used by every row, saving disk space.
  • Need high availability: Relational databases have a single point of failure. When that database goes down, a failover to a backup system must happen and takes time.
  • Users are distributed

Q: When NOT to use a NoSQL Database?Answer:
  • When you have a small dataset: NoSQL databases were made for big datasets not small datasets and while it works it wasn’t created for that.
  • When you need ACID Transactions: If you need a
    consistent database with ACID transactions, then most NoSQL databases will not be able to serve this need. NoSQL databases are eventually
    consistent and do not provide ACID transactions. However, there are
    exceptions to it. Some non-relational databases like MongoDB can support
    ACID transactions.
  • When you need the ability to do JOINS across tables: NoSQL does not allow the ability to do JOINS. This is not allowed as this will result in full table scans.
  • If you want to be able to do aggregations and analytics
  • If you have changing business requirements : Ad-hoc queries are possible but difficult as the data model was done to fix particular queries
  • If your queries are not available and you need the flexibility :
    You need your queries in advance. If those are not available or you
    will need to be able to have flexibility on how you query your data you
    might need to stick with a relational database

Remarks: Caveats to NoSQL and ACID Transactions

  • There are some NoSQL databases that offer some form of ACID transaction. 
  • As of v4.0, MongoDB added multi-document ACID transactions within a single replica set. With their later version, v4.2, they have added multi-document ACID transactions in a sharded/partitioned deployment.
  • Check out the documentation from MongoDB on multi-document ACID transactions

Module 2

Relational Data Models

Q: What are advantages of relational databases?Answer:
  • Standardization of data model: Once your data is transformed into the rows and columns format, your data is standardized and you can query it with SQL
  • Flexibility in adding and altering tables: Relational databases gives you flexibility to add tables, alter tables, add and remove data.
  • Data Integrity: Data Integrity is the backbone of using a relational database.
  • Structured Query Language (SQL): A standard language can be used to access the data with a predefined language.
  • Simplicity : Data is systematically stored and modeled in tabular format.
  • Intuitive Organization: The spreadsheet format is intuitive to data modeling in relational databases.

Q: Describe the difference between OLAP and OLTP.Answer:
Online Analytical Processing (OLAP):
Databases optimized for these workloads allow for complex analytical and ad hoc queries, including aggregations. These type of databases are optimized for reads.

Online Transactional Processing (OLTP):
Databases optimized for these workloads allow for less complex queries in large volume. The types of queries for these databases are read, insert, update, and delete.

Example:

  • The key to remember the difference between OLAP and OLTP is analytics (A) vs transactions (T). 
  • If you want to get the price of a shoe then you are using OLTP (this has very little or no aggregations). 
  • If you want to know the total stock of shoes a particular store sold, then this requires using OLAP (since this will require aggregations).

Remarks: This Stackoverflow post describes it well.

Q: What are the two most important concepts for structuring your database?Answer:
  1. Normalization
  2. Denormalization

Q: What is normalization?Answer: The process of structuring a relational database in accordance with a series of normal forms in order to reduce data redundancy and increase data integrity.

Q: What are the objectives of normal form?Answer:
  1. To free the database from unwanted insertions, updates, & deletion dependencies
  2. To reduce the need for refactoring the database as new types of data are introduced
  3. To make the relational model more informative to users
  4. To make the database neutral to the query statistics

Remarks: See this Wikipedia page to learn more.

Q: What are the three steps of normalization?Answer:
  1. First Normal Form (1NF)
  2. Second Normal Form (2NF)
  3. Third Normal Form (3NF)

Q: How to reach First Normal Form (1NF)?Answer:
  • Atomic values: each cell contains unique and single values
  • Be able to add data without altering tables
  • Separate different relations into different tables
  • Keep relationships between tables together with foreign keys

Q: How to reach second normal form (2NF)?Answer:
  • First, reach 1NF
  • All columns in the table must rely on the Primary Key

Q: How to reach third normal form (3NF)?Answer:
  • Must be in 2nd Normal Form
  • No transitive dependencies
  • Remember, transitive dependencies you are trying to maintain is that to get from A-> C, you want to avoid going through B.

Q: What is denormalization?Answer: It is the process of trying to improve the read performance (select) of a database at the expense of losing some write performance (insert, update, delete) by adding redundant copies of data.

Remarks:

  • JOINS on the database allow for outstanding flexibility but are extremely slow.
  • If you are dealing with heavy reads on your database, you may want to think about denormalizing your tables.
  • You get your data into normalized form, and then you proceed with denormalization. So, denormalization comes after normalization.

Q: How does the following table look like in 3NF?

Answer:
  1. No row contains a list of items. For example, the list of songs has been replaced with each song having its own row in the Song table.
  2. Transitive dependencies have been removed. For example, albumID is the PRIMARY KEY for the album year in Album Table. Similarly, each of the other tables have a unique primary key that can identify the other values in the table (e.g., song id and song name within Songtable).

Song_Table
Album_Table
Artist_Table

Q: What are the two most popular data mart schemas for data warehouses (because of their simplicity) ?Answer:
  1. Star Schema
  2. Snowflake Schema

Q: What is a dimension table?Answer: A dimension is a structure that categorizes facts and measures in order to enable users to answer business questions.

Example: Commonly used dimensions are people, products, place and time.

Remarks: Note: People and time sometimes are not modeled as dimensions.

Q: What is a fact table?Answer: In data warehousing, a fact table consists of the measurements, metrics or facts of a business process.

Example: Example of a star schema; the central table is the fact table

By Jesuja - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=3886973

Remarks: It is located at the center of a star schema or a snowflake schema surrounded by dimension tables. 

Q: What is a star schema?Answer:
  • It is the simplest style of data mart schema 
  • The star schema consists of one or more fact tables referencing any number of dimension tables. 

Example:

By SqlPac at English Wikipedia, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=76901169

Remarks:

  • It is the approach most widely used to develop data warehouses and dimensional data marts.
  • The star schema is an important special case of the snowflake schema, and is more effective for handling simpler queries.

Q: What are benefits of star schemas?Answer:
Star schemas are denormalized meaning the typical rules of normalization applied to transactional relational databases are relaxed during star-schema design and implementation. The benefits of star-schema denormalization are:
  1. Simpler queries – star-schema join-logic is generally simpler than the join logic required to retrieve data from a highly normalized transactional schema.
  2. Simplified business reporting logic – when compared to highly normalized schemas, the star schema simplifies common business reporting logic, such as period-over-period and as-of reporting.
  3. Query performance gains – star schemas can provide performance enhancements for read-only reporting applications when compared to highly normalized schemas.
  4. Fast aggregations – the simpler queries against a star schema can result in improved performance for aggregation operations.

Q: What are drawbacks of star schemas?Answer:
  1. The main disadvantage of the star schema is that it's not as flexible in terms of analytical needs as a normalized data model. Normalized models allow any kind of analytical query to be executed, so long as it follows the business logic defined in the model. Star schemas tend to be more purpose-built toward a particular view of the data, thus not really allowing more complex analytics.
  2. Star schemas don't easily support many-to-many relationships between business entities. Typically these relationships are simplified in a star schema in order to conform to the simple dimensional model.
  3. Data integrity is not well-enforced due to its denormalized state. One-off inserts and updates can result in data anomalies, which normalized schemas are designed to avoid. 

Q: What is the snowflake schema?Answer:
  • It is a logical arrangement of tables in a multidimensional database such that the entity relationship diagram resembles a snowflake shape. 
  • The snowflake schema is represented by centralized fact tables which are connected to multiple dimensions.

Example: The snowflake schema is a variation of the star schema, featuring normalization of dimension tables.


By SqlPac - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=4285113

Remarks: The snowflake schema is similar to the star schema. However, in the snowflake schema, dimensions are normalized into multiple related tables, whereas the star schema's dimensions are denormalized with each dimension represented by a single table.

Q: Describe the NOT NULL constraint (SQL)Answer: The NOT NULL constraint indicates that the column cannot contain a null value.

Example:

Here is the syntax for adding a NOT NULL constraint to the CREATE statement:

CREATE TABLE IF NOT EXISTS customer_transactions (
    customer_id int NOT NULL, 
    store_id int, 
    spent numeric
);

You can add NOT NULL constraints to more than one column. Usually this occurs when you have a COMPOSITE KEY, which will be discussed further below.
Here is the syntax for it:
CREATE TABLE IF NOT EXISTS customer_transactions (
    customer_id int NOT NULL, 
    store_id int NOT NULL, 
    spent numeric
);

Q: Describe the UNIQUE constraint (SQL)Answer: The UNIQUE constraint is used to specify that the data across all the rows in one column are unique within the table. The UNIQUE constraint can also be used for multiple columns, so that the combination of the values across those columns will be unique within the table. In this latter case, the values within 1 column do not need to be unique.

Example:

Let's look at an example.

CREATE TABLE IF NOT EXISTS customer_transactions (
    customer_id int NOT NULL UNIQUE, 
    store_id int NOT NULL UNIQUE, 
    spent numeric 
);
Another way to write a UNIQUE constraint is to add a table constraint using commas to separate the columns.
CREATE TABLE IF NOT EXISTS customer_transactions (
    customer_id int NOT NULL, 
    store_id int NOT NULL, 
    spent numeric,
    UNIQUE (customer_id, store_id, spent)
);

Q: Describe the PRIMARY KEY constraint (SQL)Answer:
  • The PRIMARY KEY constraint is defined on a single column, and every table should contain a primary key.
  • The values in this column uniquely identify the rows in the table.

Example:

Let's look at the following example:

CREATE TABLE IF NOT EXISTS store (
    store_id int PRIMARY KEY, 
    store_location_city text,
    store_location_state text
);
Here is an example for a group of columns serving as composite key.
CREATE TABLE IF NOT EXISTS customer_transactions (
    customer_id int, 
    store_id int, 
    spent numeric,
    PRIMARY KEY (customer_id, store_id)
);

Remarks:

  • If a group of columns are defined as a primary key, they are called a composite key. That means the combination of values in these columns will uniquely identify the rows in the table. 
  • By default, the PRIMARY KEY constraint has the unique and not null constraint built into it.

Q: What is an upsert in RDBMS language?Answer:
  • In RDBMS language, the term upsert refers to the idea of inserting a new row in an existing table, or updating the row if it already exists in the table.
  • The action of updating or inserting has been described as "upsert".

Example:

Now let's assume that the customer moved and we need to update the customer's address. However we do not want to add a new customer id. In other words, if there is any conflict on the customer_id, we do not want that to change.

This would be a good candidate for using the ON CONFLICT DO NOTHING clause.
INSERT INTO customer_address (customer_id, customer_street, customer_city, customer_state)
VALUES
 (
 432, '923 Knox Street', 'Albany', 'NY'
 ) 
ON CONFLICT (customer_id) 
DO NOTHING;
Now, let's imagine we want to add more details in the existing address for an existing customer. This would be a good candidate for using the ON CONFLICT DO UPDATE clause.
INSERT INTO customer_address (customer_id, customer_street)
VALUES
    (
    432, '923 Knox Street, Suite 1' 
) 
ON CONFLICT (customer_id) 
DO UPDATE
    SET customer_street  = EXCLUDED.customer_street;

Remarks: The way this is handled in PostgreSQL is by using the INSERT statement in combination with the ON CONFLICT clause.

Q: How can you perform an upsert in PostgreSQL?Answer: By using the INSERT statement in combination with the ON CONFLICT clause.

Q: What is the extract, transform, load (ETL) process?Answer: It is a general procedure of copying data from one or more sources into a destination system which represents the data differently from the source(s) or in a different context than the source(s). 

Remarks: The ETL process became a popular concept in the 1970s and is often used in data warehousing.

Module 3

Nosql Data Models

Q: What is eventual consistency?Answer:
  • Over time (if no new changes are made) each copy of the data will be the same, but if there are new changes, the data may be different in different locations.
  • The data may be inconsistent for only milliseconds. There are workarounds in place to prevent getting stale data.

Remarks: See also: https://en.wikipedia.org/wiki/Eventual_consistency

Q: What does the CAP theorem state?Answer: It states that it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees:
  1. Consistency
  2. Availability
  3. Partition tolerance

Q: What do consistency, availability and partition tolerance mean in the CAP theorem?Answer:
  1. Consistency: Every read receives the most recent write or an error.
  2. Availability: Every request receives a (non-error) response, without the guarantee that it contains the most recent write.
  3. Partition tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes.

Q: How is CAP Consistency different from ACID Consistency?Answer:
  • ACID consistency is all about database rules. If a schema declares that a value must be unique, then a consistent system will enforce uniqueness of that value across all operations. If a foreign key implies deleting one row will delete related rows, then a consistent system will ensure the state can’t contain related rows once the base row is deleted.
  • CAP consistency promises that every replica of the same logical value, spread across nodes in a distributed system, has the same exact value at all times. Note that this is a logical guarantee, rather than a physical one. Due to the speed of light, it may take some non-zero time to replicate values across a cluster. The cluster can still present a logical view by preventing clients from viewing different values at different nodes.

Remarks: Source: https://www.voltdb.com/blog/2015/10/disambiguating-acid-cap/

Q: List a few important rules for data modeling in Apache Cassandra.Answer:
  • Denormalization must be done for fast reads
  • ALWAYS think Queries first
  • One table per query is a great strategy
  • Apache Cassandra does not allow for JOINs between tables

Q: Why do we need to use a WHERE statement in Apache Cassandra?Answer:
  • The WHERE statement is allowing us to do the fast reads. 
  • With Apache Cassandra, we are talking about big data -- think terabytes of data -- so we are making it fast for read purposes. 
  • Data is spread across all the nodes. By using the WHERE statement, we know which node to go to, from which node to get that data and serve it back. 

Example:

  • For example, imagine we have 10 years of data on 10 nodes or servers. So 1 year's data is on a separate node. 
  • By using the WHERE year = 1 statement we know which node to visit fast to pull the data from.

Q: Can you do SELECT * FROM myTable in Apache Cassandra?Answer: It is highly discouraged because performance will be slow (or it may just fail) but it is possible with the "ALLOW FILTERING" configuration.

Q: What is the purpose of the PRIMARY KEY in Apache Cassandra?Answer: It it used to uniquely identify each row and to specify how the data is distributed across the nodes (or servers) in our system.

Q: What is the difference between a simple PRIMARY KEY and a composite PRIMARY KEY?Answer:
  1. A Simple PRIMARY KEY is just one column that is also the PARTITION KEY. 
  2. A Composite PRIMARY KEY is made up of more than one column and will assist in creating a unique value in your retrieval queries.

Q: What is the PRIMARY KEY made up in Cassandra?Answer: It is made up of either just the PARTITION KEY or it may additionaly include one or more CLUSTERING COLUMNS.

Acronyms

DBMS: Database Management System Wiki link

DDL: Data Definition Language Wiki link

RDBMS: Relational Database Management System Wiki link

SQL: Structured Query Language Wiki link

ACID: Atomicity, Consistency, Isolation, Durability

CQL: Cassandra Query Language

ERD: Entity Relationship Diagram Wiki link

OLAP: Online Analytical Processing Wiki link

OLTP: Online Transaction Processing

ETL: Extract, Transform, Load Wiki link

*[DBMS]: Database Management System *[DDL]: Data Definition Language *[RDBMS]: Relational Database Management System *[SQL]: Structured Query Language *[ACID]: Atomicity, Consistency, Isolation, Durability *[CQL]: Cassandra Query Language *[ERD]: Entity Relationship Diagram *[OLAP]: Online Analytical Processing *[OLTP]: Online Transaction Processing *[ETL]: Extract, Transform, Load