Big Data Lake Solution for Warehousing Stock Data and Tweet Data

Created by Stuart Miller, Paul Adams, and Rikel Djoko.

Table of Contents

Problem Statement

We want to build a large-scale data framework that will enable us to store and analyze financial market data as well as drive future predictions for investment.

For this project, we will use the following types of data.

Daily stock prices for all companies traded on the NYSE and the NASDAQ.
Intra-day values for all companies traded on the NYSE and the NASDAQ.
- Prices: high, low, open, close,
- Supporting Values: Brollinger Bands, stochastic oscillators, and moving average CD
- Intra-day values are at 15 minute intervals
Tweets from over 100 investment related twitter accounts

Overview of the Big Data Solution

Data Warehouse Overview

Two star schemas were designed for this data warehouse: a fully normalized schema and a denormalized schema. We will investigate the performance of the two schemas in the context of this problem. Conceptual diagrams of the data warehouse schemas are shown below.

More detailed schema diagrams were created with MySQL WorkBench the schema design can be accessed here.

Snowflake Schema

A diagram of the dataware house snowflake schema is shown below.

Denormalized Star Schema

A diagram of the dataware house star schema is shown below.

Big Data Solution Implementation

The big data solution is build on AWS.

Results

Queries were run on the two schemas with different EMR cluster sizes to see the impact of normalization on query time. The collected data is located here. A plot summarizing the results is shown below.

Reports

These reports were created during the course of this project.

Project Proposal: Initial proposal regarding the problem and the proposed solution for investigation.
Initial Project Presentation: A high level overview of the project idea and current status.
Final Project Presentation: A presentation describing the project goals, findings, and conclusions.
Project Paper: A paper describing the project.

Repo Structure

.
├── HQL              # HQL files for creating the data warehouses
├── cli              # The cli for this project
├── nifi             # NiFi control scripts
├── reports          # Reports generated for the project
├── sample_data      # Samples of data used in the project
├── scrape_utils     # All code for scraping data
├── LICENSE          # All code and analysis is licensed under the MIT license.
├── Project_Outline  # General outline of the project and milestone status
└── README.md

Name		Name	Last commit message	Last commit date
Latest commit History 394 Commits
HQL		HQL
aws_cli		aws_cli
nifi		nifi
reports		reports
results_analysis		results_analysis
sample_data		sample_data
scrape_utils		scrape_utils
.gitattributes		.gitattributes
.gitignore		.gitignore
.travis.yml		.travis.yml
CODEOWNERS		CODEOWNERS
LICENSE		LICENSE
Project_Outline.md		Project_Outline.md
README.md		README.md
_config.yml		_config.yml
index.md		index.md

License

sjmiller8182/Warehousing-Stock-Tweet-Data

Folders and files

Latest commit

History

Repository files navigation

Big Data Lake Solution for Warehousing Stock Data and Tweet Data

Problem Statement

Overview of the Big Data Solution

Data Warehouse Overview

Snowflake Schema

Denormalized Star Schema

Big Data Solution Implementation

Results

Reports

Repo Structure

About

Topics

Resources

License

Stars

Watchers

Forks

Languages