Skip to content

anthonywong611/SF-Transactions-Modeling

Repository files navigation

Project Description

The San Francisco Controller's Office maintains a database of spending and revenue data sourced from it's citywide financial system. Transactions data is scraped from the DataSF official website and conformed into a dimensional model in a data warehouse. The purpose of this data engineering is to help support dashboarding and ad-hoc query from business users.

Architecture

architecture

  • Transfer Family: Assumes role on the client's behalf as a SSH user to transfer data to a remote AWS server through the SSH File Transfer Protocol (SFTP). SFTP enables secured data transfer from local SFTP-enabled server to the AWS environment over the internet. A pair of SSH keys (public & private) will be required to support user authentication with the SFTP server.
  • S3: The SFTP server establishes a relationship with the S3 bucket which serves as its persistent file storage system so that data transfered through SFTP will be stored directly S3. All CSV data will be transfered and stored in S3.
  • Redshift: The Redshift cluster assumes role on the client's behalf to pull data from S3 and load it into the data warehouse reporting area. To enable working with Redshift remotely, a VPC security group is attached to the cluster to route all incoming traffic to port 5439, which is the default port on which Redsfhit database is exposed.
  • S3 Policies: Defined the set of actions (e.g GET, DELETE, PUT, etc) allowed to be performed on the S3 bucket by Transfer Family and Redshift assuming roles on the client's behalf, respectively. Transfer Family has permissions to put transfered objects into the bucket, whereas Redshift has permissions to fetch and copy data from the bucket over to the cluster.

Dimensional Model (ERD)

ERD

Dimension Table

  • Program: A program identifies the service under which a transaction is recorded. Each program belongs to a department under an organization group.
  • Type: Specify the detail of resources involved in a transaction. Sub-object indicates the most granular level of detail, and falls under object and character in that order.
  • Fund: An activity receives governmental funding under a particular category based on its characteristics. Fund category is the lowest level under the hierarchy, falling under fund and fund type in that order.
  • Finance: Indicates whether a transaction is a spending or a revenue.

Fact Table

  • Transaction: The amount of dollars processed in a transaction within an accounting period that begins on July 1 and ends on June 30 the following year.

Sample Dashboard

dashboard

How to Run the Project

Make sure your local server is SFTP-enabled

1. Preliminary Setup

pip3 install -r requirements.txt
  • Make sure AWS account has full access to work with S3, Transfer Famiy, Redshift, and IAM
  • Follow the boto3 configuration link if boto3 isn't already set up
  • Complete the params.cfg file where a 'TODO: Replace the value below' indicates

2. Generate an SSH Key Pair

  • Create a folder called ssh in the project root directory
  • Generate an RSA 4096-bit key pair in the ssh directory
ssh-keygen -t rsa -b 4096 -N "" -m PEM -f <key_name>  # TODO: Replace <key_name> with a key name of your choice

3. Build AWS Infrastructures

python3 infrastructures.py   # Make sure in the project root directory
  • After all AWS resources have been provisioned, the SFTP server endpoint will show up on the terminal

4. Connect to the Transfer Family SFTP Server

sftp -i ssh/<key_name> <sftp_server_username>@<sftp_server_endpoint>
  • Enter 'yes' if prompted for connection, you shall be directed to the SFTP terminal

5. Transfer Local Datasets to S3 Buckets over SFTP

put data/*.csv

files

  • Disconnect from the SFTP server, or open a new terminal

6. Load the Datasets into the Redshift Data Warehouse

python3 load_tables.py

7. Query Dimensional Model in Redshift Query Editor V2

redshift query editor

8. Tear Down AWS Infrastructures

python3 clean_up.py