SAG Award Nominations for Content Found on Netflix in 2019

netflix source data note: the movies/tv shows contained in this data set were movies/tv shows that netflix had in their streaming catalog in 2019, the release dates for the actual content varies.

Source Data

netflix data 2019 = https://www.kaggle.com/shivamb/netflix-shows

sag awards data = https://www.kaggle.com/unanimad/screen-actors-guild-awards

Purpose

The purpose of this project is to import the above CSVs, transform the data, shape for database, create a sql database with tables, populate the sql tables with the data we transformed in python, and run queries on the data.

Ultimately, I am trying to build a database where one can query to discover which of the movies/tv shows that Netflix had on their streaming service in 2019 had been nominated for an SAG award, if they won that award, what actor/actress/director was associated with it, and what category the award was in.

Process

Diagraming My Tables To Understand My Data

The foreign keys shown in this diagram are the composite keys for those junction tables, not actually foreign keys.

Cleaning The Individual Data Files

In order to prepare my files to be merged, I had to determine which series of information I wanted to join my dataframes on. I decided that I wanted to merge my data on the movie / show name series which were called 'title' in the netflix data and 'show' in the SAG data. To ensure the values matched up, I put both series in title case and stripped them to make sure there were no extra spaces in the data.

In the netflix data, I had to split apart the actors listed in the 'cast' series, as I wanted there to only be one name per cell in each row. To do this, I brought in a chainer function that will create a list by looping through a series and appending every cell value, and if that value contains a ', ' it will split on that value and append the names as two separate items. To create the dataframe, I used np.repeat to repeat the values in the other series as a result of the above splitting that was done.

For the SAG data, I had to create a better 'year' value as the ones present were long strings, and I couldn't use those in my data. To clean this up - I wrote a for loop that traversed through a range equal to the length of the dataframe, split the values on the value ' ' then returned the object at index 0. I saved this to a list and then appended that list to the original dataframe to get my years_released series.

From there, I grabbed the series from my new dataframe that I wanted to be used during the merge, stripped some of the series as a safety measure, and set the index for the dataframe as 'id'

Merging The Two DataFrames

With both of my dataframes cleaned up, I am ready to merge the dataframes on the 'title' series. The next step was to create a new dataframe that merged the information on the actor/actress name. This is an important step because it is showing me not only titles that were on netflix in 2019 that won SAG awards, but it also shows the actor/actresses associated with that title. Had I not filtered this list in this way, there would be instances of titles without actors/actress and vice versa.

Lastly, I export this data to a csv for me to use later in the process to create my dataframes that will be used to put information into my SQL tables.

Create All Necessary SQL Tables in pgadmin4

In pgadmin 4, I go about setting up my tables based off of the ERD diagram that you saw above. I am careful to ensure that I have created the tables in the exact same order that I diagramed them as there are dependencies in my tables.

I had to be careful here to create my junction tables in the right order so that I could leverage the proper primary keys for their composite keys.

Create DataFrames With Data That Matches The SQL Tables

After putting my SQL tables into my database, I was ready to go back into my csv and create new dataframes that would match the content I wanted to be populated into my database. This meant I had to make eight dataframes total (actors, titles, act_ttl**, directors, dir_ttl**, awards, act_award**, dir_award**)

The exact process I took to create these tables is laid out in my code comments, but largely it involved dropping duplicate / unneeded data and creating unique integer series for tables that did not have proper 'primary key' values.

** these are my junction tables

Push DataFrame Information To SQL DataBase

Use SQLalchemy's create_engine and connection string functionality to connect to my SQL database, and then used pandas pd.to_sql() to push my data into the database.

Verify That Information Was Sent Successfully

I then wrote basic queries to verify my data was present in my tables

Run Queries On SQL Table Data

Below are snippets of the queries I ran, and I posted links to the results of those queries in the next section.

Query Results

Query 1 - results

Query 2 - results

Query 3 - results

Query 4 - results

Query 5 - results

Query 6 - results

Query 7 - results

Query 8 - results

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
Images		Images
Queries		Queries
Resources		Resources
Results_From_Queries		Results_From_Queries
CHG_Netflix_SAG_merged_data.ipynb		CHG_Netflix_SAG_merged_data.ipynb
CHG_SQL_Table_Creation.ipynb		CHG_SQL_Table_Creation.ipynb
ERD_Diagram.JPG		ERD_Diagram.JPG
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Images

Images

Queries

Queries

Resources

Resources

Results_From_Queries

Results_From_Queries

CHG_Netflix_SAG_merged_data.ipynb

CHG_Netflix_SAG_merged_data.ipynb

CHG_SQL_Table_Creation.ipynb

CHG_SQL_Table_Creation.ipynb

ERD_Diagram.JPG

ERD_Diagram.JPG

README.md

README.md

Repository files navigation

SAG Award Nominations for Content Found on Netflix in 2019

Source Data

Purpose

Process

Diagraming My Tables To Understand My Data

Cleaning The Individual Data Files

Merging The Two DataFrames

Create All Necessary SQL Tables in pgadmin4

Create DataFrames With Data That Matches The SQL Tables

Push DataFrame Information To SQL DataBase

Verify That Information Was Sent Successfully

Run Queries On SQL Table Data

Query Results

About

Releases

Packages

Languages

CoreyGaunt/ETL_Project

Folders and files

Latest commit

History

Repository files navigation

SAG Award Nominations for Content Found on Netflix in 2019

Source Data

Purpose

Process

Diagraming My Tables To Understand My Data

Cleaning The Individual Data Files

Merging The Two DataFrames

Create All Necessary SQL Tables in pgadmin4

Create DataFrames With Data That Matches The SQL Tables

Push DataFrame Information To SQL DataBase

Verify That Information Was Sent Successfully

Run Queries On SQL Table Data

Query Results

About

Topics

Resources

Stars

Watchers

Forks

Languages