In this project, I analyzed a dataset and then communicated my findings about it. I used the Python libraries NumPy, pandas, and Matplotlib to make the analysis easier. This data set contains information about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue. Certain columns, like ‘cast’ and ‘genres’, contain multiple values separated by pipe (|) characters. The final two columns ending with “_adj” show the budget and revenue of the associated movie in terms of 2010 dollars, accounting for inflation over time.
It is recommended to install Anaconda, which comes with all of the necessary packages, as well as IPython notebook.
I recommend installing Anaconda, which comes with all of the necessary packages, as well as IPython notebook. You can find installation instructions here.
- pandas
- NumPy
- Matplotlib
- csv
- What is the number of movies released each year?
- How long are the movies?
- What is the average budget and revenue for the movies?
- What is the relationship between movie budget and revenue?
- How many movies in each genre?
- What is the most profitable movie?
After completing the project, I learned the following :
- All the steps involved in a typical data analysis process
- Being comfortable posing questions that can be answered with a given dataset and then answering those questions
- Knowing how to investigate problems in a dataset and wrangle the data into a format I can use
- Having practice communicating the results of my analysis
- Being able to use vectorized operations in NumPy and pandas to speed up your data analysis code
- familiarity with pandas' Series and DataFrame objects, which allows the data to be accessed more conveniently
- how to use Matplotlib to produce plots showing the findings
I observed that Budget and Revenue have a positively correlated relationship.The revenue has remained higher than the budget throughout the years. Drama and Comedy are most popular evident by amount of movies within each genre.