Skip to content

nazhanHarzula/Machine-Learning-Rate-Divorce-Celebrity

Repository files navigation

Machine-Learning-Process (Analysis and Predictions of Celebrity Divorce)

This project includes scraping and machine learning processes regarding actresses or actors who are getting divorced

Inside Script "Scraping Data Artist.ipynb"
  • First, what we need to prepare includes a package such as pandas, selenium, bs4 as the core of the data scraping process this time. If it doesn't exist, it needs to be installed (pip install "name_package")
  • Second, because here I use the chrome web browser, I need to download a chromedriver (http://chromedriver.chromium.org/downloads) according to the version installed on the computer (how to check: open chrome - select menu - help - press about google chrome )
  • Third, because we need to find data on actors or actresses, we need to find a list of their names. (https://www.imdb.com/search/name/?gender=male,female&ref_=rlm)
  • Fourth, after getting the list of names. We need to find their biographical data, using wikipedia we can search for the appropriate name keywords. (https://en.wikipedia.org/wiki/name)
  • Fifth, it needs to be converted into a file with a .csv or .xlsx extension
Inside Script "Process Data Artist.ipynb"
  • This section will analyze various patterns in scraped data on IMDb and Wikipedia
  • EDA (Exploratory Data Analysis) and data visualization are carried out to find out the inside of the data and to analyze it carefully
  • The advanced stage is features engineering, the process of knowing and dealing with blank / NaN data, duplicating data, extracting new (or known) data, data outliers to data labeling for classification models.
  • This stage will carry out the process of data balancing (imbalanced data handling), normalization, encoding to predictive modeling of the training data.
  • Finally, assess with confusion matrix and ROC visualization to find out how good the model is from the prediction results

Note : Attached is a dataset (Folder: External Output) of 831 lines, if you want to immediately do the modeling and prediction process