Skip to content

EnkiDoctor/The_TMDB_data_analysis

Repository files navigation

Readme File


This is the readme file of the sdsc2102 Group Project
Name: Du Junye         SID: 56641800
Name: Yang Wentao       SID: 56643528
Name: Wu Jianrui         SID: 56641885
Name: Zhou Xin         SID: 56644501


Python Verison and Package version:
Note: Please make sure that the packages are installed correctly to make the program run normally.
Python version and cuda configuration:

  • Python 3.9.7
  • cuda 11.0

Missing value handling and processing:

  • missingno 0.5.1
  • ast.literal_eval

Scientific calculation package:

  • Numpy 1.21.5
  • Pandas 1.3.4
  • Scipy 1.8.0

Machine Laerning tools:

  • scikit-learn 1.0.2
  • xgboost 1.5.2
  • lightGBM 3.3.2
  • catboost 1.0.5
  • torch 1.10.1+cu113
  • torchaudio 0.10.1+cu113
  • torchvision 0.11.2+cu113

Visualization tools:

  • Matplotlib 3.4.3
  • Seaborn 0.11.2
  • Plotly 5.5.0
  • Cufflinks 0.17.3

Intepretation tools:

  • Shap, eli5

Outline:

I. Data Exploration and Data Cleaning

Handling missing data

Detect and visualize missing values

w>>>##### Filing missing values with supplymentary data, or with the mean of its columns

Filtering and extracting useful information from the data

Extract key information as categorical features

Formatting the unstandardized dates

II. Data Visualization

Distribution of numeric features

Distribution of categorical features

III. Predixction

XGBoost model
LightGBM model
CatBoost model
Feature importance and model intepretation

IV.Summary

The findings during the process


Device Information and running time:
CPU: Intel(R) Xeon(R) Gold 5216R CPU @ 2.10GHZ
GPU: Tesla V100S*2
Expected running time: 5-10 minutes


References:

  • Hands on Machine Learning with Scikit-Learn & TensorFlow by Aurélien Géron (O'Reilly). CopyRight 2017 Aurélien Géron
  • Reference Lecture Note of SDSC2102

Releases

No releases published

Packages

No packages published