scikit_test

RedCarpetUp internship application

First step(For a batch of movies):

Primary dataset Download primary data from https://www.kaggle.com/PromptCloudHQ/imdb-data/data

Feature generation

Casts: https://archive.ics.uci.edu/ml/machine-learning-databases/movies-mld/data/casts.html Awards_types(dataset AW): https://archive.ics.uci.edu/ml/machine-learning-databases/movies-mld/data/awtypes.html Actors(dataset A): https://archive.ics.uci.edu/ml/machine-learning-databases/movies-mld/data/actors.html Movies(dataset M): https://archive.ics.uci.edu/ml/machine-learning-databases/movies-mld/data/main.html

First exercise(time: 12 hours):

Load primary dataset to pandas.
Scrape data from secondary links and load to pandas. While any method is fine - beautiful soup would be recommended.
Use Levenshtein distance to match movie names in primary dataset with movies provided in dataset M. (Recommendations: https://github.com/seatgeek/fuzzywuzzy)
Persist Levenshtein distance scores between movies in primary dataset and movies in dataset(M) and share in a CSV.
Assume that the movies with the highest Levenshtein distance is the same and use that to merge primary dataset to dataset M.
Using this, use data in dataset AW,A to create additional features.
After this exercise, you should have multiple features in for each movie. Share the processed data in a csv format.

Second exercise(time: 12 hours) For modelling divide the data into atleast three samples:

Training
Testing
Out of time testing - This dataset needs to have all 2016 year releases - DO NOT INCLUDE 2016 year releases in previous two datasets. Feel free to play around with distribution of training & testing datasets.

Models to be implemented:

SVM for multiclass prediction - http://scikit-learn.org/stable/modules/svm.html#classification
LARS Lasso - http://scikit-learn.org/stable/modules/linear_model.html#lars-lasso

Model comparision metrics to be generated for each of your models:

Share Jupyter notebook reading in .csv with all modelling code - try to optimize the model as much as possible in given time frame.

Brownie points:

Create function with single movie run. Something like this: def fun_name(movie_name): ......

Calling fun_name(movie_name) should predict rating of a movie.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
pytest		pytest
.gitignore		.gitignore
IMDB-Movie-Data.csv		IMDB-Movie-Data.csv
IMDB.ipynb		IMDB.ipynb
README.md		README.md
Utility.py		Utility.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pytest

pytest

.gitignore

.gitignore

IMDB-Movie-Data.csv

IMDB-Movie-Data.csv

IMDB.ipynb

IMDB.ipynb

README.md

README.md

Utility.py

Utility.py

Repository files navigation

scikit_test

About

Releases

Packages

Languages

Prateek2901/scikit_test

Folders and files

Latest commit

History

Repository files navigation

scikit_test

About

Topics

Resources

Stars

Watchers

Forks

Languages