Best Question Author Prediction - Enigma CodeFest - Analytics Vidya

Source: https://datahack.analyticsvidhya.com/contest/enigma-codefest-machine-learning/

Leaderboard: https://datahack.analyticsvidhya.com/contest/enigma-codefest-machine-learning/lb

Overview

The Department of Computer Science and Engineering at IIT(BHU) Varanasi is proud to present the fifth instalment of its highly anticipated coding festival, Codefest, that will be held from 31st August - 2nd September 2018.The previous editions have witnessed remarkable success at a global level - from the inaugural edition in 2010 to the rebooted 2016 edition which was especially remarkable in terms of its reach.

Problem Statement

Problem

An online question and answer platform has hired you as a data scientist to identify the best question authors on the platform. This identification will bring more insight into increasing the user engagement. Given the tag of the question, number of views received, number of answers, username and reputation of the question author, the problem requires you to predict the upvote count that the question will receive.

DATA DICTIONARY

Variable	Definition
ID	Question ID
Tag	Anonymised tags representing question category
Reputation	Reputation score of question author
Answers	Number of times question has been answered
Username	Anonymised user id of question author
Views	Number of times question has been viewed
Upvotes	(Target) Number of upvotes for the question

EVALUATION METRIC

The evaluation metric for this competition is RMSE (root mean squared error)

Leaderboard Rankings and Score

Public LB Rank	Public Score	ML Model
77	3543.8523122425	LinearRegression
22	1100.3336222340	BaggingRegressor
12	1016.7805765708	BaggingRegressor
	1452.xxx	BaggingRegressor(PCA=5)
	1268.1448673016037	BaggingRegressor(PCA=5 and Grid)
	1601.xx	AdaBoostRegressor
45	1177.7464239328351	GradientBoostingRegressor(default params)

PUBLIC AND PRIVATE SPLIT (Leader Board thing)

Note that the test data is further randomly divided into Public (30%) and Private (70%) data. Your initial responses will be checked and scored on the Public data.

The final rankings would be based on your private score which will be published once the competition is over.

Solution

References

A Complete Tutorial to Learn Data Science with Python from Scratch
Pipelines, FeatureUnions, GridSearchCV, and Custom Transformers -- GoodReads
~~A new categorical encoder for handling categorical features in scikit-learn~~
Feature Union with Heterogeneous Data Sources -- GoodReads
Building Scikit-Learn Pipelines With Pandas DataFrames -- GoodReads
Using scikit-learn Pipelines and FeatureUnions -- GoodReads
StackOverflow - Unable to use FeatureUnion to combine processed numeric and categorical features in Python
StackOverflow - Issue with OneHotEncoder for categorical features
StackOverflow - How to make pipeline for multiple dataframe columns?
StackOverflow - How many principal components to take? To decide how many eigenvalues/eigenvectors to keep, you should consider your reason for doing PCA in the first place. Are you doing it for reducing storage requirements, to reduce dimensionality for a classification algorithm, or for some other reason? If you don't have any strict constraints, I recommend plotting the cumulative sum of eigenvalues (assuming they are in descending order). If you divide each value by the total sum of eigenvalues prior to plotting, then your plot will show the fraction of total variance retained vs. number of eigenvalues. The plot will then provide a good indication of when you hit the point of diminishing returns (i.e., little variance is gained by retaining additional eigenvalues).
Github -- kennethclitton/Kaggle-College-Students-on-Loans
PCA using Python (scikit-learn)
scikit-learn Doc - PCA example with Iris Data-set
API - FeatureUnion : class sklearn.pipeline.FeatureUnion(transformer_list, n_jobs=1, transformer_weights=None).
1. Concatenates results of multiple transformer objects.
2. This estimator applies a list of transformer objects in parallel to the input data, then concatenates the results. This is useful to combine several feature extraction mechanisms into a single transformer.
3. Parameters of the transformers may be set using its name and the parameter name separated by a ‘__’. A transformer may be replaced entirely by setting the parameter with its name to another transformer, or removed by setting to None.
API - Pipeline : class sklearn.pipeline.Pipeline(steps, memory=None).
1. Pipeline of transforms with a final estimator.
2. Sequentially apply a list of transforms and a final estimator.
3. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods.
4. The final estimator only needs to implement fit.
5. The transformers in the pipeline can be cached using memory argument.
6. The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a ‘__’, as in the example below. A step’s estimator may be replaced entirely by setting the parameter with its name to another estimator, or a transformer removed by setting to None.
API - make_pipeline : sklearn.pipeline.make_pipeline(**steps*, **\*kwargs*).
1. Construct a Pipeline from the given estimators.
2. This is a shorthand for the Pipeline constructor; it does not require, and does not permit, naming the estimators. Instead, their names will be set to the lowercase of their types automatically.
ML-Ensemble: Scikit-learn style ensemble learningflennerhag
StackOverflow - Ensemble of different kinds of regressors using scikit-learn (or any other python framework)
mlxtend - StackingRegressor
scikit-learn Doc - Decision Tree Regression with AdaBoost
Boosting and Bagging: How To Develop A Robust Machine Learning Algorithm
Kaggle -- GridSearchCV + XGBRegressor (0.556+ LB)

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
other-solutions		other-solutions
.gitignore		.gitignore
411_527419_cf_B_submission_code_FQGpVF3.ipynb		411_527419_cf_B_submission_code_FQGpVF3.ipynb
A_draft.ipynb		A_draft.ipynb
A_simplified.ipynb		A_simplified.ipynb
B_submission_code.ipynb		B_submission_code.ipynb
C_EDA.ipynb		C_EDA.ipynb
D_dasking.ipynb		D_dasking.ipynb
Readme.md		Readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

other-solutions

other-solutions

.gitignore

.gitignore

411_527419_cf_B_submission_code_FQGpVF3.ipynb

411_527419_cf_B_submission_code_FQGpVF3.ipynb

A_draft.ipynb

A_draft.ipynb

A_simplified.ipynb

A_simplified.ipynb

B_submission_code.ipynb

B_submission_code.ipynb

C_EDA.ipynb

C_EDA.ipynb

D_dasking.ipynb

D_dasking.ipynb

Readme.md

Readme.md

Repository files navigation

Best Question Author Prediction - Enigma CodeFest - Analytics Vidya

Overview

Problem Statement

Problem

DATA DICTIONARY

EVALUATION METRIC

Leaderboard Rankings and Score

PUBLIC AND PRIVATE SPLIT (Leader Board thing)

Solution

References

About

Releases

Packages

Languages

DataScienceWorks/AV-EnigmaCodeFest-PredictUpvotes

Folders and files

Latest commit

History

Repository files navigation

Best Question Author Prediction - Enigma CodeFest - Analytics Vidya

Overview

Problem Statement

Problem

DATA DICTIONARY

EVALUATION METRIC

Leaderboard Rankings and Score

PUBLIC AND PRIVATE SPLIT (Leader Board thing)

Solution

References

About

Topics

Resources

Stars

Watchers

Forks

Languages