Skip to content

LeData/ML-ensemble-with-Feature-Crawler

Repository files navigation

Feature Engineering Module

This module provides classes that help automate feature extraction and management. It contains the following tools:

A FeatureCrawler class which maintains a graph of possible feature combinations. It helps recursively explore discrete feature spaces to find independent optimized sets of features for blending, or the best combination of features given a starting feature set.

  • Methods:
    • update_graph(): takes a list of features and expands the feature graph if new features are found.
    • get_unscored_node(): returns a list of features that hasn't been tested yet.
    • record_score(): takes in the score of a list of features and updates the graph.
    • get_leaves_features(): returns a list of sets of features for the current leaves of the feature space.
    • check_condition(): checks if a certain condition on the features space is met.
    • prune(): removes all scored nodes in the periphery of the graph that aren't a global max. This should only be used if looking for a single model. Not good for blending.
  • Attributes:
    • status_ : percentage crawled / current proportion of scored nodes in the graph
    • leaves_ : {current peripheral nodes that were scored : their score}

A FeatureManager class which simplifies feature creation, light storage and fast retrieval through a configuration file and parquet format. Currently only dealing with Binary Classification.

  • Methods:

    • update_features():
    • get_sample(): returns a balanced dataset sample with the requested features. If an index is provided, merely returns the features on that index.
    • get_sample_index(): returns the index of a balanced sample, if target=True, also returns a sample series of the target variable.
    • get_training_data(): returns the whole training data with the given features.
    • get_test_data(): returns the test data with the given features
    • feat_xxxxx: feature generators. For the moment the generator must save the series to file and return its name. In the future this will be taken care of by a wrapper and the deature generator will only have to return the generated series.
  • Attributes:

    • feature_list_: simple list of features. the name of the features is generated by the feature generator and corresponds to the file on disk.

Model Blending module

This module is a blending architechture for gradient boosted machines. It has 3 layers:

  • 1st layer: Takes a dataset and some given features and learns the best combinations of features for XGB, lightGBM and Catboost independently through cross validation. Once a given score threshold has been reached of the whole feature space has been explored, returns a dataframe with all the predictions of the three models with the feature combinations found to be local maxima in the feature space.
  • 2nd layer: Trains 3 models on the predictions from layer 1. Returns a dataframe of 3 prediction columns.
  • 3rd layer: Blends the predictions of the 2nd layer using weighted averages. Returns a single prediction.

Note - hyper-parameter tuning is curerntly not part of the blender. Parameters must be given by hand.

Kaggle AdTracking Competition

This repository is work stemming from my participation to the competition. The problem tackled is a very modern and crucial one to our current economy : how to prevent click-fraud plaguing the online advertisment industry. The chinese firm AdTracking offers advertisment to clients through many channels. It pays these channels oby the click and in turn is paid by the client for the exposure they received, counted by engagement, in other words clicks, as opposed to views. Bad actors in this system generate clicks that do not correspond to actual potential customers and get revenue for it. These actors are often called or rely on the services of "click farms".

The challenge is a simpler version of the problem, where one needs to predict clicks that result in actual engagement, i.e. a download of the app the ad was for. The target feature, called is_attributed, is a boolean indicating if there was download or not. The problem is therefore a binary classification.

The evaluation metric chosen for the competition the roc_auc, generally known as "Area Under the Curve".

Reference: https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection

// Dataset description and optimization//

This dataset is far from being BIG DATA, but being a few hundred million rows, the loaded raw data fills almost 4Gb of memory. Yet it is still one of the biggest datasets in Kaggle competitions. For many kagglers that means that a very tight optimization is required to function locally (8Gb of RAM in my case). Even with the use of Kaggle kernels (17Gb RAM) one needs to get extremely crafty to generate any amount of new features.

You can find some EDA (Exploratory Data Analysis) here:

The original features (click_time, ip, app, channel, os, device) were all, beside the time stamp and target, categorical.

// Model Selection //

To my knowledgem the models available to us for binary classification are:

  • Logistic Regression
  • Decision Trees and their ensembles : Random Forest, Gradient Boosted Trees (XGBoost, Lightgbm, CatBoost)
  • Support Vector Classifiers
  • Naive Bayse Classifiers (poor at probability prediction, so low performance on roc_auc)
  • Neural Networks

Since I do not have the ressources for Neural Nets and that SVM and Logistic Regressions will give subpar performance due to the high non-continuity of the data, I picked decision tree based techniques for the competition. At first, I went for the underdog - Catboost.

Catboost is the newest of the Gradient Boosting Machines listed above and comes from the russian firm Yandex. It is leading in performance and speed in most benchmarks. It also specializes in handling categorical features, which made it sound like the perfect candidate for the competition. Moreover its newness meant that there may be an opportunity to get an edge against those that stayed with the old and trusted XGBoost. Catboost came with its own challenges though as the documentation is much less extensive than the other two and more importantly it is extremely glutonous in memory. So much so that training a 4Gb dataset requires over 66Gb of memory. Like most in this competition, I ended up favoring Microsoft's LightGBM instead, for its memory efficiency.

// Categorical Features //

At first glance, it seems that our 5 features are categorical variables. They have been anonymized so the variable type is integers. No indication was given as to how the encoding was done and a quick EDA will tell some very surprizing information about the distribution of the values. In any case, a choice needs to be made for each of them. We can:

  • Method 1 - keep them as integers, possibly re-ordering them by some metric (what catboost does).
  • Method 2 - Use One-Hot-Encoding, creating as many boolean features as there are distinct values (minus one).
  • Method 3 - Use Entity Embeddings.

Moreover, the target feature is extremely unbalanced. about 0.2% of all rows are positive. This adds in complexity and there are also different ways to deal with this distribution imbalance:

  • Fix 1 - Undersampling
  • Fix 2 - Random Oversampling (boostrap of the positive class)
  • Fix 3 - Clustering Oversampling (stratified bootstrap)
  • Fix 4 - Synthetic Minority Oversampling Technique (SMOTE)
  • Fix 5 - Modified Synthetic Minority Oversampling Technique (MSMOTE)

The last 3 may not make sense with categorical features, see https://www.analyticsvidhya.com/blog/2017/03/imbalanced-classification-problem/ for details. Moreover, another contestant's analysis shows that there is no gain to them.

Ideally, we want to also compute the run time of each of these models to have a measure of the marginal cost in performance gains, so as to make buiseness sense of it once the competition is over.

// Feature Engineering //

Improving the predictive power of a given model requires to either feed more data or better data, which translates to either find relevant external data sources or craft some new features out of the existing ones. Considering the anonymity of the given dataset, there is no possibility to use external data sources, so we can focus our attention towards new features, keeping in mind which model they would benefit. I have thought of:

  • Time windows , e.g. morning/afternoon/evening.
  • Group-split-combine aggregates.
  • Moving sums and/or averages of the target variable for different features.
  • Moving sums and probabilities of different features
  • Non linear variables - for linear models (logistic regression and SVC) (need numeric features)
  • other time series based variables.

// Step towards production quality code //

This repository is meant to be useful to others and I welcome any help in the development. I have taken this challenge as a chance to learn how to write better code more than a quest to get to the top of the leaderboard (which some achieve by blending publicly posted models). I have read plenty about test driven development but haven't been able to implement it yet. I also would like to implement visualization tools to the crawler to understand at a glance how far along it is and how fast the crawling goes. For example:

  • Proportion of graph scored vs number of rounds
  • Best Score vs number of rounds
  • Number of 'leaves' vs number of rounds
  • Distribution of feature representation If this sound like something that you'd enjoy doing, please get in touch or submit PR directly.

Releases

No releases published

Packages

No packages published

Languages