Twitter RecSys Challenge 2021

https://recsys-twitter.com/

Run an experiment

Locally

Requirements

Python 3.7.9
- You can install this Python version (or any other version) easily with pyenv. Install pyenv and run pyenv install 3.7.9.
Java 8 for PySpark
- If you're running on a Debian-based system, use adoptopenjdk. Link to SO (that'll save you hours of googling).
Pipenv
- pip install pipenv

How to

Pull/clone recsys-2021 on your system
Download one of the sampled datasets and place it into data/raw/.
cd to the repository root
pipenv install -d (install both develop and default packages from Pipfile)
pipenv shell (start the shell for the environment associated to the repository)
cd src
python run.py ../data/raw/<your_dataset_file> <your_model_of_choice> false

Check python run.py --help for information on additional arguments.

On the cluster

Requirements

Koalas
- pip install koalas

How to

Pull/clone recsys-2021 on your system
cd to the repository root
cd src
spark-submit --master yarn --deploy-mode cluster run.py <args> or (if the former does not work) spark-submit --master local --deploy-mode client run.py <args>

Add a custom feature

To create a new custom feature extractor:

Add a file inside src/preprocessor/<category>/your_feature.py. If there's no appropriate category directory for your new feature, you can create one.

You can use this boilerplate code.

import pandas as pd
import numpy as np
import databricks.koalas as ks
from pyspark.sql import SparkSession

def your_feature(raw_data, features):
    """Description
    """
    
    # your feature extraction code...
    
    return your_feature_series

Import your feature in features_store.py
Use your new feature in your preferred model

Inside this function, you can:

Access any feature available in the raw dataset, i.e., all default features, even ones you don't use directly in your model. Default features are accessible through raw_data.
Access any custom feature extracted before your custom feature. The available custom features can be accessed by features as a python dict with the structure {feature_name: ks.Series}.

A single custom feature extractor can extract more than one feature. If you want to extract more than one feature, you should return a Koalas DataFrame.

Please make sure that your custom feature extractor returns one row for each input dataset row.

Add and use an auxiliary dataset extractor

Auxiliary datasets are DataFrames which may be extracted from raw data (e.g., a tabular representation of an engagement graph). These DataFrames are not used directly as features in the model: instead, they can be used by feature extractors as an auxiliary pre-compiled data source.

Auxiliary DataFrames may be extracted with means of a function (see for example src/preprocessor/graph/auxiliary_engagement_graph.py). Once extracted, one or more DataFrames are returned under the form of a Python dict. The return format is the following (even if you want to return only one DataFrame):

return {
    "auxiliary_df_1": ks.DataFrame,
    "auxiliary_df_2": ks.DataFrame,
    ...
}

auxiliary_df_1, auxiliary_df_2, ..., will be available to all feature extractors as keys of the Python dict auxiliaries, which is passed as an argument to feature extractors. This means that any feature extractor's function must be defined with an additional argument auxiliaries, even if they don't use any auxiliary DataFrame. For instance:

def feature_extractor(raw_data, features, auxiliaries):
    ...

The auxiliary dataset extractor function must be defined with two arguments, raw_data (which may be either training or test data) and auxiliary_train, which defaults to None.

This is because, at inference time (when running inference.py), the materialized auxiliary datasets, i.e., auxiliary DataFrames pre-computed on training data at training time and saved in data/auxiliary, may be integrated with additional data available in the test set, if one implements the additional logic required to do that.

If this is the case, one should check if auxiliary_train is not None. If so, we are at inference time. In this case, raw_data is the test data, and auxiliary_train is the pre-compiled auxiliary dataset which must be appended (or integrated in some other way) to the newly extracted auxiliary data coming from raw_data, i.e., test data.

Please note that this is not mandatory: you can decide not to run any further integration from test data. If this is the case, put all your train extractor code under an if clause such as if auxiliary_train is None: ... and, under else: ... (i.e., at inference time), return the keys of auxiliary_train corresponding to the auxiliary dataframes of interest.

Important note: if you decide to upload a submission with an auxiliary data extractor that you choose to extend at inference time, you must also upload the auxiliary dataframe from training time to be extended at test time.

To use an auxiliary dataset extractor in a model, add the auxiliary dataset extractor function to self.enabled_auxiliaries in the model constructor.

Check src/preprocessor/graph/auxiliary_engagement_graph.py to see a working example.

Useful Commands

Testing the run (training) script without Docker

If you want to clean up all preprocessed columns

tput reset && \
rm -rf data/preprocessed/* && \
time ARROW_PRE_0_15_IPC_FORMAT=1 PYARROW_IGNORE_TIMEZONE=1 python src/run.py \
    data/raw/sample_200k_rows native_xgboost_baseline false

If you want to keep all preprocessed columns

tput reset && \
time ARROW_PRE_0_15_IPC_FORMAT=1 PYARROW_IGNORE_TIMEZONE=1 python src/run.py \
    data/raw/sample_200k_rows native_xgboost_baseline false

Testing the inference script without Docker

Remember to run pipenv shell first.

This requires that you have data/raw/test with some *part* files inside. I suggest testing with ~1-10GB of data to check if memory is wasted somewhere

# Delete the old results_folder, a to_csv artifact
rm -rf results_folder && \
# Clear the terminal (optional)
tput reset && \
# Launch inference.py with the environment variables needed by Apache Arrow
ARROW_PRE_0_15_IPC_FORMAT=1 PYARROW_IGNORE_TIMEZONE=1 python src/inference.py

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.github/workflows		.github/workflows
data		data
notebooks		notebooks
serialized_models		serialized_models
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
run		run

MAL-TO/recsys-2021

Folders and files

Latest commit

History

Repository files navigation

Twitter RecSys Challenge 2021

Run an experiment

Locally

Requirements

How to

On the cluster

Requirements

How to

Add a custom feature

Add and use an auxiliary dataset extractor

Useful Commands

Testing the run (training) script without Docker

Testing the inference script without Docker

About

Resources

Stars

Watchers

Forks

Languages