xfeat

Slides | Tutorial | Document | Installation

Flexible Feature Engineering & Exploration Library using GPUs and Optuna.

xfeat provides sklearn-like transformation classes for feature engineering and exploration. Unlike sklearn API, xfeat provides a dataframe-in, dataframe-out interface. xfeat supports both pandas and cuDF dataframes. By using cuDF and CuPy, xfeat can generate features 10 ~ 30 times faster than a naive pandas operation.


Group-by aggregation benchmark (result)	Target encoding benchmark (result)

Document

Slides
Tutorial notebook
Feature Encoding and Pipelining
Target encoding and benchmark result
Group-by aggregation and benchmark result
Feature selection with Optuna

More examples are available in the ./examples directory.

Quick Start

xfeat provides a dataframe-in, dataframe-out interface:

Feature Engineering

It is possible to sequentially concatenate encoder objects with xfeat.Pipeline. To avoid repeating the same feature extraction process, it is useful to output the results to the feather file format.

More encoder classes available here.

import pandas as pd
from xfeat import Pipeline, SelectNumerical, ArithmeticCombinations

# 2-order Arithmetic combinations.
Pipeline(
    [
        SelectNumerical(),
        ArithmeticCombinations(
            exclude_cols=["target"], drop_origin=True, operator="+", r=2,
        ),
    ]
).fit_transform(pd.read_feather("train_test.ftr")).reset_index(
    drop=True
).to_feather(
    "feature_arithmetic_combi2.ftr"
)

Target Encoding with cuDF/CuPy

Target encoding can be greatly accelerated with cuDF. Internally, aggregation is computed on the GPU using CuPy.

from sklearn.model_selection import KFold
from xfeat import TargetEncoder

fold = KFold(n_splits=5, shuffle=False)
encoder = TargetEncoder(input_cols=cols, fold=fold)

df = cudf.from_pandas(df)  # if cuDF is available.
df_encoded = encoder.fit_transform(df)

Groupby features with cuDF

Benchmark result: Group-by aggregation and benchmark result.

from xfeat import aggregation

df = cudf.from_pandas(df)  # if cuDF is available.
df_agg = aggregation(df,
                     group_key="user_id",
                     group_values=["price", "purchased_amount"],
                     agg_methods=["sum", "min", "max"]
                     ).to_pandas()

Feature Selection with GBDT feature importance

Example code: examples/feature_selection_with_gbdt.py

from xfeat import GBDTFeatureSelector

params = {
    "objective": "regression",
    "seed": 111,
}
fit_kwargs = {
    "num_boost_round": 10,
}

selector = GBDTFeatureSelector(
    input_cols=cols,
    target_col="target",
    threshold=0.5,
    lgbm_params=params,
    lgbm_fit_kwargs=fit_kwargs,
)
df_selected = selector.fit_transform(df)
print("Selected columns:", selector._selected_cols)

Feature Selection with Optuna

GBDTFeatureSelector uses a percentile hyperparameter to select features with the highest scores. By using Optuna, we can search for the best value for this hyperparameter to maximize the objective.

Example code: examples/feature_selection_with_gbdt_and_optuna.py

import optuna

def objective(df, selector, trial):
    selector.set_trial(trial)
    selector.fit(df)
    input_cols = selector.get_selected_cols()

    # Evaluate with selected columns
    train_set = lgb.Dataset(df[input_cols], label=df["target"])
    scores = lgb.cv(LGBM_PARAMS, train_set, num_boost_round=100, stratified=False, seed=1)
    rmsle_score = scores["rmse-mean"][-1]
    return rmsle_score


selector = GBDTFeatureExplorer(
    input_cols=input_cols,
    target_col="target",
    fit_once=True,
    threshold_range=(0.6, 1.0),
    lgbm_params=params,
    lgbm_fit_kwargs=fit_params,
)

study = optuna.create_study(direction="minimize")
study.optimize(partial(objective, df_train, selector), n_trials=20)

selector.from_trial(study.best_trial)
print("Selected columns:", selector.get_selected_cols())

Installation

$ python setup.py install

If you want to use GPUs, cuDF and CuPy are required. See the cuDF installation guide.

For Developers

$ python setup.py test

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
_docs		_docs
examples		examples
tests		tests
xfeat		xfeat
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
mypy.ini		mypy.ini
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_docs

_docs

examples

examples

tests

tests

xfeat

xfeat

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

mypy.ini

mypy.ini

setup.cfg

setup.cfg

setup.py

setup.py

Repository files navigation

xfeat

Document

Quick Start

Feature Engineering

Target Encoding with cuDF/CuPy

Groupby features with cuDF

Feature Selection with GBDT feature importance

Feature Selection with Optuna

Installation

For Developers

About

Releases

Packages

Contributors 3

Languages

License

pfnet-research/xfeat

Folders and files

Latest commit

History

Repository files navigation

xfeat

Document

Quick Start

Feature Engineering

Target Encoding with cuDF/CuPy

Groupby features with cuDF

Feature Selection with GBDT feature importance

Feature Selection with Optuna

Installation

For Developers

About

Resources

License

Stars

Watchers

Forks

Languages