Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance binning strategy #39

Open
NicolasHug opened this issue Nov 6, 2018 · 5 comments
Open

Enhance binning strategy #39

NicolasHug opened this issue Nov 6, 2018 · 5 comments

Comments

@NicolasHug
Copy link
Collaborator

Results are comparable to LightGBM when n_samples <= n_bins because both libs are using the actual feature values as bin thresholds.

This is not the case anymore when n_samples > n_bins. In particular, on this very easy dataset (target = X[:, 0] > 0, lightgbm finds a perfect threshold of 1e-35 while that of pygbm is -0.262. This leads to different trees and less accurate predictions (1 vs .9).

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
import numpy as np
from pygbm import GradientBoostingMachine
from lightgbm import LGBMClassifier
from pygbm.plotting import plot_tree

rng = np.random.RandomState(seed=2)

n_leaf_nodes = 5
n_trees = 1
lr = 1.
min_samples_leaf = 1

max_bins = 5
n_samples = 100

X = rng.normal(size=(n_samples, 5))
y = (X[:, 0] > 0)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)

pygbm_model = GradientBoostingMachine(
    loss='log_loss', learning_rate=lr, max_iter=n_trees, max_bins=max_bins,
    max_leaf_nodes=n_leaf_nodes, random_state=0, scoring=None, verbose=1,
    validation_split=None, min_samples_leaf=min_samples_leaf)
pygbm_model.fit(X_train, y_train)
predicted_test = pygbm_model.predict(X_test)
acc = accuracy_score(y_test, predicted_test)
print(acc)

lightgbm_model = LGBMClassifier(
    objective='binary', n_estimators=n_trees, max_bin=max_bins,
    num_leaves=n_leaf_nodes, learning_rate=lr, verbose=10, random_state=0,
    boost_from_average=False, min_data_in_leaf=min_samples_leaf)
lightgbm_model.fit(X_train, y_train)
predicted_test = lightgbm_model.predict(X_test)
acc = accuracy_score(y_test, predicted_test)
print(acc)

plot_tree(pygbm_model, lightgbm_model, view=True)

lol

@NicolasHug
Copy link
Collaborator Author

Here is another example, this time with pre-binned data.

I can't explain why the left root child has a different split gain. When I print the split gain values considered by lightgbm, no split is equal to 0.923.

The discrepancy may not come from the actual binning strategy here, but could be due to how the bins are treated afterwards. Some of them may not be considered, or merged, I don't know.

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from pygbm import GradientBoostingMachine
from lightgbm import LGBMClassifier
from pygbm.plotting import plot_tree
from pygbm.binning import BinMapper
import numpy as np

rng = np.random.RandomState(seed=2)

n_leaf_nodes = 4
n_trees = 1
lr = 1.
min_samples_leaf = 1

max_bins = 255
n_samples = 100

X = rng.normal(size=(n_samples, 5))
y = (X[:, 0] > 0) & (X[:, 1] > .5)


X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)

X_train = BinMapper().fit_transform(X_train)

pygbm_model = GradientBoostingMachine(
    loss='log_loss', learning_rate=lr, max_iter=n_trees, max_bins=max_bins,
    max_leaf_nodes=n_leaf_nodes, random_state=0, scoring=None, verbose=1,
    validation_split=None, min_samples_leaf=min_samples_leaf)
pygbm_model.fit(X_train, y_train)

lightgbm_model = LGBMClassifier(
    objective='binary', n_estimators=n_trees, max_bin=max_bins,
    num_leaves=n_leaf_nodes, learning_rate=lr, verbose=10, random_state=0,
    boost_from_average=False, min_data_in_leaf=min_samples_leaf)
lightgbm_model.fit(X_train, y_train)

plot_tree(pygbm_model, lightgbm_model, view=True)

lol

@ogrisel
Copy link
Owner

ogrisel commented Nov 8, 2018

It's not just the split gain that is different on the left root child: it's also not splitting on the same feature.

@NicolasHug
Copy link
Collaborator Author

Ok I made some small progress on this. Still don't know the details of lightgbm binning but I can explain the 2 previous comments.


For the first comment (#39 (comment)), it looks like LightGBM forces -1e-35 and 1e-35 as binning thresholds, regardless of the binning strategy (see here and here). Now I understand why the function is called FindBinWithZeroAsOneBin... They will also add 0 as one of the 'unique values' (see here) but this is not completely related here.

Do we want to do such a thing as well?

For the binning threshold, something like

midpoints = np.insert(midpoints, np.searchsorted(midpoints, 0), 0)

would do it but that would make midpoints bigger than 256 in size in most cases.


For my second comment (#39 (comment)), the discrepancy comes from the min_data_in_bin parameter of LightGBM which is 3 by default. Setting it to 1 gives the same trees. I should have seen this sooner :s

Side note: when debugging, it's helpful to set enable_bundle to False because the bundling (of mutually exclusive features) changes the inner order of the features in lightGBM which makes it harder to debug: feature 0 of Lightgbm is not feature 0 of pygbm, etc. Regardless of the debugging, we should probably set it to False all the time in our checks, just in case.

@ogrisel
Copy link
Owner

ogrisel commented Nov 20, 2018

Do we want to do such a thing as well?

Maybe we should ask the LightGBM developers to explain why this is useful.

Regardless of the debugging, we should probably set it to False all the time in our checks, just in case.

+1, and we can reenable it the day we implement feature bundling (hopefully).

@ogrisel
Copy link
Owner

ogrisel commented Nov 20, 2018

For my second comment (#39 (comment)), the discrepancy comes from the min_data_in_bin parameter of LightGBM which is 3 by default. Setting it to 1 gives the same trees. I should have seen this sooner :s

Nice catch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants