How does XGBoost generate class probabilities in a multiclass classification problem? #1746

sosata · 2016-11-08T00:28:10Z

When I try to dump a trained XGBoost tree, I get it in the following format:

0:[f37<39811] 
	1:[f52<199.5] 
		3:leaf=-0.021461
		4:[f2<0.00617284] 
			9:leaf=-0.0118755
			10:[f35<-83.548] 
				19:[f13<0.693844] 
					23:[f37<1831]
						29:leaf=-0
						30:leaf=0.123949
					24:leaf=0.0108628
				20:[f35<-74.6198] 
					25:leaf=-0.0175897
					26:leaf=0.051898
	2:leaf=-0.0239901

While it is clear what the tree structure is, it is not clear how to interpret the leaf values. For a binary classification problem and a log-loss cost function, the leaf values can be converted to class probabilities using a logistic function: 1/(1+exp(value)). However, for a multiclass classification problem, there is no information on which class those values belong to, and without that information, it is not clear how to calculate the class probabilities.

Any ideas? Or is there any other function to get those information out of the trained trees?

The text was updated successfully, but these errors were encountered:

ghost · 2016-12-15T14:54:11Z

I also have the same question. Only thing that I noticed is that if you have 20 classes and set number of estimators to 100, then you'll have 20*100=2000 trees printed in your model. Now my guess was that first 100 estimators classify first class vs others. Next 100 estimators classify second class vs others etc.

Can't confirm but maybe it could be something like this?

sosata · 2016-12-15T18:16:37Z

That's a very good observation, which I hadn't noticed before. It's true, the number of trees in the model is n_estimator x n_class. However, to figure out the order of the trees, I used the following toy example:

x = np.concatenate([np.ones([10,1])*np.array([1,0,0]),np.ones([10,1])*np.array([0,1,0]),np.ones([10,1])*np.array([0,0,1])], axis=0)
y = np.array([['a']*10+['c']*10+['b']*10]).reshape([30,1])
model = xgb.XGBClassifier(n_estimators=2, objective='mlogloss').fit(x, y)
model.booster().dump_model('trees.txt')

which basically creates a training dataset where [1,0,0] is mapped to 'a', [0,1,0] is mapped to 'c', and [0,0,1] is mapped to 'b'. I intentionally switched the order of 'b' and 'c' to see if xgboost sorts classes based on their labels before dumping the trees.

Here is the resulting dumped model:

booster[0]:
0:[f0<0.5] yes=1,no=2,missing=1
	1:leaf=-0.0674157
	2:leaf=0.122449
booster[1]:
0:[f2<0.5] yes=1,no=2,missing=1
	1:leaf=-0.0674157
	2:leaf=0.122449
booster[2]:
0:[f1<0.5] yes=1,no=2,missing=1
	1:leaf=-0.0674157
	2:leaf=0.122449
booster[3]:
0:[f0<0.5] yes=1,no=2,missing=1
	1:leaf=-0.0650523
	2:leaf=0.10941
booster[4]:
0:[f2<0.5] yes=1,no=2,missing=1
	1:leaf=-0.0650523
	2:leaf=0.10941
booster[5]:
0:[f1<0.5] yes=1,no=2,missing=1
	1:leaf=-0.0650523
	2:leaf=0.10941

and here are the conclusions:

As you mentioned, although n_estimators=2, the number of trees is 2x3=6.
Looking at the order of the features that the trees use, it seems that the first tree belongs to the first class, the second tree to the second class, and so on, down to the last class. Then, the same pattern is repeated until all estimators are covered.
The order of the classes seems to be consistent with what numpy.unique() returns - here it is alphabetical, and that's why the first tree is using f0 while the second is using f2.

It would be nice if these information is added to xgboost documentation.

sosata · 2017-01-24T20:16:45Z

I'm closing this issue now.

GillesVandewiele · 2017-03-29T09:49:04Z

I still have a comment here. I agree with how the trees are structured in the multi-class case. But how are the leaf-values converted to probabilities?

It seems like for the binary case you have to apply 1/(1+exp(value)), while for the multi-class case you have to apply 1/(1+exp(-value)).

sosata · 2017-03-29T15:40:16Z

That seems to be true, looking at the examples. Thanks for the tip!

But I don't understand why these are defined differently for binary and multiclass cases.

GillesVandewiele · 2017-03-29T16:24:04Z

Yes indeed, it is very weird and confusing to say the least.

I came to that conclusion by testing it on the UCI Car Dataset (where you classify cars into unacceptable, acceptable, good or very good).

Would be nice to have an extra feature where you can convert the trees to sklearn trees or something like that (currently I have a code-base where I can convert these xgb trees and sklearn trees to my own defined decision tree). Of the latter, the sklearn trees, I'm atleast 100% sure that these are converted correctly. For the xgb trees, doubt remains.

sosata · 2017-03-31T20:23:00Z

Agree that would be a great feature.

mandarup · 2017-04-20T18:56:04Z

@GillesVandewiele @sosata : so in that logistic (say for class i) [ 1/(1+exp(-value)) ] , the value is the sum of all leaf scores corresponding to that particular class's trees associate with the sample?

sosata · 2017-04-20T23:20:09Z

For the example in my early comment, if I try to predict the class probabilities for [1 0 0]:

print(model.predict_proba(np.array([[1,0,0]])))

it will generate these results:

[[ 0.41852772  0.29073614  0.29073614]]

There is no way for 1/(1+exp(-value)) to generate this. The only way is that these probabilities are generated by a softmax function of the summed values across the classes:

p[i] = exp(val[i])/(exp(val[1])+exp(val[2])+...+exp(val[N]))

where i is the target class (out of N classes), and val[i] is the sum of all values generated from the trees belonging to that class.
In our example:

print(np.exp(+0.122449+0.10941)/(np.exp(+0.122449+0.10941)+np.exp(-0.0674157-0.0650523)+np.exp(-0.0674157-0.0650523)))
print(np.exp(-0.0674157-0.0650523)/(np.exp(+0.122449+0.10941)+np.exp(-0.0674157-0.0650523)+np.exp(-0.0674157-0.0650523)))
print(np.exp(-0.0674157-0.0650523)/(np.exp(+0.122449+0.10941)+np.exp(-0.0674157-0.0650523)+np.exp(-0.0674157-0.0650523)))

will generate:

0.418527719063
0.290736140469
0.290736140469

which is exactly what the predict_proba() function gave us.

GillesVandewiele · 2017-04-21T05:58:44Z

That seems to be correct @sosata

In my case, I wanted to convert each of the trees individually (thus not using leaf values of other trees). There, a sigmoid function seemed to do the job.

nimitpattanasri · 2017-11-16T06:56:19Z

@sosata Is there anyway to get feature importance per class? I think the current implementation will just sum all contributions over all classes. R API seems to have such a parameter for that (see CodingCat@e940523), though.

sosata · 2017-11-18T08:00:48Z

@MLxAI I also only tried the Python API and I don't remember I was able to get them per class.

However, in my opinion, the definition of feature importance in XGBoost as the total number of splits on that feature is a bit simplistic. Personally, I calculate feature importance by removing that feature from the feature set and calculating the resulting drop (or increase) in accuracy, when the model is trained and tested without that feature. I do this for all features, and I define "feature importance" as the amount of drop (or minus increase) in accuracy. Since training and testing can be done multiple times with different train/test subsets for each feature removal, one can also estimate the confidence intervals of feature importance. Plus, by calculating importance separately for each class, you get feature importances per class. This has produced more meaningful results in my projects than counting the number of splits.

chrisplyn · 2018-01-24T15:18:10Z

@sosata Your example is very intuitive, but I still don't know how you come up with scores for each class, can you explain that a little bit? Thanks!

GillesVandewiele · 2018-01-24T16:02:51Z

@chrisplyn You evaluate all decision trees in your ensemble belonging to a specific class and sum the resulting scores.

As @sosata stated correctly earlier: the number of trees in your ensemble will be equal to n_esimators * n_classes

csgwma · 2018-03-07T07:03:26Z

@sosata your example is very clear. Thank you very much!

@chrisplyn The predict instance [1,0,0] will be classified by booster[0~5], and get the leaf values [0.122449, -0.0674157, -0.0674157, 0.10941, -0.0650523, -0.0650523] respectively.
For booster[0, 3] belongs to class 0, booster[1, 4] belongs to class 1, and booster[2, 5] belongs to class 2, so val[0] =(0.122449 + 0.10941), val[1] = (-0.0674157 + -0.0650523), val[2] = (-0.0674157 + -0.0650523).
Is it more clear?

sosata closed this as completed Jan 24, 2017

GillesVandewiele mentioned this issue Mar 30, 2017

Convert trees to a specific format #2156

Closed

sosata reopened this Apr 20, 2017

sosata closed this as completed Aug 11, 2017

GillesVandewiele mentioned this issue Mar 9, 2018

Minimize dependency IBCNServices/GENESIM#9

Closed

lock bot locked as resolved and limited conversation to collaborators Oct 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does XGBoost generate class probabilities in a multiclass classification problem? #1746

How does XGBoost generate class probabilities in a multiclass classification problem? #1746

sosata commented Nov 8, 2016

ghost commented Dec 15, 2016

sosata commented Dec 15, 2016 •

edited

sosata commented Jan 24, 2017

GillesVandewiele commented Mar 29, 2017

sosata commented Mar 29, 2017

GillesVandewiele commented Mar 29, 2017 •

edited

sosata commented Mar 31, 2017

mandarup commented Apr 20, 2017 •

edited

sosata commented Apr 20, 2017 •

edited

GillesVandewiele commented Apr 21, 2017

nimitpattanasri commented Nov 16, 2017 •

edited

sosata commented Nov 18, 2017

chrisplyn commented Jan 24, 2018

GillesVandewiele commented Jan 24, 2018

csgwma commented Mar 7, 2018 •

edited

How does XGBoost generate class probabilities in a multiclass classification problem? #1746

How does XGBoost generate class probabilities in a multiclass classification problem? #1746

Comments

sosata commented Nov 8, 2016

ghost commented Dec 15, 2016

sosata commented Dec 15, 2016 • edited

sosata commented Jan 24, 2017

GillesVandewiele commented Mar 29, 2017

sosata commented Mar 29, 2017

GillesVandewiele commented Mar 29, 2017 • edited

sosata commented Mar 31, 2017

mandarup commented Apr 20, 2017 • edited

sosata commented Apr 20, 2017 • edited

GillesVandewiele commented Apr 21, 2017

nimitpattanasri commented Nov 16, 2017 • edited

sosata commented Nov 18, 2017

chrisplyn commented Jan 24, 2018

GillesVandewiele commented Jan 24, 2018

csgwma commented Mar 7, 2018 • edited

sosata commented Dec 15, 2016 •

edited

GillesVandewiele commented Mar 29, 2017 •

edited

mandarup commented Apr 20, 2017 •

edited

sosata commented Apr 20, 2017 •

edited

nimitpattanasri commented Nov 16, 2017 •

edited

csgwma commented Mar 7, 2018 •

edited