decision tree attribute selection seems wrong #28358

xy-whu · 2024-02-04T13:22:01Z

xy-whu
Feb 4, 2024

I used a simplified iris dataset to create a decision tree, at the same time, i also implemented code of attribute selection based on the CART algorithm. However the results of sklearn and our implementation are not the same。

the data used is shown as follows:
6.9,3.1,4.9,1.5,1.0
5.6,2.9,3.6,1.3,1.0
6.1,2.6,5.6,1.4,2.0
4.9,2.5,4.5,1.7,2.0
6.4,2.9,4.3,1.3,1.0
6.1,2.6,5.6,1.4,2.0
5.0,3.3,1.4,0.2,0.0
5.6,2.7,4.2,1.3,1.0
6.5,3.0,5.2,2.0,2.0
5.0,3.4,1.6,0.4,0.0
7.9,3.8,6.4,2.0,2.0
5.0,3.2,1.2,0.2,0.0
7.7,2.6,6.9,2.3,2.0
5.1,3.7,1.5,0.4,0.0
5.0,3.3,1.4,0.2,0.0
4.6,3.4,1.4,0.3,0.0
6.1,2.9,4.7,1.4,1.0
6.0,2.7,5.1,1.6,1.0
6.2,2.9,4.3,1.3,1.0
5.7,2.9,4.2,1.3,1.0
6.8,3.2,5.9,2.3,2.0
6.7,3.0,5.2,2.3,2.0
6.9,3.1,5.4,2.1,2.0
5.0,3.3,1.4,0.2,0.0
5.7,4.4,1.5,0.4,0.0

def getiris():
   dataset=np.loadtxt("simpliediris.csv",delimiter=",")
   return dataset[0:18,:], dataset[18:,:]

def tree_sklearn():
   train,test=getiris()

   X_train=train[:,0:-1]
   y_train=train[:,-1]
   clf = DecisionTreeClassifier(criterion="entropy",random_state=0)
   print(X_train,Y_train)
   clf.fit(X_train, Y_train)
   tree.plot_tree(clf)
   plt.show()

Obviously, the third feature is selected as the fourth atrribute (X[3]) to split the dataset.

the following code is our implimentation.

import numpy as np
from matplotlib import pyplot as plt

from sklearn import tree
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.utils import resample
import sys
import collections
from math import log
import operator
import pandas as pd
from sklearn.tree import export_text
def calcInfoGainForSeries(dataSet, i, baseEntropy):
  
    maxInfoGain = 0.0
    bestMid = -1
    featList = [example[i] for example in dataSet]

    classList = [example[-1] for example in dataSet]
    dictList = dict(zip(featList, classList))

    sortedFeatList = sorted(dictList.items(), key=operator.itemgetter(0))

    numberForFeatList = len(sortedFeatList)

    ent=0
    for mid in midFeatList:
        eltDataSet, gtDataSet = splitDataSetForSeries(dataSet, i, mid)
        newEntropy = float(len(eltDataSet))/float(len(sortedFeatList))*float(calcShannonEnt(eltDataSet)) + float(len(gtDataSet))/float(len(sortedFeatList))*float(calcShannonEnt(gtDataSet))
        infoGain = baseEntropy - newEntropy
        if infoGain > maxInfoGain:
            bestMid = mid
            maxInfoGain = infoGain
            ent=newEntropy

    return maxInfoGain, bestMid,ent

  
 
  
 
def splitDataSetForSeries(dataSet, axis, value):
  
    eltDataSet = []
    gtDataSet = []
    for feat in dataSet:
        if feat[axis] <= value:
            eltDataSet.append(feat)
        else:
            gtDataSet.append(feat)

    return eltDataSet, gtDataSet
def calcShannonEnt(dataSet):
    numEntries = len(dataSet)
    labelCounts = collections.defaultdict(int)
    for featVec in dataSet:
        currentLabel = featVec[-1]
        labelCounts[currentLabel] += 1

    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key]) / numEntries
        shannonEnt -= prob * log(prob, 2)

    return shannonEnt

def chooseBestFeatureToSplit(dataSet):
 
    numFeatures = len(dataSet[0])-1

    baseEntropy = calcShannonEnt(dataSet)

    bestInfoGain = 0.0

    bestFeature = -1
    flagSeries = 0
    bestSeriesMid = 0.0

    for i in range(numFeatures):
        featList = [example[i] for example in dataSet]
 
        infoGain, bestMid,ent = calcInfoGainForSeries(dataSet, i, baseEntropy)
        print("i: {} infoGain {},bestMid {},Ent {}".format(i,infoGain,bestMid,ent))
   
        if infoGain > bestInfoGain:

            bestInfoGain = infoGain
    
            bestFeature = i

            bestSeriesMid = bestMid
    return bestFeature, bestSeriesMid


def getiris():
   dataset=np.loadtxt("simpliediris.csv",delimiter=",")
   return dataset[0:18,:], dataset[18:,:]


if __name__=="__main__":

   train,test=getiris()
  
   feature,mid=chooseBestFeatureToSplit(train)
   print(feature,mid)

The following is the results of out implementation,
i: 0 infoGain 0.32862568602758824,bestMid 5.35,Ent 1.2563368146935678
i: 1 infoGain 0.2144132483281731,bestMid 3.15,Ent 1.370549252392983
i: 2 infoGain 0.784962500721156,bestMid 2.6,Ent 0.8
i: 3 infoGain 0.3849625007211561,bestMid 0.85,Ent 1.2
2 2.6

Obviously, the third attribute (i=2) should be selected as the first attribute to split the dataset, wihich is not consistent with the result above.

glemaitre · 2024-02-05T16:06:18Z

glemaitre
Feb 5, 2024
Maintainer

I simplified the example to be able to get insights:

import pandas as pd

df = pd.read_csv("data.csv", header=None)
X = df.iloc[:18, :-1].to_numpy()
y = df.iloc[:18, -1].to_numpy()

import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier, plot_tree

_, axs = plt.subplots(nrows=2, ncols=2, figsize=(10, 10))
for ax, feature_idx in zip(axs.ravel(), range(X.shape[1])):
    tree = DecisionTreeClassifier(criterion="entropy", max_depth=1, random_state=0)
    tree.fit(X[:, [feature_idx]], y)

    plot_tree(tree, ax=ax)
    gain = tree.tree_.impurity[0] - (tree.tree_.impurity[1:].sum())
    ax.set_title(f"Split along features #{feature_idx}\nGain in terms of entropy:\n{gain}")

Feature #2 and #3 (zero-indexed) lead to the same gain so choosing one or the other would be the same.

3 replies

glemaitre Feb 5, 2024
Maintainer

To go a bit further, let's make the manual split on feature #2 and #3. We need to order the feature and we can check the ordered target. The code will look like:

x = X[:, 2]
idx_sorted = np.argsort(x)
x_sorted = x[idx_sorted]
y_sorted = pd.Series(y[idx_sorted])
y_sorted

For both feature 2 and 3, you will get a block of the class 0 at first e.g.:

array([0., 0., 0., 0., 0., 0., 1., 1., 1., 2., 1., 1., 1., 2., 2., 2., 2.,
       2.])

It means that the best split will be located at the same position the 6 first sample. This exactly what the trees from scikit-learn find. And thus the statistics from the trees build with features #2 and #3 will give the same statistics (that is also what we see above).

I quickly make a loop to compute the gains by hand:

import numpy as np
from scipy.stats import entropy

x = X[:, 3]  # you can switch to 2
idx_sorted = np.argsort(x)
x_sorted = x[idx_sorted]
y_sorted = pd.Series(y[idx_sorted])

entropy_parent = entropy(
    y_sorted.value_counts(normalize=True).to_numpy()
)

gains = []
for i in range(1, y.size):
    y_left, y_right = y_sorted[:i], y_sorted[i:]
    entropy_left = (y_left.size / y.size) * entropy(
        y_left.value_counts(normalize=True).to_numpy()
    )
    entropy_right = (y_right.size / y.size) * entropy(
        y_right.value_counts(normalize=True).to_numpy()
    )
    gains.append(entropy_parent - (entropy_left + entropy_right))
max(gains)

that lead to 0.6365141682948128 for both feature #2 and #3 that make sense.

glemaitre Feb 5, 2024
Maintainer

So you probably have an error in your implementation.

xy-whu Feb 6, 2024
Author

@glemaitre Thank you very much, You are right, my implemenation contains a bug which results in the error.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

decision tree attribute selection seems wrong #28358

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

decision tree attribute selection seems wrong #28358

xy-whu Feb 4, 2024

Replies: 1 comment · 3 replies

glemaitre Feb 5, 2024 Maintainer

glemaitre Feb 5, 2024 Maintainer

glemaitre Feb 5, 2024 Maintainer

xy-whu Feb 6, 2024 Author

xy-whu
Feb 4, 2024

Replies: 1 comment 3 replies

glemaitre
Feb 5, 2024
Maintainer

glemaitre Feb 5, 2024
Maintainer

glemaitre Feb 5, 2024
Maintainer

xy-whu Feb 6, 2024
Author