Replies: 1 comment 3 replies
-
I simplified the example to be able to get insights: import pandas as pd
df = pd.read_csv("data.csv", header=None)
X = df.iloc[:18, :-1].to_numpy()
y = df.iloc[:18, -1].to_numpy()
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier, plot_tree
_, axs = plt.subplots(nrows=2, ncols=2, figsize=(10, 10))
for ax, feature_idx in zip(axs.ravel(), range(X.shape[1])):
tree = DecisionTreeClassifier(criterion="entropy", max_depth=1, random_state=0)
tree.fit(X[:, [feature_idx]], y)
plot_tree(tree, ax=ax)
gain = tree.tree_.impurity[0] - (tree.tree_.impurity[1:].sum())
ax.set_title(f"Split along features #{feature_idx}\nGain in terms of entropy:\n{gain}") Feature #2 and #3 (zero-indexed) lead to the same gain so choosing one or the other would be the same. |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I used a simplified iris dataset to create a decision tree, at the same time, i also implemented code of attribute selection based on the CART algorithm. However the results of sklearn and our implementation are not the same。
6.9,3.1,4.9,1.5,1.0
5.6,2.9,3.6,1.3,1.0
6.1,2.6,5.6,1.4,2.0
4.9,2.5,4.5,1.7,2.0
6.4,2.9,4.3,1.3,1.0
6.1,2.6,5.6,1.4,2.0
5.0,3.3,1.4,0.2,0.0
5.6,2.7,4.2,1.3,1.0
6.5,3.0,5.2,2.0,2.0
5.0,3.4,1.6,0.4,0.0
7.9,3.8,6.4,2.0,2.0
5.0,3.2,1.2,0.2,0.0
7.7,2.6,6.9,2.3,2.0
5.1,3.7,1.5,0.4,0.0
5.0,3.3,1.4,0.2,0.0
4.6,3.4,1.4,0.3,0.0
6.1,2.9,4.7,1.4,1.0
6.0,2.7,5.1,1.6,1.0
6.2,2.9,4.3,1.3,1.0
5.7,2.9,4.2,1.3,1.0
6.8,3.2,5.9,2.3,2.0
6.7,3.0,5.2,2.3,2.0
6.9,3.1,5.4,2.1,2.0
5.0,3.3,1.4,0.2,0.0
5.7,4.4,1.5,0.4,0.0
Obviously, the third feature is selected as the fourth atrribute (X[3]) to split the dataset.
The following is the results of out implementation,
i: 0 infoGain 0.32862568602758824,bestMid 5.35,Ent 1.2563368146935678
i: 1 infoGain 0.2144132483281731,bestMid 3.15,Ent 1.370549252392983
i: 2 infoGain 0.784962500721156,bestMid 2.6,Ent 0.8
i: 3 infoGain 0.3849625007211561,bestMid 0.85,Ent 1.2
2 2.6
Obviously, the third attribute (i=2) should be selected as the first attribute to split the dataset, wihich is not consistent with the result above.
Beta Was this translation helpful? Give feedback.
All reactions