User Should Have An Option To Assign Different criterions With Different Percentage Of Trees In Random Forest #28970

thesahibnanda · 2024-05-07T12:29:59Z

Describe the workflow you want to enable

Detailed Explanation Of Proposed Workflow

User can mention how many percentage of trees in sklearn.ensemble.RandomForestClassifier & sklearn.ensemble.RandomForestRegressor will follow which criterion

Advantages Of Implementing Above Functionality

Better results can be achieved in certain domains and this feature will help reserchers

Describe your proposed solution

This Is How The Feature Will Look At User's End When Coding In Python3

import numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Give multiple criterion
n_estimators = 100

rfc = RandomForestClassifier(n_estimators=n_estimators, criterion={"gini": 0.4, "entropy": 0.3, "random": 0.3}, random_state=42)

# Model training
rfc.fit(X_train, y_train)

# Prediction
print(rfc.predict(X_test))

Explanation Of Above Code

After implementation of this new feature, criterion parameter will also accept a dict where percentage can be passed as value for a particular criterion as key
If sum of all values is less than 1 then percentage of trees left will follow default criterion
and if it's more than 1 then an error will be raised

In above code, other than gini and entropy, there is a random criterion also where each tree falling under random criterion can have any random criterion

Describe alternatives you've considered, if relevant

Alternative Code Using `np.argmax`

import numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import VotingClassifier

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create Random Forest classifiers with Gini and Entropy
n_estimators = 100

rf_gini = RandomForestClassifier(n_estimators=n_estimators // 2, criterion="gini", random_state=42)
rf_entropy = RandomForestClassifier(n_estimators=n_estimators // 2, criterion="entropy", random_state=42)

# Fit the models
rf_gini.fit(X_train, y_train)
rf_entropy.fit(X_train, y_train)

# Predict with both models
pred_gini = rf_gini.predict(X_test)
pred_entropy = rf_entropy.predict(X_test)

# Average the predictions (convert to probabilities first, then average, then take the argmax)
avg_proba = (rf_gini.predict_proba(X_test) + rf_entropy.predict_proba(X_test)) / 2
avg_pred = np.argmax(avg_proba, axis=1)

# Majority voting (directly count the most common prediction)
from scipy.stats import mode
majority_pred = mode([pred_gini, pred_entropy])[0][0]

# Evaluate accuracy
accuracy_avg = accuracy_score(y_test, avg_pred)
accuracy_majority = accuracy_score(y_test, majority_pred)

print(f"Accuracy using averaged probabilities: {accuracy_avg:.2f}")
print(f"Accuracy using majority voting: {accuracy_majority:.2f}")

Additional context

I Will Be Able To Make A PR Once The Issue Gets Approval

The text was updated successfully, but these errors were encountered:

lorentzenchr · 2024-05-09T11:50:06Z

I would say this is solved by the VotingClassifier, see https://scikit-learn.org/stable/modules/ensemble.html#voting-classifier. Each tree in a RF is independent anyway, so you can train 2 RF with the criteria of your choice and the combine by the VotingClassifier.

thesahibnanda · 2024-05-20T09:57:42Z

But Currently We Can't Give random As A criterion, My Vision Is To Add random As Criterion Where Each Tree In RF Could Have Random Criteria Adding More Randomness To RF
I Could Make The PR, It Can Be Easily Done By Using Python's Stock Package random

lorentzenchr · 2024-05-20T14:37:48Z

You can randomly choose the number if trees in the random forests that you pass to VotingClassifier.
If you want it randomly in each split of a tree, then my opinion is that that’s out of scope for scikit-learn, unless there are strong reasons for inclusion, see https://scikit-learn.org/stable/faq.html#id19.

thesahibnanda added Needs Triage Issue requires triage New Feature labels May 7, 2024

lorentzenchr closed this as completed May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User Should Have An Option To Assign Different criterions With Different Percentage Of Trees In Random Forest #28970

User Should Have An Option To Assign Different criterions With Different Percentage Of Trees In Random Forest #28970

thesahibnanda commented May 7, 2024 •

edited

lorentzenchr commented May 9, 2024

thesahibnanda commented May 20, 2024

lorentzenchr commented May 20, 2024

User Should Have An Option To Assign Different criterions With Different Percentage Of Trees In Random Forest #28970

User Should Have An Option To Assign Different criterions With Different Percentage Of Trees In Random Forest #28970

Comments

thesahibnanda commented May 7, 2024 • edited

Describe the workflow you want to enable

Detailed Explanation Of Proposed Workflow

Advantages Of Implementing Above Functionality

Describe your proposed solution

This Is How The Feature Will Look At User's End When Coding In Python3

Explanation Of Above Code

Describe alternatives you've considered, if relevant

Alternative Code Using np.argmax

Additional context

lorentzenchr commented May 9, 2024

thesahibnanda commented May 20, 2024

lorentzenchr commented May 20, 2024

thesahibnanda commented May 7, 2024 •

edited

Alternative Code Using `np.argmax`