Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User Should Have An Option To Assign Different criterions With Different Percentage Of Trees In Random Forest #28970

Closed
thesahibnanda opened this issue May 7, 2024 · 3 comments
Labels
Needs Triage Issue requires triage New Feature

Comments

@thesahibnanda
Copy link

thesahibnanda commented May 7, 2024

Describe the workflow you want to enable

Detailed Explanation Of Proposed Workflow

User can mention how many percentage of trees in sklearn.ensemble.RandomForestClassifier & sklearn.ensemble.RandomForestRegressor will follow which criterion

Advantages Of Implementing Above Functionality

Better results can be achieved in certain domains and this feature will help reserchers

Describe your proposed solution

This Is How The Feature Will Look At User's End When Coding In Python3

import numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Give multiple criterion
n_estimators = 100

rfc = RandomForestClassifier(n_estimators=n_estimators, criterion={"gini": 0.4, "entropy": 0.3, "random": 0.3}, random_state=42)

# Model training
rfc.fit(X_train, y_train)

# Prediction
print(rfc.predict(X_test))

Explanation Of Above Code

After implementation of this new feature, criterion parameter will also accept a dict where percentage can be passed as value for a particular criterion as key
If sum of all values is less than 1 then percentage of trees left will follow default criterion
and if it's more than 1 then an error will be raised

In above code, other than gini and entropy, there is a random criterion also where each tree falling under random criterion can have any random criterion

Describe alternatives you've considered, if relevant

Alternative Code Using np.argmax

import numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import VotingClassifier

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create Random Forest classifiers with Gini and Entropy
n_estimators = 100

rf_gini = RandomForestClassifier(n_estimators=n_estimators // 2, criterion="gini", random_state=42)
rf_entropy = RandomForestClassifier(n_estimators=n_estimators // 2, criterion="entropy", random_state=42)

# Fit the models
rf_gini.fit(X_train, y_train)
rf_entropy.fit(X_train, y_train)

# Predict with both models
pred_gini = rf_gini.predict(X_test)
pred_entropy = rf_entropy.predict(X_test)

# Average the predictions (convert to probabilities first, then average, then take the argmax)
avg_proba = (rf_gini.predict_proba(X_test) + rf_entropy.predict_proba(X_test)) / 2
avg_pred = np.argmax(avg_proba, axis=1)

# Majority voting (directly count the most common prediction)
from scipy.stats import mode
majority_pred = mode([pred_gini, pred_entropy])[0][0]

# Evaluate accuracy
accuracy_avg = accuracy_score(y_test, avg_pred)
accuracy_majority = accuracy_score(y_test, majority_pred)

print(f"Accuracy using averaged probabilities: {accuracy_avg:.2f}")
print(f"Accuracy using majority voting: {accuracy_majority:.2f}")

Additional context

I Will Be Able To Make A PR Once The Issue Gets Approval

@thesahibnanda thesahibnanda added Needs Triage Issue requires triage New Feature labels May 7, 2024
@lorentzenchr
Copy link
Member

I would say this is solved by the VotingClassifier, see https://scikit-learn.org/stable/modules/ensemble.html#voting-classifier. Each tree in a RF is independent anyway, so you can train 2 RF with the criteria of your choice and the combine by the VotingClassifier.

@thesahibnanda
Copy link
Author

But Currently We Can't Give random As A criterion, My Vision Is To Add random As Criterion Where Each Tree In RF Could Have Random Criteria Adding More Randomness To RF
I Could Make The PR, It Can Be Easily Done By Using Python's Stock Package random

@lorentzenchr
Copy link
Member

You can randomly choose the number if trees in the random forests that you pass to VotingClassifier.
If you want it randomly in each split of a tree, then my opinion is that that’s out of scope for scikit-learn, unless there are strong reasons for inclusion, see https://scikit-learn.org/stable/faq.html#id19.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Triage Issue requires triage New Feature
Projects
None yet
Development

No branches or pull requests

2 participants