Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why some fundament algorithms like LR DT RF is comparable with DES methods on my dataset. #259

Open
chenz1hao opened this issue Oct 28, 2021 · 3 comments

Comments

@chenz1hao
Copy link

I mean, the des method does not improve or even worse in the indicators run by my data set.

@Menelau
Copy link
Collaborator

Menelau commented Oct 29, 2021

Hello,

It is impossible to say why without knowing more the data and all the methodological steps used to run the algorithms.

Did you normalized all your data before applying dynamic selection? Did you try different approaches like DES base on clustering to see if that would give you better performance?

@chenz1hao
Copy link
Author

chenz1hao commented Oct 31, 2021

Dataset: http://bit.ly/xMLdataset (a binary classification task), I ran logistic regression (from sklearn) on this dataset and compare with DES methods (code copy from documentation) no normalized no any preprocessing just original dataset split into train_test dataset and I found there is no obvious performance improving in using DES methods.
maybe you can have a try on this dataset. thank you very much.
Code and result details are as follows:

@chenz1hao
Copy link
Author

chenz1hao commented Nov 1, 2021

def AUC_plot(algorithmName, test_y, pred_y_prob):
    # print(algorithmName, "AUC图像绘制:")
    fpr, tpr, thresholds = roc_curve(test_y, pred_y_prob)
    auc = roc_auc_score(test_y, pred_y_prob)
    plt.plot(fpr, tpr)
    plt.title(algorithmName+" AUC=%.4f" % (auc))
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.fill_between(fpr, tpr, where=(tpr > 0), color='green', alpha=0.5)
    plt.show()


# 输出打印算法性能
def print_performance(algorithm_name, test_y, pred_y, pred_y_prob):
    # TP(True Positive) 预测正确的1
    # FN(False Negative) 预测为-1,真实为1
    # FP(False Positive) 预测为1,真实为-1
    # TN(True Negative) 预测为-1,真实为-1

    TP = []
    FN = []
    FP = []
    TN = []

    for i in range(len(pred_y)):
        if pred_y[i] == 1 and test_y[i] == 1:
            TP.append(i)
        elif pred_y[i] == 0 and test_y[i] == 1:
            FN.append(i)
        elif pred_y[i] == 1 and test_y[i] == 0:
            FP.append(i)
        elif pred_y[i] == 0 and test_y[i] == 0:
            TN.append(i)

    accuracy = (len(TP)+len(TN))/(len(TP)+len(FP)+len(TN)+len(FN))
    precision = len(TP) / (len(TP) + len(FP))
    recall = len(TP) / (len(TP) + len(FN))
    F1_score = 2 * ((precision*recall)/(precision+recall))
    print(algorithm_name, ':')
    print('Accuracy:', accuracy)
    print('Precision:', precision)
    print('Recall:', recall)
    print('F1-SCORE:', F1_score)
    AUC_plot(algorithm_name, test_y, pred_y_prob)
    print('\n')

if __name__ == '__main__':
    dataset = pd.read_csv('data/heloc_dataset_v2.csv')
    X_train, X_test, y_train, y_test = train_test_split(dataset.drop(['target'],axis=1), dataset['target'], test_size=0.30, random_state=666)
    com_lr = LogisticRegression(max_iter=10000)
    com_lr.fit(X_train, y_train)
    print_performance('LR compare', np.array(y_test), com_lr.predict(X_test), com_lr.predict_proba(X_test)[:,1])
    pool_classifiers = BaggingClassifier(base_estimator=DecisionTreeClassifier(),
                                         n_estimators=100,
                                         random_state=666)
    X_train, X_dsel, y_train, y_dsel = train_test_split(X_train, y_train,
                                                        test_size=0.50,
                                                        random_state=666)
    pool_classifiers.fit(X_train, y_train)
    meta = METADES(pool_classifiers, random_state=666)
    names = ['META-DES']
    methods = [meta]
    # Fit the DS techniques
    scores = []
    for method, name in zip(methods, names):
        method.fit(X_dsel, y_dsel)
        scores.append(method.score(X_test, y_test))
        print_performance(name, np.array(y_test), method.predict(X_test), method.predict_proba(X_test)[:,1])

image

as you can see from the picture above, LR is logistic regression in sklearn, nearly all performance terms on META-DES are not good as logistic regression. I wonder how this would happened?

@Menelau

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants