Skip to content

shivin9/CAC

Repository files navigation

Don't Just Divide; Polarize and Conquer!

Python implementation of

We design a classification algorithm called Clustering Aware Classification (CAC), to find clusters in data that are tailor made to be easily classifiable when used as training datasets by classifiers for each underlying subpopulation. CAC is theoretically motivated, efficient, convergent and provably guaranteed to improve the performance of classifiers using the Logistic Loss functions. The CAC framework improves the performance of 9 different Machine Learning classifiers on 5 standard and 1 large real world dataset.

CAC problem setting. Data points (here p1) are selected iteratively and assigned to clusters (here C1) based on the cluster update equations. At testing time, x* is assigned to the cluster that lies nearest to x*.

Datasets

  1. Titanic
  2. Magic
  3. Adult
  4. Creditcard
  5. Diabetes
  6. CIC Mortality Prediction

Demo

Run python3 CAC_experiments.py --init KM --verbose False --classifier LR --dataset adult --cv False --alpha 0.01

Command line options

  • --dataset: The dataset to be used for training. Choices 'adult', 'credit', 'titanic', 'magic', 'cic' (default 'ALL')
  • --alpha: The alpha value to be used (default: 0.01)
  • --classifier: The base classifier to be used with CAC (default: LR)
  • --cv: Test CAC with 5-fold Cross Validation (default: False)
  • --verbose: Train base classifier on every intermediate iteration (default: False)

Input File Format

CAC expects every dataset in its separate folder within the data folder. X.csv denotes the comma-separated data file and y.csv contains the corresponding binary labels for the data points.

API

Training Phase:

clf = CAC(n_clusters, alpha, beta=-np.infty, n_epochs=100, classifier="LR", decay="fixed", init="KM", verbose=False))
clf.fit(X_train, y_train)

Paremeters:

  • X_train: array-like of shape (n_samples, n_features)
    • Training Samples.
  • y_train: array-like of shape (n_samples,)
    • Training Labels.
  • alpha: Float
    • Learning Rate of CAC.
  • beta: Float
    • The maximum allowed decrease in CAC cost function.
  • n_epochs: Int
    • Number of training epochs.
  • classifier: The choice of base classifier. Choose from
    • LR: Logistic Regression (default)
    • RF: Random Forest with 10 estimators
    • SVM: Linear SVM
    • Perceptron: Linear Perceptron
    • DT: Decision Tree
    • Ridge: Ridge Classifier
    • SGD: Stochastic Gradient Descent classifier
    • LDA: Fischer's LDA classifier
    • KNN: k-Nearest Neighbour (k=5)
    • NB: Naive Bayes Classifier
  • decay: {"inv", "fixed", "exp"}
    • Decay strategy for alpha.
  • init: {"RAND", "KM"}
    • Initialization scheme. "RAND": Random, "KM": k-means intialization.
  • verbose: bool
    • Parameter to control whether to train models at every intermediate iteration.

Attributes:

  • k: n_clusters.
  • alpha: alpha.
  • beta: beta.
  • classifier: Classifier used.
  • decay: decay.
  • init: Initialization scheme.
  • verbose: verbose parameter.
  • n_epochs: n_epochs.
  • centers: An array of cluster centers at every iteration of CAC.
  • cluster_stats: An array containing counts of +ve and -ve class points in every cluster.
  • models: An array containing trained models on the initial and final clusters.
  • scores: Training scores (Accuracy, F1, AUC, Sensitivity, Specificity) of models trained at intermediate iterations
  • labels: Cluster labels of points at every iteration
  • clustering_loss: Total CAC loss at every iteration
  • classification_loss: Total classification loss (log-loss) at every iteration

Methods

  • fit(X_train, y_train): Fit the model according to the given training data.
  • predict(X_test): Predict class labels and confidence scores for samples.

Output:

The trained model.

Testing/Evaluation Phase:

y_pred, y_proba = clf.predict(X_test, ITERATION)
f1 = f1_score(y_pred, y_test)
auc = roc_auc_score(y_test, y_proba)

Input:

  • X_test: array-like of shape (n_samples, n_features). Testing Samples.
  • y_test: array-like of shape (n_samples,). Testing Labels.
  • ITERATION: Int: To get the predictions at the specified iteration

Output:

  • test_scores: A tuple containing the predictions and confidence scores of every prediction.

About

A Clustering Based Classification Algorithm

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages