API

Don't Just Divide; Polarize and Conquer!

Python implementation of

Don't Just Divide; Polarize and Conquer!. Shivin Srivastava, Siddharth Bhatia, Lingxiao Huang, Jun Heng Lim, Kenji Kawaguchi, Vaibhav Rajan. (Under Review)

We design a classification algorithm called Clustering Aware Classification (CAC), to find clusters in data that are tailor made to be easily classifiable when used as training datasets by classifiers for each underlying subpopulation. CAC is theoretically motivated, efficient, convergent and provably guaranteed to improve the performance of classifiers using the Logistic Loss functions. The CAC framework improves the performance of 9 different Machine Learning classifiers on 5 standard and 1 large real world dataset.

CAC problem setting. Data points (here p₁) are selected iteratively and assigned to clusters (here C₁) based on the cluster update equations. At testing time, x^* is assigned to the cluster that lies nearest to x^*.

Datasets

Demo

Run python3 CAC_experiments.py --init KM --verbose False --classifier LR --dataset adult --cv False --alpha 0.01

Command line options

--dataset: The dataset to be used for training. Choices 'adult', 'credit', 'titanic', 'magic', 'cic' (default 'ALL')
--alpha: The alpha value to be used (default: 0.01)
--classifier: The base classifier to be used with CAC (default: LR)
--cv: Test CAC with 5-fold Cross Validation (default: False)
--verbose: Train base classifier on every intermediate iteration (default: False)

Input File Format

CAC expects every dataset in its separate folder within the data folder. X.csv denotes the comma-separated data file and y.csv contains the corresponding binary labels for the data points.

API

Training Phase:

clf = CAC(n_clusters, alpha, beta=-np.infty, n_epochs=100, classifier="LR", decay="fixed", init="KM", verbose=False))
clf.fit(X_train, y_train)

Paremeters:

X_train: array-like of shape (n_samples, n_features)
- Training Samples.
y_train: array-like of shape (n_samples,)
- Training Labels.
alpha: Float
- Learning Rate of CAC.
beta: Float
- The maximum allowed decrease in CAC cost function.
n_epochs: Int
- Number of training epochs.
classifier: The choice of base classifier. Choose from
- LR: Logistic Regression (default)
- RF: Random Forest with 10 estimators
- SVM: Linear SVM
- Perceptron: Linear Perceptron
- DT: Decision Tree
- Ridge: Ridge Classifier
- SGD: Stochastic Gradient Descent classifier
- LDA: Fischer's LDA classifier
- KNN: k-Nearest Neighbour (k=5)
- NB: Naive Bayes Classifier
decay: {"inv", "fixed", "exp"}
- Decay strategy for alpha.
init: {"RAND", "KM"}
- Initialization scheme. "RAND": Random, "KM": k-means intialization.
verbose: bool
- Parameter to control whether to train models at every intermediate iteration.

Attributes:

k: n_clusters.
alpha: alpha.
beta: beta.
classifier: Classifier used.
decay: decay.
init: Initialization scheme.
verbose: verbose parameter.
n_epochs: n_epochs.
centers: An array of cluster centers at every iteration of CAC.
cluster_stats: An array containing counts of +ve and -ve class points in every cluster.
models: An array containing trained models on the initial and final clusters.
scores: Training scores (Accuracy, F1, AUC, Sensitivity, Specificity) of models trained at intermediate iterations
labels: Cluster labels of points at every iteration
clustering_loss: Total CAC loss at every iteration
classification_loss: Total classification loss (log-loss) at every iteration

Methods

fit(X_train, y_train): Fit the model according to the given training data.
predict(X_test): Predict class labels and confidence scores for samples.

Output:

The trained model.

Testing/Evaluation Phase:

y_pred, y_proba = clf.predict(X_test, ITERATION)
f1 = f1_score(y_pred, y_test)
auc = roc_auc_score(y_test, y_proba)

Input:

X_test: array-like of shape (n_samples, n_features). Testing Samples.
y_test: array-like of shape (n_samples,). Testing Labels.
ITERATION: Int: To get the predictions at the specified iteration

Output:

test_scores: A tuple containing the predictions and confidence scores of every prediction.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
assets		assets
baselines		baselines
data		data
.gitignore		.gitignore
CAC.py		CAC.py
CAC_experiments.py		CAC_experiments.py
CAC_tuning.py		CAC_tuning.py
README.md		README.md
dmnn_2_1.py		dmnn_2_1.py
plots.py		plots.py
simulations.py		simulations.py
tuning.py		tuning.py

shivin9/CAC

Folders and files

Latest commit

History

Repository files navigation

Datasets

Demo

Command line options

Input File Format

API

Training Phase:

Paremeters:

Attributes:

Methods

Output:

Testing/Evaluation Phase:

Input:

Output:

About

Topics

Resources

Stars

Watchers

Forks

Languages