MISSION

MISSION: Ultra Large-Scale Feature Selection using Count-Sketches

An ICML 2018 paper by Amirali Aghazadeh*, Ryan Spring*, Daniel LeJeune, Gautam Dasarathy, Anshumali Shrivastava, Richard G. Baraniuk

* These authors contributed equally and are listed alphabetically.

How-To-Run + Code Versions

All data files are formatted using the VW input format

Build executables by running Makefile
Mission Logistic Regression

// Hyperparameters
// Size of Top-K Heap
const size_t TOPK = (1 << 14) - 1;

// Size of Count-Sketch Array
const size_t D = (1 << 18) - 1;

// Number of Arrays in Count-Sketch
const size_t N = 3;

// Learning Rate
const float LR = 5e-1;

./mission_logistic train_data test_data

Fine-Grained Mission Softmax Regression

// Hyperparameters

// Size of Top-K Heap
const size_t TOPK = (1 << 20) - 1;

// Number of Classes
const size_t K = 193;

// Size of Count-Sketch Array
const size_t D = (1 << 24) - 1;

// Number of Arrays in Count-Sketch
const size_t N = 3;

// Learning Rate
const float LR = 1e-2;

// Length of String Feature Representation
const size_t LEN = 12;

./fine_mission_softmax train_data test_data

Coarse-Grained Mission Softmax Regression

// Hyperparameters

// Size of Top-K Heap
const size_t TOPK = (1 << 22) - 1;

// Number of Classes
const size_t K = 193;

// Size of Count-Sketch Array
const size_t D = (1 << 24) - 1;

// Number of Arrays in Count-Sketch
const size_t N = 3;

// Learning Rate
const float LR = 1e-1;

// Length of String Feature Representation
const size_t LEN = 12;

./coarse_mission_softmax [train_data_part_1 train_data_part_2 ... train_data_part_n] test_data

Feature Hashing Softmax Regression

// Hyperparameters

// Number of Classes
const size_t K = 193;

// Size of Count-Sketch Array
const size_t D = (1 << 24) - 1;

// Learning Rate
const float LR = 1e-2;

// Length of String Feature Representation
const size_t LEN = 12;

./softmax [train_data_part_1 train_data_part_2 ... train_data_part_n] test_data

Optimizations

Mission streams in the dataset via Memory-Mapped I/O instead of loading everything directly into memory -
Necessary for Tera-Scale Datasets
AVX SIMD optimization for fast Softmax Regression
The code is currently optimized for the Splice-Site and DNA Metagenomics datasets.

Mission Softmax Regression

Fine-Grained Feature Set - Each class maintains a separate feature set, so there is a top-k heap for each class.
Coarse-Grained Feature Set - All the classes share a common set of features, so there is only one top-k heap. -
Each feature is measured by its L1 Norm for all classes.
Data Parallelism - Each worker maintains a separate heap, while aggregating gradients in the same count-sketch.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
extra		extra
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extra

extra

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

MISSION

How-To-Run + Code Versions

Optimizations

Mission Softmax Regression

Datasets

About

Releases

Packages

Languages

License

rdspring1/MISSION

Folders and files

Latest commit

History

Repository files navigation

MISSION

How-To-Run + Code Versions

Optimizations

Mission Softmax Regression

Datasets

About

Topics

Resources

License

Stars

Watchers

Forks

Languages