Skip to content

Protocol

James Bergstra edited this page Mar 14, 2013 · 6 revisions

The protocol for communication between view objects and learning algorithms is defined in base.py. The key components are:

  • Task
  • View
  • Protocol
  • Learning algorithm

The following text explains how these pieces fit together, and then enumerates the different Task Semantics that are used in skdata so far.

As a researcher using skdata, you will probably be writing your own learning algorithm implementation, in order to collect the statistics you care about for your work. Writing your learning algorithm in the form described here will ease the process of adapting your experiment code to get results from different data sets that you can compare directly to other results from the relevant literature.

Task

A task represents some data with very brief meta-data describing what kind of data it is, and what to do with it. It is not meant to encapsulate behaviour, it is a container object. To use the Iris view.py as an example, we see at line 30 there is a method for creating tasks. The Iris task method binds together (a) an input feature matrix, (b) an output label vector, and (c) semantic meta-data "vector_classification". In the course of running an experiment, it is typical to create many task objects (e.g. one task for a training set for example, and another task for a testing set) but they will typically all have the same semantic meta-data descriptor. The meta-data is the only mandatory task attribute, and its purpose is to tell a learning algorithm what other attributes to expect in the task.

View

A view represents an interpretation of a data set (which is not generally standardized in any way) as a standard type of learning problem and often specifies particular train/test splits, feature representations, and particular metrics for judging the success of models. Technically, a view draws on a data set to define several tasks, and sequences them up into a protocol. The K-fold cross-validation evaluation protocol implemented by the Iris example creates a train and a test tasks for each evaluation fold. The tasks here are all labeled with the same "vector_classification" meta-data, which indicates to the learning algorithm that the tasks have a .x feature matrix and a .y label vector, and that a model should be a classifier that predicts y from x, and minimizes the number of classification errors.

Protocol

A protocol is generally implemented by a view method called protocol that takes a learning algorithm as an argument. So for example the view implementation skdata.iris.view.KfoldClassification has a protocol method that creates a bunch (K) of train and test tasks, and then for each split calls model = algo.best_model(train) to train a model on the training data and then algo.loss(model, test) to tell the learning algorithm to measure generalization error on the corresponding test data. The protocol method works entirely by side-effect on the learning algorithm, it does not typically modify the view object itself in any way, and it returns its algo argument as the method return value.

Learning Algorithm

A learning algorithm is an object that provides the methods called by the protocol. The idea is that algo.best_model, to continue our example, will inspect the meta-data of the training task, and produce some appropriate model for the data. The best_model implementation may also log some statistics of the learning process to some internal variables, output files, etc. When the protocol later tells the learning algo to measure loss, the idea is that the learning algorithm will inspect the meta-data, and measure loss in an appropriate way relative to a model that it previously produced. The learning algorithm object works mainly by side-effect, storing internally any kind of interesting logs or statistics about the model-fitting process or the generalization error.

When the protocol function call returns, the machine learning experiment is done. The various results of the experiment should be stored in the learning algorithm object, and the data set view object should be in the same state that it was before the experiment began.

Task Semantics Used in Skdata

The data sets in skdata (those which have been written / upgraded to actually use this design) use the following task semantics. If you want to define new semantics for your own work, you can go ahead and do it, skdata does not need to be modified or notified in any way.

"vector_classification"

Task objects with this semantics must have:

  • x - a matrix whose rows are feature vectors (of floats)
  • y - a vector whose entries are integer labels in {0, 1, ..., n_classes - 1}
  • n_classes - the number of possible classes

The x and y attributes will have the same length.

"indexed_vector_classification"

Task objects with this semantics must have:

  • all_vectors - a matrix with shape: (examples, features)
  • all_labels - a vector with shape (examples,) whose entries are integer labels in {0, 1, ..., n_classes - 1}
  • idxs - a vector of non-negative "active" elements for advanced-indexing into all_vectors, and all_labels.
  • n_classes - the number of possible classes

"indexed_image_classification"

Task objects with this semantics must have:

  • all_images - a 4-tensor with shape: (examples, height, width, channels)
  • all_labels - a vector whose entries are integer labels in {0, 1, ..., n_classes - 1}
  • idxs - a vector of non-negative "active" elements for advanced-indexing into all_images, and all_labels.
  • n_classes - the number of possible classes