Skip to content
Jack Gerrits edited this page Dec 1, 2021 · 2 revisions

Concept

The marginal reduction implements a technique which is similar to how multi-armed bandits operate. For each given id a feature is maintained where the value of this feature is updated as follows:

numerator = numerator * (1.0 - decay) + (label * weight)
denominator = denominator * (1.0 - decay) + weight

This allows you to track the value of a given id or arm based on the rewards it has received. This works without any contextual information.

How to use it

Marginal Options:
  --marginal arg                   Substitute marginal label estimates for ids
  --initial_denominator arg (=1, ) Initial denominator
  --initial_numerator arg (=0.5, ) Initial numerator
  --compete                        Enable competition with marginal features
  --update_before_learn            Update marginal values before learning
  --unweighted_marginals           Ignore importance weights when computing 
                                   marginals
  --decay arg (=0, )               Decay multiplier per event (1e-3 for 
                                   example)

The reduction is enabled with --marginal <namespace>. The given namespace is the first character of the namespace where the marginal features are located. More than one marginal namespace is supported.

The marginal namespace needs to be carefully constructed as it is interpeted in a specific way. It should contain 1 or more pairs of features. The first feature in the pair is what the feature index of the marginal feature will be. Below is an example VW data file with 4 lines and 3 different ids. Notice that in the data file every line uses the same value for the first feature.

0.5 |m constant id1
1.0 |m constant id2
0.25 |m constant id3
0.4 |m constant id1

If we train on this file.

vw --marginal m -d <data> --noconstant --readable_model readable.txt

We can inspect the readable model to see the calculated marginals. The marginal triplets are after marginals size = x and before :0. The triplets are hash,numerator,denominator.

Readable model output:

Version 8.11.0
Id 
Min label:0
Max label:1
bits:18
lda:0
0 ngram:
0 skip:
options: --marginal m
Checksum: 1964076403
marginals size = 3
262109:0.75:2
134578:1.5:2
251020:1.4:3
:0
m^constant:6788:0.877014

Note: This invert hash output should contains readable ids. This issue being tracked by #3496

Since we used the default initial numerator of 0.5 and denominator of 1 we can see the counts add up. For example for id1

numerator = 0.5 + 0.5 + 0.4
          = 1.4
denominator = 1 + 1 +1
            = 3

Since we used constant for each of the lines, only a single model weight was learned.

Let's say we were to make a prediction with this model on this example:

| constant id2

We would expect the prediction to be:

= 1.5/2 * 0.877014
= 0.658

Of course, this is a simple example so that it is easy to calculate. The marginal feature can be applied to a larger system with non-marginal features in order to learn a specific feature which directly corresponds to the value of a given id or arm.

Clone this wiki locally