Caching a submission object #33

kwgoodman · 2017-10-20T16:33:35Z

The originality server caches each submission as a 2 column numpy array. Using a numpy array limits what we can cache. I propose using a Submission object instead of a np array, something like (untested):

class Submission(object):

    def __init__(self, df):
        
        # make sure probability predictions are aligned so that we can
        # calculate correlation between submissions
        df.sort_values("id", inplace=True)
        df = df["probability"]
        self._y_id = df.as_matrix()
        
        # precalc check for zero std
        if self._y_id.std() == 0:
            self.std_is_zero = True
        else:
            self.std_is_zero = False

        # do a lazy sorting by prediction value only when needed
        self._y_p = None

    @property
    def y_id_sorted(self):
        return self._y_id

    @property
    def y_prob_sorted(self):
        if self._y_p is None:
            self._y_p = np.sort(self._y_id)
        return self._y_p

It provides

better encapsulation and data description
lazy sorting of prediction by probability value (after speed up orginality check by running corr first #32 sorting is only needed if corr test passes)
lazy sorting also reduces memory use by cache, especially after check originality only vs accepted submissions #20
precalculation of the zero std check which with 1000 active submissions saves about 1 sec per controlling capital submission

The text was updated successfully, but these errors were encountered:

kwgoodman · 2017-10-20T16:39:03Z

The benefits of this proposal are especially good once #20 is done.

kwgoodman · 2017-10-20T16:44:47Z

When we take up this proposal we may as well take the zscore of the predictions (self.y_id_sorted) to save about 3 seconds per submission when there are 1000 active submissions:

In [1]: a = np.random.randn(350000)
In [2]: b = np.random.randn(350000)
In [3]: from scipy.stats import pearsonr

In [4]: pearsonr(a, b)[0]
Out[4]: 0.003398074841035899

In [5]: az = (a - a.mean()) / a.std()
In [6]: bz = (b - b.mean()) / b.std()
In [7]: np.dot(az, bz) / az.size
Out[7]: 0.0033980748410358929
I
In [8]: timeit pearsonr(a, b)[0]
100 loops, best of 3: 3.36 ms per loop
In [9]: timeit np.dot(az, bz) / az.size
1000 loops, best of 3: 153 µs per loop

In [10]: 3.36 * 1e-3 * 1000
Out[10]: 3.3600000000000003
In [11]: 153 * 1e-6 * 1000
Out[11]: 0.153

If self.std_is_zero then we'd skip the zscore of course.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caching a submission object #33

Caching a submission object #33

kwgoodman commented Oct 20, 2017 •

edited

kwgoodman commented Oct 20, 2017

kwgoodman commented Oct 20, 2017

Caching a submission object #33

Caching a submission object #33

Comments

kwgoodman commented Oct 20, 2017 • edited

kwgoodman commented Oct 20, 2017

kwgoodman commented Oct 20, 2017

kwgoodman commented Oct 20, 2017 •

edited