Skip to content
This repository has been archived by the owner on Dec 19, 2023. It is now read-only.

Caching a submission object #33

Open
kwgoodman opened this issue Oct 20, 2017 · 2 comments
Open

Caching a submission object #33

kwgoodman opened this issue Oct 20, 2017 · 2 comments

Comments

@kwgoodman
Copy link
Contributor

kwgoodman commented Oct 20, 2017

The originality server caches each submission as a 2 column numpy array. Using a numpy array limits what we can cache. I propose using a Submission object instead of a np array, something like (untested):

class Submission(object):

    def __init__(self, df):
        
        # make sure probability predictions are aligned so that we can
        # calculate correlation between submissions
        df.sort_values("id", inplace=True)
        df = df["probability"]
        self._y_id = df.as_matrix()
        
        # precalc check for zero std
        if self._y_id.std() == 0:
            self.std_is_zero = True
        else:
            self.std_is_zero = False

        # do a lazy sorting by prediction value only when needed
        self._y_p = None

    @property
    def y_id_sorted(self):
        return self._y_id

    @property
    def y_prob_sorted(self):
        if self._y_p is None:
            self._y_p = np.sort(self._y_id)
        return self._y_p

It provides

@kwgoodman
Copy link
Contributor Author

The benefits of this proposal are especially good once #20 is done.

@kwgoodman
Copy link
Contributor Author

When we take up this proposal we may as well take the zscore of the predictions (self.y_id_sorted) to save about 3 seconds per submission when there are 1000 active submissions:

In [1]: a = np.random.randn(350000)
In [2]: b = np.random.randn(350000)
In [3]: from scipy.stats import pearsonr

In [4]: pearsonr(a, b)[0]
Out[4]: 0.003398074841035899

In [5]: az = (a - a.mean()) / a.std()
In [6]: bz = (b - b.mean()) / b.std()
In [7]: np.dot(az, bz) / az.size
Out[7]: 0.0033980748410358929
I
In [8]: timeit pearsonr(a, b)[0]
100 loops, best of 3: 3.36 ms per loop
In [9]: timeit np.dot(az, bz) / az.size
1000 loops, best of 3: 153 µs per loop

In [10]: 3.36 * 1e-3 * 1000
Out[10]: 3.3600000000000003
In [11]: 153 * 1e-6 * 1000
Out[11]: 0.153

If self.std_is_zero then we'd skip the zscore of course.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant