You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Dec 19, 2023. It is now read-only.
The originality server caches each submission as a 2 column numpy array. Using a numpy array limits what we can cache. I propose using a Submission object instead of a np array, something like (untested):
class Submission(object):
def __init__(self, df):
# make sure probability predictions are aligned so that we can
# calculate correlation between submissions
df.sort_values("id", inplace=True)
df = df["probability"]
self._y_id = df.as_matrix()
# precalc check for zero std
if self._y_id.std() == 0:
self.std_is_zero = True
else:
self.std_is_zero = False
# do a lazy sorting by prediction value only when needed
self._y_p = None
@property
def y_id_sorted(self):
return self._y_id
@property
def y_prob_sorted(self):
if self._y_p is None:
self._y_p = np.sort(self._y_id)
return self._y_p
When we take up this proposal we may as well take the zscore of the predictions (self.y_id_sorted) to save about 3 seconds per submission when there are 1000 active submissions:
In [1]: a = np.random.randn(350000)
In [2]: b = np.random.randn(350000)
In [3]: from scipy.stats import pearsonr
In [4]: pearsonr(a, b)[0]
Out[4]: 0.003398074841035899
In [5]: az = (a - a.mean()) / a.std()
In [6]: bz = (b - b.mean()) / b.std()
In [7]: np.dot(az, bz) / az.size
Out[7]: 0.0033980748410358929
I
In [8]: timeit pearsonr(a, b)[0]
100 loops, best of 3: 3.36 ms per loop
In [9]: timeit np.dot(az, bz) / az.size
1000 loops, best of 3: 153 µs per loop
In [10]: 3.36 * 1e-3 * 1000
Out[10]: 3.3600000000000003
In [11]: 153 * 1e-6 * 1000
Out[11]: 0.153
If self.std_is_zero then we'd skip the zscore of course.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
The originality server caches each submission as a 2 column numpy array. Using a numpy array limits what we can cache. I propose using a Submission object instead of a np array, something like (untested):
It provides
The text was updated successfully, but these errors were encountered: