Fix caching calls to `_vector_for_key_cached` and `_out_of_vocab_vector_cached` #47

zfang · 2019-02-07T05:33:45Z

_query_is_cached will always returns False because key should be in a tuple.
lru_cache is able to unify args, kwargs, and default args in a call with the get_default_args magic in order to generate a consistent cache key. What this means is that
a. all the default args will be part of kwargs;
b. any args with a default value will also be converted to kwargs.
c. for a parameter that has no default value, if you provide it as args in one call and as kwargs in another, they will have different cache keys.
Therefore _out_of_vocab_vector_cached._cache.get(((key,), frozenset([('normalized', normalized)]))) will always return False since the actual cache key is ((key,), frozenset([('normalized', normalized), ('force', force)]))
It's wasteful to call _cache.get and throw away the result. So I changed _query_is_cached to _query_cached.

`_query_is_cached` will always returns false because `_cache.get` expects `key` to be in a tuple. This renders the caching useless.

Fix _query_is_cached to allow caching

zfang · 2019-02-09T06:25:57Z

I'm surprised that this PR receives no attention because it improves performance of our service by a large margin. Here is a code snippet to help understand the effect of this change:

from collections import defaultdict

import pandas as pd
from pymagnitude import *

words = ['hello', 'world', 'cache', 'helllloooo', 'wooooorlddddd', 'caaaaache', ]
reversed_words = list(reversed(words))

vector = Magnitude(path=MagnitudeUtils.download_model('glove/medium/glove.twitter.27B.25d', log=True),
                   language='en',
                   lazy_loading=2400000)
vector_attrs = ['query', '_vector_for_key_cached', '_out_of_vocab_vector_cached']


def log_cached(vector):
    data = defaultdict(list)
    cache_attrs = ['size', 'lookups', 'hits', 'misses', 'evictions']
    for attr in vector_attrs:
        for cache_attr in cache_attrs:
            data[cache_attr].append(getattr(getattr(vector, attr)._cache, cache_attr))
    df = pd.DataFrame(data, index=vector_attrs)
    print(df, '\n')


print('### Query ...')
vector.query(words)
log_cached(vector)

print('### Query reverse ...')
vector.query(reversed_words)
log_cached(vector)

Output before the change:

### Query ...
                                size  lookups  hits  misses  evictions
query                           1000        1     0       1          0
_vector_for_key_cached       2400000        6     0       6          0
_out_of_vocab_vector_cached  2400000        9     0       9          0 

### Query reverse ...
                                size  lookups  hits  misses  evictions
query                           1000        2     0       2          0
_vector_for_key_cached       2400000       12     0      12          0
_out_of_vocab_vector_cached  2400000       18     3      15          0

Output after the change:

### Query ...
                                size  lookups  hits  misses  evictions
query                           1000        1     0       1          0
_vector_for_key_cached       2400000        6     0       6          0
_out_of_vocab_vector_cached  2400000        9     0       9          0 

### Query reverse ...
                                size  lookups  hits  misses  evictions
query                           1000        2     0       2          0
_vector_for_key_cached       2400000       12     6       6          0
_out_of_vocab_vector_cached  2400000       12     3       9          0

I also created https://github.com/zfang/benchmark_pymagnitude just for testing my patch.

…em differently

…vocab_vector_cached._cache.get call

AjayP13 · 2019-02-12T01:41:15Z

Hi @zfang,

Thanks for this PR, this likely broke at some point when I modified the underlying LRU cache. Sorry, I've been on travel for the last week or so, I'll get around to reviewing this tonight and merging this in the next few days.

I'll also add some tests to make sure the cache works and prevent regressions in the future.

zfang added 2 commits February 6, 2019 21:16

Update __init__.py

c452881

`_query_is_cached` will always returns false because `_cache.get` expects `key` to be in a tuple. This renders the caching useless.

Merge pull request #1 from zfang/zfang-patch-1

4b2afc3

Fix _query_is_cached to allow caching

Felix Fang added 2 commits February 8, 2019 23:10

Fix another caching issue with _out_of_vocab_vector_cached

1043fc1

Fix calls args and kwargs inconsistencies because lru_cache treats th…

25cf26c

…em differently

zfang changed the title ~~Fix _query_is_cached to actually enable caching~~ Fix caching calls to _vector_for_key_cached and _out_of_vocab_vector_cached Feb 9, 2019

Use looked up vectors directly instead of calling query; fix _out_of_…

fd5f132

…vocab_vector_cached._cache.get call

AjayP13 self-assigned this Feb 12, 2019

AjayP13 added bug Something isn't working caching labels Feb 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix caching calls to `_vector_for_key_cached` and `_out_of_vocab_vector_cached` #47

Fix caching calls to `_vector_for_key_cached` and `_out_of_vocab_vector_cached` #47

zfang commented Feb 7, 2019 •

edited

zfang commented Feb 9, 2019 •

edited

AjayP13 commented Feb 12, 2019

Fix caching calls to _vector_for_key_cached and _out_of_vocab_vector_cached #47

Are you sure you want to change the base?

Fix caching calls to _vector_for_key_cached and _out_of_vocab_vector_cached #47

Conversation

zfang commented Feb 7, 2019 • edited

zfang commented Feb 9, 2019 • edited

AjayP13 commented Feb 12, 2019

Fix caching calls to `_vector_for_key_cached` and `_out_of_vocab_vector_cached` #47

Fix caching calls to `_vector_for_key_cached` and `_out_of_vocab_vector_cached` #47

zfang commented Feb 7, 2019 •

edited

zfang commented Feb 9, 2019 •

edited