Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix caching calls to _vector_for_key_cached and _out_of_vocab_vector_cached #47

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

zfang
Copy link

@zfang zfang commented Feb 7, 2019

  1. _query_is_cached will always returns False because key should be in a tuple.

  2. lru_cache is able to unify args, kwargs, and default args in a call with the get_default_args magic in order to generate a consistent cache key. What this means is that
    a. all the default args will be part of kwargs;
    b. any args with a default value will also be converted to kwargs.
    c. for a parameter that has no default value, if you provide it as args in one call and as kwargs in another, they will have different cache keys.
    Therefore _out_of_vocab_vector_cached._cache.get(((key,), frozenset([('normalized', normalized)]))) will always return False since the actual cache key is ((key,), frozenset([('normalized', normalized), ('force', force)]))

  3. It's wasteful to call _cache.get and throw away the result. So I changed _query_is_cached to _query_cached.

`_query_is_cached` will always returns false because `_cache.get` expects `key` to be in a tuple. This renders the caching useless.
Fix _query_is_cached to allow caching
@zfang
Copy link
Author

zfang commented Feb 9, 2019

I'm surprised that this PR receives no attention because it improves performance of our service by a large margin. Here is a code snippet to help understand the effect of this change:

from collections import defaultdict

import pandas as pd
from pymagnitude import *

words = ['hello', 'world', 'cache', 'helllloooo', 'wooooorlddddd', 'caaaaache', ]
reversed_words = list(reversed(words))

vector = Magnitude(path=MagnitudeUtils.download_model('glove/medium/glove.twitter.27B.25d', log=True),
                   language='en',
                   lazy_loading=2400000)
vector_attrs = ['query', '_vector_for_key_cached', '_out_of_vocab_vector_cached']


def log_cached(vector):
    data = defaultdict(list)
    cache_attrs = ['size', 'lookups', 'hits', 'misses', 'evictions']
    for attr in vector_attrs:
        for cache_attr in cache_attrs:
            data[cache_attr].append(getattr(getattr(vector, attr)._cache, cache_attr))
    df = pd.DataFrame(data, index=vector_attrs)
    print(df, '\n')


print('### Query ...')
vector.query(words)
log_cached(vector)

print('### Query reverse ...')
vector.query(reversed_words)
log_cached(vector)

Output before the change:

### Query ...
                                size  lookups  hits  misses  evictions
query                           1000        1     0       1          0
_vector_for_key_cached       2400000        6     0       6          0
_out_of_vocab_vector_cached  2400000        9     0       9          0 

### Query reverse ...
                                size  lookups  hits  misses  evictions
query                           1000        2     0       2          0
_vector_for_key_cached       2400000       12     0      12          0
_out_of_vocab_vector_cached  2400000       18     3      15          0 

Output after the change:

### Query ...
                                size  lookups  hits  misses  evictions
query                           1000        1     0       1          0
_vector_for_key_cached       2400000        6     0       6          0
_out_of_vocab_vector_cached  2400000        9     0       9          0 

### Query reverse ...
                                size  lookups  hits  misses  evictions
query                           1000        2     0       2          0
_vector_for_key_cached       2400000       12     6       6          0
_out_of_vocab_vector_cached  2400000       12     3       9          0 

I also created https://github.com/zfang/benchmark_pymagnitude just for testing my patch.

@zfang zfang changed the title Fix _query_is_cached to actually enable caching Fix caching calls to _vector_for_key_cached and _out_of_vocab_vector_cached Feb 9, 2019
@AjayP13
Copy link
Contributor

AjayP13 commented Feb 12, 2019

Hi @zfang,

Thanks for this PR, this likely broke at some point when I modified the underlying LRU cache. Sorry, I've been on travel for the last week or so, I'll get around to reviewing this tonight and merging this in the next few days.

I'll also add some tests to make sure the cache works and prevent regressions in the future.

@AjayP13 AjayP13 self-assigned this Feb 12, 2019
@AjayP13 AjayP13 added bug Something isn't working caching labels Feb 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working caching
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants