Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LRU cache is not an LRU cache #141

Open
JamesOldfield opened this issue Dec 29, 2020 · 1 comment
Open

LRU cache is not an LRU cache #141

JamesOldfield opened this issue Dec 29, 2020 · 1 comment

Comments

@JamesOldfield
Copy link

Hi, Many thanks for all your hard work on pysqlite over the years.

Like many others, I've been using pysqlite through the fork in the Python standard library. I've been trying to make sure my prepared statements are being reused properly, and after finding the documentation about this a bit thin I resorted to reading the source code, where I found several potential problems. The Python documentation says 'sqlite3 is developed externally under the name “pysqlite”', so I'm reporting these directly here.

The problem

The comment at the start of cache.c proclaims itself to be an "LRU cache". But it seems to do something a bit different from a traditional LRU cache, instead maintaining usage counts and using those to decide when to evict entries. Overall this extra effort seems to reduce the cache's effectiveness in most situations I can think of. There are three separate problems I can see:

  • The main problem I'm reporting is that items that are evicted from the cache will have their count forgotten (unsurprisingly), and so if they re-enter the cache their count will be reset to 1.
    • This means that any entries that manage to get a high usage count can potentially hog the cache, while other entries getting cycled in and out never manage to get higher up.
    • This is easiest to see by thinking about an example – see the illustration below.
  • Even for entries with the same usage count, the item that most recently achieved that count (typically the item with that count that was most recently added to the cache) is sorted last in the list, rather than first as you would expect.
    • This applies to newly added items (count = 1), which are always put at the very end of the list …
    • … And it applies to items that have just had their usage count incremented (count > 1), which are bubbled up before items with a lower count, but not before other items with an equal count.
    • Overall this exacerbates the previous problem (again, see the example below).
  • A somewhat related problem is that the cache is not capable of being updated if there is an existing item with the same key that remains in use.
    • In practice this is probably not too big of a deal, because in this context it would mean that a statement cannot be cached if another statement with the same SQL is already open and partially iterated on another cursor. Keeping a statement open for a long time is already an application flaw in that it keeps a transaction open (even in autocommit mode) which causes its own problems. I thought it was still worth mentioning though.

To reiterate what I said at the start: This is all based on a careful reading of the code. I haven't actually done any experiments so I might have make a mistake; sorry in advance if I have.

Illustration of the problem

Imagine a database server program, which has a cache for up to 7 entries.

On startup it runs one-time queries I and J (e.g. pragma statements). You would normally expect this to result in an LRU cache of [J, I] because J was accessed more recently, but instead you get this:

[I: 1, J: 1]

Now it services a request that requires running queries X, Y and Z – again these end up at the end:

[I: 1, J: 1, X: 1, Y: 1, Z: 1]

Now it services that same request again. This time X, Y and Z managed to get pushed ahead of I and J, although they're still in an unusual order within their groups:

[X: 2, Y: 2, Z: 2, I: 1, J: 1]

Now here's the clincher. A new request is received, this time requiring queries A, B, C and D. Note that there are only 2 free spaces in the cache, so after A and B are inserted, C and D each steal the final slot:

[X: 2, Y: 2, Z: 2, I: 1, J: 1, A: 1, D: 1]

If we run this query a few more times then A will get promoted, but the final item cycles between B, C and D, which never get a count greater than 1. Even I and J don't move out of their way!

[A: 10, X: 2, Y: 2, Z: 2, I: 1, J: 1, D: 1]

This looks like it could be a degenerate case but most practical use cases I can think of tend to have the same problem, unless the cache doesn't get filled at all (in which case of course the cache eviction strategy doesn't matter). I think evicting a random element might actually do better than evicting the last one!

Suggested solution 1: Use an LRU cache

An obvious fix to the cache is to use an actual LRU cache:

  • Remove the usage count.
  • When new items are added, put them at the start rather than the end.
  • When existing items are used, move them directly to the start (no need to bubble them gradually up).
  • When the cache is full, discard the last entry.

Aside from making the cache more useful (or at least more predictable), this will also simplify its implementation.

Suggested solution 2 (my preferred option): Get rid of the cache, allow explicit statement reuse

I think part of the problem here is too much mysterious cleverness behind the scenes. Maybe sometimes a normal LRU cache is best, maybe some other times the existing count-based solution is best, and other times still yet another strategy is best. Leave it to the user! Get rid of the cache (or at least make it optional) and allow statements to be reused explicitly, then users can use whatever caching structure or strategy they like. It will also bring clarity: they (we!) won't have to hunt around in documentation or pysqlite source code to figure out what's going on.

The simplest interface I think of for this is to allow Cursor objects to reuse a statement (if the cache is disabled) when a new query is started with identical SQL to previously. Essentially, each cursor becomes a cache of size one. This idea has the benefit of allowing little or no change to the interface while still giving complete control. It also makes it completely clear to users of the library that a statement cannot be reused while an iteration is still happening, which is true at the SQLite C library level.

A modest change to the interface to support this idea would be to allow the SQL to be omitted in calls to execute() and executemany() to mean "use the same SQL as last time". You could even allowing passing SQL to Cursor on construction:

conn = sqlite3.connect(path_to_db, isolation_level=None)
cur = conn.cursor(sql="SELECT * FROM Foo WHERE i > ?")
print(cur.execute(parameters=(3,)).fetchall())
for row in cur.execute(parameters=(2,)):
    print(row)
@rogerbinns
Copy link

For the record the current sqlite3 code in CPython does use functools.lru_cache.
You can see in the function load_functools_lru_cache currently at https://github.com/python/cpython/blob/main/Modules/_sqlite/module.c#L226

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants