Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

oid caching leads to broken queries after schema changes #3995

Open
2 of 3 tasks
davepacheco opened this issue Apr 19, 2024 · 1 comment
Open
2 of 3 tasks

oid caching leads to broken queries after schema changes #3995

davepacheco opened this issue Apr 19, 2024 · 1 comment

Comments

@davepacheco
Copy link

All of the details asked for in the template below are specified in my reproducer https://github.com/davepacheco/diesel-oid. I'll also fill it in as requested below but I think it'll be easier to just look at the reproducer.

Setup

Versions

  • Rust: 1.77.0
  • Diesel: 2.1.5
  • Database: PostgreSQL 16.2
  • Operating System MacOS Ventura 13.6.1

Feature Flags

  • diesel: default + "postgres"

Problem Description

We have a program that:

  • runs some queries that insert rows containing user-defined types
  • some other program runs a schema change that's compatible with the Diesel schema but changes the type's oid (i.e., drop the type and re-add it)
  • runs more of those queries

After the schema change, such queries start failing with errors like type with ID 218 does not exist (CockroachDB) or cache lookup failed for type 16455 (PostgreSQL)

You could argue that this is not supposed to work? But restarting the application or executing the queries from new connections fixes the problem. It's the stale cache in the Diesel connection that's broken and if it didn't cache the OIDs, I think the application would work.

I don't know if PostgreSQL provides any way for Diesel to handle this without the application's help. If not, how should an application deal with this? Should the application have a way to disable caching? Or maybe a way to communicate to Diesel that the cache needs to be flushed?

To make things more complicated, the application where we actually hit this proactively populates the Diesel cache with all of our user-defined types to avoid lots of tiny queries for very short-lived connections (our test suite).

Relatedly: why is the cache per-connection? We're using a connection pool. So even if Diesel had a way to flush the cache and re-fill it, it's hard for us to go do that on every connection. It'd be really nice if the cache were distributed via a watch channel or something where we could update a canonical copy and then propagate that out to all the consumers that want it. I guess that might be hard if consumers need to modify it themselves as they run into new types that they need?

What are you trying to accomplish?

We're trying to manage an application where the database schema can change over time. We're not yet doing online schema updates, in that we're not expecting the application to work seamlessly as the schema is updated. We do expect that we can start the program, have it wait for the schema to match what the version that it understands, and then proceed.

First, we store an integer version number for our schema in a database table.

To update our system to a new version of the software + schema, we do the following:

  • shut the application down ("old" application version, "old" schema version)
  • replace the application binaries on disk
  • start the application again ("new" application version)
    • on startup, it checks the schema version. It'll still be the old one. It blocks, re-checking every few seconds until the version matches what's expected.
  • an operator executes a separate program to perform the schema update.
    • this unblocks the application above

Our problem is that Diesel may have cached some types' OIDs before the schema change. If that happens, then the application is totally broken (lots of queries fail) after the schema change.

For much more detail, see oxidecomputer/omicron#5561.

What is the expected output?

For my reproducer, I expect no "cache lookup failed" error.

What is the actual output?

For my reproducer, I see the "cache lookup failed" error.

Are you seeing any additional errors?

See reproducer README or output below.

Steps to reproduce

  1. Set up PostgreSQL 16.2 (version probably doesn't matter much) running on 127.0.0.1 port 5432 with a database called dap.
  2. Clone https://github.com/davepacheco/diesel-oid. This fully specifies the Rust version (rust-toolchain.toml) and dependencies (Cargo.lock).
  3. In diesel-oid, run cargo run -- postgresql://127.0.0.1:5432/dap.

I see:

$ cargo run -- postgresql://127.0.0.1:5432/dap
   Compiling diesel-oid v0.1.0 (/Users/dap/diesel-oid)
    Finished dev [unoptimized + debuginfo] target(s) in 0.44s
     Running `target/debug/diesel-oid 'postgresql://127.0.0.1:5432/dap'`
connecting to: postgresql://127.0.0.1:5432/dap
connected!
initialized schema
inserted rows: 1
found rows: [MyTable { a: 0, b: One }]
updated schema
ERROR: second insert failed: cache lookup failed for type 16455
establishing second connection
connected!
insert on second connection worked: 1
contents of table after second insert (second conn):
rows: [MyTable { a: 0, b: One }, MyTable { a: 0, b: One }]
contents of table after second insert (first conn):
Error: listing contents of table

Caused by:
    cached plan must not change result type

I'm not worried about "cached plan must not change result type" -- I get that one. The problem is the "cache lookup failed for type 16455". I think what's going on is:

  1. From one connection, we successfully INSERT a row into table my_table, which has a column of user-defined type my_type. This populates Diesel’s OID cache for the type my_type.
  2. Then we change the schema in a way that’s compatible with the Diesel definition of that schema, but with a different OID (by dropping and re-creating the user-defined type).
  3. Then we try the INSERT again. It fails unexpectedly. It appears that what’s happening is Diesel is sending the same OID as before, from its cache, but that’s wrong now.
  4. To prove it’s just a caching issue: we create a second connection and do the same INSERT again. It works.

Checklist

I am not sure how to reproduce this without diesel_derive_enum but I do not believe it's related to the problem.

@weiznich
Copy link
Member

Thanks for filling this bug report. It's definitively helpful to have a good minimal example for this.

That written: I unfortunately see no good way to change the current behavior without running into massive performance problems. I think this essentially boils down to: Change the schema is only supported as long as the various internal caches in the connection are not populated yet. After that it's not safe to reuse the connection again and expect it to continue to work. You need to establish connections at that point by marking your connection pool as stale and reestablish connections. After all you know at that point you are calling these queries that something might go stale, which is more than diesel knows about that particular queries.

That written: I'm open to accept contributions to improve the current behavior, but I would strongly prefer first to have a suggested design before someone starts to implement something. For me that's currently not a priority (well maybe beside just adding a documentation note about this specific behavior).

Relatedly: why is the cache per-connection? We're using a connection pool. So even if Diesel had a way to flush the cache and re-fill it, it's hard for us to go do that on every connection. It'd be really nice if the cache were distributed via a watch channel or something where we could update a canonical copy and then propagate that out to all the consumers that want it. I guess that might be hard if consumers need to modify it themselves as they run into new types that they need?

Simply because there is no pool in diesel, we just provide support for r2d2 and that does not allow to store these information in the pool itself. That only leaves storing them in the connection. I cannot really comment for other third party pool implementations here. Again that's something where I would accept contributions to change it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants