core: Collapse DBIC into HostController #1186

kim · 2024-04-30T11:19:42Z

Make it so HostController manages both the module host (wasm machinery) and the database (RelationalDB / DatabaseInstanceContext) of spacetime databases deployed to a server.

The DatabaseInstanceContextController (DBIC) is removed in the process.

This allows to make database accesses panic-safe, in that uncaught panics will cause all resouces to be released and the database to be restarted on subsequent access. This is a prerequisite for #985.

It also allows to move towards storage of the module binary directly in the database / commitlog. This patch, however, makes some contortions in order to not introduce a breaking change just yet.

Expected complexity level and risk

2.5

Testing

All API things should still work without changes
Insert a panic! in the durability crate's background tasks, which is triggered with some probability.
Deploy a test module and access it using the CLI (call reducer, get logs, sql).
Observe that the panic is triggered (in the server logs).
Run another command and observe that it returns an error.
Run it again and observe that the database restarts and the command succeeds.
On a scale from 1-10, rate how confusing the error messages are.
On a scale from 1-10, rate how confusing the fact is that the database does not eagerly restart.

kim · 2024-04-30T12:43:36Z

Please feel free to suggest naming / terminology changes. I felt uninspired so stuck with ecru.

kim · 2024-04-30T12:44:11Z

Also, I am unsure if we tracked energy in standalone before, or if I introduced it accidentally.

kim · 2024-04-30T12:53:12Z

crates/standalone/src/lib.rs

+                };
+                let database_id = database.id;
+                let database_addr = database.address;
+                self.control_db.update_database(database.clone())?;


The logic did not change in this patch, but I'll point out that it is flawed:

If we fail to run the update, the new database will keep running.
There is no obvious solution for that within the current event-driven design, because it is impossible to guarantee that we can restore the previous state.

It will need to be revised in a later patch.

kim · 2024-04-30T13:33:02Z

@kazimuth @RReverser It is somewhat mysterious to me how the repeated StandaloneEnv::init with same database directory could possibly work. Is there something I'm missing, or was it just incidental that it worked?

gefjon

Injected panics into a few places in durability::imp::local and published a module. Without RUST_BACKTRACE, saw:

Created new database with domain: chat, address: 93dda09db9a56d8fa6c024d843e805d8
thread 'tokio-runtime-worker' panicked at crates/durability/src/imp/local.rs:180:9:
FlushAndSyncTask::run: Random panic with probability 0.1 fires on sample 0.048514128
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

With RUST_BACKTRACE=1, saw:

thread 'tokio-runtime-worker' panicked at crates/durability/src/imp/local.rs:180:9:
PersisterTask::run: Random panic with probability 0.1 fires on sample 0.016180456
stack backtrace:
Updated database with domain: chat, address: 93dda09db9a56d8fa6c024d843e805d8
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: spacetimedb_durability::imp::local::maybe_panic
   3: spacetimedb_durability::imp::local::PersisterTask<T>::run::{{closure}}::{{closure}}
   4: spacetimedb_durability::imp::local::PersisterTask<T>::run::{{closure}}
   5: tokio::runtime::task::core::Core<T,S>::poll
   6: tokio::runtime::task::harness::Harness<T,S>::poll
   7: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
   8: tokio::runtime::scheduler::multi_thread::worker::Context::run
   9: tokio::runtime::context::runtime::enter_runtime
  10: tokio::runtime::scheduler::multi_thread::worker::run
  11: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
  12: tokio::runtime::task::core::Core<T,S>::poll
  13: tokio::runtime::task::harness::Harness<T,S>::poll
  14: tokio::runtime::blocking::pool::Inner::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

In both cases, the database was inaccessible after the panic.

These messages are far from exceptional, but they're sufficient.

Code looks reasonable. I'm not sure what the best way to address the test failure with repeated StandaloneEnv::init, but my first thought would be to add a global static STANDALONE_INIT: std::sync::Once and do STANDALONE_INIT.call_once(init_with_global_config).

kim · 2024-04-30T15:56:28Z

In both cases, the database was inaccessible after the panic.

That sounds wrong, it should come up again (lazily). Will investigate…

Thanks for testing!

kim · 2024-05-02T12:22:55Z

crates/standalone/src/control_db.rs

@@ -457,15 +457,6 @@ impl ControlDb {
        Ok(id)
    }

-    pub fn update_database_instance(&self, database_instance: DatabaseInstance) -> Result<()> {


Turns out the DatabaseInstance never actually changes, so this is not needed.

gefjon · 2024-05-02T12:25:03Z

In both cases, the database was inaccessible after the panic.

That sounds wrong, it should come up again (lazily). Will investigate…

Disregard. It does in fact seem to come up again, I was just mistaking my quickstart-chat being unable to connect due to stale credentials.

kim · 2024-05-14T11:45:01Z

crates/core/src/host/host_controller.rs

+    /// If the computation `F` panics, the host is removed from this controller,
+    /// releasing its resources.
+    #[tracing::instrument(skip_all)]
+    pub async fn using_database<F, T>(&self, database: Database, instance_id: u64, f: F) -> anyhow::Result<T>


Less than pleasant, arguably.
Can't think of a better way, though, because relational db is not UnwindSafe.

You could just use AssertUnwindSafe - we don't much care about unwinding anyway

Maybe, but that would still mean to catch unwind in every single method of relational db. That seems silly.

Just out of pure curiosity, what makes RelationalDB not UnwindSafe? Keep in mind I know very little about Rust stack unwinding

parking_lot locks used in the datastore keep their data in a literal UnsafeCell. Catching a panic means code can continue to run (e.g. destructors), and the compiler cannot prove anything about the memory safety of an UnsafeCell (because that is what it's for :)). At least that's how I understand it.

I believe what Noa is saying is that our goal is to drop the whole Host when there is a panic, so unless we have Drop impls which do silly things, this should be safe.

Thinking about it some more, it might be possible and sufficient to only catch panicking of durability.append_tx in RelationalDB::commit_tx. This is the place where we deliberately panic, and once the method panicked it will continue to panic (to signal that the durability layer is in an irrecoverably broken state). Any other panic (e.g. bad unwrap) will make the current request panic, but the next one is likely to succeed.

This still requires to assert unwind safety, so I'll try it in a follow up for easier review.

cloutiertyler

This is quite a positive improvement. I also like your naming scheme. I thought a bit and I could not come up with anything better than Host. I think it works.

I left one nit about renaming modules: Modules to hosts: Hosts. Could you make that change before merging?

cloutiertyler · 2024-05-14T22:29:44Z

crates/core/src/host/host_controller.rs

+    /// Storage inside the [`RelationalDB`] itself, but with a retrieval
+    /// function chosen by the user.
+    SameDb(Arc<SameDbStorage>),
+}


Ah this is cool

cloutiertyler · 2024-05-14T22:34:16Z

crates/core/src/host/host_controller.rs

-    pub energy_monitor: Arc<dyn EnergyMonitor>,
+    /// Map of all modules managed by this controller,
+    /// keyed by database instance id.
+    modules: Modules,


I would rename this to hosts: Hosts for consistency

cloutiertyler · 2024-05-14T22:37:40Z

crates/core/src/host/host_controller.rs

+    /// If the computation `F` panics, the host is removed from this controller,
+    /// releasing its resources.
+    #[tracing::instrument(skip_all)]
+    pub async fn using_database<F, T>(&self, database: Database, instance_id: u64, f: F) -> anyhow::Result<T>


Just out of pure curiosity, what makes RelationalDB not UnwindSafe? Keep in mind I know very little about Rust stack unwinding

Centril

Minor stuff

Centril · 2024-05-15T10:03:24Z

crates/client-api/src/lib.rs

@@ -26,11 +24,9 @@ pub mod util;
 ///
 /// Types returned here should be considered internal state and **never** be
 /// surfaced to the API.
-#[async_trait]
 pub trait NodeDelegate: Send + Sync {


cc @lcodes, relevant changes wrt. your PR.

Centril · 2024-05-15T10:06:11Z

crates/client-api/src/routes/database.rs

+    let module = host
+        .get_or_launch_module_host(database, instance_id)
+        .await
+        .map_err(log_and_500)?;


Saw this 5x; can this be refactored to its own function over host, db, inst_id?

At least I have reduced the line count of the 4 occurrences by ~50% each :P

I think it is courtesy to #1147 to not make those lines conflict in a non-obvious way.

The whole routes module tree could certainly benefit from a grand cleanup, tho.

Centril · 2024-05-15T10:07:26Z

crates/client-api/src/routes/database.rs

-            })
-            .collect::<Vec<_>>()
-    });
+    let json = worker_ctx


Centril · 2024-05-15T10:11:24Z

crates/core/src/host/host_controller.rs

+/// The registry of all running modules.
+type Modules = Arc<Mutex<IntMap<u64, HostCell>>>;
+
+type ExternalStorage = dyn Fn(&Hash) -> anyhow::Result<Option<AnyBytes>> + Send + Sync + 'static;


Could you type alias anyhow::Result<Option<AnyBytes>>?

E.g., type StorageResult = anyhow::Result<Option<AnyBytes>>; or some such.

Maybe it's just me, but I find it common that the idiom Result<Option<T>> denotes "may error or not found". Obscuring that for the sake of less angle brackets doesn't help recognizing the pattern imho, so I'll ask for a more precise name. Not sure if I could be convinced to name everything sentence-like, like MaybeErrorOrNotFound.

Make it so `HostController` manages both the module host (wasm machinery) and the database (`RelationalDB` / `DatabaseInstanceContext`) of spacetime databases deployed to a server. The `DatabaseInstanceContextController` (DBIC) is removed in the process. This allows to make database accesses panic-safe, in that uncaught panics will cause all resouces to be released and the database to be restarted on subsequent access. This is a prerequisite for #985. It also allows to move towards storage of the module binary directly in the database / commitlog. This patch, however, makes some contortions in order to **not** introduce a breaking change just yet.

coolreader18 · 2024-05-20T21:55:15Z

just fyi, I've got a version of #1147 ready for when this merges.

kim requested review from cloutiertyler, coolreader18 and gefjon and removed request for cloutiertyler and coolreader18 April 30, 2024 11:32

kim force-pushed the kim/collapse-dbic branch from e397a88 to 855f061 Compare April 30, 2024 12:36

kim commented Apr 30, 2024

View reviewed changes

gefjon approved these changes Apr 30, 2024

View reviewed changes

kim force-pushed the kim/collapse-dbic branch 2 times, most recently from 4018d12 to 7ea0b1a Compare May 2, 2024 09:59

kim mentioned this pull request May 2, 2024

Module hotswapping #1147

Merged

2 tasks

kim force-pushed the kim/collapse-dbic branch 2 times, most recently from 64e46dc to 7962d00 Compare May 2, 2024 11:54

kim commented May 2, 2024

View reviewed changes

bfops added the release-any To be landed in any release window label May 13, 2024

kim force-pushed the kim/collapse-dbic branch 2 times, most recently from 1b7e5d2 to 19af629 Compare May 14, 2024 11:26

kim commented May 14, 2024

View reviewed changes

cloutiertyler approved these changes May 14, 2024

View reviewed changes

Centril reviewed May 15, 2024

View reviewed changes

kim force-pushed the kim/collapse-dbic branch from 19af629 to 62b868b Compare May 15, 2024 11:31

cloutiertyler assigned kim May 20, 2024

kim added this pull request to the merge queue May 21, 2024

Merged via the queue into master with commit 2de1475 May 21, 2024
6 checks passed

kim deleted the kim/collapse-dbic branch May 21, 2024 17:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core: Collapse DBIC into HostController #1186

core: Collapse DBIC into HostController #1186

kim commented Apr 30, 2024 •

edited

kim commented Apr 30, 2024

kim commented Apr 30, 2024

kim Apr 30, 2024

kim commented Apr 30, 2024

gefjon left a comment

kim commented Apr 30, 2024

kim May 2, 2024

gefjon commented May 2, 2024

kim May 14, 2024

coolreader18 May 14, 2024

kim May 14, 2024

cloutiertyler May 14, 2024

kim May 15, 2024

cloutiertyler left a comment

cloutiertyler May 14, 2024

cloutiertyler May 14, 2024

cloutiertyler May 14, 2024

Centril left a comment

Centril May 15, 2024

Centril May 15, 2024

kim May 15, 2024

Centril May 15, 2024

Centril May 15, 2024

kim May 15, 2024

Centril May 15, 2024

kim May 15, 2024

coolreader18 commented May 20, 2024

core: Collapse DBIC into HostController #1186

core: Collapse DBIC into HostController #1186

Conversation

kim commented Apr 30, 2024 • edited

Expected complexity level and risk

Testing

kim commented Apr 30, 2024

kim commented Apr 30, 2024

Choose a reason for hiding this comment

kim commented Apr 30, 2024

gefjon left a comment

Choose a reason for hiding this comment

kim commented Apr 30, 2024

Choose a reason for hiding this comment

gefjon commented May 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloutiertyler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Centril left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coolreader18 commented May 20, 2024

kim commented Apr 30, 2024 •

edited