Merge pull request #5460 from psafont/private/paus/double-gardon

CA-383867: Add local disk cache library for xapi guard
xapi-project · Feb 29, 2024 · 1ae1dd8 · 1ae1dd8
2 parents b27ecda + dcb8042
commit 1ae1dd8
Show file tree

Hide file tree

Showing 16 changed files with 1,268 additions and 132 deletions.
diff --git a/doc/content/xapi-guard/_index.md b/doc/content/xapi-guard/_index.md
@@ -17,6 +17,8 @@ Principles
 2. Xenopsd is able to control xapi-guard through message switch, this access is
    not limited.
 3. Listening to domain socket is restored whenever the daemon restarts to minimize disruption of running domains.
+4. Disruptions to requests when xapi is unavailable is minimized.
+   The startup procedure is not blocked by the availability of xapi, and write requests from domains must not fail because xapi is unavailable.
 
 
 Overview
@@ -26,3 +28,71 @@ Xapi-guard forwards calls from domains to xapi to persist UEFI variables, and up
 To do this, it listens to 1 socket per service (varstored, or swtpm) per domain.
 To create these sockets before the domains are running, it listens to a message-switch socket.
 This socket listens to calls from xenopsd, which orchestrates the domain creation.
+
+To protect the domains from xapi being unavailable transiently, xapi-guard provides an on-disk cache for vTPM writes.
+This cache acts as a buffer and stores the requests temporarily until xapi can be contacted again.
+This situation usually happens when xapi is being restarted as part of an update.
+SWTPM, the vTPM daemon, reads the contents of the TPM from xapi-guard on startup, suspend, and resume.
+During normal operation SWTPM does not send read requests from xapi-guard.
+
+Structure
+---------
+
+The cache module consists of two Lwt threads, one that writes to disk, and another one that reads from disk.
+The writer is triggered when a VM writes to the vTPM.
+It never blocks if xapi is unreachable, but responds as soon as the data has been stored either by xapi or on the local disk, such that the VM receives a timely response to the write request.
+Both try to send the requests to xapi, depending on the state, to attempt write all the cached data back to xapi, and stop using the cache.
+The threads communicate through a bounded queue, this is done to limit the amount of memory used.
+This queue is a performance optimisation, where the writer informs the reader precisely which are the names of the cache files, such that the reader does not need to list the cache directory.
+And a full queue does not mean data loss, just a loss of performance; vTPM writes are still cached.
+
+This means that the cache operates in three modes:
+- Direct: during normal operation the disk is not used at all
+- Engaged: both threads use the queue to order events
+- Disengaged: A thread dumps request to disk while the other reads the cache
+  until it's empty
+
+```mermaid
+---
+title: Cache State
+---
+stateDiagram-v2
+    Disengaged
+    note right of Disengaged
+        Writer doesn't add requests to queue
+        Reader reads from cache and tries to push to xapi
+    end note
+    Direct
+    note left of Direct
+        Writer bypasses cache, send to xapi
+        Reader waits
+    end note
+    Engaged
+    note right of Engaged
+        Writer writes to cache and adds requests to queue
+        Reader reads from queue and tries to push to xapi
+    end note
+
+    [*] --> Disengaged
+
+    Disengaged --> Disengaged : Reader pushed pending TPMs to xapi, in the meantime TPMs appeared in the cache
+    Disengaged --> Direct : Reader pushed pending TPMs to xapi, cache is empty
+
+    Direct --> Direct : Writer receives TPM, sent to xapi
+    Direct --> Engaged : Writer receives TPM, error when sent to xapi
+
+    Engaged --> Direct : Reader sent TPM to xapi, finds an empty queue
+    Engaged --> Engaged : Writer receives TPM, queue is not full
+    Engaged --> Disengaged : Writer receives TPM, queue is full
+```
+
+Startup
+------
+
+At startup, there's a dedicated routine to transform the existing contents of the cache.
+This is currently done because the timestamp reference change on each boot.
+This means that the existing contents might have timestamps considered more recent than timestamps of writes coming from running events, leading to missing content updates.
+This must be avoided and instead the updates with offending timestamps are renamed to a timestamp taken from the current timestamp, ensuring a consistent
+ordering.
+The routine is also used to keep a minimal file tree: unrecognised files are deleted, temporary files created to ensure atomic writes are left untouched, and empty directories are deleted.
+This mechanism can be changed in the future to migrate to other formats.