Skip to content

Stack upgrades & ProcessSync: What happens if apps use a stack the platform no longer supports?

braa braa braa edited this page May 26, 2020 · 9 revisions

If I upgrade CC to a version that drops support for a stack, but some of my running apps still use that stack...

Summary

  • Diego will be very careful not to cause unexpected app downtime.
  • Those freshly invalid, old-stack apps will continue to run and be routable, but CC will no longer be able to send updates of them to Diego.
  • The system will recognize this and refuse to delete any compute resources until it can confirm that they aren't the old-stack apps that it can no longer sync.

What happens to the apps that are still using that unsupported stack?

  • They continue to exist in CCDB
  • They continue to exist as BBS as Diego DesiredLRPs
  • They continue to run on Diego Cells as Diego ActualLRPs (?)
  • They continue to be routable (?)
  • They can no longer be updated or created in Diego
    • Updates and creates will result in the error no compiler defined for requested stack
    • Any change to the process' updated_at will make Diego's DesiredLRP out-of-date
    • The ProcessSync loop will attempt to update all out-of-date DesiredLRPs
  • Because the domain is unfresh:
    • They can be deleted in the CF API, but Diego will not stop running their ActualLRPs (?)

What happens to the sync loop?

  • It continues to run
  • In parallel, it continues to sync as many CC processes as possible to Diego as DesiredLRPs
  • Any app with an unsupported stack will error on update if Diego's DesiredLRP is out-of-date.
  • Update errors will prevent freshness from being bumped
  • All errors encountered should be logged by the clock

What happens when freshness isn't bumped

  • see the BBS documentation for domain freshness
    • tldr
      • No destructive action will be taken against LRPs in that domain
      • Processes with unsupported stacks will continue to run (unless Diego has dropped them during evacuation?)
      • Processes that have been deleted in CC but exist in Diego will continue to run
      • Creates and updates of processes will continue to work fine

What happens to apps that synced successfully?

  • They can be created, updated, scaled, etc
  • Because the domain is unfresh:
    • They can be deleted in the CF API
    • BUT Diego will not stop running their ActualLRPs

What happens to "mysterious" ActualLRPs that CCDB has no record of?

  • They cannot be deleted in the CF API
  • Because the domain is unfresh:
    • Diego will not stop running their ActualLRPs

Open Questions

  • Is this the best we can do to handle this class of failure?
  • Should we tolerate unknown stack errors for bumping freshness?
  • What does Diego do if you're evacuating the last cflinuxfs2 cells?
    • Do the apps stop running?
    • Does the deployment error?
  • If the apps stop running and the mitigation here is to STOP them in CCDB, would it be better to bump freshness if the only errors during sync are about unknown stacks?

Collated context of how we came to have this behavior

  • October 2018: #156029607 We made uncaught errors on the clock log and exit 1.
  • November 2018: #162064721 We made most errors log, but continue to sync and refuse to bump freshness.
  • November 2018: #161800100 We verified this behavior applies to apps with absent stacks.
  • December 2018: A KB Article was written about recovering from this issue
  • May 2020: Pivotal Slack We started seeing a rash of this in escalations, with log lines where cc.diego.sync.processes logged sync-failed and error-updating-lrp-state
Clone this wiki locally