Runtime Conditions #293

krisnova · 2023-01-17T19:42:22Z

We will need to bake in a way for pods, cells, etc to support generic runtime conditions that will need to remain true during the duration of execution.

For example we may want an in-memory cache to only "run" as long as there is a configurable amount of memory available in the system.

This conditions will likely need to be extensible. We will want the ability to check status on various mechanisms such as remote APIs, network connectivity, local health checks, remote health checks, etc, etc

What is the best way to define these conditions in Aurae? Do we want to implement a "reverse health check" style system that will follow a proof of exhaustion style set of checks and break if something fails?

krisnova · 2023-01-17T19:42:52Z

Note: We will likely need these at the "service level" as well as at the "node level"

dmah42 · 2023-01-17T19:46:35Z

this sounds like a scheduling problem ("run cache on nodes with >X Gib available").

am I thinking of something different?

krisnova · 2023-01-17T19:50:11Z

I was thinking more about failure modes.

"Run sidekiq as long as we can talk to the database"

I think we want to "fail quickly" in situations, such that scheduling mechanisms can quickly try to address whatever problem is going on

krisnova · 2023-01-17T19:54:27Z

Maybe a better example:

"Point all traffic at production as long as the backend is online, otherwise fail over to the replica"

I am unsure if this is a step in the "turing complete yaml" direction again -- this is just a thought i had

dmah42 · 2023-01-17T20:00:58Z

i think of all of these as scheduling issues. something needs to monitor the jobs that were started and if they're no longer running (if the service returns a failure code) then it needs to be rescheduled.

what we might need is plumbing from "job" to outer aurae health check/service discovery.

krisnova · 2023-01-17T20:02:01Z

so think about edge networking and failures

what do we do if a "node goes away" we should have some basic guarantees that a service wont end up running in 2 places just because wireguard broke, for example

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime Conditions #293

Runtime Conditions #293

krisnova commented Jan 17, 2023

krisnova commented Jan 17, 2023

dmah42 commented Jan 17, 2023

krisnova commented Jan 17, 2023

krisnova commented Jan 17, 2023

dmah42 commented Jan 17, 2023

krisnova commented Jan 17, 2023

Runtime Conditions #293

Runtime Conditions #293

Comments

krisnova commented Jan 17, 2023

krisnova commented Jan 17, 2023

dmah42 commented Jan 17, 2023

krisnova commented Jan 17, 2023

krisnova commented Jan 17, 2023

dmah42 commented Jan 17, 2023

krisnova commented Jan 17, 2023