Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime Conditions #293

Open
krisnova opened this issue Jan 17, 2023 · 6 comments
Open

Runtime Conditions #293

krisnova opened this issue Jan 17, 2023 · 6 comments

Comments

@krisnova
Copy link
Contributor

We will need to bake in a way for pods, cells, etc to support generic runtime conditions that will need to remain true during the duration of execution.

For example we may want an in-memory cache to only "run" as long as there is a configurable amount of memory available in the system.

This conditions will likely need to be extensible. We will want the ability to check status on various mechanisms such as remote APIs, network connectivity, local health checks, remote health checks, etc, etc


What is the best way to define these conditions in Aurae? Do we want to implement a "reverse health check" style system that will follow a proof of exhaustion style set of checks and break if something fails?

@krisnova
Copy link
Contributor Author

Note: We will likely need these at the "service level" as well as at the "node level"

@dmah42
Copy link
Contributor

dmah42 commented Jan 17, 2023

this sounds like a scheduling problem ("run cache on nodes with >X Gib available").

am I thinking of something different?

@krisnova
Copy link
Contributor Author

I was thinking more about failure modes.

"Run sidekiq as long as we can talk to the database"

I think we want to "fail quickly" in situations, such that scheduling mechanisms can quickly try to address whatever problem is going on

@krisnova
Copy link
Contributor Author

Maybe a better example:

"Point all traffic at production as long as the backend is online, otherwise fail over to the replica"

I am unsure if this is a step in the "turing complete yaml" direction again -- this is just a thought i had

@dmah42
Copy link
Contributor

dmah42 commented Jan 17, 2023

i think of all of these as scheduling issues. something needs to monitor the jobs that were started and if they're no longer running (if the service returns a failure code) then it needs to be rescheduled.

what we might need is plumbing from "job" to outer aurae health check/service discovery.

@krisnova
Copy link
Contributor Author

so think about edge networking and failures

what do we do if a "node goes away" we should have some basic guarantees that a service wont end up running in 2 places just because wireguard broke, for example

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants