Replies: 4 comments 4 replies
-
Thanks for this great discussion! I hope I can contribute by providing some additional background from a fairly recent new effect system, that is quite similar to all the JVM ones mentioned above. KotlinX Coroutines. KotlinX Coroutines has a back-pressure cancellation system, meaning that the following has to hold true. If withTimeout(2.seconds) {
withContext(NonCancelable) {
delay(5.seconds)
}
} In contrast to CE, KotlinX Coroutines cancellation is co-operative meaning that functions are responsible for checking cancellation. While it's technically possible to build-in auto-cancellation upon KotlinX Coroutines chooses to use References: I'd be happy to answer any additional questions that may arise from my reply! |
Beta Was this translation helpful? Give feedback.
-
It's understandable that you wanted to move this discussion here from Twitter but have you considered that limiting answers on Twitter may significantly reduce the visibility of your tweet? |
Beta Was this translation helpful? Give feedback.
-
This won't add much of value except some historical curiosity, but I just want to let you know that that's actually how the idea came to be, I wanted " Alex's continual, but on demand". You can also look at the finished model as the Haskell model except with |
Beta Was this translation helpful? Give feedback.
-
Some quick thoughts fwiw… I think there are two philosophical questions in here which are interesting. The first is whether we want a cooperative or a preemptive model for interruption. Cooperative models are easier to reason about and, by definition, more controlled, but they're also much more prone to resource leaks and have historically been a lot less useful (see: Java's thread interruption for a great example). There is, after all, a very good reason that all kernel schedulers are preemptive in nature. The second interesting question is whether we want cancelation to be fire and forget, or if backpressuring it is required. Note that even in the case where cancelation is fire and forget, some form of backpressure during finalization is still necessary, otherwise you cannot reasonably have nested finalizers on a single fiber (since the order of evaluation would be nondeterministic). So really this is just the question of whether When it comes to both of these, there certainly is not a "right" answer on first principles, and so I tend to fall back on the types of failure modes that I've seen in production. And in that vein, I have sincerely lost count of the number of production outages I've witnessed which ultimately boiled down to a failure to backpressure. Building distributed systems that can self-heal in the face of traffic beyond capacity is a difficult thing, and it's also in my experience the most important thing to focus on. Losing an instance is not really a big deal in practice: the scheduler will just kill it for you and you move on. Losing a cluster will often mean losing an entire region unless you can either perform some globally selective traffic rerouting (using something like Envoy) or if you have the infrastructural robustness to deploy and spin up duplicate clusters easily. And this is really what it comes down to for me. When you fail to backpressure reliably, not only do you leak resources, but you also lose the ability to reliably load shed. Failing to reliably load shed is how clusters bomb out and fall into negative feedback cycles under pressure. At the opposite end of the spectrum, over backpressuring and being too speculative about possible resource leaks will sometimes result in reduced instance utilization, up to and potentially including instance death (if you have a bug which blocks finalization indefinitely), but there is no conceivable circumstance in which over-backpressuring results in negative feedback cycles at the system level. At the very least, I've never seen anything remotely like that in practice. So to me, the tie-breaker is just that: what kind of outage would you rather have? Losing an instance is automatically recoverable and doesn't even need to ping PagerDuty. Losing a cluster to a negative feedback loop is not recoverable, has a huge blast radius (often also affecting up- and downstream dependencies), and in many cases can only be solved by redirecting inbound traffic at edges. I would much, much rather bias in favor of instance loss. Which is to say that I don't agree that leaking a process is worse than leaking a resource. :-) Leaking a resource is worse because of how it interacts in practice with the broader infrastructure of a distributed system. |
Beta Was this translation helpful? Give feedback.
-
In Monix (and Cats-Effect) it's
cancelation
, with a singlel
, because I'm lazy, and my spell checkers have been set toen-US
.Please provide your input
I need thoughts on the Cats Effect 3 cancelation model.
What positive or negative experiences have you had, in production, with it? What about the prior Cats-Effect 2?
History
Cats Effect 3 has changed the cancelation model, compared with Cats Effect 2 being now closer to that of ZIO, which happened in 2 acts:
bracked
for managing resources, inspired by Haskell, but with ZIO semantics (e.g.,acquire
andrelease
not being cancelable);flatMap
-driven run-loop became auto-cancelable, so there was an urgent need to handle resources safely; and note that Monix had a simpler, more cooperative model that was not invalidatingflatMap
(aka the "continual model", which failed to gain traction)bracket
is insufficient, as we needed a cancelableacquire
andrelease
inbracket
, Fabio proposed interruptible/uninterruptible regions; this inspired ZIO too, and became the design of Cats-Effect 3;uncancelable
primitive is quite nice, actually; it's like Fabio wanted to reconcilebracket
and auto-cancelable run-loops with my “continual” idea;You can read: Let's discuss the cancelation model for my objections to the current behavior.
There I'm explaining why the new model is very surprising, actually, not being what users expect, invalidating the idea behind cancelation (in Monix's original design), in order to optimize for resource management via
bracket
-like abstractions.And this is where we have an ideological issue …
Note here I'm talking of Cats-Effect 3, as I'm not familiar with the latest of ZIO, but given the former's influences, I suspect it's still the same.
Prior art
I might be wrong, this is highly subjective, but the lingering thought on my mind is the prior art:
IO
, does not back-pressure cancelation;promise.failure(new CancelationException)
, but it's possible to close any resources prematurely in the producer, of course;Future
isn't directly cancelable, but that's because cancelation can be achieved by having access to itsPromise
, see the reasoning for why Future is not cancelable — note, implementing cancelation withPromise
involves sending just a signal, e.g.,promise.complete(Failure(new CancelationException))
;CancelledError
; methods liketimeout
do wait for a task to be “actually canceled” (per their docs), but this depends on how tasks get implemented, as cancelation itself isn't a task.shield(task)
will cancel theshield
(the task's protection against cancelation), and not the task being shielded. Therefore, the task can keep on running, whiletimeout
returns immediately;There is one exception that I'm noticing — Kotlin's Coroutines — their
cancel()
operation doesn't back-pressure, but it's implemented via an exception, and you can thenjoin()
, so I believe cancelAndJoin() back-pressures the finalizers, but I'll have to check (TODO).Why doesn't the industry back-pressure cancelation?
The prior art for how cancelation gets implemented in the industry is overwhelming. Why are we doing this in the Scala/FP community? Doesn't the industry fear leaked resources, given that's the whole point of having a cancelation mechanism? Did they just copy from each other? Do they have different concerns? Or perhaps, could it be that back-pressuring cancelation is perhaps a bad idea for actual I/O (e.g., like what happens with the protocol for TCP close)?
It might be the case that Cats-Effect might be in groundbreaking innovation territory, being inspired by prior art from both the JVM and Haskell.
What comes next?
Having abstractions like Resource is a super-power. Which is why Monix has to make an effort in making
Task
compatible with Cats Effect's type classes, in order to benefit from what's out there.But what about the rest of the project?
Looking at
monix.execution.cancelables
andmonix.reactive
, it seems pretty clear to me that the story would get much more complicated. I'm getting the feeling that implementing back-pressuring inCancelable
/CancelableFuture
/Observable
isn't just hard (due to resource management, and potential for memory leaks), but actually wrong.This is because, even with
Resource
, you still need a way to avoid back-pressuring (e.g., with an installedtimeout
), while chaining those finalizers. This needs to be configurable. That is possible withTask
, since the run-loop is controlled. But everything else is a different story entirely.The whole notion that, if it's async, you can simply not wait — is wrong. Unless the run-loop accepts injecting a configuration for “not waiting”.
Task
could do that. But for everything else, I have serious doubts, and there are excellent reasons for it.Maybe we can inject such configuration via the
Scheduler
🤷♂️ To be determined...In this discussion, I'm trying to document the thought process of what Monix's cancelation model should be. Will also follow up with implementation samples, pros/cons, etc. And I'm hoping for input.
Beta Was this translation helpful? Give feedback.
All reactions