Thoughts on the Cancelation Model #1586

alexandru · 2022-05-24T13:25:22Z

alexandru
May 24, 2022
Maintainer

In Monix (and Cats-Effect) it's cancelation, with a single l, because I'm lazy, and my spell checkers have been set to en-US.

Please provide your input

I need thoughts on the Cats Effect 3 cancelation model.

What positive or negative experiences have you had, in production, with it? What about the prior Cats-Effect 2?

History

Cats Effect 3 has changed the cancelation model, compared with Cats Effect 2 being now closer to that of ZIO, which happened in 2 acts:

The past: In Cats Effect 2 we introduced bracked for managing resources, inspired by Haskell, but with ZIO semantics (e.g., acquire and release not being cancelable);

This happened because, contrary to Monix's initial design, the flatMap-driven run-loop became auto-cancelable, so there was an urgent need to handle resources safely; and note that Monix had a simpler, more cooperative model that was not invalidating flatMap (aka the "continual model", which failed to gain traction)
Brackets execute finalizers in a reverse order in which the resources were created, hence the need to back-pressure them, code relying on it;
This invalidates Monix's notion of cancelation, which is a signal sent from the consumer to the producer saying that the consumer is no longer interested in the result (i.e., no back-pressuring);

The present: Because bracket is insufficient, as we needed a cancelable acquire and release in bracket, Fabio proposed interruptible/uninterruptible regions; this inspired ZIO too, and became the design of Cats-Effect 3;

This is doubling down on the idea that cancelation needs to be back-pressured, as it's used for “safe resource handling”;
The API of the current uncancelable primitive is quite nice, actually; it's like Fabio wanted to reconcile bracket and auto-cancelable run-loops with my “continual” idea;

You can read: Let's discuss the cancelation model for my objections to the current behavior.

There I'm explaining why the new model is very surprising, actually, not being what users expect, invalidating the idea behind cancelation (in Monix's original design), in order to optimize for resource management via bracket-like abstractions.

And this is where we have an ideological issue …

CE 3, as a philosophy, wants to ensure that the process can't have resource leaks, which could be an outcome if you don't back-pressure cancelation;
I am arguing, however, that resource leaks are inevitable, some resource leaks are worse than others, and cancelation gets used in race conditions — i.e., precisely in those cases in which things don't work™️. Back-pressuring cancelation can leak an entire process instead, which is awful, actually;
I'm also arguing that Cats Effect 3 treats cancelation as happy-path user input, but cancelation is anything but; those cases, in which we need to force ordering or wait for termination, could be handled by classic concurrency primitives: such as a semaphore, or reference counting;

Note here I'm talking of Cats-Effect 3, as I'm not familiar with the latest of ZIO, but given the former's influences, I suspect it's still the same.

Prior art

I might be wrong, this is highly subjective, but the lingering thought on my mind is the prior art:

Reactive Streams does not back-pressure cancelation;
Java 9's Flow does not back-pressure cancelation;
RxJava's Single, which has been used in Java land as an IO, does not back-pressure cancelation;
Project Reactor implements cancelation, much like the entire Reactive Streams family, via a simple signal, no back-pressure on cancelation;
F#'s Async does not back pressure on cancelation, and actually it has a pretty collaborative cancelation model, unlike that of the ReactiveX family;
- F# basically uses C#'s mechanism, i.e., you can wait on a CancellationToken, but the back-pressure is for when you call it, not for when resources are released; so no back-pressure on cancelation;
Java's FutureTask or CompletableFuture don't back-pressure cancelable — note that this is just like doing promise.failure(new CancelationException), but it's possible to close any resources prematurely in the producer, of course;
Scala's Future isn't directly cancelable, but that's because cancelation can be achieved by having access to its Promise, see the reasoning for why Future is not cancelable — note, implementing cancelation with Promise involves sending just a signal, e.g., promise.complete(Failure(new CancelationException));
JavaScript has a similar mechanism, cancelation being just a signal;
Python Tasks (asyncio) are cancelable, but AFAIK, cancelation is still just a signal that is akin to completing a task with CancelledError; methods like timeout do wait for a task to be “actually canceled” (per their docs), but this depends on how tasks get implemented, as cancelation itself isn't a task.
- Cancelation doesn't mean that the task actually stopped. For example, canceling a shield(task) will cancel the shield (the task's protection against cancelation), and not the task being shielded. Therefore, the task can keep on running, while timeout returns immediately;
- Say what you will of Python, but the Python 3 web server I worked on was able to immediately cancel HTTP requests when the client connection was dropped. It's a capability I haven't seen in Play Framework, or Akka HTTP — and I remember the Typelevel folks trying to implement this for Http4s (after I complained that Python can do it 🙂), but I'm not sure if they succeeded;
There's also scalaz.concurrent.Task, the original inspiration for the Monix Task, which does cooperative cancelation, i.e., a boolean reference that's being checked, being more similar to that of C#/F#, and no resource management;

There is one exception that I'm noticing — Kotlin's Coroutines — their cancel() operation doesn't back-pressure, but it's implemented via an exception, and you can then join(), so I believe cancelAndJoin() back-pressures the finalizers, but I'll have to check (TODO).

Why doesn't the industry back-pressure cancelation?

The prior art for how cancelation gets implemented in the industry is overwhelming. Why are we doing this in the Scala/FP community? Doesn't the industry fear leaked resources, given that's the whole point of having a cancelation mechanism? Did they just copy from each other? Do they have different concerns? Or perhaps, could it be that back-pressuring cancelation is perhaps a bad idea for actual I/O (e.g., like what happens with the protocol for TCP close)?

It might be the case that Cats-Effect might be in groundbreaking innovation territory, being inspired by prior art from both the JVM and Haskell.

What comes next?

Having abstractions like Resource is a super-power. Which is why Monix has to make an effort in making Task compatible with Cats Effect's type classes, in order to benefit from what's out there.

But what about the rest of the project?

Looking at monix.execution.cancelables and monix.reactive, it seems pretty clear to me that the story would get much more complicated. I'm getting the feeling that implementing back-pressuring in Cancelable / CancelableFuture / Observable isn't just hard (due to resource management, and potential for memory leaks), but actually wrong.

This is because, even with Resource, you still need a way to avoid back-pressuring (e.g., with an installed timeout), while chaining those finalizers. This needs to be configurable. That is possible with Task, since the run-loop is controlled. But everything else is a different story entirely.

The whole notion that, if it's async, you can simply not wait — is wrong. Unless the run-loop accepts injecting a configuration for “not waiting”. Task could do that. But for everything else, I have serious doubts, and there are excellent reasons for it.

Maybe we can inject such configuration via the Scheduler 🤷‍♂️ To be determined...

In this discussion, I'm trying to document the thought process of what Monix's cancelation model should be. Will also follow up with implementation samples, pros/cons, etc. And I'm hoping for input.

nomisRev · 2022-05-24T15:49:34Z

nomisRev
May 24, 2022

Thanks for this great discussion! I hope I can contribute by providing some additional background from a fairly recent new effect system, that is quite similar to all the JVM ones mentioned above. KotlinX Coroutines.

KotlinX Coroutines has a back-pressure cancellation system, meaning that the following has to hold true. If withTimeout cancels the suspend lambda it wraps after 2 seconds, but if the nested lambda is uncancelable then the cancel signal needs to be back-pressured until the inner lambda is finished. Hence the total operation takes 5 seconds.

withTimeout(2.seconds) {
  withContext(NonCancelable) {
     delay(5.seconds)
  }
}

In contrast to CE, KotlinX Coroutines cancellation is co-operative meaning that functions are responsible for checking cancellation. While it's technically possible to build-in auto-cancellation upon invokeSuspend similar to how CE does it on flatMap is possible, the designers of the effect system chose not to do so.
This offers very fine-grained control, because in low-level code (without foreign code) you can effectively freely control/ignore cancellation. In the eco-system, this has so far not been seen as an issue since all useful operators in KotlinX (and Arrow Fx) check for cancellation, and most other (user) code that doesn't is typically pure data transformations.

KotlinX Coroutines chooses to use CancellationException rather than just stopping the loop as CE does, which has some pros and cons.

References:

I'd be happy to answer any additional questions that may arise from my reply!

2 replies

alexandru May 24, 2022
Maintainer Author

@nomisRev thanks for your input — this should be my cue to go more in depth into how Kotlin's coroutines work, or Arrow-kt's implementation of IO. I'll come back after playing with it 🙂 hoping to borrow some ideas.

nomisRev May 24, 2022

My pleasure @alexandru!

Arrow-kt doesn't have an IO anymore for several years, since we choose to build further on top of KotlinX Coroutines.
We played with a CE2-inspired implementation of IO for suspend you can take a look here. https://github.com/arrow-kt/arrow-fx/tree/master/arrow-fx-coroutines/src/main/kotlin/arrow/fx/coroutines. You will find many familiar classes there from CE2.

We decided to rely on KotlinX Coroutines instead since it offered all guarantees mentioned above, which covered all the use-cases the Arrow maintainers cared about, and this way we avoided introducing a new implementation for suspend which can introduce more complexity in terms of cancellation and making mistakes against it.

ghik · 2022-05-25T13:22:46Z

ghik
May 25, 2022

It's understandable that you wanted to move this discussion here from Twitter but have you considered that limiting answers on Twitter may significantly reduce the visibility of your tweet?

1 reply

alexandru May 25, 2022
Maintainer Author

That may be, but I try limiting my Twitter exposure, me not replying would be rude, if there is a conversation, it wouldn't be centralized, or would get lost in the pit, and so not very actionable. And for complex topics, I don't value fleeting opinions that fit in 280 chars.

If there's interest, a link can be redistributed multiple times.

SystemFw · 2022-05-25T19:12:59Z

SystemFw
May 25, 2022

The API of the current uncancelable primitive is quite nice, actually; it's like Fabio wanted to reconcile bracket and auto-cancelable run-loops with my “continual” idea;

This won't add much of value except some historical curiosity, but I just want to let you know that that's actually how the idea came to be, I wanted " Alex's continual, but on demand". You can also look at the finished model as the Haskell model except with uninterruptibleMask instead of mask, but that's not where the seed of inspiration came from.

0 replies

djspiewak · 2022-05-27T15:44:33Z

djspiewak
May 27, 2022

Some quick thoughts fwiw…

I think there are two philosophical questions in here which are interesting. The first is whether we want a cooperative or a preemptive model for interruption. Cooperative models are easier to reason about and, by definition, more controlled, but they're also much more prone to resource leaks and have historically been a lot less useful (see: Java's thread interruption for a great example). There is, after all, a very good reason that all kernel schedulers are preemptive in nature.

The second interesting question is whether we want cancelation to be fire and forget, or if backpressuring it is required. Note that even in the case where cancelation is fire and forget, some form of backpressure during finalization is still necessary, otherwise you cannot reasonably have nested finalizers on a single fiber (since the order of evaluation would be nondeterministic). So really this is just the question of whether Fiber#cancel includes a join in its implementation.

When it comes to both of these, there certainly is not a "right" answer on first principles, and so I tend to fall back on the types of failure modes that I've seen in production. And in that vein, I have sincerely lost count of the number of production outages I've witnessed which ultimately boiled down to a failure to backpressure. Building distributed systems that can self-heal in the face of traffic beyond capacity is a difficult thing, and it's also in my experience the most important thing to focus on. Losing an instance is not really a big deal in practice: the scheduler will just kill it for you and you move on. Losing a cluster will often mean losing an entire region unless you can either perform some globally selective traffic rerouting (using something like Envoy) or if you have the infrastructural robustness to deploy and spin up duplicate clusters easily.

And this is really what it comes down to for me. When you fail to backpressure reliably, not only do you leak resources, but you also lose the ability to reliably load shed. Failing to reliably load shed is how clusters bomb out and fall into negative feedback cycles under pressure. At the opposite end of the spectrum, over backpressuring and being too speculative about possible resource leaks will sometimes result in reduced instance utilization, up to and potentially including instance death (if you have a bug which blocks finalization indefinitely), but there is no conceivable circumstance in which over-backpressuring results in negative feedback cycles at the system level. At the very least, I've never seen anything remotely like that in practice.

So to me, the tie-breaker is just that: what kind of outage would you rather have? Losing an instance is automatically recoverable and doesn't even need to ping PagerDuty. Losing a cluster to a negative feedback loop is not recoverable, has a huge blast radius (often also affecting up- and downstream dependencies), and in many cases can only be solved by redirecting inbound traffic at edges. I would much, much rather bias in favor of instance loss.

Which is to say that I don't agree that leaking a process is worse than leaking a resource. :-) Leaking a resource is worse because of how it interacts in practice with the broader infrastructure of a distributed system.

1 reply

alexandru May 28, 2022
Maintainer Author

Thanks a lot for your input. ❤️

In my view, backpressuring on a broken finalizer is basically a dead-lock. Dead-locks between threads can be detected by profiling tools, but async stuff can't be detected easily.

Problematic Java processes have been historically hard to detect. They preallocate memory for example, so you can't have standard system alerts that monitor for memory errors, unless you connect to the Java process and query its internal status. Which is why memory issues are very problematic for Java, and why we use NonFatal, to prevent the process from ending up in a zombie state. Processes can be monitored reliably, for sure, but it's functionality that we have to implement or that's specific for Java, in a very unlike-Unix fashion. If the process deadlocks on finalizers, you can't even have a timeout operation letting you know that something is wrong (unless you use the special timeout operation that does that).

I also have experience with other tech stacks, like Python/Javascript and resource leaks aren't as bad, because you can monitor the process for leaks and just restart it. Which is why I understand why Java processes should be more resilient, as it's not exactly the same thing, even if "let it crash" might not be such a bad idea.

I do value your experience here, and it's precisely what I wanted, as I know that my opinions have been shaped by 2-3 projects I worked on, and that's hardly a reliable sample 🙂

I agree that ability to backpressure finalizers is important, for IO at least. What I'm thinking of, however, is making it optional, and how such an API looks like.

What I'm thinking of is having some configuration injected in the run-loop that says, to all implementations that are prepared for it, something like... backpressure indefinitely or timeout (with a stderr log) after X seconds. Monix does this already for "execution model" that refers to how async barries should be inserted in the run-loop. Basically the user would be able to override standard functionality.

How do you feel about that?

In Cats Effect I know you can just use uncancelable to do your own thing. There is a lot of implemented functionality, however, that chains finalizers and you basically have to rewrite the whole thing. For that reason, backpressuring finalization isn't really optional, but maybe it can be.

Speaking of Java interruption, I'd like to learn more about why it's bad.

AFAIK what's bad about it is that implementations ignore interruption or are misusing the protocol. Is this correct?

Any good rants about it that I could read?🙂

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thoughts on the Cancelation Model #1586

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Thoughts on the Cancelation Model #1586

alexandru May 24, 2022 Maintainer

Please provide your input

History

Prior art

What comes next?

Replies: 4 comments · 4 replies

nomisRev May 24, 2022

alexandru May 24, 2022 Maintainer Author

nomisRev May 24, 2022

ghik May 25, 2022

alexandru May 25, 2022 Maintainer Author

SystemFw May 25, 2022

djspiewak May 27, 2022

alexandru May 28, 2022 Maintainer Author

alexandru
May 24, 2022
Maintainer

Replies: 4 comments 4 replies

nomisRev
May 24, 2022

alexandru May 24, 2022
Maintainer Author

ghik
May 25, 2022

alexandru May 25, 2022
Maintainer Author

SystemFw
May 25, 2022

djspiewak
May 27, 2022

alexandru May 28, 2022
Maintainer Author