Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Avoid blocking sends to a channel without timeout #131

Open
prashantv opened this issue Jul 9, 2021 · 8 comments
Open

Proposal: Avoid blocking sends to a channel without timeout #131

prashantv opened this issue Jul 9, 2021 · 8 comments

Comments

@prashantv
Copy link
Contributor

prashantv commented Jul 9, 2021

Semi-related to Channel Size is One or None

We've seen cases of deadlocks/unexpected blocking caused by blocking sends to a channel (example: tchannel-go#528). It's generally much safer to use a select with another case for timeout / ctx.Done(), or a non-blocking send.

There are some use-cases where it's safe; typically when the number of writers is bound to channel size and can be verified easily, so not sure how generally applicable this is.

@peterbourgon
Copy link

peterbourgon commented Jul 10, 2021

Oof, I hope not!

I can't speak to specific packages that may serve specific purposes for you, but in general, it should be very rare that you make a nonblocking send on a channel. Go already gives you async where it matters, on syscalls; you want your application code to be nice and synchronous, i.e. deterministic, so that it takes advantage of the runtime. In the hopefully rare cases you use channels to communicate between components, they should almost always be blocking sends and recvs on chans with capacity zero.

Buffered channels are usually queues, and queues exist solely to solve burstiness. That means they belong at the edges of a system, not within it 😇

@prashantv
Copy link
Contributor Author

This issue isn't to indicate that all sends must be non-blocking, but that blocking sends probably should have a fallback, e.g., context cancellation or timeout.

@peterbourgon
Copy link

blocking sends probably should have a fallback, e.g., context cancellation or timeout.

When a chan send occurs in a context-cancelable context 😉 I agree, you want that cancelation respected. But I don't think it's correct to claim blocking sends should have a fallback/timeout generally.

@mway
Copy link
Member

mway commented Jul 12, 2021

It's true that there's not really a "one-size-fits-all" approach, but I don't think the issue is whether the send is sync/async, or whether channel usage is error-prone: the issue that it isn't defensive - it's fragile - unless you add an unblocking strategy, and channel capacity is not an unblocking strategy.

In any event, I would argue that one should have a good reason for not having any control mechanisms in selects - avoiding them should be intentional, and not by default, especially if there is any API exposure to users that could cause unexpected behavior (deadlocks, goroutine leaks, close-send races, etc). Even for situations without context propagation/timeout/etc plumbing, having a basic "has done chan struct{}, which is managed by Close()" would be an improvement over YOLO mode, such that there is always at least one component-level guard against locks/leaks/etc.

@peterbourgon
Copy link

channel capacity is not an unblocking strategy

Absolutely true! A rule like "Don't add capacity to a channel without knowing exactly why" would be 😎

it isn't defensive - it's fragile - unless you add an unblocking strategy

I'm not sure the rule generalizes in the way you're expressing it here. A goroutine blocked in a chan send, or recv, or select statement, isn't by itself a problem — it's only a problem if that introduces deadlocks, races, etc. as a consequence. And if that can happen, adding a timeout to the select statement isn't any better than adding capacity to the channel — both just paper over the underlying structural errors in the code.

To be clear, I get the idea you're trying to convey, here. I tend to express it slightly differently, not in terms of the channel operations, but in terms of goroutine lifecycle management. From my own style guide —

Contain your concurrency

  • Treat goroutines like stack variables -- don't let them outlive their originating function (i.e. structured concurrency)
  • Avoid "fire and forget" goroutines -- know how to stop the goroutine, and verify it's stopped
  • Avoid spaghetti synchronization, e.g. sync.Atomic as a state bit, starting/stopping/done channels, etc.
  • Keep concurrency management and business logic separate
  • No: wg.Add(1); go f(wg, input)
  • Yes: wg.Add(1); go func() { defer wg.Done(); f(input) }()
  • There is no such thing as a benign data race -- races are showstopper bugs that must be fixed immediately

The key points here being the first two.

And just for context, so it doesn't look like I'm swooping in here with all these unsolicited complaints 😅 — the Uber style guide is frequently referenced as a source of truth among the Gophers I help and mentor and stuff, so I'm somewhat motivated to nudge the (few!) points that in my experience tend to lead those folks down suboptimal paths.

@mway
Copy link
Member

mway commented Jul 12, 2021

The key points here being the first two

They're good points overall. Like anything, there are exceptions, but in common cases these hold true. This is effectively the spirit in which this select guidance is being drafted in.

A goroutine blocked in a chan send, or recv, or select statement, isn't by itself a problem — it's only a problem if that introduces deadlocks, races, etc. as a consequence

I think that's a situational assessment, and is predicated on the effect that it has on the rest of the program relative to that behavior (i.e., it could either be a feature or a defect, depending on how you're rationalizing it). If the effects of said blocking are insulated from the rest of the program, or if transitive backpressure is intended, then yeah, I agree with you that it's not an issue in the same sense as I'm describing. That argument would not hold for situations with leaks, obviously.

However, we should avoid assuming specialized application behaviors/contracts/semantics/etc, so I think that we need to treat those "situationally okay" scenarios as exceptions, rather than the common case (hence taking a more general stance that "a bug is the possibility of defective behavior, regardless of how many times it manifests" rather than "it's not a bug if we haven't observed it [yet]"), and observing that a lot of the time, basic flow management or other synchronization tools reduce the possibility of various related bugs.

(We definitely can't cover every possible related bug, and there is certainly some amount of RTFM required because folks should understand the implications of what they're doing, but I'm personally of the opinion that it's good to document behaviors that encourage thoughtfulness, clear intent, and consistently nominal behavior, which I hope that this falls into.)

And just for context, so it doesn't look like I'm swooping in here with all these unsolicited complaints

We appreciate the feedback! Glad to hear that folks are (hopefully) finding it helpful, and we're also quite motivated to ensure not only that it's as useful to others as it is for us, but also that it evolves as we (and the community) do.

@peterbourgon
Copy link

peterbourgon commented Jul 13, 2021

a bug is the possibility of defective behavior, regardless of how many times it manifests

Sure! But a goroutine blocked in a chan send, recv, or select does not by itself constitute the possibility of defective behavior, any moreso than does an integer with a value of 0, say. The possibility of deadlock, or a goroutine leak, or whatever else is the defective behavior, and that requires more than just the one goroutine's code to assess. Similarly, a function that uses that integer as a denominator is required for its 0 value to be worrisome.

The thing I'm guarding against here is the addition of timeout cases to channel operations that serve no purpose. Today I wrote some code that selected on one of four channels; three were errors that may or may not come, and one was the result of an operation that would take an indeterminate amount of time. The function containing that select statement is meant to block until one of those results had manifest. It would be incorrect to add a timeout to the select, or to thread a context through, in this situation. But this rule would suggest that my code is faulty.

Or, even easier: consider signal.Notify — same semantics!

Sometimes goroutines can block indefinitely. And it's not a rare situation: the Go runtime is optimized for exactly this situation. So it's not necessarily a problem. Right?

@mway
Copy link
Member

mway commented Jul 19, 2021

Sometimes goroutines can block indefinitely. And it's not a rare situation: the Go runtime is optimized for exactly this situation. So it's not necessarily a problem. Right?

This is essentially the crux of the issue. I'm not suggesting that we add channel dogma that should be followed in all situations, but rather to be sensitive to context/semantics and provide reasonable heuristics for when you probably should (and if there are any overwhelmingly common cases, call them out as examples). Agreed that it doesn't make sense to "just add timeouts" everywhere because that itself somehow magically makes channel usage "resilient" or "correct", because it doesn't.

To that end, whatever guidance we add should probably be something to the effect of: "if your channel send/recv should not block for an unknown amount of time (potentially forever), add a timeout/etc case". (TBD, of course; we're not in a hurry to rush any guidance.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants